Codebase RAG Pipeline

Retrieval-augmented question answering pipeline that indexes the vendored vLLM codebase and answers codebase questions with BM25 retrieval plus a local Qwen model.

Technical overview

The pipeline is split into four stages:

Ingestion – walk the repository, read supported text files, and split them into chunks.
Indexing – build a BM25 index over chunk contents.
Retrieval – return the top-$k$ source spans for a question.
Generation – feed the retrieved context into Qwen/Qwen3-0.6B through transformers and generate a short answer.

Implementation details worth knowing:

Python files are chunked with a simple AST-aware split by def / class boundaries.
Markdown and text files use a sliding-window strategy.
Code and docs are indexed separately.
The generator disables Qwen3 thinking mode with enable_thinking=False so answers stay direct and concise.
Metadata for source spans is preserved as MinimalSource objects, which makes evaluation and traceability straightforward.

Prerequisites

Python 3.10
uv
A working local PyTorch installation with transformers
Enough disk space to unpack the vendored vLLM corpus and dataset archives

Installation

make install

The repository expects the raw codebase corpus under data/raw/ and the evaluation dataset archives under data/.

Run it

Build the index, then query the pipeline:

make index
uv run python -m student search "OpenAI compatible server" --k 5
uv run python -m student answer "How to configure an OpenAI server" --k 10

If you prefer the Makefile wrappers, the equivalent commands are make search and make answer.

Evaluation and metrics

The repo includes a local evaluator that reports Recall@1, Recall@3, Recall@5, and Recall@10 on the public question datasets.

Published results in the README and evaluator docs:

Recall@5 on docs: 86%
Recall@5 on code: 48%
Required thresholds in the evaluator: 55% docs and 45% code at Recall@5
answer_dataset throughput: about 2m16s for 100 questions (~1.36 s/it)

Input corpus and demo data

The repository vendors the vLLM 0.10.1 codebase under data/raw/vllm-0.10.1/. That corpus is the main indexing target.

For evaluation, the pipeline uses public question datasets in data/datasets/ and writes search outputs to data/output/search_results/ and answered outputs to data/output/search_results_and_answer/.

Design choices

bm25s for sparse retrieval and predictable ranking behavior
pydantic models for stable serialization of datasets, chunks, and answers
fire for lightweight CLI wiring
tqdm for dataset progress reporting
transformers + torch for local generation instead of a hosted API

Notes

No API keys or tokens are hardcoded in the implementation files.
The repository is optimized for fast repository-question lookup, not for free-form chat.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
output		output
src		src
student		student
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
__main__.py		__main__.py
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Codebase RAG Pipeline

Technical overview

Prerequisites

Installation

Run it

Evaluation and metrics

Input corpus and demo data

Design choices

Notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Codebase RAG Pipeline

Technical overview

Prerequisites

Installation

Run it

Evaluation and metrics

Input corpus and demo data

Design choices

Notes

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages