Skip to content

GWVG/codebase-rag-pipeline

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Codebase RAG Pipeline

Retrieval-augmented question answering pipeline that indexes the vendored vLLM codebase and answers codebase questions with BM25 retrieval plus a local Qwen model.

Technical overview

The pipeline is split into four stages:

  1. Ingestion – walk the repository, read supported text files, and split them into chunks.
  2. Indexing – build a BM25 index over chunk contents.
  3. Retrieval – return the top-$k$ source spans for a question.
  4. Generation – feed the retrieved context into Qwen/Qwen3-0.6B through transformers and generate a short answer.

Implementation details worth knowing:

  • Python files are chunked with a simple AST-aware split by def / class boundaries.
  • Markdown and text files use a sliding-window strategy.
  • Code and docs are indexed separately.
  • The generator disables Qwen3 thinking mode with enable_thinking=False so answers stay direct and concise.
  • Metadata for source spans is preserved as MinimalSource objects, which makes evaluation and traceability straightforward.

Prerequisites

  • Python 3.10
  • uv
  • A working local PyTorch installation with transformers
  • Enough disk space to unpack the vendored vLLM corpus and dataset archives

Installation

make install

The repository expects the raw codebase corpus under data/raw/ and the evaluation dataset archives under data/.

Run it

Build the index, then query the pipeline:

make index
uv run python -m student search "OpenAI compatible server" --k 5
uv run python -m student answer "How to configure an OpenAI server" --k 10

If you prefer the Makefile wrappers, the equivalent commands are make search and make answer.

Evaluation and metrics

The repo includes a local evaluator that reports Recall@1, Recall@3, Recall@5, and Recall@10 on the public question datasets.

Published results in the README and evaluator docs:

  • Recall@5 on docs: 86%
  • Recall@5 on code: 48%
  • Required thresholds in the evaluator: 55% docs and 45% code at Recall@5
  • answer_dataset throughput: about 2m16s for 100 questions (~1.36 s/it)

Input corpus and demo data

The repository vendors the vLLM 0.10.1 codebase under data/raw/vllm-0.10.1/. That corpus is the main indexing target.

For evaluation, the pipeline uses public question datasets in data/datasets/ and writes search outputs to data/output/search_results/ and answered outputs to data/output/search_results_and_answer/.

Design choices

  • bm25s for sparse retrieval and predictable ranking behavior
  • pydantic models for stable serialization of datasets, chunks, and answers
  • fire for lightweight CLI wiring
  • tqdm for dataset progress reporting
  • transformers + torch for local generation instead of a hosted API

Notes

  • No API keys or tokens are hardcoded in the implementation files.
  • The repository is optimized for fast repository-question lookup, not for free-form chat.

About

Retrieval-augmented question answering pipeline that indexes the vendored vLLM codebase and answers codebase questions with BM25 retrieval plus a local Qwen model.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors