Codebase Analyzer

Semantic code search and Q&A over GitHub repositories. Give it a repo URL — it downloads, parses, chunks by AST, embeds with sentence-transformers, indexes with FAISS HNSW, and lets you search or ask questions in sub-200ms.

How it works

POST /index  →  Download repo  →  Parse .py files  →  AST chunk (function/class)
                                                           ↓
              FAISS HNSW index  ←  Embed chunks  ←  sentence-transformers
                    ↓
POST /search  →  Embed query  →  FAISS top-k  →  Ranked code chunks (sub-200ms)
POST /ask     →  Embed query  →  FAISS top-k  →  LLM answers with context (RAG)

Key design decisions

AST-based chunking over character splits. Each chunk is a complete Python function or class extracted via ast.parse(), not an arbitrary 500-character window. This means embeddings capture real semantic units — a full function with its docstring, not half of one function and half of another. Falls back to sliding-window for files with syntax errors.

FAISS HNSW over IVF. Codebase repos produce < 100k chunks. HNSW gives sub-10ms search with high recall at this scale, no training step needed. IVF is better for millions of vectors — overkill here.

Configurable embeddings. Default: all-MiniLM-L6-v2 via sentence-transformers (local, free, 384-dim). Set EMBEDDING_PROVIDER=openai for OpenAI text-embedding-3-small (1536-dim, paid). Swap models via env var — zero code changes.

LLM fallback chain. Groq (primary, free) → Together AI → Anthropic Claude. If the primary LLM fails, the Q&A endpoint automatically routes to the fallback.

Async indexing. Indexing is a background task — POST /index returns immediately with a job_id. Poll /index/{job_id}/status for progress. No request timeouts on large repos.

API

`POST /index` — Index a repository

{
  "repo_url": "https://github.com/psf/requests",
  "strategy": "function"
}

Returns { "job_id": "a1b2c3", "status": "queued" }. Strategies: function (default), class, sliding.

`GET /index/{job_id}/status` — Poll indexing progress

{
  "status": "done",
  "chunks_indexed": 847,
  "embed_time_ms": 3200,
  "index_time_ms": 45,
  "index_size_mb": 1.24
}

`POST /search` — Semantic search

{
  "repo_url": "https://github.com/psf/requests",
  "query": "how are SSL certificates verified",
  "top_k": 5
}

Returns ranked code chunks with file path, function name, line numbers, and similarity score.

`POST /ask` — Q&A with context retrieval (RAG)

{
  "repo_url": "https://github.com/psf/requests",
  "question": "How does the retry mechanism work?"
}

Retrieves relevant chunks via FAISS, passes them as context to the LLM, returns a natural-language answer with source citations.

Project structure

Codebase-Analyzer/
├── app/
│   ├── github_utils.py     # Repo download + ZIP extraction
│   ├── file_analyzer.py    # Python file discovery and parsing
│   ├── chunker.py          # AST-based code chunking (function/class/sliding)
│   ├── embedder.py         # Configurable embeddings (sentence-transformers / OpenAI)
│   ├── indexer.py          # FAISS HNSW index with disk persistence
│   └── llm_client.py       # LLM client with fallback (Together AI / Anthropic)
├── main.py                 # FastAPI app — /index, /search, /ask, /health
├── requirements.txt
├── .env.example
└── .gitignore

Setup

git clone https://github.com/ananyavrm04/Codebase-Analyzer.git
cd Codebase-Analyzer
python -m venv venv
source venv/bin/activate        # Windows: venv\Scripts\activate
pip install -r requirements.txt

Create .env:

# Required — at least one LLM
TOGETHER_API_KEY=your_key_here

# Optional — enables fallback LLM
ANTHROPIC_API_KEY=sk-ant-...

# Optional — use OpenAI embeddings instead of local
# EMBEDDING_PROVIDER=openai
# OPENAI_API_KEY=sk-...

Run:

uvicorn main:app --host 0.0.0.0 --port 8000

Stack

Component	Technology
Framework	FastAPI
Embeddings	sentence-transformers (default) / OpenAI
Vector search	FAISS HNSW
Code parsing	Python `ast` module
LLM	Groq (primary) / Together AI / Anthropic (fallback chain)
Async	FastAPI BackgroundTasks + asyncio

Performance

Metric	Value
Search latency (FAISS)	< 10ms
End-to-end search (embed + search)	< 200ms
Indexing (1000 chunks)	~3-5s
Max repo size tested	80MB+

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
app		app
static		static
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt
run_analysis.py		run_analysis.py
status		status
streamlit_app.py		streamlit_app.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Codebase Analyzer

How it works

Key design decisions

API

`POST /index` — Index a repository

`GET /index/{job_id}/status` — Poll indexing progress

`POST /search` — Semantic search

`POST /ask` — Q&A with context retrieval (RAG)

Project structure

Setup

Stack

Performance

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Codebase Analyzer

How it works

Key design decisions

API

POST /index — Index a repository

GET /index/{job_id}/status — Poll indexing progress

POST /search — Semantic search

POST /ask — Q&A with context retrieval (RAG)

Project structure

Setup

Stack

Performance

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`POST /index` — Index a repository

`GET /index/{job_id}/status` — Poll indexing progress

`POST /search` — Semantic search

`POST /ask` — Q&A with context retrieval (RAG)

Packages