#VectorMind
RAG-powered document Q&A — upload PDFs, ask questions in natural language, and get answers grounded in your documents with source citations.
Built with React + FastAPI + Qdrant Cloud + Claude. This app is designed to run locally on your machine (local embeddings require more RAM than free cloud tiers provide). See run_locally.txt for startup commands.
This README is the full technical reference: architecture, RAG pipeline and client/server flow.
- Overview
- Tech stack
- Quick start
- Project structure
- Where FastAPI fits
- System architecture
- Session identification
- RAG ingestion pipeline
- RAG query pipeline
- Upload flow (client + server)
- Chat flow (client + server)
- Data model
- API reference
- Configuration
- Testing
VectorMind is a Retrieval-Augmented Generation (RAG) application. Users upload PDF documents; the system extracts text, splits it into chunks, embeds them into a vector database, and answers questions by retrieving the most relevant passages and passing them to Anthropic Claude as context.
Problem it solves: General-purpose LLMs do not know your private documents. RAG lets you query proprietary PDFs without retraining a model — upload new files and the knowledge base updates immediately.
Key features:
- Landing page with drag-and-drop PDF upload
- Per-browser session isolation (no login required)
- Async background indexing with progress polling
- Streaming chat responses (Server-Sent Events)
- Source citations (filename, page, chunk preview)
- CORS, rate limiting, structured logging, Prometheus metrics
Runs locally only — not hosted online. The frontend shows a notice on the landing page; use run_locally.txt to start backend + frontend.
| Layer | Technology | Role |
|---|---|---|
| Frontend | React 19 + Vite | Landing page, drag-drop upload, streaming chat UI |
| API server | FastAPI + Uvicorn | REST + SSE endpoints (app/main.py) |
| LLM | Anthropic Claude (claude-sonnet-4-6) |
Answer generation via langchain-anthropic |
| Embeddings | HuggingFace all-MiniLM-L6-v2 (local, 384-dim) |
Semantic vectors for chunks and queries |
| Vector DB | Qdrant Cloud | Per-user collections, cosine similarity search |
| Reranker | CrossEncoder ms-marco-MiniLM-L-6-v2 |
Re-score top-20 vector hits down to top-5 |
| PDF parsing | PyMuPDF + PyPDF2 fallback + Tesseract OCR | Text extraction from scanned pages |
| Chat history | SQLite + SQLAlchemy | Persistent conversations and messages |
| Orchestration | LangChain | Text splitting, conversation memory, LLM wrapper |
- Python 3.12+
- Node.js 18+
- Anthropic API key
- Qdrant Cloud cluster (free tier works)
See run_locally.txt for copy-paste commands. Summary:
1. Install backend dependencies
pip install -r requirements.txt2. Configure environment
cp .env.example .envEdit .env with your keys:
ANTHROPIC_API_KEY=your_anthropic_api_key
QDRANT_URL=https://your-cluster-id.region.cloud.qdrant.io:6333
QDRANT_API_KEY=your_qdrant_api_key
CORS_ORIGINS=http://localhost:5173,http://127.0.0.1:51733. Start the backend (project root)
uvicorn app.main:app --host 0.0.0.0 --port 8000 --reload4. Start the frontend (new terminal)
cd frontend
npm install
npm run dev5. Open http://localhost:5173
The API entry point is app/main.py. Uvicorn loads the FastAPI app object directly.
frontend/.env can set VITE_API_BASE_URL=http://localhost:8000 — defaults to localhost if unset.
pdf_chat/
├── app/
│ ├── main.py # FastAPI app factory, middleware, router registration
│ ├── config.py # Settings from .env (pydantic-settings)
│ ├── session.py # X-Session-Id → guest User (per-browser isolation)
│ ├── database.py # SQLAlchemy engine, init_db()
│ ├── models.py # User, Conversation, Message tables
│ ├── routes/
│ │ ├── upload.py # POST /upload, GET /upload/status/{task_id}
│ │ ├── chat.py # POST /chat, /chat/stream, GET /status
│ │ ├── conversations.py # CRUD for chat threads + message history
│ │ └── health.py # GET /health
│ └── core/ # RAG engine (no HTTP knowledge)
│ ├── pdf_parser.py # PDF text extraction + OCR fallback
│ ├── rag.py # Chunking, retrieval, reranking, LLM calls
│ ├── vectorstore.py # Qdrant integration
│ ├── embeddings.py # HuggingFace embedding singleton
│ ├── cache.py # Answer TTL cache (1 hour)
│ └── tasks.py # Background upload job state
├── frontend/
│ └── src/
│ ├── App.jsx # Landing page + chat dashboard
│ ├── api/client.js # HTTP client, session ID, SSE parsing
│ └── components/
│ ├── PdfUpload.jsx # Drag-drop upload (hero + sidebar variants)
│ ├── ChatInterface.jsx
│ └── StatusBadge.jsx
├── tests/ # pytest suite
├── scripts/eval_rag.py # Golden-set RAG evaluation
├── run_locally.txt # Commands to start the full app
├── .env.example # Environment variable template
└── requirements.txt
Rule of thumb: app/routes/ handles HTTP; app/core/ handles AI/document logic. The React app never talks to Qdrant or Claude directly — everything goes through FastAPI.
FastAPI is the backend server. It is not a separate layer sitting beside another server — app/main.py creates the application, registers routes, applies middleware, and on startup initializes SQLite.
flowchart TB
subgraph client [React Client :5173]
Landing[Landing + PdfUpload]
ChatUI[ChatInterface]
APIClient["api/client.js X-Session-Id"]
end
subgraph fastapi [FastAPI Server :8000 app/main.py]
Routes["app/routes/ upload chat conversations"]
Session["app/session.py guest user"]
Middleware["CORS rate-limit metrics"]
end
subgraph core [RAG Engine app/core]
Parser[pdf_parser.py]
RAG[rag.py]
VS[vectorstore.py]
Emb[embeddings.py]
end
subgraph external [External Services]
Qdrant[(Qdrant)]
Claude[Anthropic API]
SQLite[(SQLite)]
end
Landing --> APIClient
ChatUI --> APIClient
APIClient -->|"HTTP REST + SSE"| Routes
Routes --> Session
Routes --> Parser
Routes --> RAG
RAG --> Emb
RAG --> VS
VS --> Qdrant
RAG --> Claude
Routes --> SQLite
What FastAPI does:
- Receives HTTP requests from the React client
- Resolves the browser session to a
Userrow (app/session.py) - Validates input (file size, question not empty, etc.)
- Persists conversations/messages to SQLite
- Delegates document AI work to
app/core/ - Returns JSON or SSE streams
What FastAPI does not do: embedding, vector search, PDF parsing, or LLM calls — those live in app/core/.
flowchart TB
subgraph browser [Browser]
ReactSPA["React SPA frontend/"]
LocalStorage["localStorage session_id"]
end
subgraph backend [Backend :8000]
Main["app/main.py"]
UploadRoute["routes/upload.py"]
ChatRoute["routes/chat.py"]
ConvRoute["routes/conversations.py"]
SessionMod["session.py"]
Tasks["core/tasks.py"]
RAGEngine["core/rag.py"]
VectorStore["core/vectorstore.py"]
PDFParser["core/pdf_parser.py"]
DB["database.py + models.py"]
end
subgraph infra [Infrastructure]
QdrantDB["Qdrant Cloud"]
SQLiteDB["pdf_chat.db"]
end
subgraph ml [ML Services]
MiniLM["HuggingFace MiniLM local"]
CrossEnc["CrossEncoder reranker local"]
ClaudeAPI["Anthropic Claude API"]
end
ReactSPA -->|"X-Session-Id"| Main
LocalStorage -.-> ReactSPA
Main --> UploadRoute
Main --> ChatRoute
Main --> ConvRoute
UploadRoute --> SessionMod
ChatRoute --> SessionMod
ConvRoute --> SessionMod
SessionMod --> DB
DB --> SQLiteDB
UploadRoute --> Tasks
Tasks --> PDFParser
Tasks --> RAGEngine
ChatRoute --> RAGEngine
RAGEngine --> VectorStore
VectorStore --> MiniLM
VectorStore --> QdrantDB
RAGEngine --> CrossEnc
RAGEngine --> ClaudeAPI
ConvRoute --> DB
| Component | Port / path | Responsibility |
|---|---|---|
| React (Vite) | :5173 |
UI, session ID, API calls |
| FastAPI | :8000 |
HTTP API, auth session, orchestration |
| Qdrant Cloud | HTTPS | Vector storage per user (pdf_chat_user_{id}) |
| SQLite | pdf_chat.db |
Users, conversations, messages |
| Claude | HTTPS API | Text generation |
There is no login or JWT. Each browser generates a UUID stored in localStorage and sends it on every request as the X-Session-Id header. The backend creates a guest User row on first visit.
sequenceDiagram
participant Browser
participant React as React client.js
participant API as FastAPI session.py
participant SQLite
Browser->>React: First visit
React->>React: crypto.randomUUID to localStorage
React->>API: GET /status + X-Session-Id
API->>SQLite: get_or_create guest-{uuid}@local
SQLite-->>API: User id=1
API-->>React: status JSON with user_id
Note over API,SQLite: user_id scopes Qdrant collection and chat history
- Same browser = same session = same documents and conversations
- Incognito / cleared storage = new session = empty state
- Not production-secure: anyone with a session UUID could access that guest's data (acceptable for local/demo)
What happens when a user uploads PDFs — from file bytes to searchable vectors.
flowchart LR
PDF[PDF file] --> Extract["pdf_parser.py PyMuPDF/OCR"]
Extract --> Pages["page text + metadata"]
Pages --> Chunk["rag.py RecursiveCharacterTextSplitter"]
Chunk --> Embed["embeddings.py MiniLM 384-dim"]
Embed --> Upsert["vectorstore.py Qdrant upsert"]
| Step | Module | What happens |
|---|---|---|
| 1. Upload | routes/upload.py |
Client sends multipart/form-data. Server validates PDF type, per-file size (50 MB), total size (200 MB), max 10 files. Returns 202 Accepted with task_id. |
| 2. Background job | core/tasks.py |
FastAPI BackgroundTasks runs process_upload_task. Status tracked in-memory: pending → processing → completed / failed. |
| 3. Extract | core/pdf_parser.py |
PyMuPDF extracts text per page. If a page has < 50 chars, Tesseract OCR runs on a rendered image. PyPDF2 is the fallback if PyMuPDF yields nothing. |
| 4. Chunk | core/rag.py |
RecursiveCharacterTextSplitter splits text into ~1000-character chunks with 200-character overlap. Each chunk gets metadata: filename, page, chunk_index. |
| 5. Embed | core/embeddings.py |
HuggingFace all-MiniLM-L6-v2 converts each chunk to a 384-dimensional vector. Runs locally — no embedding API cost. |
| 6. Store | core/vectorstore.py |
Vectors upserted into Qdrant collection pdf_chat_user_{user_id} with cosine distance. Payload stores content + metadata for retrieval. |
Chunk overlap ensures sentences split across chunk boundaries are still findable by semantic search.
What happens when a user asks a question — semantic search through answer generation.
flowchart TB
Q[User question] --> CacheCheck{Answer cached?}
CacheCheck -->|yes| CachedAnswer[Return cached answer]
CacheCheck -->|no| QEmbed[Embed query vector]
QEmbed --> Search["Qdrant cosine similarity top-20"]
Search --> Rerank["CrossEncoder rerank top-5"]
Rerank --> Prompt["Build prompt context + history"]
Prompt --> Claude[Claude generate]
Claude --> Answer[Answer + source citations]
Answer --> StoreCache[Cache answer 1hr]
Traditional search matches keywords. Semantic search matches meaning:
- The question is embedded into the same 384-dim vector space as document chunks
- Qdrant finds the 20 nearest neighbors by cosine similarity
- A CrossEncoder reranker re-scores those 20 pairs
(question, chunk)and keeps the best 5
This two-stage retrieve-then-rerank pattern is common in production RAG: vector search is fast but approximate; the cross-encoder is slower but more accurate.
Code path: retrieve_context() in rag.py → vectorstore.similarity_search(k=20) → rerank_documents(top_k=5).
The top-5 chunks are formatted as:
[Source 1: filename.pdf, page 3]
<chunk text>
Plus optional conversation history (last 6 messages from LangChain ConversationBufferMemory) and the user's question. System instructions in RAG_SYSTEM_INSTRUCTIONS enforce professional formatting, no emojis, and no raw source dumps in the answer body.
| Mechanism | Location | Lifetime |
|---|---|---|
| Answer cache | core/cache.py |
1 hour TTL, keyed by user_id + question |
| LangChain memory | rag.py _user_memories |
In-process; lost on server restart |
| Chat history | SQLite messages table |
Persistent across restarts |
sequenceDiagram
participant User
participant PdfUpload as PdfUpload.jsx
participant Client as api/client.js
participant API as routes/upload.py
participant Tasks as core/tasks.py
participant Parser as pdf_parser.py
participant RAG as core/rag.py
participant Qdrant
User->>PdfUpload: Drop PDF files
PdfUpload->>Client: uploadPdfs FormData
Client->>API: POST /upload + X-Session-Id
API->>API: Validate files, create task_id
API-->>Client: 202 task_id
API->>Tasks: BackgroundTasks process_upload_task
loop Poll every 2s
Client->>API: GET /upload/status/task_id
API-->>Client: pending / processing / completed
end
Tasks->>Parser: extract_pdf_pages per file
Parser-->>Tasks: page text tuples
Tasks->>RAG: chunk_documents
Tasks->>RAG: index_documents
RAG->>Qdrant: embed + upsert vectors
Tasks-->>Tasks: status completed
Client->>API: GET /status
API-->>Client: vectorstore_loaded true
Note over User,PdfUpload: App.jsx switches from landing to chat dashboard
sequenceDiagram
participant User
participant Chat as ChatInterface.jsx
participant Client as api/client.js
participant API as routes/chat.py
participant RAG as core/rag.py
participant Qdrant
participant Claude as Anthropic Claude
participant SQLite
User->>Chat: Type question, submit
Chat->>Client: streamChatMessage
Client->>API: POST /chat/stream + X-Session-Id
API->>SQLite: Save user message
API->>RAG: ask_question_stream user_id question
RAG->>Qdrant: similarity_search top-20
RAG->>RAG: rerank to top-5
RAG->>Claude: astream prompt with context
loop SSE tokens
Claude-->>RAG: token chunk
RAG-->>Client: data token
Client-->>Chat: Append to message bubble
end
RAG-->>Client: data done sources
Client-->>Chat: Render source citations
API->>SQLite: Save assistant message + sources_json
Non-streaming POST /chat follows the same retrieval path but returns a single JSON response via ask_question().
SQLite schema (app/models.py):
erDiagram
User ||--o{ Conversation : has
User ||--o{ Message : has
Conversation ||--o{ Message : contains
User {
int id PK
string email UK
string hashed_password
datetime created_at
}
Conversation {
int id PK
int user_id FK
string title
datetime created_at
datetime updated_at
}
Message {
int id PK
int user_id FK
int conversation_id FK
string role
text content
text sources_json
datetime created_at
}
Guest users use emails like guest-{uuid}@local. The user_id foreign key scopes both SQLite rows and the Qdrant collection name.
In-memory only (not in DB):
upload_tasksdict intasks.py— upload job progress_user_memoriesdict inrag.py— LangChain conversation buffer_answer_cachedict incache.py— recent Q&A pairs
| Method | Path | Auth | Description |
|---|---|---|---|
GET |
/ |
None | API root + endpoint list |
GET |
/health |
None | Health check + uptime |
GET |
/status |
X-Session-Id |
Vectorstore ready, document count |
POST |
/upload |
X-Session-Id |
Upload PDFs (async, returns task_id) |
GET |
/upload/status/{task_id} |
X-Session-Id |
Poll upload progress |
POST |
/chat |
X-Session-Id |
Ask question (JSON response) |
POST |
/chat/stream |
X-Session-Id |
Ask question (SSE stream) |
GET |
/conversations |
X-Session-Id |
List conversations |
POST |
/conversations |
X-Session-Id |
Create conversation |
GET |
/conversations/{id}/messages |
X-Session-Id |
Get message history |
GET |
/metrics |
None | Prometheus metrics |
Upload:
POST /upload
X-Session-Id: <uuid>
Content-Type: multipart/form-data
files: [pdf1, pdf2, ...]
Chat stream:
POST /chat/stream
X-Session-Id: <uuid>
Content-Type: application/json
{"question": "What is NLP?", "conversation_id": 1}
SSE response format:
data: {"token": "Natural "}
data: {"token": "Language "}
data: {"done": true, "sources": [{"content": "...", "metadata": {"filename": "...", "page": 3}}]}
| Endpoint | Limit |
|---|---|
/chat, /chat/stream |
20 requests / minute |
/upload |
5 requests / minute |
All settings in app/config.py, overridable via .env:
| Variable | Default | Description |
|---|---|---|
ANTHROPIC_API_KEY |
"" |
Required for Claude |
QDRANT_URL |
http://localhost:6333 |
Qdrant URL (use Cloud URL with :6333 for local dev) |
QDRANT_API_KEY |
"" |
Required for Qdrant Cloud |
QDRANT_COLLECTION_PREFIX |
pdf_chat |
Collection name prefix |
DATABASE_URL |
sqlite:///./pdf_chat.db |
SQLAlchemy connection |
CORS_ORIGINS |
http://localhost:5173,... |
Allowed frontend origins |
CLAUDE_MODEL |
claude-sonnet-4-6 |
Anthropic model ID |
CLAUDE_MAX_TOKENS |
2048 |
Max response tokens |
CLAUDE_TEMPERATURE |
0.2 |
LLM temperature |
CHUNK_SIZE |
1000 |
Characters per chunk |
CHUNK_OVERLAP |
200 |
Overlap between chunks |
RETRIEVAL_TOP_K |
20 |
Vector search candidates |
RERANK_TOP_K |
5 |
Chunks sent to Claude |
EMBEDDING_MODEL |
all-MiniLM-L6-v2 |
HuggingFace model |
RERANKER_MODEL |
ms-marco-MiniLM-L-6-v2 |
CrossEncoder model |
pytest tests/ -vTests mock Qdrant and use an in-memory SQLite database. Session headers are sent via X-Session-Id.
Optional dev dependencies:
pip install -r requirements-dev.txtMIT