This repository contains a fully local Retrieval-Augmented Generation (RAG) pipeline designed to extract, synthesize, and analyze information from dense academic PDFs.
I built this project to bridge the gap between large language models and static documents, with a specific focus on making it run smoothly on constrained hardware (specifically, the free tier of Google Colab using a Tesla T4 GPU).
Building a reliable RAG system requires balancing retrieval accuracy with generation capabilities, all while keeping an eye on VRAM limits. Here is the breakdown of the pipeline:
- Chunking Strategy: The PDF is parsed into 800-character chunks with a 100-character overlap. This prevents context fragmentation and ensures that cross-page sentences aren't split abruptly.
- Embeddings:
sentence-transformers/all-MiniLM-L6-v2. Chosen because it's extremely lightweight and fast, yet provides high semantic density. - Vector Search:
FAISS(Facebook AI Similarity Search) using L2 distance. It runs entirely on the CPU, saving precious GPU memory for the generation model.
Initially, I developed this pipeline using Flan-T5-Base (a Seq2Seq model). While it was fine for simple extraction, it struggled with "context confusion" when presented with long, concatenated chunks of text during multi-hop reasoning tasks.
To achieve deeper, abstractive reasoning without relying on paid APIs, I upgraded the system to Qwen2.5-3B-Instruct.
- Hardware Optimization: A 3-billion parameter model usually eats up a lot of VRAM. By loading the model in Half-Precision (
torch_dtype=torch.float16), the footprint is reduced to roughly 6GB. This allows it to run flawlessly on a standard 15GB T4 GPU without throwing Out-Of-Memory (OOM) errors. - Prompt Engineering: The implementation leverages standard Chat Templates (
apply_chat_template) combined with strict system prompts and low temperature (temperature=0.1) to keep the LLM grounded and prevent hallucinations.
You can run this entire pipeline directly in a Google Colab notebook.
- Clone the repository and install the dependencies:
pip install -r requirements.txt
- Upload your target PDF as
paper.pdfin the working directory. - Run the
RAG_Engine.ipynbnotebook. (Make sure your runtime is set to Hardware Accelerator: GPU / T4).
To benchmark the system, I tested it against the foundational "Attention Is All You Need" paper. The model successfully demonstrates multi-hop reasoning (pulling context from different pages) and provides analytical, non-extractive answers.
Sample Query:
Explain the main advantages of the Transformer architecture over RNNs according to the paper.
System Output:
ANSWER: According to the provided context, the main advantage of the Transformer architecture over RNNs is that it computes representations of its input and output without using sequence-aligned RNNs or convolution. This implies that the Transformer avoids the sequential dependencies and memory limitations inherent in RNNs, which can be a significant drawback in handling long sequences.
SOURCES: Pages 2, 3
Pasha Ahmadi M.Sc. Student in Computer Engineering