NeuroQuest: Hardware-Optimized RAG Pipeline for Academic Papers

This repository contains a fully local Retrieval-Augmented Generation (RAG) pipeline designed to extract, synthesize, and analyze information from dense academic PDFs.

I built this project to bridge the gap between large language models and static documents, with a specific focus on making it run smoothly on constrained hardware (specifically, the free tier of Google Colab using a Tesla T4 GPU).

Architecture & Engineering Choices

Building a reliable RAG system requires balancing retrieval accuracy with generation capabilities, all while keeping an eye on VRAM limits. Here is the breakdown of the pipeline:

1. Document Processing & Vectorization

Chunking Strategy: The PDF is parsed into 800-character chunks with a 100-character overlap. This prevents context fragmentation and ensures that cross-page sentences aren't split abruptly.
Embeddings: sentence-transformers/all-MiniLM-L6-v2. Chosen because it's extremely lightweight and fast, yet provides high semantic density.
Vector Search: FAISS (Facebook AI Similarity Search) using L2 distance. It runs entirely on the CPU, saving precious GPU memory for the generation model.

2. The Generation Model: Why Qwen 2.5 (3B)?

Initially, I developed this pipeline using Flan-T5-Base (a Seq2Seq model). While it was fine for simple extraction, it struggled with "context confusion" when presented with long, concatenated chunks of text during multi-hop reasoning tasks.

To achieve deeper, abstractive reasoning without relying on paid APIs, I upgraded the system to Qwen2.5-3B-Instruct.

Hardware Optimization: A 3-billion parameter model usually eats up a lot of VRAM. By loading the model in Half-Precision (torch_dtype=torch.float16), the footprint is reduced to roughly 6GB. This allows it to run flawlessly on a standard 15GB T4 GPU without throwing Out-Of-Memory (OOM) errors.
Prompt Engineering: The implementation leverages standard Chat Templates (apply_chat_template) combined with strict system prompts and low temperature (temperature=0.1) to keep the LLM grounded and prevent hallucinations.

Getting Started

You can run this entire pipeline directly in a Google Colab notebook.

Clone the repository and install the dependencies:
```
pip install -r requirements.txt
```
Upload your target PDF as paper.pdf in the working directory.
Run the RAG_Engine.ipynb notebook. (Make sure your runtime is set to Hardware Accelerator: GPU / T4).

Evaluation & Sample Output

To benchmark the system, I tested it against the foundational "Attention Is All You Need" paper. The model successfully demonstrates multi-hop reasoning (pulling context from different pages) and provides analytical, non-extractive answers.

Sample Query:

Explain the main advantages of the Transformer architecture over RNNs according to the paper.

System Output:

ANSWER: According to the provided context, the main advantage of the Transformer architecture over RNNs is that it computes representations of its input and output without using sequence-aligned RNNs or convolution. This implies that the Transformer avoids the sequential dependencies and memory limitations inherent in RNNs, which can be a significant drawback in handling long sequences.
SOURCES: Pages 2, 3

Author

Pasha Ahmadi M.Sc. Student in Computer Engineering

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
assets		assets
RAG_Engine.ipynb		RAG_Engine.ipynb
README.md		README.md
paper.pdf		paper.pdf
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NeuroQuest: Hardware-Optimized RAG Pipeline for Academic Papers

Architecture & Engineering Choices

1. Document Processing & Vectorization

2. The Generation Model: Why Qwen 2.5 (3B)?

Getting Started

Evaluation & Sample Output

Author

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

NeuroQuest: Hardware-Optimized RAG Pipeline for Academic Papers

Architecture & Engineering Choices

1. Document Processing & Vectorization

2. The Generation Model: Why Qwen 2.5 (3B)?

Getting Started

Evaluation & Sample Output

Author

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages