📄 PDF RAG System
An end-to-end Retrieval-Augmented Generation (RAG) system that allows users to upload PDF documents and ask questions based on their content using Large Language Models (LLMs).
🚀 Features
📄 Upload and process PDF documents
🧠 Extract text, tables, and images
✂️ Intelligent text chunking for better context
🔍 Semantic search using vector embeddings
🤖 Context-aware answers using LLM
📊 Table-aware understanding
📚 Source traceability with metadata
🌐 Interactive UI with Streamlit
🧱 Architecture
User → Upload PDF → Process → Store Embeddings → Ask Question → Get Answer
🧠 How It Works
🧾 Ingestion Phase
Upload PDF
Extract:
Text (PyMuPDF)
Tables (pdfplumber)
Chunk the content
Generate embeddings
Store in vector database
💬 Query Phase
User enters a question
Convert query to embedding
Retrieve relevant chunks
Pass context to LLM
Generate answer
🛠️ Tech Stack
Python
Streamlit
PyMuPDF
pdfplumber
LangChain
ChromaDB
Gemini API
📂 Project Structure
pdf_rag_project/
│
├── app.py
├── parser.py
├── tables.py
├── chunking.py
├── embeddings.py
├── vectordb.py
├── query.py
├── llm.py
│
├── data/
├── requirements.txt
└── .env
⚙️ Setup Instructions
- Clone the repo
git clone https://github.com/your-username/pdf-rag-system.git
cd pdf-rag-system - Install dependencies
pip install -r requirements.txt - Add API Key
Create .env file:
GOOGLE_API_KEY=your_api_key_here
- Run the app
streamlit run app.py
🧪 Example Queries
What is this document about?
Summarize page 2
What does the table show?
Extracting tables from PDFs
Handling unstructured PDFs
Improving retrieval accuracy
Reducing hallucination
📌 Author
Prabhat Singh
GitHub: https://github.com/prabhat310-bit