Skip to content

prabhat310-bit/PDF-RAG-Project

Repository files navigation

📄 PDF RAG System

An end-to-end Retrieval-Augmented Generation (RAG) system that allows users to upload PDF documents and ask questions based on their content using Large Language Models (LLMs).

🚀 Features

📄 Upload and process PDF documents
🧠 Extract text, tables, and images
✂️ Intelligent text chunking for better context
🔍 Semantic search using vector embeddings
🤖 Context-aware answers using LLM
📊 Table-aware understanding
📚 Source traceability with metadata
🌐 Interactive UI with Streamlit

🧱 Architecture

User → Upload PDF → Process → Store Embeddings → Ask Question → Get Answer

🧠 How It Works
🧾 Ingestion Phase
Upload PDF
Extract:
Text (PyMuPDF)
Tables (pdfplumber)
Chunk the content
Generate embeddings
Store in vector database

💬 Query Phase
User enters a question
Convert query to embedding
Retrieve relevant chunks
Pass context to LLM
Generate answer

🛠️ Tech Stack

Python
Streamlit
PyMuPDF
pdfplumber
LangChain
ChromaDB
Gemini API

📂 Project Structure

pdf_rag_project/

├── app.py
├── parser.py
├── tables.py
├── chunking.py
├── embeddings.py
├── vectordb.py
├── query.py
├── llm.py

├── data/
├── requirements.txt
└── .env

⚙️ Setup Instructions

  1. Clone the repo
    git clone https://github.com/your-username/pdf-rag-system.git
    cd pdf-rag-system
  2. Install dependencies
    pip install -r requirements.txt
  3. Add API Key

Create .env file:
GOOGLE_API_KEY=your_api_key_here

  1. Run the app

streamlit run app.py

🧪 Example Queries
What is this document about?
Summarize page 2
What does the table show?

⚠️ Challenges & Learnings
Extracting tables from PDFs
Handling unstructured PDFs
Improving retrieval accuracy
Reducing hallucination

📌 Author

Prabhat Singh
GitHub: https://github.com/prabhat310-bit

About

This is an end-to-end RAG system that allows users to upload PDF documents and ask questions based on their content using gemini llm.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages