GitHub - prabhat310-bit/PDF-RAG-Project: This is an end-to-end RAG system that allows users to upload PDF documents and ask questions based on their content using gemini llm.

📄 PDF RAG System

An end-to-end Retrieval-Augmented Generation (RAG) system that allows users to upload PDF documents and ask questions based on their content using Large Language Models (LLMs).

🚀 Features

📄 Upload and process PDF documents
🧠 Extract text, tables, and images
✂️ Intelligent text chunking for better context
🔍 Semantic search using vector embeddings
🤖 Context-aware answers using LLM
📊 Table-aware understanding
📚 Source traceability with metadata
🌐 Interactive UI with Streamlit

🧱 Architecture

User → Upload PDF → Process → Store Embeddings → Ask Question → Get Answer

🧠 How It Works
🧾 Ingestion Phase
Upload PDF
Extract:
Text (PyMuPDF)
Tables (pdfplumber)
Chunk the content
Generate embeddings
Store in vector database

💬 Query Phase
User enters a question
Convert query to embedding
Retrieve relevant chunks
Pass context to LLM
Generate answer

🛠️ Tech Stack

Python
Streamlit
PyMuPDF
pdfplumber
LangChain
ChromaDB
Gemini API

📂 Project Structure

pdf_rag_project/
│
├── app.py
├── parser.py
├── tables.py
├── chunking.py
├── embeddings.py
├── vectordb.py
├── query.py
├── llm.py
│
├── data/
├── requirements.txt
└── .env

⚙️ Setup Instructions

Clone the repo
git clone https://github.com/your-username/pdf-rag-system.git
cd pdf-rag-system
Install dependencies
pip install -r requirements.txt
Add API Key

Create .env file:
GOOGLE_API_KEY=your_api_key_here

Run the app

streamlit run app.py

🧪 Example Queries
What is this document about?
Summarize page 2
What does the table show?

⚠️ Challenges & Learnings
Extracting tables from PDFs
Handling unstructured PDFs
Improving retrieval accuracy
Reducing hallucination

📌 Author

Prabhat Singh
GitHub: https://github.com/prabhat310-bit

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.gitignore		.gitignore
README.md		README.md
app.py		app.py
chunking.py		chunking.py
embeddings.py		embeddings.py
llm.py		llm.py
parser.py		parser.py
query.py		query.py
requirements.txt		requirements.txt
tables.py		tables.py
vectordb.py		vectordb.py

Folders and files

Latest commit

History

Repository files navigation

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages