Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
147 changes: 147 additions & 0 deletions pageindex.egg-info/PKG-INFO
Original file line number Diff line number Diff line change
@@ -0,0 +1,147 @@
Metadata-Version: 2.4
Name: pageindex
Version: 0.1.0
Summary: Vectorless, reasoning-based RAG indexer
License: MIT
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: openai==1.101.0
Requires-Dist: pymupdf==1.26.4
Requires-Dist: PyPDF2==3.0.1
Requires-Dist: python-dotenv==1.1.0
Requires-Dist: tiktoken==0.11.0
Requires-Dist: pyyaml==6.0.2
Requires-Dist: pydantic>=2.0
Provides-Extra: dev
Requires-Dist: pytest>=7.4.0; extra == "dev"
Requires-Dist: pytest-asyncio>=0.21.0; extra == "dev"
Dynamic: license-file

<div align="center">

<a href="https://vectify.ai/pageindex" target="_blank">
<img src="https://github.com/user-attachments/assets/46201e72-675b-43bc-bfbd-081cc6b65a1d" alt="PageIndex Banner" />
</a>

<br/>
<br/>

<p align="center">
<a href="https://trendshift.io/repositories/14736" target="_blank"><img src="https://trendshift.io/api/badge/repositories/14736" alt="VectifyAI%2FPageIndex | Trendshift" style="width: 250px; height: 55px;" width="250" height="55"/></a>
</p>

# PageIndex: Reasoning-Based Vectorless RAG

<p align="center"><b>Reasoning-native RAG&nbsp; ◦ &nbsp;No Vector DB&nbsp; ◦ &nbsp;No Chunking&nbsp; ◦ &nbsp;Human-like Retrieval</b></p>

<h4 align="center">
<a href="https://vectify.ai">🏠 Homepage</a>&nbsp; • &nbsp;
<a href="https://chat.pageindex.ai">🖥️ Chat Platform</a>&nbsp; • &nbsp;
<a href="https://pageindex.ai/mcp">🔌 MCP</a>&nbsp; • &nbsp;
<a href="https://docs.pageindex.ai">📚 Documentation</a>&nbsp; • &nbsp;
<a href="https://discord.com/invite/VuXuf29EUj">💬 Discord</a>&nbsp; • &nbsp;
<a href="https://ii2abc2jejf.typeform.com/to/tK3AXl8T">✉️ Contact Us</a>&nbsp;
</h4>

</div>

<details open>
<summary><h3>📢 Latest Updates</h3></summary>

**🔥 Releases:**
- [**PageIndex Chat**](https://chat.pageindex.ai): The first human-like agentic platform for document analysis, built for professional long-context documents. Also available via [MCP](https://pageindex.ai/mcp) or [API](https://docs.pageindex.ai/quickstart) (beta).

**📝 Articles:**
- [**PageIndex Framework**](https://pageindex.ai/blog/pageindex-intro): Introduces the PageIndex framework — an *agentic, in-context tree index* that empowers LLMs to perform *reasoning-based, human-like retrieval* over long documents without a Vector DB or chunking.

**🧪 Cookbooks:**
- [Vectorless RAG](https://docs.pageindex.ai/cookbook/vectorless-rag-pageindex): A minimal, practical example of reasoning-based RAG using PageIndex. No vectors, no chunks, and human-like retrieval.
- [Vision-based Vectorless RAG](https://docs.pageindex.ai/cookbook/vision-rag-pageindex): Vision-only RAG without OCR; a reasoning-native approach that acts directly over PDF page images.
</details>

---

# 📑 Introduction to PageIndex

Tired of poor retrieval accuracy with Vector DBs on long, professional documents? Traditional vector RAG relies on semantic *similarity* rather than true *relevance*. But **similarity ≠ relevance** — what we need for retrieval is **relevance**, and relevance requires **reasoning**. When dealing with professional documents where domain knowledge and multi-step reasoning matter, similarity search often fails.

Inspired by AlphaGo, we propose **[PageIndex](https://vectify.ai/pageindex)** — a reasoning-based, **Vectorless RAG** framework that builds a **hierarchical tree index** from long documents and prompts the LLM to **reason over this index** for **agentic, context-aware retrieval**.

---

# ⚙️ Package Usage

### 1. Install Dependencies

```bash
pip3 install --upgrade -r requirements.txt
pip3 install -e .
```

### 2. Provide your OpenAI API Key

Create a `.env` file in the root directory and add your API key:

```bash
OPENAI_API_KEY=your_openai_key_here
```

### 3. Run PageIndex on your PDF

```bash
pageindex --pdf_path /path/to/your/document.pdf
```

---

# 💻 Developer Guide

This section is for developers contributing to `PageIndex` or integrating it as a library.

### Development Setup

1. **Clone the repository:**
```bash
git clone https://github.com/VectifyAI/PageIndex.git
cd PageIndex
```

2. **Install development dependencies:**
```bash
pip install -e ".[dev]"
# Or simply:
pip install pytest pytest-asyncio
```

3. **Run Tests:**
We use `pytest` for unit and integration testing.
```bash
pytest
```

### Project Structure

The project has been refactored into a modular library structure under `pageindex`.

- `pageindex/core/`: Core logic modules.
- `llm.py`: LLM interactions and token counting.
- `pdf.py`: PDF text extraction and processing.
- `tree.py`: Tree data structure manipulation and recursion.
- `logging.py`: Custom logging utilities.
- `pageindex/config.py`: Configuration loading and validation (Pydantic).
- `pageindex/cli.py`: Command Line Interface entry point.
- `pageindex/utils.py`: Facade for backward compatibility.

### Configuration

Configuration is handled via `pageindex/config.py`. You can modify default settings in `config.yaml` or override them via environment variables (`PAGEINDEX_CONFIG`) or CLI arguments.
Config validation is powered by Pydantic, ensuring type safety.

For API Reference, please see [API_REFERENCE.md](docs/API_REFERENCE.md).

---

# ⭐ Support Us

Give us a star 🌟 if you like the project. Thank you!
28 changes: 28 additions & 0 deletions pageindex.egg-info/SOURCES.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
LICENSE
README.md
pyproject.toml
pageindex/__init__.py
pageindex/cli.py
pageindex/config.py
pageindex/page_index.py
pageindex/page_index_md.py
pageindex/utils.py
pageindex.egg-info/PKG-INFO
pageindex.egg-info/SOURCES.txt
pageindex.egg-info/dependency_links.txt
pageindex.egg-info/entry_points.txt
pageindex.egg-info/requires.txt
pageindex.egg-info/top_level.txt
pageindex/core/__init__.py
pageindex/core/llm.py
pageindex/core/logging.py
pageindex/core/pdf.py
pageindex/core/tree.py
scripts/analyze_notebooks.py
scripts/local_client_adapter.py
scripts/refactor_notebooks_logic.py
scripts/verify_adapter.py
tests/conftest.py
tests/test_config.py
tests/test_llm.py
tests/test_tree.py
1 change: 1 addition & 0 deletions pageindex.egg-info/dependency_links.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@

2 changes: 2 additions & 0 deletions pageindex.egg-info/entry_points.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
[console_scripts]
pageindex = pageindex.cli:main
11 changes: 11 additions & 0 deletions pageindex.egg-info/requires.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
openai==1.101.0
pymupdf==1.26.4
PyPDF2==3.0.1
python-dotenv==1.1.0
tiktoken==0.11.0
pyyaml==6.0.2
pydantic>=2.0

[dev]
pytest>=7.4.0
pytest-asyncio>=0.21.0
6 changes: 6 additions & 0 deletions pageindex.egg-info/top_level.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
data
docs
notebooks
pageindex
scripts
tests
Comment on lines +1 to +6
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These *.egg-info files look like generated packaging artifacts from a local build/install (and top_level.txt even lists non-importable directories like data/, docs/, tests/). Committing them is brittle and often incorrect across environments. Prefer removing pageindex.egg-info/ from version control and adding it to .gitignore, and manage dependencies/metadata via a real packaging config (e.g., pyproject.toml / setup.cfg).

Suggested change
data
docs
notebooks
pageindex
scripts
tests
pageindex

Copilot uses AI. Check for mistakes.
Empty file added pageindex/core/__init__.py
Empty file.
Loading
Loading