Skip to content

Feature/rag#29

Merged
filippostanghellini merged 7 commits intomainfrom
feature/rag
Mar 12, 2026
Merged

Feature/rag#29
filippostanghellini merged 7 commits intomainfrom
feature/rag

Conversation

@filippostanghellini
Copy link
Copy Markdown
Owner

Description

Add a fully local RAG (Retrieval-Augmented Generation) system to DocFinder, enabling users to ask questions about
their documents using a local LLM. No data leaves the user's machine. The system automatically selects the best
model based on available RAM, downloads it once, and uses GPU acceleration when available (Metal on Apple
Silicon, CUDA on NVIDIA). Context retrieval is page-aware: PDF uses real pages, Markdown splits on headings, Word
groups paragraphs, and plain text uses virtual pages.

Type of Change

  • 🐛 Bug fix (non-breaking change which fixes an issue)
  • ✨ New feature (non-breaking change which adds functionality)
  • 💥 Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • 📝 Documentation update
  • 🎨 Code style update (formatting, renaming)
  • ♻️ Code refactor (no functional changes)
  • ⚡ Performance improvement
  • ✅ Test update
  • 🔧 Configuration change
  • 🏗️ Build/CI update

Changes Made

  • New src/docfinder/rag/ module — llm.py handles model selection (Qwen2.5-7B/3B/1.5B based on RAM), GGUF download
    via HuggingFace Hub, GPU detection (Metal/CUDA/CPU), and llama-cpp-python inference wrapper. engine.py provides
    RAGEngine class coordinating Searcher + LocalLLM with token-budget-aware context assembly.
  • Page-aware context retrieval — new get_context_by_page() in storage.py retrieves chunks by page/section and
    expands symmetrically to adjacent pages until the token budget is filled. Falls back to get_context_window() (±10
    chunks) for documents indexed without page metadata.
  • Page-aware chunking pipeline — chunk_text_stream_paged() in utils/text.py propagates page provenance through
    the chunking process. Each format uses its natural structure: PDF real pages, Markdown headings, Word paragraph
    groups (10), plain text virtual pages (~3000 chars).
  • RAG backend endpoints — /rag/models (list available models with recommended flag), /rag/download +
    /rag/download/status (background download with real-time byte-level progress via HF tqdm patching), /rag/chat
    (question answering over page-aware context window).
  • RAG Settings UI — "AI Chat (RAG)" section in Settings with enable/disable toggle, hardware detection, three
    model cards with size/RAM/recommended badge, real-time download progress bar that persists across tab switches.
  • Chat panel UI — slide-up panel from bottom-right on search results, with conversation history, auto-focus,
    Enter-to-send. Chat button conditionally visible only when RAG is enabled and model is downloaded.
  • Version bump to 2.0.0 across pyproject.toml, init.py, app.py, and CHANGELOG.md.

Testing

  • Existing tests pass locally (198 passed, 1 skipped)
  • Added new tests for the changes (31 tests in test_rag.py)
  • Manual testing performed

Test details

  • TestGetContextWindow (8 tests) — fixed-window retrieval, clamping, ordering, edge cases
  • TestGetContextByPage (7 tests) — page-based retrieval, expansion, max_chars budget, ordering
  • TestChunkTextStreamPaged (3 tests) — page provenance through chunking pipeline
  • TestTxtPaged (3 tests) — plain text virtual paging
  • TestMdPaged (3 tests) — Markdown heading-based sections
  • TestDocxPaged (1 test) — Word paragraph grouping
  • TestSelectModel (5 tests) — RAM-based model tier selection
  • TestRAGEngineContextAssembly (2 tests) — context trimming and ordering

filippostanghellini and others added 5 commits March 11, 2026 23:26
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
the models is recommended based on the computer spec, the context is optimized for a trade off between quality and effiecency.
Copilot AI review requested due to automatic review settings March 12, 2026 16:37
@github-actions
Copy link
Copy Markdown

github-actions bot commented Mar 12, 2026

Dependency Review

The following issues were found:
  • ✅ 0 vulnerable package(s)
  • ✅ 0 package(s) with incompatible licenses
  • ✅ 0 package(s) with invalid SPDX license definitions
  • ⚠️ 1 package(s) with unknown licenses.
See the Details below.

License Issues

pyproject.toml

PackageVersionLicenseIssue Type
python-docx>= 1.1.0NullUnknown License

OpenSSF Scorecard

PackageVersionScoreDetails
pip/python-docx >= 1.1.0 UnknownUnknown

Scanned Files

  • pyproject.toml

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Mar 12, 2026

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 32.65306% with 594 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
src/docfinder/gui.py 0.00% 241 Missing ⚠️
src/docfinder/web/app.py 16.88% 192 Missing ⚠️
src/docfinder/ingestion/pdf_loader.py 57.51% 65 Missing ⚠️
src/docfinder/rag/llm.py 37.03% 51 Missing ⚠️
src/docfinder/rag/engine.py 57.64% 36 Missing ⚠️
src/docfinder/index/indexer.py 75.00% 4 Missing ⚠️
src/docfinder/utils/files.py 75.00% 2 Missing ⚠️
src/docfinder/index/storage.py 97.22% 1 Missing ⚠️
src/docfinder/settings.py 0.00% 1 Missing ⚠️
src/docfinder/web/frontend.py 83.33% 1 Missing ⚠️

📢 Thoughts on this report? Let us know!

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a fully local RAG (Retrieval-Augmented Generation) capability to DocFinder and expands indexing beyond PDFs (TXT/MD/DOCX), with supporting backend endpoints, UI updates, and test coverage.

Changes:

  • Introduces docfinder.rag (local LLM selection/download + a RAG orchestration engine) and new RAG HTTP endpoints (/rag/models, /rag/download, /rag/chat).
  • Adds page-aware chunking + retrieval (chunk_text_stream_paged, get_context_by_page, get_context_window) to support better context assembly.
  • Expands ingestion/indexing and UX to support multi-format documents and adds a Spotlight-style quick-search panel on macOS.

Reviewed changes

Copilot reviewed 26 out of 28 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
tests/test_rag.py New tests for context window/page retrieval, page-aware chunking, model selection, and RAG engine assembly.
tests/test_pdf_loader.py Updates mocks to use the new paged text iterator for PDF chunk building.
tests/test_indexer.py Updates patch points from PDF-only discovery to multi-document discovery.
tests/test_imports.py Updates version assertion to 2.0.0.
tests/test_frontend.py Updates frontend template loader signature to accept a template name.
tests/test_cli.py Updates CLI messaging from “PDFs” to “supported documents”.
src/docfinder/web/templates/spotlight.html Adds dedicated Spotlight panel HTML/JS for quick search.
src/docfinder/web/templates/index.html Major UI redesign + adds in-app Spotlight overlay, indexing warnings, and RAG settings + chat panel.
src/docfinder/web/frontend.py Template loader now supports multiple templates and serves /spotlight.
src/docfinder/web/app.py Adds RAG endpoints, system RAM endpoint, indexing scan endpoint, exclude-path support, and embed batch sizing by RAM.
src/docfinder/utils/text.py Adds chunk_text_stream_paged() for page-provenance chunking.
src/docfinder/utils/files.py Adds supported extensions + iter_document_paths(); keeps iter_pdf_paths() for compatibility.
src/docfinder/settings.py Changes default hotkey to <alt>+d.
src/docfinder/rag/llm.py New local LLM management: model tiers, RAM-based selection, download, and llama-cpp wrapper.
src/docfinder/rag/engine.py New RAG engine coordinating search + context assembly + generation.
src/docfinder/rag/init.py Adds RAG package init.
src/docfinder/ingestion/pdf_loader.py Generalizes ingestion to PDF/TXT/MD/DOCX; adds paged iterators and stores page in chunk metadata.
src/docfinder/index/storage.py Search now returns document_id; adds get_context_window() + get_context_by_page().
src/docfinder/index/indexer.py Generalizes indexer to index supported docs; adds exclusions and embedding batch-size override support.
src/docfinder/gui.py Adds native macOS Spotlight NSPanel + CGEventTap-based global hotkey handling; integrates with web endpoints.
src/docfinder/embedding/encoder.py Adds optional batch_size override to EmbeddingModel.embed().
src/docfinder/cli.py Updates CLI to index supported document types instead of PDFs only.
src/docfinder/init.py Bumps package version to 2.0.0.
pyproject.toml Bumps version + adds python-docx dependency and [rag] extra for llama-cpp-python.
README.md Updates messaging and demo media for multi-format + new UX.
CLAUDE.md Adds repo command/architecture guidance for Claude Code tooling.
CHANGELOG.md Adds 2.0.0 release notes and updates compare links.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

"error": None,
}


Copy link

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RAG globals are accessed/modified from multiple threads without synchronization: _rag_llm_lock is defined but never used, and _rag_download is mutated inside _load_rag_llm() (run via asyncio.to_thread) while being read by /rag/download/status. This can lead to inconsistent status/progress and makes future changes riskier. Use _rag_llm_lock (or a dedicated lock) to guard updates to _rag_llm and _rag_download (or replace _rag_download with an immutable snapshot updated atomically).

Suggested change
def _get_rag_download_state() -> dict[str, Any]:
"""Return a thread-safe snapshot of the RAG download state."""
with _rag_llm_lock:
# Return a shallow copy so callers cannot mutate the shared dict
return dict(_rag_download)

Copilot uses AI. Check for mistakes.
Comment on lines +288 to +309
# Collect pages: start with center_page, then expand symmetrically
collected: List[dict] = []
total_chars = 0

# Add center page first
for c in chunks_with_page:
if c["page"] == center_page:
collected.append(c)
total_chars += len(c["text"])

# Expand to adjacent pages
expand = 1
while total_chars < max_chars:
added = False
for offset_page in (center_page - expand, center_page + expand):
if offset_page < 0:
continue
for c in chunks_with_page:
if c["page"] == offset_page and total_chars < max_chars:
collected.append(c)
total_chars += len(c["text"])
added = True
Copy link

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get_context_by_page() does repeated full scans of chunks_with_page inside the expansion loop (for offset_page ...: for c in chunks_with_page:). For documents with many chunks/pages this becomes O(n * pages_expanded) and can be noticeably slow in RAG. Consider pre-grouping chunks by page (e.g., page -> [chunks]) once, then appending from the adjacent pages without re-scanning the entire chunk list each time (and consider skipping page 0 by checking offset_page <= 0).

Suggested change
# Collect pages: start with center_page, then expand symmetrically
collected: List[dict] = []
total_chars = 0
# Add center page first
for c in chunks_with_page:
if c["page"] == center_page:
collected.append(c)
total_chars += len(c["text"])
# Expand to adjacent pages
expand = 1
while total_chars < max_chars:
added = False
for offset_page in (center_page - expand, center_page + expand):
if offset_page < 0:
continue
for c in chunks_with_page:
if c["page"] == offset_page and total_chars < max_chars:
collected.append(c)
total_chars += len(c["text"])
added = True
# Group chunks by page for efficient access during expansion
page_to_chunks = {}
for c in chunks_with_page:
page_to_chunks.setdefault(c["page"], []).append(c)
# Collect pages: start with center_page, then expand symmetrically
collected: List[dict] = []
total_chars = 0
# Add center page first
for c in page_to_chunks.get(center_page, []):
collected.append(c)
total_chars += len(c["text"])
# Expand to adjacent pages
expand = 1
while total_chars < max_chars:
added = False
for offset_page in (center_page - expand, center_page + expand):
# Skip non-positive pages; page numbers are expected to be 1-based
if offset_page <= 0:
continue
for c in page_to_chunks.get(offset_page, []):
if total_chars >= max_chars:
break
collected.append(c)
total_chars += len(c["text"])
added = True

Copilot uses AI. Check for mistakes.

@app.post("/index/scan")
async def scan_index_paths(payload: ScanPayload) -> dict[str, Any]:
"""Scan paths for PDFs and return file stats without indexing."""
Copy link

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

scan_index_paths() docstring says it scans paths for PDFs, but the implementation uses iter_document_paths() and returns counts by extension for multiple document types. Update the docstring to match the current behavior (supported documents rather than PDFs only).

Suggested change
"""Scan paths for PDFs and return file stats without indexing."""
"""Scan the given paths for supported document files and return file stats (including counts by extension) without indexing."""

Copilot uses AI. Check for mistakes.
Comment on lines +212 to +217
pages = [(1, "A" * 60), (2, "B" * 60)]
results = list(chunk_text_stream_paged(pages, max_chars=100, overlap=0))

assert len(results) >= 1
# First chunk starts on page 1
assert results[0][1] == 1
Copy link

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test_chunk_spanning_pages only asserts the first chunk’s page number. Given the intended contract (“page_number is the page that contributed the start of the chunk”), add an assertion for the next yielded chunk when overlap=0 (it should start on page 2 in this setup). This would catch incorrect page provenance when chunk_text_stream_paged() slices the buffer past the end of the first page.

Suggested change
pages = [(1, "A" * 60), (2, "B" * 60)]
results = list(chunk_text_stream_paged(pages, max_chars=100, overlap=0))
assert len(results) >= 1
# First chunk starts on page 1
assert results[0][1] == 1
# Second chunk should start on page 2 when overlap=0.
pages = [(1, "A" * 60), (2, "B" * 60)]
results = list(chunk_text_stream_paged(pages, max_chars=100, overlap=0))
# We expect exactly two chunks: 100 chars, then 20 chars.
assert len(results) == 2
# First chunk starts on page 1
assert results[0][1] == 1
# Second chunk starts on page 2
assert results[1][1] == 2

Copilot uses AI. Check for mistakes.

@staticmethod
def _assemble_context_text(chunks: List[dict], results: List[SearchResult]) -> str:
"""Join chunk texts, trimming symmetrically if the total exceeds the budget."""
Copy link

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_assemble_context_text() docstring says it “trims symmetrically”, but the implementation truncates by stopping once the global character budget is reached while iterating documents/chunks in order. Either adjust the docstring to match the actual behavior, or implement true symmetric trimming around the highest-relevance chunks if that’s the intent.

Suggested change
"""Join chunk texts, trimming symmetrically if the total exceeds the budget."""
"""Join chunk texts into a single context string, truncating once the
global character budget is reached.
Chunks are grouped by document path and ordered by their chunk index.
Text is appended in that order until ``_MAX_CONTEXT_CHARS`` is exceeded;
any remaining chunks are omitted (prefix-style trimming, not symmetric)."""

Copilot uses AI. Check for mistakes.
Comment on lines +56 to +78
# Track which page the start of the buffer came from
buf_page = 0
step = max(max_chars - overlap, 1)

for page_num, part in pages:
if not buffer:
buf_page = page_num
buffer += part
while len(buffer) >= max_chars:
yield buffer[:max_chars], buf_page
buffer = buffer[step:]
# After slicing, the start of the buffer has shifted.
# The overlap region still belongs to the old page, so
# we keep buf_page unchanged — it's the page of the
# beginning of the chunk. It will be updated when new
# text is appended from the next page.
if not buffer:
buf_page = page_num

if buffer:
yield buffer, buf_page


Copy link

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

chunk_text_stream_paged() can assign the wrong page_number once a chunk spans pages and overlap is small/zero. After yielding a chunk, buffer = buffer[step:] may advance the buffer start into a later page, but buf_page is intentionally left unchanged, so subsequent yielded chunks can be labeled with the previous page even though they start on the new page. Consider tracking page boundaries in the buffer (e.g., buffer as a deque of (page,text) segments) and updating buf_page when slicing crosses into the next segment, or recomputing the start page based on consumed lengths.

Suggested change
# Track which page the start of the buffer came from
buf_page = 0
step = max(max_chars - overlap, 1)
for page_num, part in pages:
if not buffer:
buf_page = page_num
buffer += part
while len(buffer) >= max_chars:
yield buffer[:max_chars], buf_page
buffer = buffer[step:]
# After slicing, the start of the buffer has shifted.
# The overlap region still belongs to the old page, so
# we keep buf_page unchanged — it's the page of the
# beginning of the chunk. It will be updated when new
# text is appended from the next page.
if not buffer:
buf_page = page_num
if buffer:
yield buffer, buf_page
# Track segments of the buffer with their originating page numbers.
# Each entry is (start_index, end_index, page_num) relative to the current buffer.
segments: list[tuple[int, int, int]] = []
step = max(max_chars - overlap, 1)
for page_num, part in pages:
if not part:
continue
# Append new text and record its segment range in the buffer.
start = len(buffer)
buffer += part
end = len(buffer)
segments.append((start, end, page_num))
# Emit chunks while we have enough buffered text.
while len(buffer) >= max_chars and segments:
# The page for the start of the chunk is the page of buffer index 0,
# which is always the page of the first remaining segment.
current_page = segments[0][2]
yield buffer[:max_chars], current_page
# Advance the buffer by `step` characters and update segments accordingly.
consume = step
buffer = buffer[consume:]
if not buffer:
segments = []
break
new_segments: list[tuple[int, int, int]] = []
for seg_start, seg_end, seg_page in segments:
seg_start -= consume
seg_end -= consume
# Drop segments fully consumed by the shift.
if seg_end <= 0:
continue
# Clamp partially consumed segments to start at 0.
if seg_start < 0:
seg_start = 0
new_segments.append((seg_start, seg_end, seg_page))
segments = new_segments
if buffer and segments:
# Remaining buffer is shorter than max_chars; its start page is the
# page of the first segment.
yield buffer, segments[0][2]

Copilot uses AI. Check for mistakes.
Comment on lines +627 to +631
# Stop the run loop so the thread exits
Quartz.CFRunLoopStop(
Quartz.CFRunLoopGetMain()
)
except Exception:
Copy link

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GlobalHotkeyManager.stop() disables the CGEventTap but attempts to stop the run loop via CFRunLoopStop(CFRunLoopGetMain()), which does not stop the tap thread’s CFRunLoopRun() (it’s running on the tap thread’s current run loop). This can leave the tap thread running and may cause hotkey reloads to create multiple active taps. Store the tap thread’s run loop (e.g., in self._tap_runloop) inside _run_tap() and stop that run loop in stop() (or signal the thread and call CFRunLoopStop on the correct loop).

Suggested change
# Stop the run loop so the thread exits
Quartz.CFRunLoopStop(
Quartz.CFRunLoopGetMain()
)
except Exception:
# Stop the tap thread's run loop so the thread exits.
# Prefer a stored tap run loop if available; otherwise fall back.
tap_runloop = getattr(self, "_tap_runloop", None)
if tap_runloop is not None:
Quartz.CFRunLoopStop(tap_runloop)
else:
Quartz.CFRunLoopStop(Quartz.CFRunLoopGetMain())
except Exception:
# Swallowing exceptions here preserves existing behavior,
# but avoids crashing if stopping the run loop fails.

Copilot uses AI. Check for mistakes.
hf_hub_download(
repo_id=spec.repo_id,
filename=spec.filename,
local_dir=str(dest_dir),
Copy link

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hf_hub_download() here uses local_dir=... but does not set local_dir_use_symlinks=False (unlike ensure_model() in docfinder.rag.llm). On Windows (and some locked-down environments) HF’s default symlink behavior can fail or require elevated privileges. Consider passing local_dir_use_symlinks=False for consistency and portability.

Suggested change
local_dir=str(dest_dir),
local_dir=str(dest_dir),
local_dir_use_symlinks=False,

Copilot uses AI. Check for mistakes.
@filippostanghellini filippostanghellini merged commit e2f2337 into main Mar 12, 2026
21 checks passed
@filippostanghellini filippostanghellini deleted the feature/rag branch March 13, 2026 12:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants