Indexing optimization, UI improvement and v1.2.0 release preparation#28
Indexing optimization, UI improvement and v1.2.0 release preparation#28filippostanghellini merged 6 commits intomainfrom
Conversation
Dependency Review✅ No vulnerabilities or license issues or OpenSSF Scorecard issues found.Scanned FilesNone |
There was a problem hiding this comment.
Pull request overview
This PR prepares the v1.2.0 release by improving indexing/search performance, adding desktop-oriented UX (progress polling, folder picker, global hotkey + settings), and refreshing the web UI styling and interactions.
Changes:
- Speed up indexing via batched SQLite inserts, larger embedding/index batches, and a singleton embedder warmed at app startup.
- Add async indexing jobs with
/index/status/{job_id}polling + corresponding UI progress bar. - Introduce persistent user settings (global hotkey) and desktop integrations (pywebview folder picker, hotkey reload, macOS dock icon fix, Windows icon-parameter compatibility).
Reviewed changes
Copilot reviewed 11 out of 15 changed files in this pull request and generated 13 comments.
Show a summary per file
| File | Description |
|---|---|
| src/docfinder/web/templates/index.html | Major UI refresh; adds progress polling, settings tab + hotkey capture modal, folder picker integration, and new JS flows. |
| src/docfinder/web/app.py | Adds embedder singleton + warm startup, async indexing job registry/status endpoint, and settings GET/POST APIs. |
| src/docfinder/settings.py | New persistent settings module storing user preferences (hotkey + enabled flag). |
| src/docfinder/index/storage.py | Speeds up chunk inserts with executemany() batching. |
| src/docfinder/index/indexer.py | Adds progress callback support and increases chunk batch size. |
| src/docfinder/gui.py | Adds global hotkey manager (pynput), desktop JS API, and macOS/Windows icon handling improvements. |
| src/docfinder/embedding/encoder.py | Raises embedding batch size and removes manual mini-batching/GC loop. |
| pyproject.toml | Bumps version to 1.2.0; adds GUI deps for hotkey + macOS Cocoa framework. |
| README.md | Streamlines docs and updates run-from-source instructions. |
| Makefile | Adds setup, run, and run-web convenience targets. |
| CHANGELOG.md | Adds v1.2.0 release notes and link updates. |
| images/index-documents.png, images/database-documents.png | Screenshot assets referenced by docs. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # Pre-load the embedding model at startup so the first request is instant | ||
| await asyncio.to_thread(_get_embedder) |
There was a problem hiding this comment.
The lifespan hook eagerly preloads the embedding model on startup. In the test suite, the first request is often an unpatched call (e.g., empty-query validation), which will still trigger startup and load the real SentenceTransformer model, making tests slow/flaky. Consider guarding preload behind an env flag (disabled in tests) or making preload optional so startup validation-only requests don't force model initialization.
| # Pre-load the embedding model at startup so the first request is instant | |
| await asyncio.to_thread(_get_embedder) | |
| # Optionally pre-load the embedding model at startup so the first request is instant. | |
| # Controlled by DOCFINDER_PRELOAD_EMBEDDINGS (default: enabled). Set to "0"/"false"/"no" to disable. | |
| preload_flag = os.getenv("DOCFINDER_PRELOAD_EMBEDDINGS", "1").lower() | |
| if preload_flag not in ("0", "false", "no"): | |
| await asyncio.to_thread(_get_embedder) |
| job_id = str(uuid.uuid4()) | ||
| job: dict[str, Any] = { | ||
| "id": job_id, | ||
| "status": "running", | ||
| "processed": 0, | ||
| "total": 0, | ||
| "current_file": "", | ||
| "stats": None, | ||
| "error": None, | ||
| } | ||
| _index_jobs[job_id] = job |
There was a problem hiding this comment.
_index_jobs is an unbounded global registry and completed/errored jobs are never evicted. Repeated indexing will grow memory usage over time. Consider removing jobs after a TTL, capping the dict size, or deleting entries once the frontend has fetched final stats.
| @app.get("/index/status/{job_id}") | ||
| async def get_index_status(job_id: str) -> dict[str, Any]: | ||
| """Poll the status of a running or completed indexing job.""" | ||
| job = _index_jobs.get(job_id) | ||
| if job is None: | ||
| raise HTTPException(status_code=404, detail="Job not found") | ||
| return job |
There was a problem hiding this comment.
New async indexing workflow adds /index/status/{job_id}, but there are no tests validating job creation, polling transitions (running→complete/error), or 404 behavior. Since tests/test_web_app.py already covers /index, add coverage for the status endpoint to ensure progress polling works and stays stable.
| const safePath = r.path.replace(/'/g, "\\'"); | ||
| const title = r.title || r.path.split('/').pop(); | ||
| const pct = Math.round(r.score * 100); |
There was a problem hiding this comment.
safePath only escapes single quotes before being embedded into an inline JS string. On Windows, paths contain backslashes (e.g., C:\docs\a.pdf), which will be interpreted as escape sequences inside the generated JS and can corrupt the path or break the handler. Prefer avoiding inline onclick entirely (attach a click listener and store the path in a data attribute), or at minimum escape backslashes/newlines and generate the argument via JSON.stringify.
| const interval = setInterval(async () => { | ||
| try { | ||
| const statusRes = await fetch(`${apiBase}/index/status/${jobId}`); | ||
| const job = await statusRes.json(); | ||
| if (job.status === 'running') { | ||
| indexStatus.innerHTML = renderProgress(job); | ||
| } else if (job.status === 'complete') { | ||
| clearInterval(interval); | ||
| resolve(job); | ||
| } else if (job.status === 'error') { | ||
| clearInterval(interval); | ||
| reject(new Error(job.error || 'Indexing failed')); | ||
| } | ||
| } catch (err) { clearInterval(interval); reject(err); } | ||
| }, 600); |
There was a problem hiding this comment.
Using setInterval with an async callback can cause overlapping in-flight fetches if a request takes longer than the interval (600ms), leading to out-of-order UI updates and unnecessary load. Prefer a recursive setTimeout that schedules the next poll only after the previous one finishes (or guard with an inFlight flag).
| const interval = setInterval(async () => { | |
| try { | |
| const statusRes = await fetch(`${apiBase}/index/status/${jobId}`); | |
| const job = await statusRes.json(); | |
| if (job.status === 'running') { | |
| indexStatus.innerHTML = renderProgress(job); | |
| } else if (job.status === 'complete') { | |
| clearInterval(interval); | |
| resolve(job); | |
| } else if (job.status === 'error') { | |
| clearInterval(interval); | |
| reject(new Error(job.error || 'Indexing failed')); | |
| } | |
| } catch (err) { clearInterval(interval); reject(err); } | |
| }, 600); | |
| const poll = async () => { | |
| try { | |
| const statusRes = await fetch(`${apiBase}/index/status/${jobId}`); | |
| const job = await statusRes.json(); | |
| if (job.status === 'running') { | |
| indexStatus.innerHTML = renderProgress(job); | |
| setTimeout(poll, 600); | |
| } else if (job.status === 'complete') { | |
| resolve(job); | |
| } else if (job.status === 'error') { | |
| reject(new Error(job.error || 'Indexing failed')); | |
| } else { | |
| // Unknown status: poll again conservatively | |
| setTimeout(poll, 600); | |
| } | |
| } catch (err) { | |
| reject(err); | |
| } | |
| }; | |
| poll(); |
| @app.get("/settings") | ||
| async def get_settings() -> dict: | ||
| """Return current user settings.""" | ||
| return load_settings() | ||
|
|
||
|
|
||
| @app.post("/settings") | ||
| async def update_settings(payload: SettingsPayload) -> dict: | ||
| """Persist updated settings and return the full settings dict.""" | ||
| current = load_settings() | ||
| if payload.hotkey is not None: | ||
| current["hotkey"] = payload.hotkey | ||
| if payload.hotkey_enabled is not None: | ||
| current["hotkey_enabled"] = payload.hotkey_enabled | ||
| _save_settings(current) | ||
| return current |
There was a problem hiding this comment.
New /settings endpoints were added, but there are existing comprehensive tests for docfinder.web.app and no tests currently cover settings load/update behavior. Add tests for GET/POST /settings (including partial updates) to prevent regressions in persistence/serialization.
| resultsMeta.style.display = 'none'; | ||
| resultsList.innerHTML = '<div class="loading"><div class="spinner"></div><span>Searching…</span></div>'; | ||
|
|
There was a problem hiding this comment.
resultsList is a <ul>, but loading/empty/error states are injected as a <div> via innerHTML. This produces invalid list markup and can confuse screen readers. Render these states as <li> elements (or move them to a separate container outside the <ul>).
| document.addEventListener('keydown', e => { | ||
| if ((e.metaKey || e.ctrlKey) && e.key === 'k') { | ||
| e.preventDefault(); | ||
| document.querySelector('[data-tab="search"]').click(); | ||
| setTimeout(() => document.getElementById('query').focus(), 60); | ||
| } | ||
| }); |
There was a problem hiding this comment.
The global ⌘K/Ctrl+K handler runs even while the hotkey capture modal is open. Because it's registered earlier than the capture modal keydown handler, pressing ⌘K during capture will switch tabs instead of recording the combo. Add a guard (e.g., if #hotkey-modal is active, ignore) or register the capture handler in the capture phase so it always wins.
| try { | ||
| const statusRes = await fetch(`${apiBase}/index/status/${jobId}`); | ||
| const job = await statusRes.json(); | ||
| if (job.status === 'running') { | ||
| indexStatus.innerHTML = renderProgress(job); | ||
| } else if (job.status === 'complete') { | ||
| clearInterval(interval); | ||
| resolve(job); |
There was a problem hiding this comment.
Progress polling doesn't check statusRes.ok. If the status endpoint returns 404/500 (or a non-job JSON like {detail: ...}), job.status will be undefined and the interval will keep running forever. Handle non-2xx responses by clearing the interval and surfacing an error to the user.
|
|
||
| // Render stats | ||
| const res = await fetch(`${apiBase}/documents`); | ||
| const data = await res.json(); |
There was a problem hiding this comment.
The Documents load path parses JSON without checking res.ok. If the server returns an error payload, the code will try to read data.stats and likely throw a generic JS exception, which then becomes a low-quality UI error. Check res.ok and raise using data.detail (similar to the search and delete flows).
| const data = await res.json(); | |
| const data = await res.json(); | |
| if (!res.ok) { | |
| throw new Error(data.detail || `Failed to load documents (status ${res.status})`); | |
| } |
|
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
No description provided.