Conversation
Six focused quality / robustness fixes across the search -> dedup -> filter -> enrich -> download pipeline. Search quality - ranking: add a query-relevance axis. Dedup merges each source's locally relevance-sorted list and discarded that ordering, so ranking fell back to recency + citation alone — surfacing off-topic-but-highly-cited papers above on-topic ones. Now score = relevance (query-term overlap with title >> abstract, dominant) + recency + damped log10(citations). keywords is optional so single-paper callers keep the old behaviour. - dedup: normalise the title hash (strip punctuation/whitespace) and drop the arXiv version suffix, so "Attention Is All You Need" / "...need." and 2401.00001v1 / v2 collapse to one record instead of splitting. Filter correctness - min_citations was accepted (CLI/MCP/Query) and documented as a filter but only ever applied by the semantic_scholar plugin. Enforce it in the pipeline for every source; papers whose source reports no count are kept (an unknown count is not treated as zero). Concurrency robustness - enrich_collection: cap simultaneous per-paper enrichments with a semaphore (default 4) so a large collection doesn't fire one Anthropic call + one PDF download per paper at once and trip the API rate limit. - download_pdfs: same semaphore cap, so several sources handing back PDF URLs on the same publisher CDN don't burst it into an IP block. Cleanup - pipeline._enrich_one: drop the redundant `except (ThesisAgentsError, Exception)` (Exception already covers it). Gates: 588 tests pass (+10), ruff + bandit clean.
Continuation of the workflow-quality pass.
- dedup: canonicalise the first-author surname so name-format differences
("Vaswani, Ashish" / "Ashish Vaswani" / "A. Vaswani") don't split one
DOI-less paper into duplicates — complements the title/arXiv-version canon.
- pipeline: enforce the year range for ALL sources after dedup/rank, not only
inside the source plugins. Scrape sources (scholar / ieee) filter loosely,
so a pipeline guard keeps the result within [year_from, year_to]; papers
with an unknown year are kept.
- cli: expose --min-citations. The flag and the Query wiring were both missing,
so the filter (now pipeline-wide) was unreachable from the CLI. docs/cli.md
updated.
Gates: 592 tests pass (+4), ruff + bandit clean.
The ranking/dedup/filter/concurrency workflow pass introduced two conventions worth pinning so they don't bit-rot: - code-quality-reviewer (Async & Concurrency): add the bounded-fan-out rule — a stage that gathers one op over N papers (enrich / download / OA resolve) escapes the per-source token buckets and must use a module-level Semaphore cap. Includes the 25-paper 429 / IP-block rationale + anti-pattern + pattern. - compliance-auditor (Core vs Source Plugins): add the query-semantics rule — Query filter fields (min_citations / year / top_tier) are enforced in the pipeline for all sources, never delegated to plugins. Documents the real min_citations incident (only semantic_scholar honoured it) + anti-pattern.
- S2583 (test_pdf_download / test_pipeline concurrency tests): the 'condition always false' on 'assert 1 <= peak <= 3' was a static-analysis false positive — Sonar can't see that 'peak' (a nonlocal int) is mutated by the monkeypatched async double via download_pdfs/enrich_collection's gather. Switch the counter to a dict so the mutation isn't const-propagated to 0; same fix applied to the enrich test pre-emptively. - S7503 (test_cli min-citations test): the fake_run_search / fake_shutdown doubles replace awaited production fns and must stay async — mark NOSONAR.
The per-test fake_run_search / fake_shutdown / monkeypatch boilerplate was the single biggest duplication source (118 dup lines, ~13% file density). Upgrade the existing patched_pipeline fixture to also capture the Query, and route the four query-asserting tests (source-default / exclude-source / min-citations / top-tier) through it instead of each re-rolling its own fake. -35 lines; the two async-stub NOSONARs now live once in the fixture.
Route the pdf-download and auto-enrich tests through the shared patched_pipeline fixture, and delete the now-redundant _fake_search_with_papers helper (it duplicated the fixture's run_search / shutdown stub). test_cli.py 926 -> 848 lines across this pass; the only remaining fake_run_search definitions are the fixture and _patch_search (which legitimately takes custom papers with controlled pdf_url).
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.



Summary
Quality + robustness improvements across the automatic paper-research workflow
(search → dedup → filter → enrich → download), delivered in two commits. All
gates green: 592 tests pass, ruff + bandit clean.
Search quality
relevance order, so the final sort was recency + citation only (surfacing a
100k-citation off-topic paper above an on-topic one). Now relevance
(query-term overlap, title ≫ abstract) is the dominant signal.
punctuation/whitespace), the arXiv version suffix (
v1/v2), and thefirst-author surname (
"Vaswani, Ashish"/"Ashish Vaswani"/"A. Vaswani"), so cosmetic cross-source differences no longer split onepaper into duplicates.
Filter correctness
min_citationsapplies to every source — it was accepted + documented asa threshold filter but only
semantic_scholarhonoured it. Now enforced inthe pipeline for all sources (unknown counts kept).
--min-citationsexposed on the CLI — the flag and theQuerywiringwere both missing, so the (now pipeline-wide) filter was unreachable from the
CLI. Added + documented.
[year_from, year_to]once for allsources after dedup/rank (scrape sources filter loosely); unknown years kept.
Concurrency robustness
enrich_collectionanddownload_pdfseach cap concurrency with asemaphore (default 4), so a large collection doesn't fire one Anthropic call /
one PDF download per paper at once and trip an API / publisher-CDN rate limit.
Cleanup
except (ThesisAgentsError, Exception)inpipeline._enrich_one.Every change ships with unit tests (relevance ordering; punctuation / version /
author-format dedup; pipeline min_citations + year guard across sources;
concurrency-cap assertions; CLI
--min-citationswiring).