Improve the automatic paper-research workflow (ranking, dedup, filters, concurrency) by JE-Chen · Pull Request #17 · Integration-Automation/ThesisAgent

JE-Chen · 2026-06-05T14:59:51Z

Summary

Quality + robustness improvements across the automatic paper-research workflow
(search → dedup → filter → enrich → download), delivered in two commits. All
gates green: 592 tests pass, ruff + bandit clean.

Search quality

Ranking scores query relevance — dedup discarded each source's local
relevance order, so the final sort was recency + citation only (surfacing a
100k-citation off-topic paper above an on-topic one). Now relevance
(query-term overlap, title ≫ abstract) is the dominant signal.
Robust dedup — the dedup key now canonicalises the title (strip
punctuation/whitespace), the arXiv version suffix (v1/v2), and the
first-author surname ("Vaswani, Ashish" / "Ashish Vaswani" /
"A. Vaswani"), so cosmetic cross-source differences no longer split one
paper into duplicates.

Filter correctness

min_citations applies to every source — it was accepted + documented as
a threshold filter but only semantic_scholar honoured it. Now enforced in
the pipeline for all sources (unknown counts kept).
--min-citations exposed on the CLI — the flag and the Query wiring
were both missing, so the (now pipeline-wide) filter was unreachable from the
CLI. Added + documented.
Year-range pipeline guard — enforce [year_from, year_to] once for all
sources after dedup/rank (scrape sources filter loosely); unknown years kept.

Concurrency robustness

enrich_collection and download_pdfs each cap concurrency with a
semaphore (default 4), so a large collection doesn't fire one Anthropic call /
one PDF download per paper at once and trip an API / publisher-CDN rate limit.

Cleanup

Drop the redundant except (ThesisAgentsError, Exception) in
pipeline._enrich_one.

Every change ships with unit tests (relevance ordering; punctuation / version /
author-format dedup; pipeline min_citations + year guard across sources;
concurrency-cap assertions; CLI --min-citations wiring).

Six focused quality / robustness fixes across the search -> dedup -> filter -> enrich -> download pipeline. Search quality - ranking: add a query-relevance axis. Dedup merges each source's locally relevance-sorted list and discarded that ordering, so ranking fell back to recency + citation alone — surfacing off-topic-but-highly-cited papers above on-topic ones. Now score = relevance (query-term overlap with title >> abstract, dominant) + recency + damped log10(citations). keywords is optional so single-paper callers keep the old behaviour. - dedup: normalise the title hash (strip punctuation/whitespace) and drop the arXiv version suffix, so "Attention Is All You Need" / "...need." and 2401.00001v1 / v2 collapse to one record instead of splitting. Filter correctness - min_citations was accepted (CLI/MCP/Query) and documented as a filter but only ever applied by the semantic_scholar plugin. Enforce it in the pipeline for every source; papers whose source reports no count are kept (an unknown count is not treated as zero). Concurrency robustness - enrich_collection: cap simultaneous per-paper enrichments with a semaphore (default 4) so a large collection doesn't fire one Anthropic call + one PDF download per paper at once and trip the API rate limit. - download_pdfs: same semaphore cap, so several sources handing back PDF URLs on the same publisher CDN don't burst it into an IP block. Cleanup - pipeline._enrich_one: drop the redundant `except (ThesisAgentsError, Exception)` (Exception already covers it). Gates: 588 tests pass (+10), ruff + bandit clean.

Continuation of the workflow-quality pass. - dedup: canonicalise the first-author surname so name-format differences ("Vaswani, Ashish" / "Ashish Vaswani" / "A. Vaswani") don't split one DOI-less paper into duplicates — complements the title/arXiv-version canon. - pipeline: enforce the year range for ALL sources after dedup/rank, not only inside the source plugins. Scrape sources (scholar / ieee) filter loosely, so a pipeline guard keeps the result within [year_from, year_to]; papers with an unknown year are kept. - cli: expose --min-citations. The flag and the Query wiring were both missing, so the filter (now pipeline-wide) was unreachable from the CLI. docs/cli.md updated. Gates: 592 tests pass (+4), ruff + bandit clean.

The ranking/dedup/filter/concurrency workflow pass introduced two conventions worth pinning so they don't bit-rot: - code-quality-reviewer (Async & Concurrency): add the bounded-fan-out rule — a stage that gathers one op over N papers (enrich / download / OA resolve) escapes the per-source token buckets and must use a module-level Semaphore cap. Includes the 25-paper 429 / IP-block rationale + anti-pattern + pattern. - compliance-auditor (Core vs Source Plugins): add the query-semantics rule — Query filter fields (min_citations / year / top_tier) are enforced in the pipeline for all sources, never delegated to plugins. Documents the real min_citations incident (only semantic_scholar honoured it) + anti-pattern.

- S2583 (test_pdf_download / test_pipeline concurrency tests): the 'condition always false' on 'assert 1 <= peak <= 3' was a static-analysis false positive — Sonar can't see that 'peak' (a nonlocal int) is mutated by the monkeypatched async double via download_pdfs/enrich_collection's gather. Switch the counter to a dict so the mutation isn't const-propagated to 0; same fix applied to the enrich test pre-emptively. - S7503 (test_cli min-citations test): the fake_run_search / fake_shutdown doubles replace awaited production fns and must stay async — mark NOSONAR.

The per-test fake_run_search / fake_shutdown / monkeypatch boilerplate was the single biggest duplication source (118 dup lines, ~13% file density). Upgrade the existing patched_pipeline fixture to also capture the Query, and route the four query-asserting tests (source-default / exclude-source / min-citations / top-tier) through it instead of each re-rolling its own fake. -35 lines; the two async-stub NOSONARs now live once in the fixture.

Route the pdf-download and auto-enrich tests through the shared patched_pipeline fixture, and delete the now-redundant _fake_search_with_papers helper (it duplicated the fixture's run_search / shutdown stub). test_cli.py 926 -> 848 lines across this pass; the only remaining fake_run_search definitions are the fixture and _patch_search (which legitimately takes custom papers with controlled pdf_url).

sonarqubecloud · 2026-06-05T15:48:43Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

JE-Chen added 3 commits June 5, 2026 22:59

Merge remote-tracking branch 'origin/main' into dev

a3003b3

JE-Chen changed the title ~~Improve the automatic paper-research workflow (ranking, dedup, min_citations, concurrency)~~ Improve the automatic paper-research workflow (ranking, dedup, filters, concurrency) Jun 5, 2026

JE-Chen added 4 commits June 5, 2026 23:17

JE-Chen merged commit 23d4844 into main Jun 5, 2026
13 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve the automatic paper-research workflow (ranking, dedup, filters, concurrency)#17

Improve the automatic paper-research workflow (ranking, dedup, filters, concurrency)#17
JE-Chen merged 7 commits into
mainfrom
dev

JE-Chen commented Jun 5, 2026 •

edited

Loading

Uh oh!

sonarqubecloud Bot commented Jun 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

JE-Chen commented Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Search quality

Filter correctness

Concurrency robustness

Cleanup

Uh oh!

sonarqubecloud Bot commented Jun 5, 2026

Quality Gate passed

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

JE-Chen commented Jun 5, 2026 •

edited

Loading