Skip to content

Improve the automatic paper-research workflow (ranking, dedup, filters, concurrency)#17

Merged
JE-Chen merged 7 commits into
mainfrom
dev
Jun 5, 2026
Merged

Improve the automatic paper-research workflow (ranking, dedup, filters, concurrency)#17
JE-Chen merged 7 commits into
mainfrom
dev

Conversation

@JE-Chen

@JE-Chen JE-Chen commented Jun 5, 2026

Copy link
Copy Markdown
Member

Summary

Quality + robustness improvements across the automatic paper-research workflow
(search → dedup → filter → enrich → download), delivered in two commits. All
gates green: 592 tests pass, ruff + bandit clean.

Search quality

  • Ranking scores query relevance — dedup discarded each source's local
    relevance order, so the final sort was recency + citation only (surfacing a
    100k-citation off-topic paper above an on-topic one). Now relevance
    (query-term overlap, title ≫ abstract) is the dominant signal.
  • Robust dedup — the dedup key now canonicalises the title (strip
    punctuation/whitespace), the arXiv version suffix (v1/v2), and the
    first-author surname ("Vaswani, Ashish" / "Ashish Vaswani" /
    "A. Vaswani"), so cosmetic cross-source differences no longer split one
    paper into duplicates.

Filter correctness

  • min_citations applies to every source — it was accepted + documented as
    a threshold filter but only semantic_scholar honoured it. Now enforced in
    the pipeline for all sources (unknown counts kept).
  • --min-citations exposed on the CLI — the flag and the Query wiring
    were both missing, so the (now pipeline-wide) filter was unreachable from the
    CLI. Added + documented.
  • Year-range pipeline guard — enforce [year_from, year_to] once for all
    sources after dedup/rank (scrape sources filter loosely); unknown years kept.

Concurrency robustness

  • enrich_collection and download_pdfs each cap concurrency with a
    semaphore (default 4), so a large collection doesn't fire one Anthropic call /
    one PDF download per paper at once and trip an API / publisher-CDN rate limit.

Cleanup

  • Drop the redundant except (ThesisAgentsError, Exception) in
    pipeline._enrich_one.

Every change ships with unit tests (relevance ordering; punctuation / version /
author-format dedup; pipeline min_citations + year guard across sources;
concurrency-cap assertions; CLI --min-citations wiring).

JE-Chen added 3 commits June 5, 2026 22:59
Six focused quality / robustness fixes across the search -> dedup -> filter
-> enrich -> download pipeline.

Search quality
- ranking: add a query-relevance axis. Dedup merges each source's locally
  relevance-sorted list and discarded that ordering, so ranking fell back to
  recency + citation alone — surfacing off-topic-but-highly-cited papers above
  on-topic ones. Now score = relevance (query-term overlap with title >>
  abstract, dominant) + recency + damped log10(citations). keywords is
  optional so single-paper callers keep the old behaviour.
- dedup: normalise the title hash (strip punctuation/whitespace) and drop the
  arXiv version suffix, so "Attention Is All You Need" / "...need." and
  2401.00001v1 / v2 collapse to one record instead of splitting.

Filter correctness
- min_citations was accepted (CLI/MCP/Query) and documented as a filter but
  only ever applied by the semantic_scholar plugin. Enforce it in the pipeline
  for every source; papers whose source reports no count are kept (an unknown
  count is not treated as zero).

Concurrency robustness
- enrich_collection: cap simultaneous per-paper enrichments with a semaphore
  (default 4) so a large collection doesn't fire one Anthropic call + one PDF
  download per paper at once and trip the API rate limit.
- download_pdfs: same semaphore cap, so several sources handing back PDF URLs
  on the same publisher CDN don't burst it into an IP block.

Cleanup
- pipeline._enrich_one: drop the redundant `except (ThesisAgentsError,
  Exception)` (Exception already covers it).

Gates: 588 tests pass (+10), ruff + bandit clean.
Continuation of the workflow-quality pass.

- dedup: canonicalise the first-author surname so name-format differences
  ("Vaswani, Ashish" / "Ashish Vaswani" / "A. Vaswani") don't split one
  DOI-less paper into duplicates — complements the title/arXiv-version canon.
- pipeline: enforce the year range for ALL sources after dedup/rank, not only
  inside the source plugins. Scrape sources (scholar / ieee) filter loosely,
  so a pipeline guard keeps the result within [year_from, year_to]; papers
  with an unknown year are kept.
- cli: expose --min-citations. The flag and the Query wiring were both missing,
  so the filter (now pipeline-wide) was unreachable from the CLI. docs/cli.md
  updated.

Gates: 592 tests pass (+4), ruff + bandit clean.
@JE-Chen JE-Chen changed the title Improve the automatic paper-research workflow (ranking, dedup, min_citations, concurrency) Improve the automatic paper-research workflow (ranking, dedup, filters, concurrency) Jun 5, 2026
JE-Chen added 4 commits June 5, 2026 23:17
The ranking/dedup/filter/concurrency workflow pass introduced two conventions
worth pinning so they don't bit-rot:

- code-quality-reviewer (Async & Concurrency): add the bounded-fan-out rule —
  a stage that gathers one op over N papers (enrich / download / OA resolve)
  escapes the per-source token buckets and must use a module-level Semaphore
  cap. Includes the 25-paper 429 / IP-block rationale + anti-pattern + pattern.

- compliance-auditor (Core vs Source Plugins): add the query-semantics rule —
  Query filter fields (min_citations / year / top_tier) are enforced in the
  pipeline for all sources, never delegated to plugins. Documents the real
  min_citations incident (only semantic_scholar honoured it) + anti-pattern.
- S2583 (test_pdf_download / test_pipeline concurrency tests): the
  'condition always false' on 'assert 1 <= peak <= 3' was a static-analysis
  false positive — Sonar can't see that 'peak' (a nonlocal int) is mutated by
  the monkeypatched async double via download_pdfs/enrich_collection's gather.
  Switch the counter to a dict so the mutation isn't const-propagated to 0;
  same fix applied to the enrich test pre-emptively.
- S7503 (test_cli min-citations test): the fake_run_search / fake_shutdown
  doubles replace awaited production fns and must stay async — mark NOSONAR.
The per-test fake_run_search / fake_shutdown / monkeypatch boilerplate was the
single biggest duplication source (118 dup lines, ~13% file density). Upgrade
the existing patched_pipeline fixture to also capture the Query, and route the
four query-asserting tests (source-default / exclude-source / min-citations /
top-tier) through it instead of each re-rolling its own fake. -35 lines; the
two async-stub NOSONARs now live once in the fixture.
Route the pdf-download and auto-enrich tests through the shared
patched_pipeline fixture, and delete the now-redundant
_fake_search_with_papers helper (it duplicated the fixture's run_search /
shutdown stub). test_cli.py 926 -> 848 lines across this pass; the only
remaining fake_run_search definitions are the fixture and _patch_search
(which legitimately takes custom papers with controlled pdf_url).
@sonarqubecloud

sonarqubecloud Bot commented Jun 5, 2026

Copy link
Copy Markdown

@JE-Chen JE-Chen merged commit 23d4844 into main Jun 5, 2026
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant