Integration-Automation · JE-Chen · Jun 5, 2026 · Jun 5, 2026 · Jun 5, 2026 · Jun 5, 2026
diff --git a/.claude/agents/rules/code-quality-reviewer.md b/.claude/agents/rules/code-quality-reviewer.md
@@ -51,6 +51,7 @@ You do NOT modify files. The parent agent decides whether to fix.
 - Per-source rate limits live in `thesisagents/fetchers/rate_limit.py` as a token-bucket decorator. Each source plugin declares its own bucket (`arxiv: 1 req/3s`, `semantic_scholar: 1 req/s`, `scholar: 1 req/10s with jitter`, etc.). Do NOT bypass the bucket — even retries go through it.
 - Streamlit runs the UI on a separate thread per session. Mutate `st.session_state` only, never module globals. Long-running export jobs are dispatched to the FastAPI backend and polled.
 - All fixture-recording, CLI exports, and tests use `asyncio.run` at the outermost layer and never inside library code.
+- **Bounded fan-out over a paper collection.** A per-source token bucket caps traffic *within* one source, but a stage that fans one operation out over N papers (PDF download, OA resolve, LLM enrich) escapes those buckets — several papers can hit the *same* external API (Anthropic) or publisher CDN (`dl.acm.org`) at once through different source clients. Any `asyncio.gather(*(coro(item) for item in collection.papers))` MUST wrap each task in a module-level `asyncio.Semaphore` cap (the project uses 4–5: `pipeline._ENRICH_CONCURRENCY`, `pdf_download._DOWNLOAD_CONCURRENCY`, `oa_resolver._CONCURRENCY`). **Why:** an unbounded fan-out of a 25-paper search fires 25 concurrent Anthropic calls (an easy 429) or 25 hits on one CDN (an IP block) — it looks fine at 3 papers and fails at 30. **Anti-pattern:** `await asyncio.gather(*(_download_one(p, d) for p in collection.papers))` with no semaphore. **Pattern:** define `sem = asyncio.Semaphore(max(1, concurrency))` and a `_bounded` wrapper that does `async with sem: return await coro(item)`.
 
 ## Security (review-level)
 

diff --git a/.claude/agents/rules/compliance-auditor.md b/.claude/agents/rules/compliance-auditor.md
@@ -46,6 +46,8 @@ The line between `thesisagents/` and `sources/<name>/` is **not** "anything sour
 
 **Loading source plugins:** at runtime `thesisagents/fetchers/base.py::load_fetcher(name)` does `importlib.import_module(f"thesisagents.sources.{name}")`. No more `sys.path` injection — sources are a regular sub-package of `thesisagents`. Plugins under `thesisagents/sources/<name>/` use **relative imports** internally (`from .fetcher import …`, `from .parser import …`, `from . import webrunner_backend`). Tests import them via the full path (`from thesisagents.sources.<name>.fetcher import …`) and reference module-string targets the same way (e.g. `monkeypatch.setattr("thesisagents.sources.acm.fetcher.get_client", …)`).
 
+**Query semantics are enforced in core, never delegated to plugins.** Every filter field on `Query` (`min_citations`, `year_from` / `year_to`, `top_tier_only`) MUST be applied by the pipeline (`thesisagents/core/pipeline.py::run_search`, after dedup + rank) so it holds for **all** sources uniformly. A plugin MAY *additionally* push the filter to its upstream API as an optimisation (e.g. `semantic_scholar` sends `minCitationCount`), but core must never *rely* on plugins to do it. **Why (real incident):** `min_citations` was accepted by CLI/MCP/`Query`, documented as a threshold filter, and implemented by only the `semantic_scholar` plugin — so for the other 14 sources it silently did nothing; `min_citations=50` still returned 2-citation arXiv / OpenAlex / DBLP papers. **Anti-pattern:** wiring a `Query` filter into one source's request params and assuming the result set is filtered. **Pattern:** filter in `run_search`, and keep records whose value is *unknown* (`citation_count is None` / `year is None`) rather than treating unknown as failing the threshold (a source not reporting a count must not silently drop the paper).
+
 **When in doubt:** "if a user installs ThesisAgents with the default `requirements.txt` and never enables a source plugin, should this source work?" Yes → core. No → source plugin.
 
 ---

diff --git a/docs/cli.md b/docs/cli.md
@@ -15,7 +15,7 @@ exclusive modes:
 thesisagents (--query KEYWORDS | --paper IDENTIFIER)
                 [--source SOURCES] [--exclude-source SOURCES]
                 [--max N]
-                [--year-from YEAR] [--year-to YEAR]
+                [--year-from YEAR] [--year-to YEAR] [--min-citations N]
                 [--export FORMATS]
                 [--out DIR]
                 [--filename-stem STEM]
@@ -39,6 +39,7 @@ thesisagents (--query KEYWORDS | --paper IDENTIFIER)
 | `--exclude-source` / `-x` | — | Comma-separated sources to remove from the mix, subtracted **after** `--source` resolves. The no-VPN gesture is to leave `--source` at its default and pass `--exclude-source ieee` — every other default source stays in. An unknown name (or excluding the whole mix) is an error. |
 | `--max` / `-n` | `25` | Range 1..200. |
 | `--year-from`, `--year-to` | — | Inclusive year filter. |
+| `--min-citations` | — | Drop papers below this citation count, enforced across **all** sources (not just Semantic Scholar). Papers whose source reports no count are kept. Omit for no minimum. |
 | `--export` / `-e` | mode-specific | Any of `pptx`, `xlsx`, `md`, `bib`, `json`, `ris`, `csv`, `csl`. **Default with `--query` is `pptx,xlsx,bib`; default with `--paper` is `pptx,bib`** (one-row Excel is busy work). Explicit `--export` always wins. `ris` (Zotero/Mendeley/EndNote), `csv` (flat table) and `csl` (CSL-JSON for Pandoc, written as `<stem>.csl.json`) are the interchange formats. |
 | `--list-sources`, `--list-exports` | — | Print the available search sources / export formats and exit (no query needed). |
 | `--out` / `-o` | `./exports` | Created if missing. |

diff --git a/tests/test_cli.py b/tests/test_cli.py
@@ -43,16 +43,28 @@ async def _fake_success(collection, out_dir):
 
 @pytest.fixture()
 def patched_pipeline(monkeypatch, sample_papers):
-    async def fake_run_search(query: Query, **_kwargs) -> PaperCollection:
+    """Stub cli's run_search + shutdown_clients and return a dict that captures
+    the Query passed to run_search (``patched_pipeline["query"]``).
+
+    Tests that only need the pipeline stubbed ignore the return value; tests
+    that assert on the constructed Query read ``patched_pipeline["query"]``
+    instead of each re-rolling their own ``fake_run_search`` / ``fake_shutdown``
+    boilerplate.
+    """
+    captured: dict[str, Query] = {}
+
+    async def fake_run_search(query, **_kwargs):  # NOSONAR async stub
+        captured["query"] = query
         return PaperCollection(query=query, papers=tuple(sample_papers))
 
-    async def fake_shutdown() -> None:
+    async def fake_shutdown():  # NOSONAR async stub
         return None
 
     monkeypatch.setattr(cli_module, "run_search", fake_run_search)
     monkeypatch.setattr(cli_module, "shutdown_clients", fake_shutdown)
     # Leave the autouse ``_stub_download_pdfs`` in place so the new per-paper
     # PPT gate sees every paper as downloadable.
+    return captured
 
 
 def test_cli_runs_end_to_end(tmp_path: Path, patched_pipeline, capsys):
@@ -248,28 +260,16 @@ def test_cli_rejects_doi_identifier_until_resolver_lands(tmp_path):
     assert code == 2
 
 
-def test_cli_source_default_is_multi_source(tmp_path, monkeypatch, sample_papers):
+def test_cli_source_default_is_multi_source(tmp_path, patched_pipeline):
     """When --source is omitted, run_search must be invoked across the
     DEFAULT_SOURCES mix, not just arxiv."""
     from thesisagents.core.constants import DEFAULT_SOURCES
 
-    captured: dict[str, Query] = {}
-
-    async def fake_run_search(query: Query, **_kwargs) -> PaperCollection:
-        captured["query"] = query
-        return PaperCollection(query=query, papers=tuple(sample_papers))
-
-    async def fake_shutdown() -> None:
-        return None
-
-    monkeypatch.setattr(cli_module, "run_search", fake_run_search)
-    monkeypatch.setattr(cli_module, "shutdown_clients", fake_shutdown)
-
     code = cli_module.main(
         ["--query", "x", "--out", str(tmp_path), "--export", "bib"]
     )
     assert code == 0
-    assert captured["query"].sources == DEFAULT_SOURCES
+    assert patched_pipeline["query"].sources == DEFAULT_SOURCES
 
 
 def test_cli_list_sources(capsys):
@@ -293,31 +293,30 @@ def test_cli_list_exports(capsys):
     assert "pptx" in out
 
 
-def test_cli_exclude_source_prunes_default_mix(tmp_path, monkeypatch, sample_papers):
+def test_cli_exclude_source_prunes_default_mix(tmp_path, patched_pipeline):
     """--exclude-source subtracts from the resolved mix; the no-VPN path drops
     only ieee and keeps every other default source."""
     from thesisagents.core.constants import DEFAULT_SOURCES
 
-    captured: dict[str, Query] = {}
-
-    async def fake_run_search(query: Query, **_kwargs) -> PaperCollection:  # NOSONAR async stub
-        captured["query"] = query
-        return PaperCollection(query=query, papers=tuple(sample_papers))
-
-    async def fake_shutdown() -> None:  # NOSONAR async stub
-        return None
-
-    monkeypatch.setattr(cli_module, "run_search", fake_run_search)
-    monkeypatch.setattr(cli_module, "shutdown_clients", fake_shutdown)
-
     code = cli_module.main(
         ["--query", "x", "--out", str(tmp_path), "--export", "bib",
          "--exclude-source", "ieee"]
     )
     assert code == 0
-    assert "ieee" not in captured["query"].sources
+    assert "ieee" not in patched_pipeline["query"].sources
     expected = tuple(s for s in DEFAULT_SOURCES if s != "ieee")
-    assert captured["query"].sources == expected
+    assert patched_pipeline["query"].sources == expected
+
+
+def test_cli_min_citations_flows_into_query(tmp_path, patched_pipeline):
+    """--min-citations is parsed and passed through to the Query (it was
+    previously unreachable from the CLI)."""
+    code = cli_module.main(
+        ["--query", "x", "--out", str(tmp_path), "--export", "bib",
+         "--min-citations", "50"]
+    )
+    assert code == 0
+    assert patched_pipeline["query"].min_citations == 50
 
 
 def test_cli_exclude_unknown_source_errors(tmp_path):
@@ -337,90 +336,50 @@ def test_cli_exclude_all_sources_errors(tmp_path):
         )
 
 
-def test_cli_top_tier_filter_off_by_default(tmp_path, monkeypatch, sample_papers):
+def test_cli_top_tier_filter_off_by_default(tmp_path, patched_pipeline):
     """top_tier_only is OFF by default (broader coverage including IEEE / ACM
     workshops); --top-tier-only flips it on."""
-    captured: dict[str, Query] = {}
-
-    async def fake_run_search(query: Query, **_kwargs) -> PaperCollection:
-        captured["query"] = query
-        return PaperCollection(query=query, papers=tuple(sample_papers))
-
-    async def fake_shutdown() -> None:
-        return None
-
-    monkeypatch.setattr(cli_module, "run_search", fake_run_search)
-    monkeypatch.setattr(cli_module, "shutdown_clients", fake_shutdown)
-
     code = cli_module.main(
         ["--query", "x", "--out", str(tmp_path), "--export", "bib"]
     )
     assert code == 0
-    assert captured["query"].top_tier_only is False
+    assert patched_pipeline["query"].top_tier_only is False
 
-    captured.clear()
+    patched_pipeline.clear()
     code = cli_module.main(
         ["--query", "x", "--top-tier-only", "--out", str(tmp_path), "--export", "bib"]
     )
     assert code == 0
-    assert captured["query"].top_tier_only is True
+    assert patched_pipeline["query"].top_tier_only is True
 
 
-def test_cli_default_triggers_pdf_download(tmp_path, monkeypatch, sample_papers):
+def test_cli_default_triggers_pdf_download(tmp_path, monkeypatch, patched_pipeline):
     """Default flag set should invoke download_pdfs; --no-pdf disables it."""
     calls: list[str] = []
 
-    async def fake_run_search(query: Query, **_kwargs) -> PaperCollection:
-        return PaperCollection(query=query, papers=tuple(sample_papers))
-
-    async def fake_shutdown() -> None:
-        return None
-
-    async def fake_download(_collection, _out_dir):
+    async def fake_download(_collection, _out_dir):  # NOSONAR async stub
         calls.append("called")
         return []
 
-    monkeypatch.setattr(cli_module, "run_search", fake_run_search)
-    monkeypatch.setattr(cli_module, "shutdown_clients", fake_shutdown)
     monkeypatch.setattr(cli_module, "download_pdfs", fake_download)
-
     code = cli_module.main(
-        [
-            "--query", "x",
-            "--source", "arxiv",
-            "--out", str(tmp_path),
-            "--export", "bib",
-        ]
+        ["--query", "x", "--source", "arxiv", "--out", str(tmp_path), "--export", "bib"]
     )
     assert code == 0
     assert calls == ["called"]
 
 
-def test_cli_no_pdf_flag_skips_download(tmp_path, monkeypatch, sample_papers):
+def test_cli_no_pdf_flag_skips_download(tmp_path, monkeypatch, patched_pipeline):
     calls: list[str] = []
 
-    async def fake_run_search(query: Query, **_kwargs) -> PaperCollection:
-        return PaperCollection(query=query, papers=tuple(sample_papers))
-
-    async def fake_shutdown() -> None:
-        return None
-
-    async def fake_download(_collection, _out_dir):
+    async def fake_download(_collection, _out_dir):  # NOSONAR async stub
         calls.append("called")
         return []
 
-    monkeypatch.setattr(cli_module, "run_search", fake_run_search)
-    monkeypatch.setattr(cli_module, "shutdown_clients", fake_shutdown)
     monkeypatch.setattr(cli_module, "download_pdfs", fake_download)
-
     code = cli_module.main(
-        [
-            "--query", "x",
-            "--source", "arxiv",
-            "--no-pdf",
-            "--out", str(tmp_path),
-            "--export", "bib",
-        ]
+        ["--query", "x", "--source", "arxiv", "--no-pdf",
+         "--out", str(tmp_path), "--export", "bib"]
     )
     assert code == 0
     assert calls == []
@@ -747,21 +706,9 @@ async def fake_enrich(collection, language=None, model=None):  # noqa: ARG001
     return calls
 
 
-def _fake_search_with_papers(monkeypatch, sample_papers):
-    async def fake_run_search(query, **_kwargs):
-        return PaperCollection(query=query, papers=tuple(sample_papers))
-
-    async def fake_shutdown():
-        return None
-
-    monkeypatch.setattr(cli_module, "run_search", fake_run_search)
-    monkeypatch.setattr(cli_module, "shutdown_clients", fake_shutdown)
-
-
-def test_cli_auto_enriches_when_api_key_set(tmp_path, monkeypatch, sample_papers):
+def test_cli_auto_enriches_when_api_key_set(tmp_path, monkeypatch, patched_pipeline):
     """ANTHROPIC_API_KEY in env + no --lightweight = auto-enrich fires."""
     monkeypatch.setenv("ANTHROPIC_API_KEY", "test-key")
-    _fake_search_with_papers(monkeypatch, sample_papers)
     calls = _stub_enrich_collection(monkeypatch)
     code = cli_module.main(
         ["--query", "x", "--source", "arxiv", "--out", str(tmp_path), "--export", "bib"]
@@ -770,10 +717,9 @@ def test_cli_auto_enriches_when_api_key_set(tmp_path, monkeypatch, sample_papers
     assert calls == ["called"]
 
 
-def test_cli_lightweight_skips_auto_enrich(tmp_path, monkeypatch, sample_papers):
+def test_cli_lightweight_skips_auto_enrich(tmp_path, monkeypatch, patched_pipeline):
     """--lightweight wins over the auto-enrich default."""
     monkeypatch.setenv("ANTHROPIC_API_KEY", "test-key")
-    _fake_search_with_papers(monkeypatch, sample_papers)
     calls = _stub_enrich_collection(monkeypatch)
     code = cli_module.main(
         [
@@ -786,10 +732,9 @@ def test_cli_lightweight_skips_auto_enrich(tmp_path, monkeypatch, sample_papers)
     assert calls == []
 
 
-def test_cli_no_key_does_not_auto_enrich(tmp_path, monkeypatch, sample_papers):
+def test_cli_no_key_does_not_auto_enrich(tmp_path, monkeypatch, patched_pipeline):
     """No ANTHROPIC_API_KEY → no Anthropic call, lightweight deck."""
     monkeypatch.delenv("ANTHROPIC_API_KEY", raising=False)
-    _fake_search_with_papers(monkeypatch, sample_papers)
     calls = _stub_enrich_collection(monkeypatch)
     code = cli_module.main(
         ["--query", "x", "--source", "arxiv", "--out", str(tmp_path), "--export", "bib"]
@@ -798,11 +743,10 @@ def test_cli_no_key_does_not_auto_enrich(tmp_path, monkeypatch, sample_papers):
     assert calls == []
 
 
-def test_cli_explicit_enrich_still_works(tmp_path, monkeypatch, sample_papers):
+def test_cli_explicit_enrich_still_works(tmp_path, monkeypatch, patched_pipeline):
     """--enrich runs even without a key in env (the explicit path used to
     error inside the API client; here we only check the CLI dispatch)."""
     monkeypatch.setenv("ANTHROPIC_API_KEY", "test-key")
-    _fake_search_with_papers(monkeypatch, sample_papers)
     calls = _stub_enrich_collection(monkeypatch)
     code = cli_module.main(
         [
-Original file line number
+Diff line change
@@ Expand Up @@
     **Loading source plugins:** at runtime `thesisagents/fetchers/base.py::load_fetcher(name)` does `importlib.import_module(f"thesisagents.sources.{name}")`. No more `sys.path` injection — sources are a regular sub-package of `thesisagents`. Plugins under `thesisagents/sources/<name>/` use **relative imports** internally (`from .fetcher import …`, `from .parser import …`, `from . import webrunner_backend`). Tests import them via the full path (`from thesisagents.sources.<name>.fetcher import …`) and reference module-string targets the same way (e.g. `monkeypatch.setattr("thesisagents.sources.acm.fetcher.get_client", …)`).
+    **Query semantics are enforced in core, never delegated to plugins.** Every filter field on `Query` (`min_citations`, `year_from` / `year_to`, `top_tier_only`) MUST be applied by the pipeline (`thesisagents/core/pipeline.py::run_search`, after dedup + rank) so it holds for **all** sources uniformly. A plugin MAY *additionally* push the filter to its upstream API as an optimisation (e.g. `semantic_scholar` sends `minCitationCount`), but core must never *rely* on plugins to do it. **Why (real incident):** `min_citations` was accepted by CLI/MCP/`Query`, documented as a threshold filter, and implemented by only the `semantic_scholar` plugin — so for the other 14 sources it silently did nothing; `min_citations=50` still returned 2-citation arXiv / OpenAlex / DBLP papers. **Anti-pattern:** wiring a `Query` filter into one source's request params and assuming the result set is filtered. **Pattern:** filter in `run_search`, and keep records whose value is *unknown* (`citation_count is None` / `year is None`) rather than treating unknown as failing the threshold (a source not reporting a count must not silently drop the paper).
     **When in doubt:** "if a user installs ThesisAgents with the default `requirements.txt` and never enables a source plugin, should this source work?" Yes → core. No → source plugin.
     ---
@@ Expand Down @@