Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .claude/agents/rules/code-quality-reviewer.md
Original file line number Diff line number Diff line change
Expand Up @@ -51,6 +51,7 @@ You do NOT modify files. The parent agent decides whether to fix.
- Per-source rate limits live in `thesisagents/fetchers/rate_limit.py` as a token-bucket decorator. Each source plugin declares its own bucket (`arxiv: 1 req/3s`, `semantic_scholar: 1 req/s`, `scholar: 1 req/10s with jitter`, etc.). Do NOT bypass the bucket — even retries go through it.
- Streamlit runs the UI on a separate thread per session. Mutate `st.session_state` only, never module globals. Long-running export jobs are dispatched to the FastAPI backend and polled.
- All fixture-recording, CLI exports, and tests use `asyncio.run` at the outermost layer and never inside library code.
- **Bounded fan-out over a paper collection.** A per-source token bucket caps traffic *within* one source, but a stage that fans one operation out over N papers (PDF download, OA resolve, LLM enrich) escapes those buckets — several papers can hit the *same* external API (Anthropic) or publisher CDN (`dl.acm.org`) at once through different source clients. Any `asyncio.gather(*(coro(item) for item in collection.papers))` MUST wrap each task in a module-level `asyncio.Semaphore` cap (the project uses 4–5: `pipeline._ENRICH_CONCURRENCY`, `pdf_download._DOWNLOAD_CONCURRENCY`, `oa_resolver._CONCURRENCY`). **Why:** an unbounded fan-out of a 25-paper search fires 25 concurrent Anthropic calls (an easy 429) or 25 hits on one CDN (an IP block) — it looks fine at 3 papers and fails at 30. **Anti-pattern:** `await asyncio.gather(*(_download_one(p, d) for p in collection.papers))` with no semaphore. **Pattern:** define `sem = asyncio.Semaphore(max(1, concurrency))` and a `_bounded` wrapper that does `async with sem: return await coro(item)`.

## Security (review-level)

Expand Down
2 changes: 2 additions & 0 deletions .claude/agents/rules/compliance-auditor.md
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,8 @@ The line between `thesisagents/` and `sources/<name>/` is **not** "anything sour

**Loading source plugins:** at runtime `thesisagents/fetchers/base.py::load_fetcher(name)` does `importlib.import_module(f"thesisagents.sources.{name}")`. No more `sys.path` injection — sources are a regular sub-package of `thesisagents`. Plugins under `thesisagents/sources/<name>/` use **relative imports** internally (`from .fetcher import …`, `from .parser import …`, `from . import webrunner_backend`). Tests import them via the full path (`from thesisagents.sources.<name>.fetcher import …`) and reference module-string targets the same way (e.g. `monkeypatch.setattr("thesisagents.sources.acm.fetcher.get_client", …)`).

**Query semantics are enforced in core, never delegated to plugins.** Every filter field on `Query` (`min_citations`, `year_from` / `year_to`, `top_tier_only`) MUST be applied by the pipeline (`thesisagents/core/pipeline.py::run_search`, after dedup + rank) so it holds for **all** sources uniformly. A plugin MAY *additionally* push the filter to its upstream API as an optimisation (e.g. `semantic_scholar` sends `minCitationCount`), but core must never *rely* on plugins to do it. **Why (real incident):** `min_citations` was accepted by CLI/MCP/`Query`, documented as a threshold filter, and implemented by only the `semantic_scholar` plugin — so for the other 14 sources it silently did nothing; `min_citations=50` still returned 2-citation arXiv / OpenAlex / DBLP papers. **Anti-pattern:** wiring a `Query` filter into one source's request params and assuming the result set is filtered. **Pattern:** filter in `run_search`, and keep records whose value is *unknown* (`citation_count is None` / `year is None`) rather than treating unknown as failing the threshold (a source not reporting a count must not silently drop the paper).

**When in doubt:** "if a user installs ThesisAgents with the default `requirements.txt` and never enables a source plugin, should this source work?" Yes → core. No → source plugin.

---
Expand Down
3 changes: 2 additions & 1 deletion docs/cli.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ exclusive modes:
thesisagents (--query KEYWORDS | --paper IDENTIFIER)
[--source SOURCES] [--exclude-source SOURCES]
[--max N]
[--year-from YEAR] [--year-to YEAR]
[--year-from YEAR] [--year-to YEAR] [--min-citations N]
[--export FORMATS]
[--out DIR]
[--filename-stem STEM]
Expand All @@ -39,6 +39,7 @@ thesisagents (--query KEYWORDS | --paper IDENTIFIER)
| `--exclude-source` / `-x` | — | Comma-separated sources to remove from the mix, subtracted **after** `--source` resolves. The no-VPN gesture is to leave `--source` at its default and pass `--exclude-source ieee` — every other default source stays in. An unknown name (or excluding the whole mix) is an error. |
| `--max` / `-n` | `25` | Range 1..200. |
| `--year-from`, `--year-to` | — | Inclusive year filter. |
| `--min-citations` | — | Drop papers below this citation count, enforced across **all** sources (not just Semantic Scholar). Papers whose source reports no count are kept. Omit for no minimum. |
| `--export` / `-e` | mode-specific | Any of `pptx`, `xlsx`, `md`, `bib`, `json`, `ris`, `csv`, `csl`. **Default with `--query` is `pptx,xlsx,bib`; default with `--paper` is `pptx,bib`** (one-row Excel is busy work). Explicit `--export` always wins. `ris` (Zotero/Mendeley/EndNote), `csv` (flat table) and `csl` (CSL-JSON for Pandoc, written as `<stem>.csl.json`) are the interchange formats. |
| `--list-sources`, `--list-exports` | — | Print the available search sources / export formats and exit (no query needed). |
| `--out` / `-o` | `./exports` | Created if missing. |
Expand Down
146 changes: 45 additions & 101 deletions tests/test_cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -43,16 +43,28 @@ async def _fake_success(collection, out_dir):

@pytest.fixture()
def patched_pipeline(monkeypatch, sample_papers):
async def fake_run_search(query: Query, **_kwargs) -> PaperCollection:
"""Stub cli's run_search + shutdown_clients and return a dict that captures
the Query passed to run_search (``patched_pipeline["query"]``).

Tests that only need the pipeline stubbed ignore the return value; tests
that assert on the constructed Query read ``patched_pipeline["query"]``
instead of each re-rolling their own ``fake_run_search`` / ``fake_shutdown``
boilerplate.
"""
captured: dict[str, Query] = {}

async def fake_run_search(query, **_kwargs): # NOSONAR async stub
captured["query"] = query
return PaperCollection(query=query, papers=tuple(sample_papers))

async def fake_shutdown() -> None:
async def fake_shutdown(): # NOSONAR async stub
return None

monkeypatch.setattr(cli_module, "run_search", fake_run_search)
monkeypatch.setattr(cli_module, "shutdown_clients", fake_shutdown)
# Leave the autouse ``_stub_download_pdfs`` in place so the new per-paper
# PPT gate sees every paper as downloadable.
return captured


def test_cli_runs_end_to_end(tmp_path: Path, patched_pipeline, capsys):
Expand Down Expand Up @@ -248,28 +260,16 @@ def test_cli_rejects_doi_identifier_until_resolver_lands(tmp_path):
assert code == 2


def test_cli_source_default_is_multi_source(tmp_path, monkeypatch, sample_papers):
def test_cli_source_default_is_multi_source(tmp_path, patched_pipeline):
"""When --source is omitted, run_search must be invoked across the
DEFAULT_SOURCES mix, not just arxiv."""
from thesisagents.core.constants import DEFAULT_SOURCES

captured: dict[str, Query] = {}

async def fake_run_search(query: Query, **_kwargs) -> PaperCollection:
captured["query"] = query
return PaperCollection(query=query, papers=tuple(sample_papers))

async def fake_shutdown() -> None:
return None

monkeypatch.setattr(cli_module, "run_search", fake_run_search)
monkeypatch.setattr(cli_module, "shutdown_clients", fake_shutdown)

code = cli_module.main(
["--query", "x", "--out", str(tmp_path), "--export", "bib"]
)
assert code == 0
assert captured["query"].sources == DEFAULT_SOURCES
assert patched_pipeline["query"].sources == DEFAULT_SOURCES


def test_cli_list_sources(capsys):
Expand All @@ -293,31 +293,30 @@ def test_cli_list_exports(capsys):
assert "pptx" in out


def test_cli_exclude_source_prunes_default_mix(tmp_path, monkeypatch, sample_papers):
def test_cli_exclude_source_prunes_default_mix(tmp_path, patched_pipeline):
"""--exclude-source subtracts from the resolved mix; the no-VPN path drops
only ieee and keeps every other default source."""
from thesisagents.core.constants import DEFAULT_SOURCES

captured: dict[str, Query] = {}

async def fake_run_search(query: Query, **_kwargs) -> PaperCollection: # NOSONAR async stub
captured["query"] = query
return PaperCollection(query=query, papers=tuple(sample_papers))

async def fake_shutdown() -> None: # NOSONAR async stub
return None

monkeypatch.setattr(cli_module, "run_search", fake_run_search)
monkeypatch.setattr(cli_module, "shutdown_clients", fake_shutdown)

code = cli_module.main(
["--query", "x", "--out", str(tmp_path), "--export", "bib",
"--exclude-source", "ieee"]
)
assert code == 0
assert "ieee" not in captured["query"].sources
assert "ieee" not in patched_pipeline["query"].sources
expected = tuple(s for s in DEFAULT_SOURCES if s != "ieee")
assert captured["query"].sources == expected
assert patched_pipeline["query"].sources == expected


def test_cli_min_citations_flows_into_query(tmp_path, patched_pipeline):
"""--min-citations is parsed and passed through to the Query (it was
previously unreachable from the CLI)."""
code = cli_module.main(
["--query", "x", "--out", str(tmp_path), "--export", "bib",
"--min-citations", "50"]
)
assert code == 0
assert patched_pipeline["query"].min_citations == 50


def test_cli_exclude_unknown_source_errors(tmp_path):
Expand All @@ -337,90 +336,50 @@ def test_cli_exclude_all_sources_errors(tmp_path):
)


def test_cli_top_tier_filter_off_by_default(tmp_path, monkeypatch, sample_papers):
def test_cli_top_tier_filter_off_by_default(tmp_path, patched_pipeline):
"""top_tier_only is OFF by default (broader coverage including IEEE / ACM
workshops); --top-tier-only flips it on."""
captured: dict[str, Query] = {}

async def fake_run_search(query: Query, **_kwargs) -> PaperCollection:
captured["query"] = query
return PaperCollection(query=query, papers=tuple(sample_papers))

async def fake_shutdown() -> None:
return None

monkeypatch.setattr(cli_module, "run_search", fake_run_search)
monkeypatch.setattr(cli_module, "shutdown_clients", fake_shutdown)

code = cli_module.main(
["--query", "x", "--out", str(tmp_path), "--export", "bib"]
)
assert code == 0
assert captured["query"].top_tier_only is False
assert patched_pipeline["query"].top_tier_only is False

captured.clear()
patched_pipeline.clear()
code = cli_module.main(
["--query", "x", "--top-tier-only", "--out", str(tmp_path), "--export", "bib"]
)
assert code == 0
assert captured["query"].top_tier_only is True
assert patched_pipeline["query"].top_tier_only is True


def test_cli_default_triggers_pdf_download(tmp_path, monkeypatch, sample_papers):
def test_cli_default_triggers_pdf_download(tmp_path, monkeypatch, patched_pipeline):
"""Default flag set should invoke download_pdfs; --no-pdf disables it."""
calls: list[str] = []

async def fake_run_search(query: Query, **_kwargs) -> PaperCollection:
return PaperCollection(query=query, papers=tuple(sample_papers))

async def fake_shutdown() -> None:
return None

async def fake_download(_collection, _out_dir):
async def fake_download(_collection, _out_dir): # NOSONAR async stub
calls.append("called")
return []

monkeypatch.setattr(cli_module, "run_search", fake_run_search)
monkeypatch.setattr(cli_module, "shutdown_clients", fake_shutdown)
monkeypatch.setattr(cli_module, "download_pdfs", fake_download)

code = cli_module.main(
[
"--query", "x",
"--source", "arxiv",
"--out", str(tmp_path),
"--export", "bib",
]
["--query", "x", "--source", "arxiv", "--out", str(tmp_path), "--export", "bib"]
)
assert code == 0
assert calls == ["called"]


def test_cli_no_pdf_flag_skips_download(tmp_path, monkeypatch, sample_papers):
def test_cli_no_pdf_flag_skips_download(tmp_path, monkeypatch, patched_pipeline):
calls: list[str] = []

async def fake_run_search(query: Query, **_kwargs) -> PaperCollection:
return PaperCollection(query=query, papers=tuple(sample_papers))

async def fake_shutdown() -> None:
return None

async def fake_download(_collection, _out_dir):
async def fake_download(_collection, _out_dir): # NOSONAR async stub
calls.append("called")
return []

monkeypatch.setattr(cli_module, "run_search", fake_run_search)
monkeypatch.setattr(cli_module, "shutdown_clients", fake_shutdown)
monkeypatch.setattr(cli_module, "download_pdfs", fake_download)

code = cli_module.main(
[
"--query", "x",
"--source", "arxiv",
"--no-pdf",
"--out", str(tmp_path),
"--export", "bib",
]
["--query", "x", "--source", "arxiv", "--no-pdf",
"--out", str(tmp_path), "--export", "bib"]
)
assert code == 0
assert calls == []
Expand Down Expand Up @@ -747,21 +706,9 @@ async def fake_enrich(collection, language=None, model=None): # noqa: ARG001
return calls


def _fake_search_with_papers(monkeypatch, sample_papers):
async def fake_run_search(query, **_kwargs):
return PaperCollection(query=query, papers=tuple(sample_papers))

async def fake_shutdown():
return None

monkeypatch.setattr(cli_module, "run_search", fake_run_search)
monkeypatch.setattr(cli_module, "shutdown_clients", fake_shutdown)


def test_cli_auto_enriches_when_api_key_set(tmp_path, monkeypatch, sample_papers):
def test_cli_auto_enriches_when_api_key_set(tmp_path, monkeypatch, patched_pipeline):
"""ANTHROPIC_API_KEY in env + no --lightweight = auto-enrich fires."""
monkeypatch.setenv("ANTHROPIC_API_KEY", "test-key")
_fake_search_with_papers(monkeypatch, sample_papers)
calls = _stub_enrich_collection(monkeypatch)
code = cli_module.main(
["--query", "x", "--source", "arxiv", "--out", str(tmp_path), "--export", "bib"]
Expand All @@ -770,10 +717,9 @@ def test_cli_auto_enriches_when_api_key_set(tmp_path, monkeypatch, sample_papers
assert calls == ["called"]


def test_cli_lightweight_skips_auto_enrich(tmp_path, monkeypatch, sample_papers):
def test_cli_lightweight_skips_auto_enrich(tmp_path, monkeypatch, patched_pipeline):
"""--lightweight wins over the auto-enrich default."""
monkeypatch.setenv("ANTHROPIC_API_KEY", "test-key")
_fake_search_with_papers(monkeypatch, sample_papers)
calls = _stub_enrich_collection(monkeypatch)
code = cli_module.main(
[
Expand All @@ -786,10 +732,9 @@ def test_cli_lightweight_skips_auto_enrich(tmp_path, monkeypatch, sample_papers)
assert calls == []


def test_cli_no_key_does_not_auto_enrich(tmp_path, monkeypatch, sample_papers):
def test_cli_no_key_does_not_auto_enrich(tmp_path, monkeypatch, patched_pipeline):
"""No ANTHROPIC_API_KEY → no Anthropic call, lightweight deck."""
monkeypatch.delenv("ANTHROPIC_API_KEY", raising=False)
_fake_search_with_papers(monkeypatch, sample_papers)
calls = _stub_enrich_collection(monkeypatch)
code = cli_module.main(
["--query", "x", "--source", "arxiv", "--out", str(tmp_path), "--export", "bib"]
Expand All @@ -798,11 +743,10 @@ def test_cli_no_key_does_not_auto_enrich(tmp_path, monkeypatch, sample_papers):
assert calls == []


def test_cli_explicit_enrich_still_works(tmp_path, monkeypatch, sample_papers):
def test_cli_explicit_enrich_still_works(tmp_path, monkeypatch, patched_pipeline):
"""--enrich runs even without a key in env (the explicit path used to
error inside the API client; here we only check the CLI dispatch)."""
monkeypatch.setenv("ANTHROPIC_API_KEY", "test-key")
_fake_search_with_papers(monkeypatch, sample_papers)
calls = _stub_enrich_collection(monkeypatch)
code = cli_module.main(
[
Expand Down
Loading
Loading