Reliability hardening suite#261
Conversation
Add typed timeout errors for LLM and embedding calls and wrap async provider operations with asyncio.wait_for. Cover base, LiteLLM, and OpenRouter async paths with regression tests.\n\nCo-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add typed wrapping and error-level logging around high-risk broad exception paths while preserving debug tracebacks. Cover connection, provider, loader, pipeline, retrieval, and history validation error behavior.\n\nCo-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add regression tests for async context manager cleanup, close failure propagation, and inner-exception preservation.\n\nCo-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add a typed latency budget error and enforce Context budgets before retrieval phases, helper I/O, graph config probes, Cypher calls, and completion LLM calls. Cover propagation and phase gating with regression tests.\n\nCo-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add an explicit integration marker, run marked real-FalkorDB tests in CI, document docker-compose usage, and expose the FalkorDB browser port in local compose.\n\nCo-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Validate Python distributions before trusted PyPI publishing, upload release artifacts, enable manual docs deploys, and add Dependabot coverage for actions and Python dependencies.\n\nCo-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds latency-budget enforcement and typed timeouts across providers and retrievals, threads ctx and timeouts through GraphRAG facade and strategies, introduces timeout helpers and new exception types, tightens history validation and logging, updates CI/workflows and docker/dev docs, and adds extensive tests for budgets, timeouts, and logging. ChangesLatency Budget Enforcement System
Sequence Diagram(s)sequenceDiagram
participant Client
participant GraphRAG
participant Context
participant Retrieval
participant Embedder
participant LLM
Client->>GraphRAG: retrieve(query, ctx)
GraphRAG->>Context: ensure_budget(graph-config)
GraphRAG->>GraphRAG: _validate_graph_config(ctx)
GraphRAG->>Context: ensure_budget(completion retrieval)
GraphRAG->>Retrieval: search(ctx)
Retrieval->>Context: ensure_budget(query embedding)
Retrieval->>Embedder: aembed_query(text, timeout=ctx.remaining_budget_seconds)
Retrieval->>Context: ensure_budget(chunk search)
Retrieval->>Retrieval: vector_store.search_chunks(...)
Retrieval->>Context: ensure_budget(rerank/embed)
Retrieval->>Embedder: aembed_documents(..., timeout=ctx.remaining_budget_seconds)
GraphRAG->>Context: ensure_budget(completion LLM call)
GraphRAG->>LLM: ainvoke_messages(messages, timeout=ctx.remaining_budget_seconds)
LLM-->>GraphRAG: response
GraphRAG-->>Client: result
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related issues
Possibly related PRs
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Deep-dive review of the reliability hardening suite. Solid intent and real value, but three load-bearing issues need addressing before merge. Posted as line comments — summary below.
Blocking (Critical):
- C1:
connection.pysilently wraps every DB error inDatabaseError— breaking change for any downstreamexcept falkordb.<Specific>Errorclause, andFalkorDBConnectionis a re-exported public class. - C2: the latency budget is never wired into provider
timeout=. Once a slow LLM call is in flight, the budget can fly past the threshold and nothing aborts it. - C3:
aembed_documents(..., timeout=10)applies the timeout per batch (and per binary-split sub-call), not as an overall deadline — the public contract reads as 10s but wall-clock total can be 50s+.
Wins worth keeping: the MENTIONED_IN cosine-rank merge from #259, the history-validation refactor in api/main.py (error messages were wrong before, are right now), the secret-leakage test in test_providers.py, the PdfLoader open/finally fix. Tests pass cleanly (393/393 on touched files).
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
graphrag_sdk/src/graphrag_sdk/core/providers/openrouter.py (1)
161-166:⚠️ Potential issue | 🟠 Major | ⚡ Quick winUse an overall deadline for retried LLM calls.
Like the LiteLLM path, this treats
timeoutas per-attempt, not per-operation. With retries enabled, the request can exceed the caller's remaining latency budget by the backoff plus extra attempts. Recompute a remaining timeout from a monotonic deadline on each loop iteration so the wholeainvoke*call stays within the passed budget.Also applies to: 216-221
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@graphrag_sdk/src/graphrag_sdk/core/providers/openrouter.py` around lines 161 - 166, The current call to wait_for_provider_call(client.chat.completions.create(**create_kwargs), timeout=timeout, ...) treats timeout as per-attempt; change the retry loop to compute a monotonic deadline once at the start and, before each attempt, recompute remaining_timeout = max(0, deadline - time.monotonic()) and pass that remaining_timeout into wait_for_provider_call (and into any backoff/wait logic) so the whole ainvoke/chat completion operation respects the original budget; apply the same deadline-based timeout recomputation to the other OpenRouter completion call site that uses client.chat.completions.create and model_name.graphrag_sdk/src/graphrag_sdk/core/providers/litellm.py (1)
138-143:⚠️ Potential issue | 🟠 Major | ⚡ Quick winMake
timeouta deadline across retries.These calls reuse the same
timeouton every retry attempt, soainvoke(..., timeout=2, max_retries=3)can still run well past 2 seconds once retry backoff is included. That breaks the new latency-budget plumbing, because callers now passctx.remaining_budget_secondsexpecting the whole provider operation to stay inside the remaining budget. Mirror thedeadline/remaining_timeout()pattern already used inaembed_documentsso retries and sleeps consume the same budget.Also applies to: 227-232
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@graphrag_sdk/src/graphrag_sdk/core/providers/litellm.py` around lines 138 - 143, The wait_for_provider_call invocation in LiteLLM methods (e.g., the acompletion call that uses self._completion_kwargs and model_name) currently reuses the same timeout for every retry, which can exceed a caller's overall latency budget; change the logic to compute a deadline/remaining timeout per attempt (mirror the remaining_timeout()/deadline pattern used in aembed_documents) so each retry and any backoff/sleep consumes from the same budget and you pass the updated remaining timeout into wait_for_provider_call (and any sleep/backoff) rather than the original static timeout; update both the completion path around acompletion and the other occurrence noted (lines ~227-232) to use this per-attempt remaining timeout.
🧹 Nitpick comments (1)
graphrag_sdk/tests/test_connection.py (1)
117-117: ⚡ Quick winTighten exception type assertions to match the “typed” intent.
Using
pytest.raises(Exception)is too broad and may pass on unintended exception paths. Use explicit exception classes inside_effectand assert those exact classes are propagated.✅ Suggested test hardening
- mock_graph.query = AsyncMock(side_effect=Exception("always fails")) + class AlwaysFailsError(Exception): + pass + mock_graph.query = AsyncMock(side_effect=AlwaysFailsError("always fails")) @@ - with pytest.raises(Exception, match="always fails"): + with pytest.raises(AlwaysFailsError, match="always fails"): await conn.query("MATCH (n) RETURN n") @@ - mock_graph.query = AsyncMock(side_effect=Exception("already indexed")) + class AlreadyIndexedError(Exception): + pass + mock_graph.query = AsyncMock(side_effect=AlreadyIndexedError("already indexed")) @@ - with pytest.raises(Exception, match="already indexed"): + with pytest.raises(AlreadyIndexedError, match="already indexed"): await conn.query("CREATE INDEX idx")Also applies to: 131-132
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@graphrag_sdk/tests/test_connection.py` at line 117, The tests use pytest.raises(Exception, match="always fails") which is too broad; change the mocked side_effect to raise a specific exception class (e.g., RuntimeError or the actual custom exception used in the code) and update the pytest.raises call(s) to assert that exact exception type instead of Exception for the occurrences where pytest.raises(Exception, match="always fails") appears; ensure both instances (the shown pytest.raises call and the other occurrence later in the file tied to the same mock/side_effect) use the same explicit exception class so the test verifies the intended error path.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@graphrag_sdk/src/graphrag_sdk/core/providers/_timeout.py`:
- Around line 18-27: The validate_timeout function currently raises a raw
ValueError when timeout <= 0 which bypasses the typed timeout error contract;
change validate_timeout(timeout: float | None) to raise the module's
timeout_error (not ValueError) when timeout is not None and timeout <= 0,
including the timeout value in the message (e.g. f"timed out after
{timeout:.3g}s") so callers and the except block keep receiving the typed
timeout error; update references to validate_timeout and keep the existing await
asyncio.wait_for/except behavior unchanged.
In `@graphrag_sdk/tests/test_providers.py`:
- Around line 729-734: The wall-clock assertion in the test around
embedder.aembed_documents is too strict and can flake; relax the timing ceiling
or mock the clock: change the assertion `assert time.monotonic() - started <
0.05` to a looser bound (for example `< 0.1` or `< 0.2`) or replace
time.monotonic with a mocked clock to avoid scheduler jitter, leaving the rest
of the test (the with pytest.raises(EmbeddingTimeoutError) block and the `assert
mock_litellm.aembedding.await_count == 2`) unchanged.
---
Outside diff comments:
In `@graphrag_sdk/src/graphrag_sdk/core/providers/litellm.py`:
- Around line 138-143: The wait_for_provider_call invocation in LiteLLM methods
(e.g., the acompletion call that uses self._completion_kwargs and model_name)
currently reuses the same timeout for every retry, which can exceed a caller's
overall latency budget; change the logic to compute a deadline/remaining timeout
per attempt (mirror the remaining_timeout()/deadline pattern used in
aembed_documents) so each retry and any backoff/sleep consumes from the same
budget and you pass the updated remaining timeout into wait_for_provider_call
(and any sleep/backoff) rather than the original static timeout; update both the
completion path around acompletion and the other occurrence noted (lines
~227-232) to use this per-attempt remaining timeout.
In `@graphrag_sdk/src/graphrag_sdk/core/providers/openrouter.py`:
- Around line 161-166: The current call to
wait_for_provider_call(client.chat.completions.create(**create_kwargs),
timeout=timeout, ...) treats timeout as per-attempt; change the retry loop to
compute a monotonic deadline once at the start and, before each attempt,
recompute remaining_timeout = max(0, deadline - time.monotonic()) and pass that
remaining_timeout into wait_for_provider_call (and into any backoff/wait logic)
so the whole ainvoke/chat completion operation respects the original budget;
apply the same deadline-based timeout recomputation to the other OpenRouter
completion call site that uses client.chat.completions.create and model_name.
---
Nitpick comments:
In `@graphrag_sdk/tests/test_connection.py`:
- Line 117: The tests use pytest.raises(Exception, match="always fails") which
is too broad; change the mocked side_effect to raise a specific exception class
(e.g., RuntimeError or the actual custom exception used in the code) and update
the pytest.raises call(s) to assert that exact exception type instead of
Exception for the occurrences where pytest.raises(Exception, match="always
fails") appears; ensure both instances (the shown pytest.raises call and the
other occurrence later in the file tied to the same mock/side_effect) use the
same explicit exception class so the test verifies the intended error path.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 46226887-d6df-4e6d-8508-2f1351babea6
📒 Files selected for processing (21)
docker-compose.ymlgraphrag_sdk/src/graphrag_sdk/api/main.pygraphrag_sdk/src/graphrag_sdk/core/connection.pygraphrag_sdk/src/graphrag_sdk/core/context.pygraphrag_sdk/src/graphrag_sdk/core/providers/_timeout.pygraphrag_sdk/src/graphrag_sdk/core/providers/base.pygraphrag_sdk/src/graphrag_sdk/core/providers/litellm.pygraphrag_sdk/src/graphrag_sdk/core/providers/openrouter.pygraphrag_sdk/src/graphrag_sdk/retrieval/reranking_strategies/cosine.pygraphrag_sdk/src/graphrag_sdk/retrieval/strategies/chunk_retrieval.pygraphrag_sdk/src/graphrag_sdk/retrieval/strategies/cypher_generation.pygraphrag_sdk/src/graphrag_sdk/retrieval/strategies/entity_discovery.pygraphrag_sdk/src/graphrag_sdk/retrieval/strategies/local.pygraphrag_sdk/src/graphrag_sdk/retrieval/strategies/multi_path.pygraphrag_sdk/src/graphrag_sdk/retrieval/strategies/relationship_expansion.pygraphrag_sdk/src/graphrag_sdk/retrieval/strategies/result_assembly.pygraphrag_sdk/tests/test_connection.pygraphrag_sdk/tests/test_facade.pygraphrag_sdk/tests/test_multi_path_retrieval.pygraphrag_sdk/tests/test_providers.pygraphrag_sdk/tests/test_retrieval.py
✅ Files skipped from review due to trivial changes (1)
- docker-compose.yml
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@graphrag_sdk/src/graphrag_sdk/core/context.py`:
- Line 57: The method remaining_budget_seconds currently returns a tiny positive
1e-9 even when the budget is exhausted, which conflicts with budget_exceeded and
can allow a call to start with a non-zero timeout; update
remaining_budget_seconds (in the Context class/function named
remaining_budget_seconds) to return 0.0 when remaining milliseconds <= 0 (e.g.,
if remaining <= 0 return 0.0 else return remaining/1000.0) so an exhausted
budget yields exactly zero seconds and aligns with budget_exceeded semantics.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 6f48d8e8-50d7-492e-8564-7ec08c41f346
📒 Files selected for processing (3)
graphrag_sdk/src/graphrag_sdk/core/context.pygraphrag_sdk/src/graphrag_sdk/core/providers/_timeout.pygraphrag_sdk/tests/test_providers.py
Co-Authored-By: Copilot <223556219+Copilot@users.noreply.github.com>
Summary
Review
Testing
Summary by CodeRabbit
New Features
Bug Fixes
Documentation
Chores
Tests