241 issue expand document loader coverage#262
Conversation
…te() Changes the default chunker that ``GraphRAG.ingest()`` and ``GraphRAG.update()`` fall back to when the caller doesn't pass an explicit ``chunker=``. Was ``FixedSizeChunking()``; now ``SentenceTokenCapChunking()`` (sentence-aware, max_tokens=512, overlap_sentences=2 — the strategy's own defaults). Why --- ``FixedSizeChunking`` splits on a hard character window with no awareness of sentence, word, or paragraph boundaries. When the window cuts through an entity name, the per-chunk LLM extractor produces a stub entity for the fragment (``"Wayne Enterprises"`` → ``"Wayne En"`` in chunk N plus unparsable text in chunk N+1). These stubs never merge with their full forms during resolution because their embeddings differ enough that LLMVerifiedResolution scores them below the soft threshold. This silently inflates cypher counts and pollutes "which X" lists. The strategy that surfaced this — ``CypherFirstAggregationStrategy`` — was hitting a 6/7 ceiling on the internal aggregation benchmark with one question failing because of these stubs. Switching to ``SentenceTokenCapChunking`` cleared the benchmark to 7/7 stable across three runs, and the post-ingest graph state went from 11-14 organization nodes (including ``Glo`` / ``Initech System`` / ``Wayne En``) to exactly 10 clean orgs, and from 66-80 ``Person`` nodes (with ``Carla`` / ``Carla Okafor`` duplicates) to exactly 56 distinct persons — matching the corpus. A side benefit: sentence-aware chunks with 2-sentence overlap almost always keep a person's first mention in the same chunk as their later short-form references, so per-chunk FastCoref now binds ``Carla → Carla Okafor`` reliably. That eliminates the short-form-duplicate class too, not just the truncation stubs. Compatibility ------------- ``FixedSizeChunking`` remains exported and fully supported — callers who explicitly pass ``chunker=FixedSizeChunking()`` get unchanged behavior. Existing tests (748 passed, 24 skipped) pass without modification: no test in the suite asserts on chunk count or content shape from the default chunker, so switching defaults doesn't break the suite. Callers who relied on the previous default and want to keep it should pass ``chunker=FixedSizeChunking()`` explicitly. The docstrings call out the new default and reference ``FixedSizeChunking`` as the opt-in character-window alternative. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
📝 WalkthroughWalkthroughAdds DoclingBaseLoader and format-specific loaders (DOCX/XLSX/PPTX/HTML/CSV), exports them from the loaders package, adds an optional ChangesDocling-based document loaders and API updates
Sequence DiagramsequenceDiagram
participant Client
participant DoclingBaseLoader
participant DocumentConverter
participant GraphRAG
Client->>DoclingBaseLoader: load(source, ctx)
DoclingBaseLoader->>DoclingBaseLoader: asyncio.to_thread(_load_sync)
DoclingBaseLoader->>DocumentConverter: create & convert(source)
DocumentConverter-->>DoclingBaseLoader: iterated items (label, text, level)
DoclingBaseLoader->>GraphRAG: map items -> DocumentElement(s)
DoclingBaseLoader-->>Client: DocumentOutput(text, elements, metadata)
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@graphrag_sdk/pyproject.toml`:
- Around line 59-67: The extras specification is inconsistent: the standalone
"docling" extra pins docling>=2.91.0 while the "all" extras list contains
docling>=2.0.0; update the "all" extras entry to require the same minimum
(docling>=2.91.0) so installing graphrag-sdk[all] cannot pull an older
incompatible docling version—edit the "docling" entry inside the all array in
pyproject.toml to match the docling extra's minimum version.
In `@graphrag_sdk/src/graphrag_sdk/api/main.py`:
- Around line 329-333: Update the stale loader-default docstring that currently
reads "Loader: auto-detected from file extension (PDF or text)" so it reflects
the new extension routing; locate the docstring containing that exact phrase in
graphrag_sdk/api/main.py (the help/usage text shown in the diff) and replace it
with a concise description like "Loader: auto-detected from file extension (PDF,
DOCX, XLSX, PPTX, HTML/XHTML, CSV, or plain text)" so the user-facing docs list
the supported formats.
In `@graphrag_sdk/tests/test_docling_loaders.py`:
- Around line 24-27: The test is breaking the import system by having the
patched builtins.__import__ return None for non-target imports; change the patch
side_effect to delegate to the real importer for all names except the one you
want to simulate failing (i.e., capture the original_import =
builtins.__import__ and in the side_effect for the patch used around loader.load
call, raise ImportError("module not found") when name ==
"docling.document_converter" and otherwise return original_import(name, *args,
**kwargs)); keep the pytest.raises assertion around await loader.load(str(file),
ctx) unchanged and reference the patched builtins.__import__ side_effect that
delegates for everything except "docling.document_converter".
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 6566ed76-11b6-4ce9-8ef8-6fb548181321
📒 Files selected for processing (10)
graphrag_sdk/pyproject.tomlgraphrag_sdk/src/graphrag_sdk/api/main.pygraphrag_sdk/src/graphrag_sdk/ingestion/loaders/__init__.pygraphrag_sdk/src/graphrag_sdk/ingestion/loaders/csv_loader.pygraphrag_sdk/src/graphrag_sdk/ingestion/loaders/docling_base.pygraphrag_sdk/src/graphrag_sdk/ingestion/loaders/docx_loader.pygraphrag_sdk/src/graphrag_sdk/ingestion/loaders/html_loader.pygraphrag_sdk/src/graphrag_sdk/ingestion/loaders/pptx_loader.pygraphrag_sdk/src/graphrag_sdk/ingestion/loaders/xlsx_loader.pygraphrag_sdk/tests/test_docling_loaders.py
Path C in retrieve_chunks used `COLLECT(c)[..3]` with no ORDER BY, so hub entities (which can be MENTIONED_IN hundreds of chunks) returned an arbitrary 3 — almost never including the chunks most relevant to the current query. Add an ORDER BY on `vec.cosineDistance(c.embedding, query_vector)` before the COLLECT so per-entity chunk selection is query-aware. Refs FalkorDB#258 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…stead of actual imports
…and add debug helpers
There was a problem hiding this comment.
Actionable comments posted: 5
🧹 Nitpick comments (3)
test_mock_import.py (1)
9-15: ⚡ Quick winAvoid global
sys.modulesmutation at import time.Lines 9-15 mutate global module state and execute prints on import, which can leak across tests/tools.
Proposed fix
-sys.modules["docling"] = MagicMock() -sys.modules["docling.datamodel"] = MagicMock() -sys.modules["docling.datamodel.document"] = mock_datamodel - -from docling.datamodel.document import DocItemLabel -print("DocItemLabel is:", DocItemLabel) -print("DocItemLabel.LIST_ITEM is:", getattr(DocItemLabel, "LIST_ITEM", None)) +def main(): + modules = { + "docling": MagicMock(), + "docling.datamodel": MagicMock(), + "docling.datamodel.document": mock_datamodel, + } + with patch.dict("sys.modules", modules): + from docling.datamodel.document import DocItemLabel + print("DocItemLabel is:", DocItemLabel) + print("DocItemLabel.LIST_ITEM is:", getattr(DocItemLabel, "LIST_ITEM", None)) + +if __name__ == "__main__": + main()🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@test_mock_import.py` around lines 9 - 15, The test mutates global sys.modules and does prints at import time (sys.modules["docling"], sys.modules["docling.datamodel"], sys.modules["docling.datamodel.document"], mock_datamodel) and then imports DocItemLabel, which can leak state; change this to perform the module patching inside the test or a fixture using a temporary monkeypatch (e.g., monkeypatch.setitem(sys.modules, "docling", MagicMock()) and monkeypatch.setitem(... "docling.datamodel.document", mock_datamodel)) or use importlib and context-local injection before importing, remove the top-level print statements, and reference DocItemLabel only within the test body so mock_datamodel and DocItemLabel are created and torn down per-test instead of at import time.graphrag_sdk/test_monkeypatch.py (1)
19-19: ⚡ Quick winGuard script execution behind
if __name__ == "__main__":.Line 19 causes side effects on import. In a
test_*.pyfile this is easy to trigger unintentionally.Proposed fix
-asyncio.run(main()) +if __name__ == "__main__": + asyncio.run(main())🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@graphrag_sdk/test_monkeypatch.py` at line 19, Current top-level call asyncio.run(main()) in test_monkeypatch.py causes the module to execute on import; wrap that invocation in a standard entry-point guard by adding if __name__ == "__main__": and moving asyncio.run(main()) inside it so async def main() only runs when the file is executed directly, not when imported by pytest or other modules.graphrag_sdk/test_monkeypatch3.py (1)
16-19: ⚡ Quick winDon’t swallow all exceptions in the failure-path check.
Catching broad
Exceptionandpasshides regressions unrelated to the mocked import failure.Proposed fix
- try: - await asyncio.to_thread(load_sync) - except Exception as e: - pass # catch the mocked exception + with pytest.raises(ImportError, match="module not found"): + await asyncio.to_thread(load_sync)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@graphrag_sdk/test_monkeypatch3.py` around lines 16 - 19, The test currently swallows all exceptions around the await asyncio.to_thread(load_sync) call; change this to only accept the specific expected failure (e.g., ModuleNotFoundError or the mocked exception class) or use pytest.raises to assert the failure. Replace the broad "except Exception as e: pass" with either "except ModuleNotFoundError: pass" (or the exact mocked exception type used in the test) or wrap the await in "with pytest.raises(ExpectedException): await asyncio.to_thread(load_sync)" so unexpected exceptions are not hidden; ensure you reference the load_sync call and asyncio.to_thread in the updated test.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@graphrag_sdk/test_debug.py`:
- Around line 27-28: The debug harness currently calls
loader._load_sync("dummy_path") which fails file-existence checks; replace the
dummy string with a real temporary file path (create a temp file via
tempfile.NamedTemporaryFile or tempfile.mkstemp, write minimal content that will
exercise the converter mapping logic, flush/close it) and pass that file's path
to loader._load_sync; ensure the temp file is cleaned up after the test and keep
the existing print(result.elements) to observe outputs.
In `@graphrag_sdk/test_monkeypatch2.py`:
- Around line 4-6: The function load_sync is declared async but is passed
directly to asyncio.to_thread at its call sites (asyncio.to_thread(load_sync)),
which causes to_thread to execute a callable that returns a coroutine instead of
running the import logic; change load_sync from async def load_sync() to a
synchronous def load_sync() that performs the import and returns "success" (so
the body runs inside asyncio.to_thread), or alternatively update the call sites
to await load_sync() (or wrap with asyncio.to_thread(lambda:
asyncio.run(load_sync())) if you must keep it async).
In `@test_import_mock.py`:
- Around line 6-10: The custom import hook _import currently calls __import__
directly for non-"fake" modules which will re-enter the patched
builtins.__import__ and cause infinite recursion; fix by saving the original
import function before patching (e.g., orig_import = builtins.__import__) and
have _import delegate to orig_import(name, *args, **kwargs) for non-"fake"
cases, then install _import via patch("builtins.__import__",
side_effect=_import) so normal imports use the saved original implementation.
In `@test_patch_dict.py`:
- Around line 3-5: Convert the top-level patch.dict snippet into a real pytest
test function (e.g., def test_patch_dict_imports_mocked_module():) that uses
unittest.mock.patch.dict to temporarily inject {"docling": MagicMock(),
"docling.document_converter": MagicMock()}, then imports (or
importlib.import_module) "docling.document_converter" and asserts that the
imported object is the MagicMock from sys.modules and that inside the context
sys.modules["docling.document_converter"] is the mock; after exiting the
patch.dict context assert sys.modules has been restored to its pre-test state
(original module present or key removed) to ensure cleanup.
In `@test_sys_modules.py`:
- Around line 3-6: The module-level mutations of sys.modules
(sys.modules['docling'] and sys.modules['docling.document_converter']) should be
moved into a test-scoped patch.dict to avoid cross-test pollution; wrap the
import of docling.document_converter inside a with patch.dict("sys.modules",
{"docling": MagicMock(), "docling.document_converter": MagicMock()}): block (or
use the patch.dict decorator) so the mocked entries are only present for the
duration of the test, and ensure the import statement (import
docling.document_converter) happens inside that patched context.
---
Nitpick comments:
In `@graphrag_sdk/test_monkeypatch.py`:
- Line 19: Current top-level call asyncio.run(main()) in test_monkeypatch.py
causes the module to execute on import; wrap that invocation in a standard
entry-point guard by adding if __name__ == "__main__": and moving
asyncio.run(main()) inside it so async def main() only runs when the file is
executed directly, not when imported by pytest or other modules.
In `@graphrag_sdk/test_monkeypatch3.py`:
- Around line 16-19: The test currently swallows all exceptions around the await
asyncio.to_thread(load_sync) call; change this to only accept the specific
expected failure (e.g., ModuleNotFoundError or the mocked exception class) or
use pytest.raises to assert the failure. Replace the broad "except Exception as
e: pass" with either "except ModuleNotFoundError: pass" (or the exact mocked
exception type used in the test) or wrap the await in "with
pytest.raises(ExpectedException): await asyncio.to_thread(load_sync)" so
unexpected exceptions are not hidden; ensure you reference the load_sync call
and asyncio.to_thread in the updated test.
In `@test_mock_import.py`:
- Around line 9-15: The test mutates global sys.modules and does prints at
import time (sys.modules["docling"], sys.modules["docling.datamodel"],
sys.modules["docling.datamodel.document"], mock_datamodel) and then imports
DocItemLabel, which can leak state; change this to perform the module patching
inside the test or a fixture using a temporary monkeypatch (e.g.,
monkeypatch.setitem(sys.modules, "docling", MagicMock()) and
monkeypatch.setitem(... "docling.datamodel.document", mock_datamodel)) or use
importlib and context-local injection before importing, remove the top-level
print statements, and reference DocItemLabel only within the test body so
mock_datamodel and DocItemLabel are created and torn down per-test instead of at
import time.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: af7af7df-af2f-4640-a6c3-c5b56215e9ad
📒 Files selected for processing (11)
graphrag_sdk/dummy_pathgraphrag_sdk/pyproject.tomlgraphrag_sdk/test_debug.pygraphrag_sdk/test_monkeypatch.pygraphrag_sdk/test_monkeypatch2.pygraphrag_sdk/test_monkeypatch3.pygraphrag_sdk/tests/test_docling_loaders.pytest_import_mock.pytest_mock_import.pytest_patch_dict.pytest_sys_modules.py
Summary
This feature extends the loader interface using Docling to add support for DOCX, XLSX, PPTX, CSV, HTML, and XHTML.
Changes
Test Plan
pytest tests/ -q)ruff check src/)Notes
Currently, the
GraphRAG_SDKloader interface maps one loader to one extension. Since Docling supports multiple extensions, this PR updates the current loader implementation paradigm to support multi-format loaders.Summary by CodeRabbit
New Features
Chores
Tests