fix: address Copilot review on utils-split PR#182
fix: address Copilot review on utils-split PR#182denis-samatov wants to merge 8 commits intoVectifyAI:mainfrom
Conversation
- core/pdf.py: add KeyError fallback in get_page_tokens() for unknown model names, consistent with count_tokens() in core/llm.py - tests/conftest.py: fix sys.path to point at project root instead of non-existent '../src' directory - pageindex/utils.py: add backward-compatible wrappers for removed legacy symbols llm_completion and llm_acompletion so existing callers don't raise NameError at runtime
There was a problem hiding this comment.
Pull request overview
This PR follows up on the earlier feature/utils-split work by wiring pageindex to the new core/* modules, restoring legacy pageindex.utils symbols, and adding/adjusting tests + config loading to match the refactor.
Changes:
- Re-export
coremodules viapageindex/utils.pyand add legacyllm_completion/llm_acompletionwrappers. - Move
page_index.pyofffrom .utils import *onto explicitcore.*imports and update call sites accordingly. - Add
core/modules (llm,pdf,tree,logging), introduce Pydantic-based config loading, and add unit tests.
Reviewed changes
Copilot reviewed 13 out of 14 changed files in this pull request and generated 9 comments.
Show a summary per file
| File | Description |
|---|---|
pageindex/utils.py |
Re-exports core modules and adds backward-compat wrappers for removed legacy symbols. |
pageindex/page_index.py |
Switches to explicit imports from core/* and updates LLM call sites and option handling. |
pageindex/core/llm.py |
Adds OpenAI client helpers, token counting, and JSON extraction utilities. |
pageindex/core/pdf.py |
Adds PDF helpers and per-page token counting with a tiktoken fallback. |
pageindex/core/tree.py |
Adds tree/structure utilities and summary generation helpers. |
pageindex/core/logging.py |
Adds JSON file logger used by indexing pipeline. |
pageindex/config.py |
Introduces Pydantic-backed PageIndexConfig + loader. |
config.yaml |
Adds repo-level default configuration values. |
pyproject.toml |
Declares package metadata + pinned dependencies. |
tests/conftest.py |
Fixes sys.path injection to project root for tests. |
tests/test_tree.py |
Adds tests for new tree utilities. |
tests/test_llm.py |
Adds tests for JSON extraction and token counting. |
tests/test_config.py |
Adds tests for config loading and validation behavior. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- core/tree.py: fix IndentationError in add_node_text() and add_node_text_with_labels() — extra space on node['text'] = ... line - core/pdf.py: guard against None from page.extract_text() in both extract_text_from_pdf() and get_page_tokens() with 'or ""' - utils.py: use functools.partial in run_in_executor fallback to safely pass kwargs; add return_finish_reason routing in llm_completion wrapper; remove unused _core_pkg import - page_index.py: load default ConfigLoader when opt is None in page_index_main() to prevent model=None crash; fix tob_ typo in single_toc_item_index_fixer()
JiwaniZakir
left a comment
There was a problem hiding this comment.
The default values in config.yaml and PageIndexConfig in pageindex/config.py are inconsistent in several places: toc_check_page_num is 20 vs. 3, max_page_num_each_node is 10 vs. 5, max_token_num_each_node is 20000 vs. 4000, and both if_add_doc_description and if_add_node_text are false in the YAML but True in the Pydantic model. Since ConfigLoader falls back to the Pydantic defaults when no config file is found (e.g., during testing or when the package is installed without the repo's config.yaml), this creates two distinct behavioral profiles depending on the deployment context — likely unintentional. The Pydantic field defaults should either match config.yaml exactly, or config.yaml should be the sole source of truth and the Pydantic defaults should be treated as last-resort fallbacks with that clearly documented.
Additionally, in ConfigLoader.load, filtering out None values from user_dict (if v is not None) silently prevents a caller from explicitly clearing a field that has a non-None default (e.g., setting api_key=None to override a key loaded from the YAML). Consider distinguishing between "not provided" and "explicitly set to None," possibly by using MISSING sentinel values or model_fields_set from Pydantic v2.
…utils-split # Conflicts: # examples/documents/conftest.py # examples/documents/test_config.py # examples/documents/test_llm.py # examples/documents/test_tree.py # pageindex/config.yaml # pageindex/utils.py
Summary
This PR addresses the automated Copilot review comments left on the
feature/utils-splitPR.Changes
pageindex/core/pdf.pyKeyErrorcrash:get_page_tokens()now wrapstiktoken.encoding_for_model()intry/except KeyErrorand falls back tocl100k_base— consistent with the same fallback logic already incore/llm.py'scount_tokens()tests/conftest.py../srcwith..(project root). The repo has nosrc/directory, so the previous path was silently adding a non-existent location to sys.path, masking potential import failurespageindex/utils.pyllm_completionandllm_acompletion. Without these, any code importing frompageindex.utilswould raiseNameErrorat runtime. The wrappers delegate to the newChatGPT_APIandChatGPT_API_asyncincore/llm, with a thread-executor fallback for the async pathTesting
All 11 unit tests pass: