Skip to content

fix: address Copilot review on utils-split PR#182

Open
denis-samatov wants to merge 8 commits intoVectifyAI:mainfrom
denis-samatov:fix/copilot-review-utils-split
Open

fix: address Copilot review on utils-split PR#182
denis-samatov wants to merge 8 commits intoVectifyAI:mainfrom
denis-samatov:fix/copilot-review-utils-split

Conversation

@denis-samatov
Copy link
Copy Markdown

Summary

This PR addresses the automated Copilot review comments left on the feature/utils-split PR.

Changes

pageindex/core/pdf.py

  • Fix KeyError crash: get_page_tokens() now wraps tiktoken.encoding_for_model() in try/except KeyError and falls back to cl100k_base — consistent with the same fallback logic already in core/llm.py's count_tokens()

tests/conftest.py

  • Fix broken sys.path: Replaced ../src with .. (project root). The repo has no src/ directory, so the previous path was silently adding a non-existent location to sys.path, masking potential import failures

pageindex/utils.py

  • Restore backward compatibility: Added explicit wrappers for the removed legacy symbols llm_completion and llm_acompletion. Without these, any code importing from pageindex.utils would raise NameError at runtime. The wrappers delegate to the new ChatGPT_API and ChatGPT_API_async in core/llm, with a thread-executor fallback for the async path

Testing

All 11 unit tests pass:

11 passed in 0.51s

- core/pdf.py: add KeyError fallback in get_page_tokens() for unknown
  model names, consistent with count_tokens() in core/llm.py
- tests/conftest.py: fix sys.path to point at project root instead of
  non-existent '../src' directory
- pageindex/utils.py: add backward-compatible wrappers for removed legacy
  symbols llm_completion and llm_acompletion so existing callers don't
  raise NameError at runtime
Copilot AI review requested due to automatic review settings March 26, 2026 13:12
Copy link
Copy Markdown

@claude claude bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR follows up on the earlier feature/utils-split work by wiring pageindex to the new core/* modules, restoring legacy pageindex.utils symbols, and adding/adjusting tests + config loading to match the refactor.

Changes:

  • Re-export core modules via pageindex/utils.py and add legacy llm_completion / llm_acompletion wrappers.
  • Move page_index.py off from .utils import * onto explicit core.* imports and update call sites accordingly.
  • Add core/ modules (llm, pdf, tree, logging), introduce Pydantic-based config loading, and add unit tests.

Reviewed changes

Copilot reviewed 13 out of 14 changed files in this pull request and generated 9 comments.

Show a summary per file
File Description
pageindex/utils.py Re-exports core modules and adds backward-compat wrappers for removed legacy symbols.
pageindex/page_index.py Switches to explicit imports from core/* and updates LLM call sites and option handling.
pageindex/core/llm.py Adds OpenAI client helpers, token counting, and JSON extraction utilities.
pageindex/core/pdf.py Adds PDF helpers and per-page token counting with a tiktoken fallback.
pageindex/core/tree.py Adds tree/structure utilities and summary generation helpers.
pageindex/core/logging.py Adds JSON file logger used by indexing pipeline.
pageindex/config.py Introduces Pydantic-backed PageIndexConfig + loader.
config.yaml Adds repo-level default configuration values.
pyproject.toml Declares package metadata + pinned dependencies.
tests/conftest.py Fixes sys.path injection to project root for tests.
tests/test_tree.py Adds tests for new tree utilities.
tests/test_llm.py Adds tests for JSON extraction and token counting.
tests/test_config.py Adds tests for config loading and validation behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

- core/tree.py: fix IndentationError in add_node_text() and
  add_node_text_with_labels() — extra space on node['text'] = ... line
- core/pdf.py: guard against None from page.extract_text() in both
  extract_text_from_pdf() and get_page_tokens() with 'or ""'
- utils.py: use functools.partial in run_in_executor fallback to safely
  pass kwargs; add return_finish_reason routing in llm_completion wrapper;
  remove unused _core_pkg import
- page_index.py: load default ConfigLoader when opt is None in
  page_index_main() to prevent model=None crash; fix tob_ typo in
  single_toc_item_index_fixer()
Copy link
Copy Markdown

@JiwaniZakir JiwaniZakir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default values in config.yaml and PageIndexConfig in pageindex/config.py are inconsistent in several places: toc_check_page_num is 20 vs. 3, max_page_num_each_node is 10 vs. 5, max_token_num_each_node is 20000 vs. 4000, and both if_add_doc_description and if_add_node_text are false in the YAML but True in the Pydantic model. Since ConfigLoader falls back to the Pydantic defaults when no config file is found (e.g., during testing or when the package is installed without the repo's config.yaml), this creates two distinct behavioral profiles depending on the deployment context — likely unintentional. The Pydantic field defaults should either match config.yaml exactly, or config.yaml should be the sole source of truth and the Pydantic defaults should be treated as last-resort fallbacks with that clearly documented.

Additionally, in ConfigLoader.load, filtering out None values from user_dict (if v is not None) silently prevents a caller from explicitly clearing a field that has a non-None default (e.g., setting api_key=None to override a key loaded from the YAML). Consider distinguishing between "not provided" and "explicitly set to None," possibly by using MISSING sentinel values or model_fields_set from Pydantic v2.

…utils-split

# Conflicts:
#	examples/documents/conftest.py
#	examples/documents/test_config.py
#	examples/documents/test_llm.py
#	examples/documents/test_tree.py
#	pageindex/config.yaml
#	pageindex/utils.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants