Skip to content

Feat: Add Pydantic configuration validation#181

Open
denis-samatov wants to merge 4 commits intoVectifyAI:mainfrom
denis-samatov:feature/pydantic-config
Open

Feat: Add Pydantic configuration validation#181
denis-samatov wants to merge 4 commits intoVectifyAI:mainfrom
denis-samatov:feature/pydantic-config

Conversation

@denis-samatov
Copy link
Copy Markdown

Building on the modularization of the core logic, this PR introduces Pydantic for robust, type-safe configuration loading and validation.

Changes:

  • Added pageindex/config.py leveraging Pydantic to validate settings.
  • Updated config.yaml to use standard boolean values (true/false) instead of string representations ("yes"/"no").
  • Updated page_index.py to utilize the new ConfigLoader seamlessly.
  • Added comprehensive unit tests in pytest for the configuration loader.

Note: This PR is based on the new modular core structure and is intended to be reviewed/merged after the utils.py refactoring PR.

Copilot AI review requested due to automatic review settings March 26, 2026 12:09
Copy link
Copy Markdown

@claude claude bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a Pydantic-based configuration schema/loader and continues the modularization of the codebase by splitting prior utils.py functionality into pageindex/core/* modules, along with new pytest unit tests.

Changes:

  • Added pageindex/config.py with PageIndexConfig (Pydantic) and ConfigLoader for validated config loading/merging.
  • Refactored core logic into pageindex/core/{llm,pdf,tree,logging}.py and updated pageindex/utils.py as a compatibility façade.
  • Updated pageindex/page_index.py to import from the new core modules and adjusted various LLM call sites; added pytest tests and packaging metadata updates.

Reviewed changes

Copilot reviewed 19 out of 20 changed files in this pull request and generated 11 comments.

Show a summary per file
File Description
pageindex/config.py New Pydantic config model + loader (env/cwd/repo discovery).
config.yaml Switches config flags to real YAML booleans (true/false).
pageindex/core/llm.py New OpenAI client wrapper + token counting + JSON extraction helpers.
pageindex/core/pdf.py PDF text extraction + per-page tokenization utilities.
pageindex/core/tree.py Tree manipulation helpers migrated from utils.py with typing/docstrings.
pageindex/core/logging.py JSON logger extracted into core module.
pageindex/utils.py Backwards-compat re-export of core modules via import *.
pageindex/page_index.py Updated imports + multiple logic edits around TOC generation/fixing flow.
tests/test_config.py New unit tests validating config loading and Pydantic validation failures.
tests/test_llm.py New unit tests for JSON extraction and token counting.
tests/test_tree.py New unit tests for tree building/flattening and node id assignment.
tests/conftest.py Adds ../src to sys.path for tests.
pyproject.toml Declares project metadata and dependencies (incl. pydantic>=2.0).
pageindex.egg-info/* Adds generated packaging artifacts to the repo.
Comments suppressed due to low confidence (3)

pageindex/page_index.py:325

  • last_complete is only updated when new_complete.startswith('```json'). If the model returns valid JSON without a fenced code block (common), last_complete never changes and the loop will never progress. Always extract/append the returned JSON fragment (with/without fences), or handle both cases explicitly.
        if new_complete.startswith('```json'):
            new_complete =  get_json_content(new_complete)
            last_complete = last_complete+new_complete

        if_complete = check_if_toc_transformation_is_complete(toc_content, last_complete, model)

pageindex/page_index.py:824

  • process_and_check_item runs inside asyncio.gather(...), but it calls the synchronous ChatGPT_API(...) via single_toc_item_index_fixer. This will block the event loop and reduce concurrency. Consider making single_toc_item_index_fixer async (using ChatGPT_API_async) or running the sync call in a thread pool.
        physical_index_int = single_toc_item_index_fixer(incorrect_item['title'], content_range, model)
        
        # Check if the result is correct
        check_item = incorrect_item.copy()
        check_item['physical_index'] = physical_index_int
        check_result = await check_title_appearance(check_item, page_list, start_index, model)

pageindex/page_index.py:827

  • Inside process_and_check_item, list_index (the TOC item index) is overwritten in the page loop (list_index = page_index - start_index). As a result, the returned list_index no longer points to the TOC item being fixed, and later updates can write to the wrong element. Use a different variable name for the page-list index (e.g., page_list_idx) and keep the original item index intact.
        page_contents=[]
        for page_index in range(prev_correct, next_correct+1):
            # Add bounds checking to prevent IndexError
            list_index = page_index - start_index
            if list_index >= 0 and list_index < len(page_list):
                page_text = f"<physical_index_{page_index}>\n{page_list[list_index][0]}\n<physical_index_{page_index}>\n\n"
                page_contents.append(page_text)
            else:
                continue
        content_range = ''.join(page_contents)
        
        physical_index_int = single_toc_item_index_fixer(incorrect_item['title'], content_range, model)
        
        # Check if the result is correct
        check_item = incorrect_item.copy()
        check_item['physical_index'] = physical_index_int
        check_result = await check_title_appearance(check_item, page_list, start_index, model)

        return {
            'list_index': list_index,
            'title': incorrect_item['title'],

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants