Feat: Add Pydantic configuration validation by denis-samatov · Pull Request #181 · VectifyAI/PageIndex

denis-samatov · 2026-03-26T12:09:17Z

Building on the modularization of the core logic, this PR introduces Pydantic for robust, type-safe configuration loading and validation.

Changes:

Added pageindex/config.py leveraging Pydantic to validate settings.
Updated config.yaml to use standard boolean values (true/false) instead of string representations ("yes"/"no").
Updated page_index.py to utilize the new ConfigLoader seamlessly.
Added comprehensive unit tests in pytest for the configuration loader.

Note: This PR is based on the new modular core structure and is intended to be reviewed/merged after the utils.py refactoring PR.

claude

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

Copilot

Pull request overview

This PR introduces a Pydantic-based configuration schema/loader and continues the modularization of the codebase by splitting prior utils.py functionality into pageindex/core/* modules, along with new pytest unit tests.

Changes:

Added pageindex/config.py with PageIndexConfig (Pydantic) and ConfigLoader for validated config loading/merging.
Refactored core logic into pageindex/core/{llm,pdf,tree,logging}.py and updated pageindex/utils.py as a compatibility façade.
Updated pageindex/page_index.py to import from the new core modules and adjusted various LLM call sites; added pytest tests and packaging metadata updates.

Reviewed changes

Copilot reviewed 19 out of 20 changed files in this pull request and generated 11 comments.

Show a summary per file

File	Description
`pageindex/config.py`	New Pydantic config model + loader (env/cwd/repo discovery).
`config.yaml`	Switches config flags to real YAML booleans (`true`/`false`).
`pageindex/core/llm.py`	New OpenAI client wrapper + token counting + JSON extraction helpers.
`pageindex/core/pdf.py`	PDF text extraction + per-page tokenization utilities.
`pageindex/core/tree.py`	Tree manipulation helpers migrated from `utils.py` with typing/docstrings.
`pageindex/core/logging.py`	JSON logger extracted into core module.
`pageindex/utils.py`	Backwards-compat re-export of core modules via `import *`.
`pageindex/page_index.py`	Updated imports + multiple logic edits around TOC generation/fixing flow.
`tests/test_config.py`	New unit tests validating config loading and Pydantic validation failures.
`tests/test_llm.py`	New unit tests for JSON extraction and token counting.
`tests/test_tree.py`	New unit tests for tree building/flattening and node id assignment.
`tests/conftest.py`	Adds `../src` to `sys.path` for tests.
`pyproject.toml`	Declares project metadata and dependencies (incl. `pydantic>=2.0`).
`pageindex.egg-info/*`	Adds generated packaging artifacts to the repo.

Comments suppressed due to low confidence (3)

pageindex/page_index.py:325

last_complete is only updated when new_complete.startswith('```json'). If the model returns valid JSON without a fenced code block (common), last_complete never changes and the loop will never progress. Always extract/append the returned JSON fragment (with/without fences), or handle both cases explicitly.

        if new_complete.startswith('```json'):
            new_complete =  get_json_content(new_complete)
            last_complete = last_complete+new_complete

        if_complete = check_if_toc_transformation_is_complete(toc_content, last_complete, model)

pageindex/page_index.py:824

process_and_check_item runs inside asyncio.gather(...), but it calls the synchronous ChatGPT_API(...) via single_toc_item_index_fixer. This will block the event loop and reduce concurrency. Consider making single_toc_item_index_fixer async (using ChatGPT_API_async) or running the sync call in a thread pool.

        physical_index_int = single_toc_item_index_fixer(incorrect_item['title'], content_range, model)
        
        # Check if the result is correct
        check_item = incorrect_item.copy()
        check_item['physical_index'] = physical_index_int
        check_result = await check_title_appearance(check_item, page_list, start_index, model)

pageindex/page_index.py:827

Inside process_and_check_item, list_index (the TOC item index) is overwritten in the page loop (list_index = page_index - start_index). As a result, the returned list_index no longer points to the TOC item being fixed, and later updates can write to the wrong element. Use a different variable name for the page-list index (e.g., page_list_idx) and keep the original item index intact.

        page_contents=[]
        for page_index in range(prev_correct, next_correct+1):
            # Add bounds checking to prevent IndexError
            list_index = page_index - start_index
            if list_index >= 0 and list_index < len(page_list):
                page_text = f"<physical_index_{page_index}>\n{page_list[list_index][0]}\n<physical_index_{page_index}>\n\n"
                page_contents.append(page_text)
            else:
                continue
        content_range = ''.join(page_contents)
        
        physical_index_int = single_toc_item_index_fixer(incorrect_item['title'], content_range, model)
        
        # Check if the result is correct
        check_item = incorrect_item.copy()
        check_item['physical_index'] = physical_index_int
        check_result = await check_title_appearance(check_item, page_list, start_index, model)

        return {
            'list_index': list_index,
            'title': incorrect_item['title'],

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

pageindex/page_index.py

pageindex/core/logging.py

pageindex/config.py

config.yaml

pageindex.egg-info/PKG-INFO

pageindex/page_index.py

denis-samatov added 2 commits March 26, 2026 18:59

refactor: modularize utils.py into core package

a10ad51

feat: add pydantic configuration and validation

67f6cdf

Copilot AI review requested due to automatic review settings March 26, 2026 12:09

claude bot reviewed Mar 26, 2026

View reviewed changes

Copilot started reviewing on behalf of denis-samatov March 26, 2026 12:09 View session

Copilot AI reviewed Mar 26, 2026

View reviewed changes

denis-samatov added 2 commits March 26, 2026 19:50

fix(page_index): address Copilot review feedback

afecf2f

fix: logic tweaks, boolean conditionals, and added retry limits

d2bed43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat: Add Pydantic configuration validation#181

Feat: Add Pydantic configuration validation#181
denis-samatov wants to merge 4 commits intoVectifyAI:mainfrom
denis-samatov:feature/pydantic-config

denis-samatov commented Mar 26, 2026

Uh oh!

claude bot left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

denis-samatov commented Mar 26, 2026

Uh oh!

claude bot left a comment

Choose a reason for hiding this comment

Claude Code Review

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants