Feat: Add Pydantic configuration validation#181
Open
denis-samatov wants to merge 4 commits intoVectifyAI:mainfrom
Open
Feat: Add Pydantic configuration validation#181denis-samatov wants to merge 4 commits intoVectifyAI:mainfrom
denis-samatov wants to merge 4 commits intoVectifyAI:mainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This PR introduces a Pydantic-based configuration schema/loader and continues the modularization of the codebase by splitting prior utils.py functionality into pageindex/core/* modules, along with new pytest unit tests.
Changes:
- Added
pageindex/config.pywithPageIndexConfig(Pydantic) andConfigLoaderfor validated config loading/merging. - Refactored core logic into
pageindex/core/{llm,pdf,tree,logging}.pyand updatedpageindex/utils.pyas a compatibility façade. - Updated
pageindex/page_index.pyto import from the new core modules and adjusted various LLM call sites; added pytest tests and packaging metadata updates.
Reviewed changes
Copilot reviewed 19 out of 20 changed files in this pull request and generated 11 comments.
Show a summary per file
| File | Description |
|---|---|
pageindex/config.py |
New Pydantic config model + loader (env/cwd/repo discovery). |
config.yaml |
Switches config flags to real YAML booleans (true/false). |
pageindex/core/llm.py |
New OpenAI client wrapper + token counting + JSON extraction helpers. |
pageindex/core/pdf.py |
PDF text extraction + per-page tokenization utilities. |
pageindex/core/tree.py |
Tree manipulation helpers migrated from utils.py with typing/docstrings. |
pageindex/core/logging.py |
JSON logger extracted into core module. |
pageindex/utils.py |
Backwards-compat re-export of core modules via import *. |
pageindex/page_index.py |
Updated imports + multiple logic edits around TOC generation/fixing flow. |
tests/test_config.py |
New unit tests validating config loading and Pydantic validation failures. |
tests/test_llm.py |
New unit tests for JSON extraction and token counting. |
tests/test_tree.py |
New unit tests for tree building/flattening and node id assignment. |
tests/conftest.py |
Adds ../src to sys.path for tests. |
pyproject.toml |
Declares project metadata and dependencies (incl. pydantic>=2.0). |
pageindex.egg-info/* |
Adds generated packaging artifacts to the repo. |
Comments suppressed due to low confidence (3)
pageindex/page_index.py:325
last_completeis only updated whennew_complete.startswith('```json'). If the model returns valid JSON without a fenced code block (common),last_completenever changes and the loop will never progress. Always extract/append the returned JSON fragment (with/without fences), or handle both cases explicitly.
if new_complete.startswith('```json'):
new_complete = get_json_content(new_complete)
last_complete = last_complete+new_complete
if_complete = check_if_toc_transformation_is_complete(toc_content, last_complete, model)
pageindex/page_index.py:824
process_and_check_itemruns insideasyncio.gather(...), but it calls the synchronousChatGPT_API(...)viasingle_toc_item_index_fixer. This will block the event loop and reduce concurrency. Consider makingsingle_toc_item_index_fixerasync (usingChatGPT_API_async) or running the sync call in a thread pool.
physical_index_int = single_toc_item_index_fixer(incorrect_item['title'], content_range, model)
# Check if the result is correct
check_item = incorrect_item.copy()
check_item['physical_index'] = physical_index_int
check_result = await check_title_appearance(check_item, page_list, start_index, model)
pageindex/page_index.py:827
- Inside
process_and_check_item,list_index(the TOC item index) is overwritten in the page loop (list_index = page_index - start_index). As a result, the returnedlist_indexno longer points to the TOC item being fixed, and later updates can write to the wrong element. Use a different variable name for the page-list index (e.g.,page_list_idx) and keep the original item index intact.
page_contents=[]
for page_index in range(prev_correct, next_correct+1):
# Add bounds checking to prevent IndexError
list_index = page_index - start_index
if list_index >= 0 and list_index < len(page_list):
page_text = f"<physical_index_{page_index}>\n{page_list[list_index][0]}\n<physical_index_{page_index}>\n\n"
page_contents.append(page_text)
else:
continue
content_range = ''.join(page_contents)
physical_index_int = single_toc_item_index_fixer(incorrect_item['title'], content_range, model)
# Check if the result is correct
check_item = incorrect_item.copy()
check_item['physical_index'] = physical_index_int
check_result = await check_title_appearance(check_item, page_list, start_index, model)
return {
'list_index': list_index,
'title': incorrect_item['title'],
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Building on the modularization of the core logic, this PR introduces
Pydanticfor robust, type-safe configuration loading and validation.Changes:
true/false) instead of string representations ("yes"/"no").ConfigLoaderseamlessly.pytestfor the configuration loader.Note: This PR is based on the new modular core structure and is intended to be reviewed/merged after the utils.py refactoring PR.