Skip to content

fix: prevent KeyError crash and context exhaustion in TOC processing#188

Open
Shreyansh1729 wants to merge 2 commits intoVectifyAI:mainfrom
Shreyansh1729:fix/issue-163-toc-logic-error
Open

fix: prevent KeyError crash and context exhaustion in TOC processing#188
Shreyansh1729 wants to merge 2 commits intoVectifyAI:mainfrom
Shreyansh1729:fix/issue-163-toc-logic-error

Conversation

@Shreyansh1729
Copy link
Copy Markdown

@Shreyansh1729 Shreyansh1729 commented Mar 27, 2026

Summary

Fixes a critical bug where processing large documents (~800+ pages) causes a KeyError: 'completed' crash and context window exhaustion during TOC transformation retries.

Root Cause

Two independent issues combined to break large document processing:

  1. KeyError on truncated LLM output: When extract_json() returns {} due to malformed or truncated JSON, direct dictionary access like json_content['completed'] throws a KeyError instead of gracefully handling the failure.

  2. Context window exhaustion in retry loops: Both extract_toc_content() and toc_transformer() rebuilt the full prompt context on every retry iteration — embedding the entire raw TOC text AND the growing accumulated response into each new prompt. For 800-page documents, this quickly exceeds the model's 128k token limit.

Changes Made

  • Replace all 5 unsafe json_content['key'] accesses with .get('key', default) to handle empty/malformed LLM responses gracefully
  • Refactor extract_toc_content() retry loop to grow chat_history incrementally — only the delta/new content is appended per iteration, not the full accumulated response
  • Refactor toc_transformer() retry loop to use chat_history instead of re-embedding the entire raw TOC and incomplete JSON directly in each prompt
  • Move the initial truncation (last_complete[:position+2]) to run once before the retry loop, preventing valid accumulated content from being cut on subsequent passes
  • Add explicit None guard for physical_index before calling convert_physical_index_to_int
  • Raise explicit Exception on max retries so callers fail fast instead of receiving partial results

Testing

14 mock-based tests added in tests/test_issue_163.py:

  • 8 tests verify robust key access (empty, malformed, valid responses)
  • 4 tests verify extract_toc_content retry behavior (first-try success, continuation, max retries raises exception, incremental history growth)
  • 2 tests verify toc_transformer retry behavior (first-try success, missing key handling)
pytest tests/test_issue_163.py -v
======================== 14 passed in 2.80s ========================

Closes #163

- Use .get() with safe defaults for all LLM response dict accesses
- Optimize extract_toc_content retry loop to grow chat_history
  incrementally instead of rebuilding with full accumulated response
- Optimize toc_transformer retry loop to use chat_history instead of
  re-embedding the entire raw TOC and incomplete JSON in each prompt
- Return best-effort results on max retries instead of raising
- Add 14 mock-based tests covering all fix scenarios

Closes VectifyAI#163
Copy link
Copy Markdown

@claude claude bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

martindecz pushed a commit to martindecz/PageIndex that referenced this pull request Mar 29, 2026
- Upstream main (28542de) — latest VectifyAI/PageIndex
- PR VectifyAI#188: fix prevent KeyError crash and context exhaustion in TOC processing
- Fix list_index shadowing (VectifyAI#167) already in upstream
- LiteLLM integration (VectifyAI#168) already in upstream

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@JiwaniZakir JiwaniZakir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The single_toc_item_index_fixer change in page_index.py (line 744) swaps a KeyError for a potential TypeError: json_content.get('physical_index') returns None when the key is absent, and then convert_physical_index_to_int(None) is called. Whether this actually resolves the crash depends entirely on how convert_physical_index_to_int handles None — worth checking and possibly providing an explicit fallback value (e.g., .get('physical_index', 0)) rather than silently passing None downstream.

The for/else refactor in both extract_toc_content and toc_transformer is a clear improvement over the manual attempt counter, but the else branch now only logs a warning instead of raising. This means callers silently receive a partial or potentially malformed result after max retries, which could cause confusing failures further downstream. The original raise Exception was at least explicit about the failure; a silent degraded return should at minimum be documented in the function's docstring or validated before returning.

In toc_transformer's loop, last_complete = last_complete[:position+2] runs on every iteration — including iterations where last_complete already contains successfully appended content from a previous pass. This means valid accumulated output could be silently truncated on each retry cycle if rfind('}') finds an interior } rather than the terminal one.

- Restore explicit Exception on max retries instead of silent warning
- Move truncation logic before the retry loop so it only runs once
  on the initial incomplete response, not on every iteration
- Add explicit None guard for physical_index before passing to
  convert_physical_index_to_int to prevent potential TypeError
- Update test to expect Exception on max retries
@Shreyansh1729
Copy link
Copy Markdown
Author

Thanks for the detailed review. All three points are valid — pushed fixes in ce67562:

  1. physical_index None guard: Added an explicit if physical_index is None: return None check before calling convert_physical_index_to_int, so None is never passed downstream.

  2. Silent degradation on max retries: Restored raise Exception(...) in both extract_toc_content and toc_transformer so callers get an explicit failure instead of a partial result.

  3. Truncation on every iteration: Moved the last_complete[:position+2] logic to run once before the retry loop, so it only trims the initial incomplete response rather than potentially cutting valid accumulated content on subsequent passes.

@Shreyansh1729
Copy link
Copy Markdown
Author

@JiwaniZakir check it now

@JiwaniZakir
Copy link
Copy Markdown

The .get('key', default) changes are the right fix for the KeyError issue. For the context exhaustion fix, it would be worth confirming the refactored retry logic passes only the incremental/delta content on subsequent iterations rather than re-embedding the full accumulated response — that's the critical part for large document handling. The .gitignore discrepancy @KylinMountain flagged should be corrected before merge to keep the description accurate.

@Shreyansh1729
Copy link
Copy Markdown
Author

Confirmed on both points:

  1. Incremental chat_history: The retry loops append only the new delta content per iteration (chat_history.append({...new_response})), not the full accumulated response. The LLM sees the conversation history naturally without re-embedding the entire TOC or accumulated JSON in the prompt.

  2. Description updated: Fixed the PR description to accurately reflect the current behavior — raise Exception on max retries, initial truncation moved before the loop, and the None guard for physical_index.

@Shreyansh1729
Copy link
Copy Markdown
Author

@JiwaniZakir it is ready for the merge
@KylinMountain

@JiwaniZakir
Copy link
Copy Markdown

The closing rationale is sound — changing the fallback from {} to [] is the wrong layer to fix this; the real contract violation is that extract_json() can return either type depending on the input, which forces every caller to defensively handle both. A cleaner fix would be to make extract_json() always return a consistent type (or raise), and fix the 5 unsafe dict accesses with .get() as originally proposed without touching the return type. The context exhaustion issue still needs a dedicated fix regardless of how the KeyError is resolved.

@Shreyansh1729
Copy link
Copy Markdown
Author

Valid point about the extract_json() contract. Changing its return type to always raise on failure (instead of returning {}) would be a cleaner long-term fix, but it touches every caller across the codebase — that's a larger refactor.

For this PR, the scope is intentionally narrow: fix the crash from #163 with minimal risk. The .get() calls are defensive regardless of how extract_json() behaves, and the context exhaustion fix is independent of that.

Happy to open a follow-up PR to normalize extract_json() to raise on parse failure if that's preferred over the current silent {} return.

@JiwaniZakir
Copy link
Copy Markdown

The context exhaustion fix is the more critical of the two — passing the full accumulated response on each retry is quadratic in cost for large documents. To confirm the refactored retry logic is correct, check that the new prompt on iteration N contains only the remaining unparsed portion of the TOC, not the concatenation of all prior partial responses. The .get() changes are straightforward and correct regardless. Worth noting: if extract_json() returns {} due to a truncated response mid-retry, the fix prevents the crash but the retry will still re-attempt with stale context unless the prompt construction is also fixed.

@Shreyansh1729
Copy link
Copy Markdown
Author

Confirmed — the prompt on iteration N contains only the fixed continuation string, not the accumulated response.

extract_toc_content (line 181): llm_completion(prompt=continue_prompt, chat_history=chat_history)continue_prompt is a static string. chat_history grows via .append(new_response), so only the delta from each iteration is added. The full accumulated response is never re-embedded in the prompt.

toc_transformer (line 307): Same pattern — continue_prompt is static, chat_history.append(new_complete) adds only the new chunk. The original code embedded both {toc_content} and {last_complete} directly in an f-string prompt every iteration — that was the quadratic cost. Now removed.

@Shreyansh1729
Copy link
Copy Markdown
Author

@JiwaniZakir

@JiwaniZakir
Copy link
Copy Markdown

The .get('key', default) fix for the KeyError is straightforward and correct. On the context exhaustion issue, appending only the delta per iteration to chat_history is the right approach — but worth verifying the base prompt (with the full TOC text) isn't being reconstructed on each retry call, since that alone can exhaust the window on 800-page docs regardless of how history is managed. If the TOC is being passed as a parameter into the retry loop rather than captured in a closure or outer scope, each iteration may still be re-embedding it.

@Shreyansh1729
Copy link
Copy Markdown
Author

The base prompt (containing the full TOC text) is constructed once at line 160-165 and stored in chat_history[0] at line 174. It is never reconstructed inside the loop.

The retry loop (line 180-187) only passes continue_prompt — a static string with no embedded content. The chat_history list grows by appending {new_response} (delta only), so the full TOC text exists exactly once in element [0] of the history, not re-embedded per iteration.

For toc_transformer, same pattern: toc_content appears once in chat_history[0] (line 295). The old code had an f-string that re-embedded both {toc_content} and {last_complete} on every retry — that was the quadratic cost eliminated by this refactor.

@Shreyansh1729
Copy link
Copy Markdown
Author

@JiwaniZakir

@JiwaniZakir
Copy link
Copy Markdown

The incremental chat_history approach is the right call — appending only the new delta per iteration keeps the retry cost linear rather than quadratic. One thing worth verifying before merge: confirm the retry loop correctly tracks the offset into the raw TOC so each iteration's prompt starts from where the last response left off, not from the beginning of the unparsed block. If that boundary tracking is solid, the context exhaustion fix is sound and this is good to merge. The extract_json() contract cleanup @JiwaniZakir raised is worth a follow-up issue, but agreed it doesn't belong in this PR.

@Shreyansh1729
Copy link
Copy Markdown
Author

Boundary tracking is handled through the conversation context itself. Here's the exact message sequence the LLM sees on iteration N:

[0] user: "Extract TOC from: {full_toc_text}"     ← set once, line 174
[1] assistant: "Section 1... Section 2..."           ← initial partial output
[2] user: "please continue..."                      ← static continuation prompt
[3] assistant: "Section 3... Section 4..."           ← delta from iteration 1
[4] user: "please continue..."
[5] assistant: "Section 5..."                        ← delta from iteration 2
...
[N] user: "please continue..."                      ← current iteration

Each assistant message contains only its own chunk. The LLM sees the full conversation thread and naturally continues from where the last response ended. The offset is implicit in the chat sequence — the model knows exactly where it left off because it can see its own prior outputs as separate messages.

llm_completion (utils.py line 36) constructs messages as list(chat_history) + [current_prompt], so the full history is preserved in order without re-embedding any content.

@Shreyansh1729
Copy link
Copy Markdown
Author

@JiwaniZakir i think this ready now

@JiwaniZakir
Copy link
Copy Markdown

The narrow scope is reasonable for an emergency fix, but the extract_json() contract issue is worth tracking — returning {} on failure silently converts parse errors into missing-key bugs downstream, which is exactly what this PR is patching around. A follow-up that makes extract_json() raise (or return None) would eliminate this class of bug structurally rather than requiring defensive .get() calls at every call site. On the retry logic, one thing to verify: if chat_history carries the full accumulated response across iterations and the model's reply on iteration N is also appended before the next call, confirm the total tokens in chat_history are being tracked or bounded — otherwise the quadratic growth problem reappears just one layer up.

@Shreyansh1729
Copy link
Copy Markdown
Author

The total tokens in chat_history do grow with each iteration — that's inherent to any multi-turn continuation since the LLM needs prior context to continue coherently. But the growth is linear: each chunk appears exactly once in the history.

The old code was quadratic — on iteration N, the prompt f-string embedded the full accumulated output ({last_complete}), and then llm_completion sent that as a new user message on top of existing history. So the same content was duplicated across layers, and the total tokens scaled as O(N * total_output).

Now it's O(total_output) — each chunk exists once. For an 800-page document, if the TOC output is ~10k tokens split across 5 retries, the final call sees ~10k tokens of history vs the old code seeing ~30k+ due to duplication. Context limits could still be hit on extremely large TOCs, but that would require a fundamentally different approach (chunking the input TOC itself) — a separate concern from this fix.

Agreed on the extract_json() follow-up, will open a separate PR for that.

@Shreyansh1729
Copy link
Copy Markdown
Author

@JiwaniZakir check it

@Shreyansh1729
Copy link
Copy Markdown
Author

@JiwaniZakir , @KylinMountain i think it's ready for merge

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

toc_transformer crashes on large TOCs (~800 page PDF) - JSON truncation + context overflow

2 participants