Skip to content

Add dots OCR PDF example and harden vLLM inference path#115

Open
Hynek Kydlíček (hynky1999) wants to merge 14 commits into
mainfrom
codex/pdf-file-rendered-pages
Open

Add dots OCR PDF example and harden vLLM inference path#115
Hynek Kydlíček (hynky1999) wants to merge 14 commits into
mainfrom
codex/pdf-file-rendered-pages

Conversation

@hynky1999
Copy link
Copy Markdown
Collaborator

Summary

  • add a simplified dots.mocr PDF OCR example using PdfFile rendering and blank-page verification
  • harden the inference client for empty chat responses and simplify active request metrics
  • make the example configurable for alternate input parquet datasets and skip invalid PDFs

Verification

  • uv run python -m py_compile src/refiner/inference/client.py src/refiner/inference/generate.py tests/test_inference.py examples/docling_issues_ocr.py
  • uv run pytest tests/test_inference.py -q

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds PDF document support to the Refiner library, introducing a pdf module for lazy page rendering via pypdfium2, updated documentation, and a new OCR example script. It also refactors inference response parsing to handle nested chat message structures and flattens vllm service configurations. Feedback identifies a concurrency issue where a global lock is held during asynchronous yields in PDF rendering and recommends adding exception logging in the example script to improve debuggability.

Comment thread src/refiner/pdf/types.py
Comment on lines +56 to +69
async with _pdfium_lock:
doc = pdfium.PdfDocument(pdf_bytes)
try:
page_count = len(doc)
for index in range(page_count):
page = doc[index]
try:
image = page.render(scale=scale).to_pil()
image.load()
finally:
page.close()
yield RenderedPdfPage(index=index, image=image)
finally:
doc.close()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The _pdfium_lock is held across the yield statement. In an asynchronous generator, this means the lock remains acquired while the consumer of the generator is processing the yielded page (e.g., performing OCR or calling an LLM). Since this is a global lock, it will effectively serialize PDF rendering across the entire process, preventing other tasks from rendering pages even if they are processing different documents.

To improve concurrency, consider acquiring and releasing the lock for each page rendering individually. While this requires re-opening the document for each page, the overhead is typically negligible compared to the rasterization process and the benefit of allowing interleaved rendering across multiple tasks.

Suggested change
async with _pdfium_lock:
doc = pdfium.PdfDocument(pdf_bytes)
try:
page_count = len(doc)
for index in range(page_count):
page = doc[index]
try:
image = page.render(scale=scale).to_pil()
image.load()
finally:
page.close()
yield RenderedPdfPage(index=index, image=image)
finally:
doc.close()
async with _pdfium_lock:
doc = pdfium.PdfDocument(pdf_bytes)
page_count = len(doc)
doc.close()
for index in range(page_count):
async with _pdfium_lock:
doc = pdfium.PdfDocument(pdf_bytes)
page = doc[index]
try:
image = page.render(scale=scale).to_pil()
image.load()
finally:
page.close()
doc.close()
yield RenderedPdfPage(index=index, image=image)

Comment thread examples/docling_issues_ocr.py Outdated
page_images.append(image_url)
if not text or text.startswith("The "):
suspect_blank_pages.append(page.index)
except Exception as exc:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Catching a broad Exception here without logging the traceback makes it difficult to diagnose why a PDF might be failing in a large-scale pipeline. It is recommended to log the exception using a logger (e.g., loguru.logger.exception) so that issues like missing dependencies, corrupted files, or out-of-memory errors can be identified and addressed.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: cb1e2c52ec

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "Codex (@codex) review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "Codex (@codex) address that feedback".

Comment thread src/refiner/pdf/types.py
Comment on lines +56 to +57
async with _pdfium_lock:
doc = pdfium.PdfDocument(pdf_bytes)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Release PDFium lock before yielding rendered pages

Holding _pdfium_lock across the generator yield means the lock stays held while callers do downstream awaits (for example, per-page OCR calls), so one slow consumer blocks all other PDF rendering globally. In the new OCR pipeline this effectively collapses map_async concurrency to a single active PDF and can stall throughput badly on multi-page inputs.

Useful? React with 👍 / 👎.

Comment thread examples/docling_issues_ocr.py Outdated
page_images.append(image_url)
if not text or text.startswith("The "):
suspect_blank_pages.append(page.index)
except Exception as exc:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Avoid swallowing inference failures as invalid PDFs

This broad except Exception wraps both PDF decoding and generate(...) calls, so transient model/service failures are re-labeled as invalid_pdf=True and then dropped by keep_valid_pdf. That silently discards valid documents and hides runtime inference outages as data-quality issues.

Useful? React with 👍 / 👎.

Comment thread examples/docling_issues_ocr.py Outdated
Comment on lines +151 to +157
if not suspect_pages:
return row.update(
{
"blank_pages": [],
"has_blank_pages": False,
}
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Drop temporary image payloads on early blank-page return

When suspect_blank_pages is empty, the early return updates flags but does not remove _page_images. Because later filters keep these rows, full base64 page images are written to output for the common no-suspect case, causing avoidable memory/storage bloat and much larger parquet artifacts.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant