Add dots OCR PDF example and harden vLLM inference path by hynky1999 · Pull Request #115 · macrodata-labs/refiner

Hynek Kydlíček (hynky1999) · 2026-04-28T21:39:57Z

Summary

add a simplified dots.mocr PDF OCR example using PdfFile rendering and blank-page verification
harden the inference client for empty chat responses and simplify active request metrics
make the example configurable for alternate input parquet datasets and skip invalid PDFs

Verification

uv run python -m py_compile src/refiner/inference/client.py src/refiner/inference/generate.py tests/test_inference.py examples/docling_issues_ocr.py
uv run pytest tests/test_inference.py -q

gemini-code-assist

Code Review

This pull request adds PDF document support to the Refiner library, introducing a pdf module for lazy page rendering via pypdfium2, updated documentation, and a new OCR example script. It also refactors inference response parsing to handle nested chat message structures and flattens vllm service configurations. Feedback identifies a concurrency issue where a global lock is held during asynchronous yields in PDF rendering and recommends adding exception logging in the example script to improve debuggability.

gemini-code-assist · 2026-04-28T21:41:42Z

+        async with _pdfium_lock:
+            doc = pdfium.PdfDocument(pdf_bytes)
+            try:
+                page_count = len(doc)
+                for index in range(page_count):
+                    page = doc[index]
+                    try:
+                        image = page.render(scale=scale).to_pil()
+                        image.load()
+                    finally:
+                        page.close()
+                    yield RenderedPdfPage(index=index, image=image)
+            finally:
+                doc.close()


The _pdfium_lock is held across the yield statement. In an asynchronous generator, this means the lock remains acquired while the consumer of the generator is processing the yielded page (e.g., performing OCR or calling an LLM). Since this is a global lock, it will effectively serialize PDF rendering across the entire process, preventing other tasks from rendering pages even if they are processing different documents.

To improve concurrency, consider acquiring and releasing the lock for each page rendering individually. While this requires re-opening the document for each page, the overhead is typically negligible compared to the rasterization process and the benefit of allowing interleaved rendering across multiple tasks.

Suggested change

async with _pdfium_lock:

doc = pdfium.PdfDocument(pdf_bytes)

try:

page_count = len(doc)

for index in range(page_count):

page = doc[index]

try:

image = page.render(scale=scale).to_pil()

image.load()

finally:

page.close()

yield RenderedPdfPage(index=index, image=image)

finally:

doc.close()

async with _pdfium_lock:

doc = pdfium.PdfDocument(pdf_bytes)

page_count = len(doc)

doc.close()

for index in range(page_count):

async with _pdfium_lock:

doc = pdfium.PdfDocument(pdf_bytes)

page = doc[index]

try:

image = page.render(scale=scale).to_pil()

image.load()

finally:

page.close()

doc.close()

yield RenderedPdfPage(index=index, image=image)

gemini-code-assist · 2026-04-28T21:41:42Z

+            page_images.append(image_url)
+            if not text or text.startswith("The "):
+                suspect_blank_pages.append(page.index)
+    except Exception as exc:


Catching a broad Exception here without logging the traceback makes it difficult to diagnose why a PDF might be failing in a large-scale pipeline. It is recommended to log the exception using a logger (e.g., loguru.logger.exception) so that issues like missing dependencies, corrupted files, or out-of-memory errors can be identified and addressed.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: cb1e2c52ec

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "Codex (@codex) review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "Codex (@codex) address that feedback".

chatgpt-codex-connector · 2026-04-28T21:43:40Z

+        async with _pdfium_lock:
+            doc = pdfium.PdfDocument(pdf_bytes)


Release PDFium lock before yielding rendered pages

Holding _pdfium_lock across the generator yield means the lock stays held while callers do downstream awaits (for example, per-page OCR calls), so one slow consumer blocks all other PDF rendering globally. In the new OCR pipeline this effectively collapses map_async concurrency to a single active PDF and can stall throughput badly on multi-page inputs.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-04-28T21:43:40Z

+            page_images.append(image_url)
+            if not text or text.startswith("The "):
+                suspect_blank_pages.append(page.index)
+    except Exception as exc:


Avoid swallowing inference failures as invalid PDFs

This broad except Exception wraps both PDF decoding and generate(...) calls, so transient model/service failures are re-labeled as invalid_pdf=True and then dropped by keep_valid_pdf. That silently discards valid documents and hides runtime inference outages as data-quality issues.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-04-28T21:43:40Z

+    if not suspect_pages:
+        return row.update(
+            {
+                "blank_pages": [],
+                "has_blank_pages": False,
+            }
+        )


Drop temporary image payloads on early blank-page return

When suspect_blank_pages is empty, the early return updates flags but does not remove _page_images. Because later filters keep these rows, full base64 page images are written to output for the common no-suspect case, causing avoidable memory/storage bloat and much larger parquet artifacts.

Useful? React with 👍 / 👎.

Hynek Kydlíček (hynky1999) added 13 commits April 23, 2026 15:26

add pdf rendered page primitive

00fe0f6

add docling issues pdf ocr example

a415166

match supported vllm config values

54fae55

flatten vllm service config kwargs

36ab388

use cloud vllm presets in ocr example

51d188c

Handle dots OCR chat responses

b656dbb

Align dots OCR prompt with upstream example

4be9c1f

Reduce dots OCR completion budget

b0a0929

Use PNG payloads for dots OCR

c53b4f2

Align dots OCR PDF rendering with upstream parser

c5ba541

Treat empty dots OCR responses as blank pages

5997a63

Skip invalid PDFs in dots OCR example

9b799e2

Expose only active inference request metric

cb1e2c5

gemini-code-assist Bot reviewed Apr 28, 2026

View reviewed changes

chatgpt-codex-connector Bot reviewed Apr 28, 2026

View reviewed changes

Remove dots OCR example and docs change

83b46a0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add dots OCR PDF example and harden vLLM inference path#115

Add dots OCR PDF example and harden vLLM inference path#115
Hynek Kydlíček (hynky1999) wants to merge 14 commits into
mainfrom
codex/pdf-file-rendered-pages

Hynek Kydlíček (hynky1999) commented Apr 28, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Apr 28, 2026

Uh oh!

gemini-code-assist Bot Apr 28, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Apr 28, 2026

Uh oh!

chatgpt-codex-connector Bot Apr 28, 2026

Uh oh!

chatgpt-codex-connector Bot Apr 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Hynek Kydlíček (hynky1999) commented Apr 28, 2026

Summary

Verification

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant