Add dots OCR PDF example and harden vLLM inference path#115
Add dots OCR PDF example and harden vLLM inference path#115Hynek Kydlíček (hynky1999) wants to merge 14 commits into
Conversation
There was a problem hiding this comment.
Code Review
This pull request adds PDF document support to the Refiner library, introducing a pdf module for lazy page rendering via pypdfium2, updated documentation, and a new OCR example script. It also refactors inference response parsing to handle nested chat message structures and flattens vllm service configurations. Feedback identifies a concurrency issue where a global lock is held during asynchronous yields in PDF rendering and recommends adding exception logging in the example script to improve debuggability.
| async with _pdfium_lock: | ||
| doc = pdfium.PdfDocument(pdf_bytes) | ||
| try: | ||
| page_count = len(doc) | ||
| for index in range(page_count): | ||
| page = doc[index] | ||
| try: | ||
| image = page.render(scale=scale).to_pil() | ||
| image.load() | ||
| finally: | ||
| page.close() | ||
| yield RenderedPdfPage(index=index, image=image) | ||
| finally: | ||
| doc.close() |
There was a problem hiding this comment.
The _pdfium_lock is held across the yield statement. In an asynchronous generator, this means the lock remains acquired while the consumer of the generator is processing the yielded page (e.g., performing OCR or calling an LLM). Since this is a global lock, it will effectively serialize PDF rendering across the entire process, preventing other tasks from rendering pages even if they are processing different documents.
To improve concurrency, consider acquiring and releasing the lock for each page rendering individually. While this requires re-opening the document for each page, the overhead is typically negligible compared to the rasterization process and the benefit of allowing interleaved rendering across multiple tasks.
| async with _pdfium_lock: | |
| doc = pdfium.PdfDocument(pdf_bytes) | |
| try: | |
| page_count = len(doc) | |
| for index in range(page_count): | |
| page = doc[index] | |
| try: | |
| image = page.render(scale=scale).to_pil() | |
| image.load() | |
| finally: | |
| page.close() | |
| yield RenderedPdfPage(index=index, image=image) | |
| finally: | |
| doc.close() | |
| async with _pdfium_lock: | |
| doc = pdfium.PdfDocument(pdf_bytes) | |
| page_count = len(doc) | |
| doc.close() | |
| for index in range(page_count): | |
| async with _pdfium_lock: | |
| doc = pdfium.PdfDocument(pdf_bytes) | |
| page = doc[index] | |
| try: | |
| image = page.render(scale=scale).to_pil() | |
| image.load() | |
| finally: | |
| page.close() | |
| doc.close() | |
| yield RenderedPdfPage(index=index, image=image) |
| page_images.append(image_url) | ||
| if not text or text.startswith("The "): | ||
| suspect_blank_pages.append(page.index) | ||
| except Exception as exc: |
There was a problem hiding this comment.
Catching a broad Exception here without logging the traceback makes it difficult to diagnose why a PDF might be failing in a large-scale pipeline. It is recommended to log the exception using a logger (e.g., loguru.logger.exception) so that issues like missing dependencies, corrupted files, or out-of-memory errors can be identified and addressed.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: cb1e2c52ec
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "Codex (@codex) review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "Codex (@codex) address that feedback".
| async with _pdfium_lock: | ||
| doc = pdfium.PdfDocument(pdf_bytes) |
There was a problem hiding this comment.
Release PDFium lock before yielding rendered pages
Holding _pdfium_lock across the generator yield means the lock stays held while callers do downstream awaits (for example, per-page OCR calls), so one slow consumer blocks all other PDF rendering globally. In the new OCR pipeline this effectively collapses map_async concurrency to a single active PDF and can stall throughput badly on multi-page inputs.
Useful? React with 👍 / 👎.
| page_images.append(image_url) | ||
| if not text or text.startswith("The "): | ||
| suspect_blank_pages.append(page.index) | ||
| except Exception as exc: |
There was a problem hiding this comment.
Avoid swallowing inference failures as invalid PDFs
This broad except Exception wraps both PDF decoding and generate(...) calls, so transient model/service failures are re-labeled as invalid_pdf=True and then dropped by keep_valid_pdf. That silently discards valid documents and hides runtime inference outages as data-quality issues.
Useful? React with 👍 / 👎.
| if not suspect_pages: | ||
| return row.update( | ||
| { | ||
| "blank_pages": [], | ||
| "has_blank_pages": False, | ||
| } | ||
| ) |
There was a problem hiding this comment.
Drop temporary image payloads on early blank-page return
When suspect_blank_pages is empty, the early return updates flags but does not remove _page_images. Because later filters keep these rows, full base64 page images are written to output for the common no-suspect case, causing avoidable memory/storage bloat and much larger parquet artifacts.
Useful? React with 👍 / 👎.
Summary
Verification