Add Dots PDF OCR example by hynky1999 · Pull Request #109 · macrodata-labs/refiner

Hynek Kydlíček (hynky1999) · 2026-04-22T13:18:44Z

Summary

add a simplified Refiner PDF OCR example using rednote-hilab/dots.mocr
filter repeated OCR output and run Qwen/Qwen3.5-4B blank-page checks for pages starting with "The "
document the OCR example in inference docs

Verification

uv run ruff format examples/ocr_dots_pdf.py tests/test_ocr_dots_pdf_example.py
uv run ruff check examples/ocr_dots_pdf.py tests/test_ocr_dots_pdf_example.py
uv run ty check examples/ocr_dots_pdf.py tests/test_ocr_dots_pdf_example.py
uv run pytest tests/test_ocr_dots_pdf_example.py
pre-commit hook passed

gemini-code-assist

Code Review

This pull request introduces a new PDF OCR example using Dots OCR and Qwen VLM models, along with documentation updates for prewarmed models and associated tests. Feedback focuses on correcting absolute local file paths in the documentation links, replacing magic numbers with named constants for better maintainability, improving the robustness of the model response parsing logic, and adopting dynamic paths in tests to avoid filesystem side effects.

gemini-code-assist · 2026-04-22T13:20:22Z


 - Direct endpoint example: [examples/inference_endpoint.py](/Users/hynky/.codex/worktrees/ab9a/refiner/examples/inference_endpoint.py)
 - VLLM-backed example: [examples/inference_vllm.py](/Users/hynky/.codex/worktrees/ab9a/refiner/examples/inference_vllm.py)
+- PDF OCR example: [examples/ocr_dots_pdf.py](/Users/hynky/.codex/worktrees/ab9a/refiner/examples/ocr_dots_pdf.py)


The link to the PDF OCR example contains an absolute path from a local environment (/Users/hynky/...). This should be updated to a relative path to ensure the link works correctly in the documentation. Note that the preceding links on lines 98 and 99 also appear to have this issue and should be corrected as well.

Suggested change

- PDF OCR example: [examples/ocr_dots_pdf.py](/Users/hynky/.codex/worktrees/ab9a/refiner/examples/ocr_dots_pdf.py)

- PDF OCR example: [examples/ocr_dots_pdf.py](examples/ocr_dots_pdf.py)

gemini-code-assist · 2026-04-22T13:20:22Z

+def has_repetition(text: str) -> bool:
+    lines = [line.strip() for line in text.splitlines() if len(line.strip()) >= 12]
+    line_counts: dict[str, int] = {}
+    for line in lines:
+        line_counts[line] = line_counts.get(line, 0) + 1
+        if line_counts[line] >= 3:
+            return True
+
+    normalized = " ".join(text.split())
+    if len(normalized) < 120:
+        return False
+
+    words = normalized.split()
+    for window_size in (4, 6, 8):
+        if _has_repeated_word_window(words, window_size=window_size, min_repeats=4):
+            return True
+    return False


The has_repetition function uses several magic numbers (e.g., 12, 3, 120, 4, 6, 8) as thresholds for its heuristic repetition check. Defining these as named constants at the top of the file or adding comments to explain the rationale behind these values would improve the maintainability and clarity of the example for users who might need to tune these parameters.

gemini-code-assist · 2026-04-22T13:20:22Z

+    except json.JSONDecodeError:
+        lowered = cleaned.lower()
+        return '"is_blank_page": true' in lowered or "is_blank_page: true" in lowered


The fallback logic for parsing the blank page response is quite fragile as it relies on exact string matches for specific JSON-like patterns. If the model's output format varies slightly (e.g., different spacing or capitalization), this check might fail. Consider using a more robust approach, such as a regular expression, to extract the is_blank_page value.

gemini-code-assist · 2026-04-22T13:20:22Z

+    example = _load_example_module()
+    from refiner.services.discovery import collect_pipeline_services
+
+    pipeline = example.build_pipeline(["sample.pdf"], output_path="output/test.jsonl")


The test uses a hardcoded output path output/test.jsonl. Although the pipeline is not executed in this specific test, it is better practice to use a temporary directory or a dynamic path to avoid potential side effects on the filesystem and to follow testing best practices.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 65541769ce

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "Codex (@codex) review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "Codex (@codex) address that feedback".

chatgpt-codex-connector · 2026-04-22T13:22:05Z

+        return '"is_blank_page": true' in lowered or "is_blank_page: true" in lowered
+    if not isinstance(parsed, dict):
+        return False
+    return bool(parsed.get("is_blank_page"))


Require a real boolean for is_blank_page

parse_blank_page_response currently returns bool(parsed.get("is_blank_page")), which treats any non-empty string as True. If the model responds with valid JSON but stringified values (for example {"is_blank_page": "false"}), this function marks the page as blank and keep_after_blank_check drops a non-blank page from output, corrupting OCR results.

Useful? React with 👍 / 👎.

add dots pdf ocr example

6554176

gemini-code-assist Bot reviewed Apr 22, 2026

View reviewed changes

chatgpt-codex-connector Bot reviewed Apr 22, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Dots PDF OCR example#109

Add Dots PDF OCR example#109
Hynek Kydlíček (hynky1999) wants to merge 1 commit into
mainfrom
codex/inference-services-binding

Hynek Kydlíček (hynky1999) commented Apr 22, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Apr 22, 2026

Uh oh!

gemini-code-assist Bot Apr 22, 2026

Uh oh!

gemini-code-assist Bot Apr 22, 2026

Uh oh!

gemini-code-assist Bot Apr 22, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Apr 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	- PDF OCR example: [examples/ocr_dots_pdf.py](/Users/hynky/.codex/worktrees/ab9a/refiner/examples/ocr_dots_pdf.py)
	- PDF OCR example: [examples/ocr_dots_pdf.py](examples/ocr_dots_pdf.py)

Conversation

Hynek Kydlíček (hynky1999) commented Apr 22, 2026

Summary

Verification

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant