Skip to content

Add Dots PDF OCR example#109

Open
Hynek Kydlíček (hynky1999) wants to merge 1 commit into
mainfrom
codex/inference-services-binding
Open

Add Dots PDF OCR example#109
Hynek Kydlíček (hynky1999) wants to merge 1 commit into
mainfrom
codex/inference-services-binding

Conversation

@hynky1999
Copy link
Copy Markdown
Collaborator

Summary

  • add a simplified Refiner PDF OCR example using rednote-hilab/dots.mocr
  • filter repeated OCR output and run Qwen/Qwen3.5-4B blank-page checks for pages starting with "The "
  • document the OCR example in inference docs

Verification

  • uv run ruff format examples/ocr_dots_pdf.py tests/test_ocr_dots_pdf_example.py
  • uv run ruff check examples/ocr_dots_pdf.py tests/test_ocr_dots_pdf_example.py
  • uv run ty check examples/ocr_dots_pdf.py tests/test_ocr_dots_pdf_example.py
  • uv run pytest tests/test_ocr_dots_pdf_example.py
  • pre-commit hook passed

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new PDF OCR example using Dots OCR and Qwen VLM models, along with documentation updates for prewarmed models and associated tests. Feedback focuses on correcting absolute local file paths in the documentation links, replacing magic numbers with named constants for better maintainability, improving the robustness of the model response parsing logic, and adopting dynamic paths in tests to avoid filesystem side effects.

Comment thread docs/inference.md

- Direct endpoint example: [examples/inference_endpoint.py](/Users/hynky/.codex/worktrees/ab9a/refiner/examples/inference_endpoint.py)
- VLLM-backed example: [examples/inference_vllm.py](/Users/hynky/.codex/worktrees/ab9a/refiner/examples/inference_vllm.py)
- PDF OCR example: [examples/ocr_dots_pdf.py](/Users/hynky/.codex/worktrees/ab9a/refiner/examples/ocr_dots_pdf.py)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The link to the PDF OCR example contains an absolute path from a local environment (/Users/hynky/...). This should be updated to a relative path to ensure the link works correctly in the documentation. Note that the preceding links on lines 98 and 99 also appear to have this issue and should be corrected as well.

Suggested change
- PDF OCR example: [examples/ocr_dots_pdf.py](/Users/hynky/.codex/worktrees/ab9a/refiner/examples/ocr_dots_pdf.py)
- PDF OCR example: [examples/ocr_dots_pdf.py](examples/ocr_dots_pdf.py)

Comment thread examples/ocr_dots_pdf.py
Comment on lines +142 to +158
def has_repetition(text: str) -> bool:
lines = [line.strip() for line in text.splitlines() if len(line.strip()) >= 12]
line_counts: dict[str, int] = {}
for line in lines:
line_counts[line] = line_counts.get(line, 0) + 1
if line_counts[line] >= 3:
return True

normalized = " ".join(text.split())
if len(normalized) < 120:
return False

words = normalized.split()
for window_size in (4, 6, 8):
if _has_repeated_word_window(words, window_size=window_size, min_repeats=4):
return True
return False
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The has_repetition function uses several magic numbers (e.g., 12, 3, 120, 4, 6, 8) as thresholds for its heuristic repetition check. Defining these as named constants at the top of the file or adding comments to explain the rationale behind these values would improve the maintainability and clarity of the example for users who might need to tune these parameters.

Comment thread examples/ocr_dots_pdf.py
Comment on lines +168 to +170
except json.JSONDecodeError:
lowered = cleaned.lower()
return '"is_blank_page": true' in lowered or "is_blank_page: true" in lowered
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The fallback logic for parsing the blank page response is quite fragile as it relies on exact string matches for specific JSON-like patterns. If the model's output format varies slightly (e.g., different spacing or capitalization), this check might fail. Consider using a more robust approach, such as a regular expression, to extract the is_blank_page value.

example = _load_example_module()
from refiner.services.discovery import collect_pipeline_services

pipeline = example.build_pipeline(["sample.pdf"], output_path="output/test.jsonl")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The test uses a hardcoded output path output/test.jsonl. Although the pipeline is not executed in this specific test, it is better practice to use a temporary directory or a dynamic path to avoid potential side effects on the filesystem and to follow testing best practices.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 65541769ce

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "Codex (@codex) review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "Codex (@codex) address that feedback".

Comment thread examples/ocr_dots_pdf.py
return '"is_blank_page": true' in lowered or "is_blank_page: true" in lowered
if not isinstance(parsed, dict):
return False
return bool(parsed.get("is_blank_page"))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Require a real boolean for is_blank_page

parse_blank_page_response currently returns bool(parsed.get("is_blank_page")), which treats any non-empty string as True. If the model responds with valid JSON but stringified values (for example {"is_blank_page": "false"}), this function marks the page as blank and keep_after_blank_check drops a non-blank page from output, corrupting OCR results.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant