Add Dots PDF OCR example#109
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a new PDF OCR example using Dots OCR and Qwen VLM models, along with documentation updates for prewarmed models and associated tests. Feedback focuses on correcting absolute local file paths in the documentation links, replacing magic numbers with named constants for better maintainability, improving the robustness of the model response parsing logic, and adopting dynamic paths in tests to avoid filesystem side effects.
|
|
||
| - Direct endpoint example: [examples/inference_endpoint.py](/Users/hynky/.codex/worktrees/ab9a/refiner/examples/inference_endpoint.py) | ||
| - VLLM-backed example: [examples/inference_vllm.py](/Users/hynky/.codex/worktrees/ab9a/refiner/examples/inference_vllm.py) | ||
| - PDF OCR example: [examples/ocr_dots_pdf.py](/Users/hynky/.codex/worktrees/ab9a/refiner/examples/ocr_dots_pdf.py) |
There was a problem hiding this comment.
The link to the PDF OCR example contains an absolute path from a local environment (/Users/hynky/...). This should be updated to a relative path to ensure the link works correctly in the documentation. Note that the preceding links on lines 98 and 99 also appear to have this issue and should be corrected as well.
| - PDF OCR example: [examples/ocr_dots_pdf.py](/Users/hynky/.codex/worktrees/ab9a/refiner/examples/ocr_dots_pdf.py) | |
| - PDF OCR example: [examples/ocr_dots_pdf.py](examples/ocr_dots_pdf.py) |
| def has_repetition(text: str) -> bool: | ||
| lines = [line.strip() for line in text.splitlines() if len(line.strip()) >= 12] | ||
| line_counts: dict[str, int] = {} | ||
| for line in lines: | ||
| line_counts[line] = line_counts.get(line, 0) + 1 | ||
| if line_counts[line] >= 3: | ||
| return True | ||
|
|
||
| normalized = " ".join(text.split()) | ||
| if len(normalized) < 120: | ||
| return False | ||
|
|
||
| words = normalized.split() | ||
| for window_size in (4, 6, 8): | ||
| if _has_repeated_word_window(words, window_size=window_size, min_repeats=4): | ||
| return True | ||
| return False |
There was a problem hiding this comment.
The has_repetition function uses several magic numbers (e.g., 12, 3, 120, 4, 6, 8) as thresholds for its heuristic repetition check. Defining these as named constants at the top of the file or adding comments to explain the rationale behind these values would improve the maintainability and clarity of the example for users who might need to tune these parameters.
| except json.JSONDecodeError: | ||
| lowered = cleaned.lower() | ||
| return '"is_blank_page": true' in lowered or "is_blank_page: true" in lowered |
There was a problem hiding this comment.
The fallback logic for parsing the blank page response is quite fragile as it relies on exact string matches for specific JSON-like patterns. If the model's output format varies slightly (e.g., different spacing or capitalization), this check might fail. Consider using a more robust approach, such as a regular expression, to extract the is_blank_page value.
| example = _load_example_module() | ||
| from refiner.services.discovery import collect_pipeline_services | ||
|
|
||
| pipeline = example.build_pipeline(["sample.pdf"], output_path="output/test.jsonl") |
There was a problem hiding this comment.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 65541769ce
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "Codex (@codex) review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "Codex (@codex) address that feedback".
| return '"is_blank_page": true' in lowered or "is_blank_page: true" in lowered | ||
| if not isinstance(parsed, dict): | ||
| return False | ||
| return bool(parsed.get("is_blank_page")) |
There was a problem hiding this comment.
Require a real boolean for is_blank_page
parse_blank_page_response currently returns bool(parsed.get("is_blank_page")), which treats any non-empty string as True. If the model responds with valid JSON but stringified values (for example {"is_blank_page": "false"}), this function marks the page as blank and keep_after_blank_check drops a non-blank page from output, corrupting OCR results.
Useful? React with 👍 / 👎.
Summary
Verification