Add fuzzy label matching for PDF/HTML metadata extraction by sabriguenes · Pull Request #66 · cppalliance/wg21-paperflow

sabriguenes · 2026-05-01T09:45:41Z

Metadata labels in WG21 papers sometimes contain typos ("Repy-to", "Documnet Number", "Auther") that exact regex cannot match, causing metadata loss during conversion. This changeset adds typo-tolerant label recovery across both PDF and HTML extraction paths.

Core (tomd)

similarity.py: add _symmetric_similarity (max of both SM argument orders) and fuzzy_match_label (best-match against known label set, threshold 0.82)
wg21.py: replace all _LABEL_RE.match() calls with _is_label_line(), a two-stage wrapper (exact regex first, fuzzy fallback second) using _FUZZY_LABEL_TARGETS frozenset
extract.py: extend _match_field with fuzzy fallback via _ALL_SYNONYMS inverted dict; warn on fuzzy matches
cleanup.py: update import from _LABEL_RE to _is_label_line for consistency

CLI (paperlint)

logutil.py: add rich_console_handler context manager that temporarily swaps the console logger to a RichHandler tied to a specific Console instance, so log output does not collide with Rich progress bars
progress.py: change Console() to Console(stderr=True); wrap progress context with rich_console_handler for coordinated output

Tests

test_similarity.py: 6 new tests for fuzzy_match_label + symmetry
test_html_extract.py: 5 new tests for HTML fuzzy label fallback

Docs

PDF_ARCH.md: new "Fuzzy label recovery" subsection
HTML_ARCH.md: document fuzzy fallback in "Shared label mapping"
CLAUDE.md (tomd): update file map entries for similarity.py and extract.py

Summary by CodeRabbit

New Features
- Metadata extraction now tolerates and recovers typoed field labels and can extract/deobfuscate plain-text author/email entries.
- Progress display activates rich console logging during progress sessions.
Documentation
- Architecture docs extended to describe fuzzy label-matching behavior and recovery workflow.
Tests
- Added tests covering fuzzy label matching and end-to-end metadata extraction cases.

Metadata labels in WG21 papers sometimes contain typos ("Repy-to", "Documnet Number", "Auther") that exact regex cannot match, causing metadata loss during conversion. This changeset adds typo-tolerant label recovery across both PDF and HTML extraction paths. Core (tomd): - similarity.py: add _symmetric_similarity (max of both SM argument orders) and fuzzy_match_label (best-match against known label set, threshold 0.82) - wg21.py: replace all _LABEL_RE.match() calls with _is_label_line(), a two-stage wrapper (exact regex first, fuzzy fallback second) using _FUZZY_LABEL_TARGETS frozenset - extract.py: extend _match_field with fuzzy fallback via _ALL_SYNONYMS inverted dict; warn on fuzzy matches - cleanup.py: update import from _LABEL_RE to _is_label_line for consistency CLI (paperlint): - logutil.py: add rich_console_handler context manager that temporarily swaps the console logger to a RichHandler tied to a specific Console instance, so log output does not collide with Rich progress bars - progress.py: change Console() to Console(stderr=True); wrap progress context with rich_console_handler for coordinated output Tests: - test_similarity.py: 6 new tests for fuzzy_match_label + symmetry - test_html_extract.py: 5 new tests for HTML fuzzy label fallback Docs: - PDF_ARCH.md: new "Fuzzy label recovery" subsection - HTML_ARCH.md: document fuzzy fallback in "Shared label mapping" - CLAUDE.md (tomd): update file map entries for similarity.py and extract.py

coderabbitai · 2026-05-01T09:46:54Z

Warning

Rate limit exceeded

@sabriguenes has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 34 minutes and 55 seconds before requesting another review.

To keep reviews running without waiting, you can enable usage-based add-on for your organization. This allows additional reviews beyond the hourly cap. Account admins can enable it under billing.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: a9ba777e-6bc0-4c81-aa36-7180ec4ac769

📥 Commits

Reviewing files that changed from the base of the PR and between 5101500 and f1c4bba.

📒 Files selected for processing (13)

packages/paperlint/src/paperlint/logutil.py
packages/paperlint/src/paperlint/progress.py
packages/tomd/src/tomd/CLAUDE.md
packages/tomd/src/tomd/HTML_ARCH.md
packages/tomd/src/tomd/PDF_ARCH.md
packages/tomd/src/tomd/lib/__init__.py
packages/tomd/src/tomd/lib/html/extract.py
packages/tomd/src/tomd/lib/pdf/cleanup.py
packages/tomd/src/tomd/lib/pdf/wg21.py
packages/tomd/src/tomd/lib/similarity.py
packages/tomd/tests/test_html_extract.py
packages/tomd/tests/test_similarity.py
packages/tomd/tests/test_wg21.py

📝 Walkthrough

Walkthrough

Adds fuzzy label matching (symmetric SequenceMatcher, 0.82 threshold) and integrates it into PDF/HTML metadata extraction paths; supplies new tests. Adds a context-managed Rich console handler and uses it from paperlint progress callbacks to enable Rich output on stderr during progress sessions.

Changes

Cohort / File(s)	Summary
Logging & Console Integration `packages/paperlint/src/paperlint/logutil.py`, `packages/paperlint/src/paperlint/progress.py`	New `rich_console_handler(console: Console)` context manager that temporarily swaps the paperlint console handler for a RichHandler; `progress_callbacks` now creates a stderr `Console` and enters `rich_console_handler` during progress sessions.
Fuzzy Matching Core `packages/tomd/src/tomd/lib/similarity.py`	Adds `fuzzy_match_label(candidate, known, threshold=0.82) -> str
HTML Extraction Updates `packages/tomd/src/tomd/lib/html/extract.py`	Adds `_extract_plaintext_authors` fallback, deobfuscation pass for emails, introduces `_ALL_SYNONYMS` and fuzzy fallback in `_match_field` that logs fuzzy matches and returns canonical fields.
PDF Extraction & Cleanup `packages/tomd/src/tomd/lib/pdf/wg21.py`, `packages/tomd/src/tomd/lib/pdf/cleanup.py`	Replace direct `_LABEL_RE` usage with `_is_label_line()` predicate that performs exact-then-fuzzy detection; enables colon-split fuzzy label parsing and continuation handling for reply/author blocks; imports updated accordingly.
Documentation / Architecture `packages/tomd/src/tomd/CLAUDE.md`, `packages/tomd/src/tomd/HTML_ARCH.md`, `packages/tomd/src/tomd/PDF_ARCH.md`	Document exact-then-fuzzy label detection, symmetric similarity scoring, 0.82 threshold, and changes to detection flow.
Tests `packages/tomd/tests/test_html_extract.py`, `packages/tomd/tests/test_similarity.py`, `packages/tomd/tests/test_wg21.py`	Adds tests covering fuzzy label mapping, symmetry property, edge cases, and functional extraction behavior when labels are misspelled.

Sequence Diagram(s)

sequenceDiagram
    participant Extractor as Document Extractor
    participant Fuzzy as fuzzy_match_label()
    participant Known as Known Labels
    participant Seq as SequenceMatcher

    Extractor->>Fuzzy: submit candidate label text
    Fuzzy->>Known: iterate known label set
    loop for each known label
        Fuzzy->>Seq: compute ratio(candidate, known)
        Fuzzy->>Seq: compute ratio(known, candidate)
        Seq-->>Fuzzy: return both ratios
        Fuzzy->>Fuzzy: take max (symmetric score)
        Fuzzy->>Fuzzy: apply max-length guard, compare >= 0.82
    end
    Fuzzy->>Extractor: return best canonical label or None

Estimated Code Review Effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly related PRs

Restructure into a uv workspace: paperstore/mailing/tomd/paperlint #53 — Operates on the same paperlint logger handlers and the _pwl_console_handler global that rich_console_handler() manipulates.

Poem

🐰 I hop through labels, fuzzy and bright,
Symmetric scores guiding me right.
Rich handlers shine on stderr’s stage,
Typos recovered, tests engage.
A little rabbit cheers the new flight.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 62.22% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The pull request title 'Add fuzzy label matching for PDF/HTML metadata extraction' clearly and accurately describes the main change: introducing fuzzy matching for metadata labels across both PDF and HTML extraction paths.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Review rate limit: 0/1 reviews remaining, refill in 34 minutes and 55 seconds.}

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

🧹 Nitpick comments (1)

packages/tomd/src/tomd/PDF_ARCH.md (1)
219-227: 💤 Low value

LGTM — Clear and accurate documentation of the fuzzy label recovery feature.

The two-stage approach (exact regex first, fuzzy fallback second) is well explained, with appropriate technical details (symmetric SequenceMatcher, 0.82 threshold) and illustrative examples of recoverable typos. The source citations align with the code changes in wg21.py, cleanup.py, and similarity.py.
Optional: Add similarity.py to the appendix table

For improved discoverability, consider adding lib/similarity.py to the appendix table (lines 423-438) since it now provides a shared fuzzy-matching capability used by both PDF and HTML extraction paths:
 | [`lib/pdf/wg21.py`](lib/pdf/wg21.py) | [Page zero metadata](`#page-zero-metadata`) |
+| [`lib/similarity.py`](lib/similarity.py) | [Page zero metadata](`#page-zero-metadata`) fuzzy matching |
 | [`lib/pdf/table.py`](lib/pdf/table.py) | [Tables](`#tables`) |
However, since the table currently focuses on lib/pdf/* modules and the Sources line already cites similarity.py, this is purely a discoverability enhancement.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/tomd/src/tomd/PDF_ARCH.md` around lines 219 - 227, Update the
appendix table in PDF_ARCH.md to include lib/similarity.py for discoverability:
add an entry referencing similarity.py (the module that contains
fuzzy_match_label) to the existing appendix table section so readers can find
the shared fuzzy-matching capability used by wg21.py and cleanup.py; ensure the
new row follows the table format used for other lib/pdf/* modules and include a
short description like "Shared fuzzy matching utilities (fuzzy_match_label)" and
a source pointer to lib/similarity.py.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@packages/tomd/src/tomd/PDF_ARCH.md`:
- Around line 219-227: Update the appendix table in PDF_ARCH.md to include
lib/similarity.py for discoverability: add an entry referencing similarity.py
(the module that contains fuzzy_match_label) to the existing appendix table
section so readers can find the shared fuzzy-matching capability used by wg21.py
and cleanup.py; ensure the new row follows the table format used for other
lib/pdf/* modules and include a short description like "Shared fuzzy matching
utilities (fuzzy_match_label)" and a source pointer to lib/similarity.py.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 790e2eca-589e-4060-b4fe-ec9639f9d309

📥 Commits

Reviewing files that changed from the base of the PR and between 85f9960 and 574187f.

📒 Files selected for processing (11)

packages/paperlint/src/paperlint/logutil.py
packages/paperlint/src/paperlint/progress.py
packages/tomd/src/tomd/CLAUDE.md
packages/tomd/src/tomd/HTML_ARCH.md
packages/tomd/src/tomd/PDF_ARCH.md
packages/tomd/src/tomd/lib/html/extract.py
packages/tomd/src/tomd/lib/pdf/cleanup.py
packages/tomd/src/tomd/lib/pdf/wg21.py
packages/tomd/src/tomd/lib/similarity.py
packages/tomd/tests/test_html_extract.py
packages/tomd/tests/test_similarity.py

gregjkal

Mostly small things. The one I'm least sure about but want to flag: I think the reply-to continuation pass gets skipped on the fuzzy path because the gate at line 307 substring-checks for "reply" in the literal block text, which a Repy-to: typo wouldn't satisfy.

gregjkal · 2026-05-01T12:28:13Z

    re.IGNORECASE,
 )
+    # Fallback for PDF sources with typos in label text (e.g. "Repy-to", "Document Number")
+    # that _LABEL_RE cannot match exactly.


Comment is over-indented

gregjkal · 2026-05-01T12:28:13Z

+    "document number", "doc no", "title", "date", "audience", "subgroup",
+    "reply-to", "reply to", "author", "authors", "editor", "editors",
+    "target", "project", "email", "emails",
+})


This set and _LABEL_RE are now two sources of truth for the same vocabulary and could drift. e.g. _LABEL_RE has Doc\.?\s*No\.? but the set only has "doc no". Maybe a small unit test asserting every _LABEL_RE alternative has a fuzzy target, or derive one from the other?

gregjkal · 2026-05-01T12:28:13Z

+})
+
+## Split "Label: Value" where the label portion is 2-20 chars before the colon.
+_COLON_SPLIT_RE = re.compile(r"^([^:]{2,20}):\s*(.*)", re.DOTALL)


re.DOTALL looks unnecessary here: input is single-line _clean(line.text) output and [^:]{2,20} already excludes newlines. Not harmful, just suggests intent that doesn't apply.

gregjkal · 2026-05-01T12:28:13Z

+                            "Fuzzy PDF label match: %r -> %r",
+                            candidate, fuzzy_hit,
+                        )
+                        label = fuzzy_hit.title() if " " not in fuzzy_hit else fuzzy_hit


I think this .title() is effectively a no-op: _store_field immediately calls label.lower().strip() and routes via substring checks. Could probably just pass fuzzy_hit through.

gregjkal · 2026-05-01T12:28:13Z

                        break
                    next_text = _strip_bullets(_clean(next_block.lines[0].text)) if next_block.lines else ""
-                    if not next_text or _LABEL_RE.match(next_text):
+                    if not next_text or _is_label_line(next_text):


Possible bug worth checking: the gate that decides whether to enter this continuation pass (around line 307) substring-checks the original block text for "reply", which a typo like Repy-to: wouldn't satisfy. If so, fuzzy-matched reply-to labels would extract only the first line and skip continuation, i.e. the typo case this PR is trying to fix. A synthetic-block test would confirm.

gregjkal · 2026-05-01T12:28:13Z

+                    fuzzy_hit = fuzzy_match_label(candidate, _FUZZY_LABEL_TARGETS)
+                    if fuzzy_hit is not None:
+                        _log.warning(
+                            "Fuzzy PDF label match: %r -> %r",


Formatnit: HTML side logs "Fuzzy label match: %r -> %r (field %r)"

Also, again recommend info rather than warning here.

gregjkal · 2026-05-01T12:28:13Z

+    # Fuzzy fallback when no exact synonym matched.
+    fuzzy_hit = fuzzy_match_label(norm, _ALL_SYNONYMS.keys())
+    if fuzzy_hit is not None:
+        _log.warning("Fuzzy label match: %r -> %r (field %r)", norm, fuzzy_hit, _ALL_SYNONYMS[fuzzy_hit])


Probably should be info rather than warning. Fuzzy hit isn't really an error condition, and could be noisy on batch runs.

Bug fix: - wg21.py: replace raw-text "reply" substring check at continuation gate with has_reply_label flag set from the resolved label name, so fuzzy-matched labels like "Repy-to" still trigger multi-block reply-to absorption Nits (gregjkal): - wg21.py: fix over-indented comment above _FUZZY_LABEL_TARGETS - wg21.py: remove unnecessary re.DOTALL from _COLON_SPLIT_RE - wg21.py: drop .title() no-op on fuzzy_hit (immediately lowercased by _store_field) - wg21.py: align log format with HTML side, change warning to info - extract.py: change fuzzy match log from warning to info Tests: - test_wg21.py: add vocabulary drift guard (MUST_MATCH/MUST_REJECT lists asserted against both _LABEL_RE and fuzzy_match_label) - test_wg21.py: add fuzzy reply-to continuation test with synthetic "Repy-to:" block followed by continuation block

coderabbitai

Actionable comments posted: 5

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@packages/tomd/src/tomd/lib/html/extract.py`:
- Around line 92-94: The fallback that adds bare-text lines from an <address>
block currently treats any non-email line as a reply-to entry; update the branch
in _extract_handwritten_metadata() so it only adds author-like lines: implement
and call a small helper (e.g., _is_author_line(text)) that returns true for
plausible person names (for example, two or more words with initial capitals, no
date/affiliation keywords, and no digits) and/or matches an AUTHOR_RE, and only
then call _add(cleaned); alternatively restrict the fallback to known author
rows rather than all leftover lines to avoid adding
document/date/audience/affiliation entries.
- Around line 439-453: The table-based and hand-written HTML metadata extractors
should call the shared _match_field(label) instead of inspecting raw substrings
on the label; update those extractor code paths to call _match_field(label),
check its return (canonical field or None), and use that canonical name (from
_FIELD_SYNONYMS/_ALL_SYNONYMS via fuzzy_match_label) when assigning metadata
like document/date/audience/reply-to so typo-tolerant matching applies across
all HTML extractors.
- Around line 8-12: The current relative import in extract.py fails because
symbols like deobfuscate_email, parse_author_lines, DATE_RE, DOC_NUM_RE,
EMAIL_RE, _OBFUSCATED_UNDERSCORE_RE, and _OBFUSCATED_WORD_RE are not exported
from tomd.lib; fix by either (A) changing the import in tomd/lib/html/extract.py
to import those symbols directly from the module that defines them (locate the
module that defines deobfuscate_email and parse_author_lines and import from it
instead of using "from .. import ..."), or (B) re-export the missing symbols
from packages/tomd/src/tomd/lib/__init__.py so "from .. import DATE_RE,
DOC_NUM_RE, EMAIL_RE, parse_author_lines, deobfuscate_email,
_OBFUSCATED_UNDERSCORE_RE, _OBFUSCATED_WORD_RE" works; update tests/imports
accordingly and ensure fuzzy_match_label import remains from ..similarity.

In `@packages/tomd/src/tomd/lib/pdf/wg21.py`:
- Around line 11-12: wg21.py and html/extract.py expect deobfuscate_email,
_OBFUSCATED_UNDERSCORE_RE, and _OBFUSCATED_WORD_RE to be available from the
package lib namespace but they aren't exported; fix by either (A) defining and
exporting these three symbols in the lib package __init__ (add definitions for
_OBFUSCATED_UNDERSCORE_RE and _OBFUSCATED_WORD_RE and the deobfuscate_email
function, then add them to __all__) or (B) change the imports in wg21.py and
html/extract.py to import those symbols from the module that actually implements
them; refer to the symbol names deobfuscate_email, _OBFUSCATED_UNDERSCORE_RE,
and _OBFUSCATED_WORD_RE when making the change.

In `@packages/tomd/src/tomd/PDF_ARCH.md`:
- Around line 219-226: The subsection currently uses bold text "Fuzzy label
recovery" which doesn't create an anchor for the existing link to
"#fuzzy-label-recovery"; replace the bolded line with a proper Markdown heading
(for example "### Fuzzy label recovery" or "## Fuzzy label recovery") so a
stable anchor is generated, keep the rest of the paragraph unchanged, and ensure
the heading text exactly matches the link target (fuzzy-label-recovery) so
references to `_is_label_line`, `_FUZZY_LABEL_TARGETS`, and `fuzzy_match_label`
resolve correctly.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: f0443aed-68cd-478f-a5b6-a04dbc12513e

📥 Commits

Reviewing files that changed from the base of the PR and between 574187f and 5101500.

📒 Files selected for processing (4)

packages/tomd/src/tomd/PDF_ARCH.md
packages/tomd/src/tomd/lib/html/extract.py
packages/tomd/src/tomd/lib/pdf/wg21.py
packages/tomd/tests/test_wg21.py

coderabbitai · 2026-05-01T15:13:24Z

 def _match_field(label: str) -> str | None:
-    """Map a metadata label to its canonical field name, or None if unrecognized."""
+    """Map a metadata label to its canonical field name, or None if unrecognized.
+
+    Two-stage: exact synonym lookup first, then fuzzy fallback via
+    ``fuzzy_match_label`` from ``similarity.py``.
+    """
    norm = _normalize_label(label)
    for field, synonyms in _FIELD_SYNONYMS.items():
        if norm in synonyms:
            return field
+    # Fuzzy fallback when no exact synonym matched.
+    fuzzy_hit = fuzzy_match_label(norm, _ALL_SYNONYMS.keys())
+    if fuzzy_hit is not None:
+        _log.info("Fuzzy label match: %r -> %r (field %r)", norm, fuzzy_hit, _ALL_SYNONYMS[fuzzy_hit])
+        return _ALL_SYNONYMS[fuzzy_hit]


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Wire _match_field() into the other HTML extractors too.

This helper now does the fuzzy recovery, but the hand-written and generic table paths still inspect raw substrings on label directly. As written, typo-tolerant matching only applies to the WG21 <dl> path, so table-based HTML metadata can still miss typoed document/date/audience/reply-to labels.

Based on learnings: "Use shared label matcher (_FIELD_SYNONYMS + _match_field) across HTML extractors with exact synonym lookup first, then fuzzy fallback via similarity.fuzzy_match_label for metadata extraction".

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@packages/tomd/src/tomd/lib/html/extract.py` around lines 439 - 453, The table-based and hand-written HTML metadata extractors should call the shared _match_field(label) instead of inspecting raw substrings on the label; update those extractor code paths to call _match_field(label), check its return (canonical field or None), and use that canonical name (from _FIELD_SYNONYMS/_ALL_SYNONYMS via fuzzy_match_label) when assigning metadata like document/date/audience/reply-to so typo-tolerant matching applies across all HTML extractors.

coderabbitai · 2026-05-01T15:13:24Z

+**Fuzzy label recovery**
+
+- Label detection uses a two-stage strategy: first an exact regex match via `_LABEL_RE`, then a fuzzy fallback via `_is_label_line` which delegates to `similarity.fuzzy_match_label`.
+- The fuzzy stage compares the candidate text against `_FUZZY_LABEL_TARGETS` (a frozen set of canonical labels: "Document Number", "Date", "Project", "Reply to", "Audience") using symmetric `SequenceMatcher` (max of both argument orders) with a threshold of 0.82.
+- This recovers metadata from PDFs with typos in label text (e.g. "Repy-to", "Documnet Number") that the exact regex cannot match.
+- All call sites that previously checked `_LABEL_RE.match()` directly now call `_is_label_line()` for consistent exact-then-fuzzy behavior.
+
+**Sources:** `_is_label_line`, `_FUZZY_LABEL_TARGETS` in [`lib/pdf/wg21.py`](lib/pdf/wg21.py); `fuzzy_match_label` in [`lib/similarity.py`](lib/similarity.py).


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Use a real heading for this subsection.

Line 438 now links to #fuzzy-label-recovery, but Lines 219-226 only introduce bold text, so no anchor is generated and the appendix link stays broken.

Suggested fix

-**Fuzzy label recovery** +### Fuzzy label recovery

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

**Fuzzy label recovery**

- Label detection uses a two-stage strategy: first an exact regex match via `_LABEL_RE`, then a fuzzy fallback via `_is_label_line` which delegates to `similarity.fuzzy_match_label`.

- The fuzzy stage compares the candidate text against `_FUZZY_LABEL_TARGETS` (a frozen set of canonical labels: "Document Number", "Date", "Project", "Reply to", "Audience") using symmetric `SequenceMatcher` (max of both argument orders) with a threshold of 0.82.

- This recovers metadata from PDFs with typos in label text (e.g. "Repy-to", "Documnet Number") that the exact regex cannot match.

- All call sites that previously checked `_LABEL_RE.match()` directly now call `_is_label_line()` for consistent exact-then-fuzzy behavior.

**Sources:** `_is_label_line`, `_FUZZY_LABEL_TARGETS` in [`lib/pdf/wg21.py`](lib/pdf/wg21.py); `fuzzy_match_label` in [`lib/similarity.py`](lib/similarity.py).

### Fuzzy label recovery

- Label detection uses a two-stage strategy: first an exact regex match via `_LABEL_RE`, then a fuzzy fallback via `_is_label_line` which delegates to `similarity.fuzzy_match_label`.

- The fuzzy stage compares the candidate text against `_FUZZY_LABEL_TARGETS` (a frozen set of canonical labels: "Document Number", "Date", "Project", "Reply to", "Audience") using symmetric `SequenceMatcher` (max of both argument orders) with a threshold of 0.82.

- This recovers metadata from PDFs with typos in label text (e.g. "Repy-to", "Documnet Number") that the exact regex cannot match.

- All call sites that previously checked `_LABEL_RE.match()` directly now call `_is_label_line()` for consistent exact-then-fuzzy behavior.

**Sources:** `_is_label_line`, `_FUZZY_LABEL_TARGETS` in [`lib/pdf/wg21.py`](lib/pdf/wg21.py); `fuzzy_match_label` in [`lib/similarity.py`](lib/similarity.py).

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@packages/tomd/src/tomd/PDF_ARCH.md` around lines 219 - 226, The subsection currently uses bold text "Fuzzy label recovery" which doesn't create an anchor for the existing link to "#fuzzy-label-recovery"; replace the bolded line with a proper Markdown heading (for example "### Fuzzy label recovery" or "## Fuzzy label recovery") so a stable anchor is generated, keep the rest of the paragraph unchanged, and ensure the heading text exactly matches the link target (fuzzy-label-recovery) so references to `_is_label_line`, `_FUZZY_LABEL_TARGETS`, and `fuzzy_match_label` resolve correctly.

extract.py and wg21.py import deobfuscate_email, _OBFUSCATED_WORD_RE, and _OBFUSCATED_UNDERSCORE_RE from tomd.lib, but the definition in lib/__init__.py was not included in the original commit. This caused ImportError on CI while passing locally (uncommitted file present in working directory). Also extends parse_author_lines to deobfuscate "user at domain dot com" and "user_at_domain.com" anti-spam patterns before treating a line as a bare name.

Docs: - CLAUDE.md: document deobfuscate_email and _extract_plaintext_authors in file map entries for lib/__init__.py and lib/html/extract.py - PDF_ARCH.md: add deobfuscation fallback to email enrichment section - HTML_ARCH.md: document _extract_plaintext_authors fallback and deobfuscation in _collect_metadata_emails bootstrap path Tests (test_html_extract.py): - TestDeobfuscateEmail: 12 cases covering word-based ("at"/"dot"), underscore-based (_at_), rejection of prose and valid emails - TestExtractPlaintextAuthors: 4 cases for address tags, plain-text emails, td fallback, empty containers - TestHandwrittenObfuscatedReplyTo: integration test reproducing p2285r1 pattern (obfuscated emails in address tags)

gregjkal

Changes per my earlier review look good. Four new minor comments on the new deobfuscation code.

gregjkal · 2026-05-01T16:09:39Z

+                    from .. import _OBFUSCATED_UNDERSCORE_RE, _OBFUSCATED_WORD_RE
+                    um = _OBFUSCATED_UNDERSCORE_RE.search(lt)
+                    wm = _OBFUSCATED_WORD_RE.search(lt)
+                    match_start = (um or wm).start()


Fragile: deobfuscate_email already found the match span internally and discarded it; this code re-runs both regexes to recover it. If the underscore family matches but fails EMAIL_RE.fullmatch and the word family is what actually succeeded, (um or wm).start() returns the wrong offset and the name slice is off. Same pattern duplicated in lib/__init__.py:497 and lib/html/extract.py:84. Maybe have deobfuscate_email return (email, span) so callers can slice without re-matching.

gregjkal · 2026-05-01T16:09:39Z


 EMAIL_RE = re.compile(r"[\w.+-]+@[\w.-]+\.\w+")

+import logging as _logging


Belongs with top-of-file imports.

gregjkal · 2026-05-01T16:09:39Z

+                deob = deobfuscate_email(lt)
+                if deob:
+                    # Name is everything before the obfuscated email
+                    from .. import _OBFUSCATED_UNDERSCORE_RE, _OBFUSCATED_WORD_RE


Belongs with top-of-file imports.

gregjkal · 2026-05-01T17:16:38Z

+                    if fuzzy_hit is not None:
+                        _log.info(
+                            "Fuzzy label match: %r -> %r (target %r)",
+                            candidate, fuzzy_hit, fuzzy_hit,


Same fuzzy_hit item twice. HTML side passes _ALL_SYNONYMS[fuzzy_hit] (canonical field). Maybe that's what you meant here too?

coderabbitai Bot reviewed May 1, 2026

View reviewed changes

docs(PDF_ARCH): add similarity.py to appendix table

c983d31

gregjkal requested changes May 1, 2026

View reviewed changes

coderabbitai Bot reviewed May 1, 2026

View reviewed changes

sabriguenes added 2 commits May 1, 2026 17:21

sabriguenes force-pushed the sg/fuzzy-label-matching branch from a4e606e to f1c4bba Compare May 1, 2026 15:29

gregjkal requested changes May 1, 2026

View reviewed changes


		EMAIL_RE = re.compile(r"[\w.+-]+@[\w.-]+\.\w+")

		import logging as _logging

Conversation

sabriguenes commented May 1, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Core (tomd)

CLI (paperlint)

Tests

Docs

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rate limit exceeded

Walkthrough

Changes

Sequence Diagram(s)

Estimated Code Review Effort

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

gregjkal left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot May 1, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot May 1, 2026

Choose a reason for hiding this comment

Uh oh!

gregjkal left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

sabriguenes commented May 1, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 1, 2026 •

edited

Loading

gregjkal left a comment •

edited

Loading