Skip to content

Add fuzzy label matching for PDF/HTML metadata extraction#66

Open
sabriguenes wants to merge 5 commits intocppalliance:mainfrom
sabriguenes:sg/fuzzy-label-matching
Open

Add fuzzy label matching for PDF/HTML metadata extraction#66
sabriguenes wants to merge 5 commits intocppalliance:mainfrom
sabriguenes:sg/fuzzy-label-matching

Conversation

@sabriguenes
Copy link
Copy Markdown
Contributor

@sabriguenes sabriguenes commented May 1, 2026

Metadata labels in WG21 papers sometimes contain typos ("Repy-to", "Documnet Number", "Auther") that exact regex cannot match, causing metadata loss during conversion. This changeset adds typo-tolerant label recovery across both PDF and HTML extraction paths.

Core (tomd)

  • similarity.py: add _symmetric_similarity (max of both SM argument orders) and fuzzy_match_label (best-match against known label set, threshold 0.82)
  • wg21.py: replace all _LABEL_RE.match() calls with _is_label_line(), a two-stage wrapper (exact regex first, fuzzy fallback second) using _FUZZY_LABEL_TARGETS frozenset
  • extract.py: extend _match_field with fuzzy fallback via _ALL_SYNONYMS inverted dict; warn on fuzzy matches
  • cleanup.py: update import from _LABEL_RE to _is_label_line for consistency

CLI (paperlint)

  • logutil.py: add rich_console_handler context manager that temporarily swaps the console logger to a RichHandler tied to a specific Console instance, so log output does not collide with Rich progress bars
  • progress.py: change Console() to Console(stderr=True); wrap progress context with rich_console_handler for coordinated output

Tests

  • test_similarity.py: 6 new tests for fuzzy_match_label + symmetry
  • test_html_extract.py: 5 new tests for HTML fuzzy label fallback

Docs

  • PDF_ARCH.md: new "Fuzzy label recovery" subsection
  • HTML_ARCH.md: document fuzzy fallback in "Shared label mapping"
  • CLAUDE.md (tomd): update file map entries for similarity.py and extract.py

Summary by CodeRabbit

  • New Features

    • Metadata extraction now tolerates and recovers typoed field labels and can extract/deobfuscate plain-text author/email entries.
    • Progress display activates rich console logging during progress sessions.
  • Documentation

    • Architecture docs extended to describe fuzzy label-matching behavior and recovery workflow.
  • Tests

    • Added tests covering fuzzy label matching and end-to-end metadata extraction cases.

Metadata labels in WG21 papers sometimes contain typos ("Repy-to",
"Documnet Number", "Auther") that exact regex cannot match, causing
metadata loss during conversion. This changeset adds typo-tolerant
label recovery across both PDF and HTML extraction paths.

Core (tomd):
- similarity.py: add _symmetric_similarity (max of both SM argument
  orders) and fuzzy_match_label (best-match against known label set,
  threshold 0.82)
- wg21.py: replace all _LABEL_RE.match() calls with _is_label_line(),
  a two-stage wrapper (exact regex first, fuzzy fallback second) using
  _FUZZY_LABEL_TARGETS frozenset
- extract.py: extend _match_field with fuzzy fallback via _ALL_SYNONYMS
  inverted dict; warn on fuzzy matches
- cleanup.py: update import from _LABEL_RE to _is_label_line for
  consistency

CLI (paperlint):
- logutil.py: add rich_console_handler context manager that temporarily
  swaps the console logger to a RichHandler tied to a specific Console
  instance, so log output does not collide with Rich progress bars
- progress.py: change Console() to Console(stderr=True); wrap progress
  context with rich_console_handler for coordinated output

Tests:
- test_similarity.py: 6 new tests for fuzzy_match_label + symmetry
- test_html_extract.py: 5 new tests for HTML fuzzy label fallback

Docs:
- PDF_ARCH.md: new "Fuzzy label recovery" subsection
- HTML_ARCH.md: document fuzzy fallback in "Shared label mapping"
- CLAUDE.md (tomd): update file map entries for similarity.py and
  extract.py
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 1, 2026

Warning

Rate limit exceeded

@sabriguenes has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 34 minutes and 55 seconds before requesting another review.

To keep reviews running without waiting, you can enable usage-based add-on for your organization. This allows additional reviews beyond the hourly cap. Account admins can enable it under billing.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: a9ba777e-6bc0-4c81-aa36-7180ec4ac769

📥 Commits

Reviewing files that changed from the base of the PR and between 5101500 and f1c4bba.

📒 Files selected for processing (13)
  • packages/paperlint/src/paperlint/logutil.py
  • packages/paperlint/src/paperlint/progress.py
  • packages/tomd/src/tomd/CLAUDE.md
  • packages/tomd/src/tomd/HTML_ARCH.md
  • packages/tomd/src/tomd/PDF_ARCH.md
  • packages/tomd/src/tomd/lib/__init__.py
  • packages/tomd/src/tomd/lib/html/extract.py
  • packages/tomd/src/tomd/lib/pdf/cleanup.py
  • packages/tomd/src/tomd/lib/pdf/wg21.py
  • packages/tomd/src/tomd/lib/similarity.py
  • packages/tomd/tests/test_html_extract.py
  • packages/tomd/tests/test_similarity.py
  • packages/tomd/tests/test_wg21.py
📝 Walkthrough

Walkthrough

Adds fuzzy label matching (symmetric SequenceMatcher, 0.82 threshold) and integrates it into PDF/HTML metadata extraction paths; supplies new tests. Adds a context-managed Rich console handler and uses it from paperlint progress callbacks to enable Rich output on stderr during progress sessions.

Changes

Cohort / File(s) Summary
Logging & Console Integration
packages/paperlint/src/paperlint/logutil.py, packages/paperlint/src/paperlint/progress.py
New rich_console_handler(console: Console) context manager that temporarily swaps the paperlint console handler for a RichHandler; progress_callbacks now creates a stderr Console and enters rich_console_handler during progress sessions.
Fuzzy Matching Core
packages/tomd/src/tomd/lib/similarity.py
Adds `fuzzy_match_label(candidate, known, threshold=0.82) -> str
HTML Extraction Updates
packages/tomd/src/tomd/lib/html/extract.py
Adds _extract_plaintext_authors fallback, deobfuscation pass for emails, introduces _ALL_SYNONYMS and fuzzy fallback in _match_field that logs fuzzy matches and returns canonical fields.
PDF Extraction & Cleanup
packages/tomd/src/tomd/lib/pdf/wg21.py, packages/tomd/src/tomd/lib/pdf/cleanup.py
Replace direct _LABEL_RE usage with _is_label_line() predicate that performs exact-then-fuzzy detection; enables colon-split fuzzy label parsing and continuation handling for reply/author blocks; imports updated accordingly.
Documentation / Architecture
packages/tomd/src/tomd/CLAUDE.md, packages/tomd/src/tomd/HTML_ARCH.md, packages/tomd/src/tomd/PDF_ARCH.md
Document exact-then-fuzzy label detection, symmetric similarity scoring, 0.82 threshold, and changes to detection flow.
Tests
packages/tomd/tests/test_html_extract.py, packages/tomd/tests/test_similarity.py, packages/tomd/tests/test_wg21.py
Adds tests covering fuzzy label mapping, symmetry property, edge cases, and functional extraction behavior when labels are misspelled.

Sequence Diagram(s)

sequenceDiagram
    participant Extractor as Document Extractor
    participant Fuzzy as fuzzy_match_label()
    participant Known as Known Labels
    participant Seq as SequenceMatcher

    Extractor->>Fuzzy: submit candidate label text
    Fuzzy->>Known: iterate known label set
    loop for each known label
        Fuzzy->>Seq: compute ratio(candidate, known)
        Fuzzy->>Seq: compute ratio(known, candidate)
        Seq-->>Fuzzy: return both ratios
        Fuzzy->>Fuzzy: take max (symmetric score)
        Fuzzy->>Fuzzy: apply max-length guard, compare >= 0.82
    end
    Fuzzy->>Extractor: return best canonical label or None
Loading

Estimated Code Review Effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly related PRs

Poem

🐰 I hop through labels, fuzzy and bright,
Symmetric scores guiding me right.
Rich handlers shine on stderr’s stage,
Typos recovered, tests engage.
A little rabbit cheers the new flight.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 62.22% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The pull request title 'Add fuzzy label matching for PDF/HTML metadata extraction' clearly and accurately describes the main change: introducing fuzzy matching for metadata labels across both PDF and HTML extraction paths.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
Review rate limit: 0/1 reviews remaining, refill in 34 minutes and 55 seconds.

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
packages/tomd/src/tomd/PDF_ARCH.md (1)

219-227: 💤 Low value

LGTM — Clear and accurate documentation of the fuzzy label recovery feature.

The two-stage approach (exact regex first, fuzzy fallback second) is well explained, with appropriate technical details (symmetric SequenceMatcher, 0.82 threshold) and illustrative examples of recoverable typos. The source citations align with the code changes in wg21.py, cleanup.py, and similarity.py.

Optional: Add similarity.py to the appendix table

For improved discoverability, consider adding lib/similarity.py to the appendix table (lines 423-438) since it now provides a shared fuzzy-matching capability used by both PDF and HTML extraction paths:

 | [`lib/pdf/wg21.py`](lib/pdf/wg21.py) | [Page zero metadata](`#page-zero-metadata`) |
+| [`lib/similarity.py`](lib/similarity.py) | [Page zero metadata](`#page-zero-metadata`) fuzzy matching |
 | [`lib/pdf/table.py`](lib/pdf/table.py) | [Tables](`#tables`) |

However, since the table currently focuses on lib/pdf/* modules and the Sources line already cites similarity.py, this is purely a discoverability enhancement.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/tomd/src/tomd/PDF_ARCH.md` around lines 219 - 227, Update the
appendix table in PDF_ARCH.md to include lib/similarity.py for discoverability:
add an entry referencing similarity.py (the module that contains
fuzzy_match_label) to the existing appendix table section so readers can find
the shared fuzzy-matching capability used by wg21.py and cleanup.py; ensure the
new row follows the table format used for other lib/pdf/* modules and include a
short description like "Shared fuzzy matching utilities (fuzzy_match_label)" and
a source pointer to lib/similarity.py.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@packages/tomd/src/tomd/PDF_ARCH.md`:
- Around line 219-227: Update the appendix table in PDF_ARCH.md to include
lib/similarity.py for discoverability: add an entry referencing similarity.py
(the module that contains fuzzy_match_label) to the existing appendix table
section so readers can find the shared fuzzy-matching capability used by wg21.py
and cleanup.py; ensure the new row follows the table format used for other
lib/pdf/* modules and include a short description like "Shared fuzzy matching
utilities (fuzzy_match_label)" and a source pointer to lib/similarity.py.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 790e2eca-589e-4060-b4fe-ec9639f9d309

📥 Commits

Reviewing files that changed from the base of the PR and between 85f9960 and 574187f.

📒 Files selected for processing (11)
  • packages/paperlint/src/paperlint/logutil.py
  • packages/paperlint/src/paperlint/progress.py
  • packages/tomd/src/tomd/CLAUDE.md
  • packages/tomd/src/tomd/HTML_ARCH.md
  • packages/tomd/src/tomd/PDF_ARCH.md
  • packages/tomd/src/tomd/lib/html/extract.py
  • packages/tomd/src/tomd/lib/pdf/cleanup.py
  • packages/tomd/src/tomd/lib/pdf/wg21.py
  • packages/tomd/src/tomd/lib/similarity.py
  • packages/tomd/tests/test_html_extract.py
  • packages/tomd/tests/test_similarity.py

Copy link
Copy Markdown
Collaborator

@gregjkal gregjkal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly small things. The one I'm least sure about but want to flag: I think the reply-to continuation pass gets skipped on the fuzzy path because the gate at line 307 substring-checks for "reply" in the literal block text, which a Repy-to: typo wouldn't satisfy.

Comment thread packages/tomd/src/tomd/lib/pdf/wg21.py Outdated
re.IGNORECASE,
)
# Fallback for PDF sources with typos in label text (e.g. "Repy-to", "Document Number")
# that _LABEL_RE cannot match exactly.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment is over-indented

"document number", "doc no", "title", "date", "audience", "subgroup",
"reply-to", "reply to", "author", "authors", "editor", "editors",
"target", "project", "email", "emails",
})
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This set and _LABEL_RE are now two sources of truth for the same vocabulary and could drift. e.g. _LABEL_RE has Doc\.?\s*No\.? but the set only has "doc no". Maybe a small unit test asserting every _LABEL_RE alternative has a fuzzy target, or derive one from the other?

Comment thread packages/tomd/src/tomd/lib/pdf/wg21.py Outdated
})

## Split "Label: Value" where the label portion is 2-20 chars before the colon.
_COLON_SPLIT_RE = re.compile(r"^([^:]{2,20}):\s*(.*)", re.DOTALL)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

re.DOTALL looks unnecessary here: input is single-line _clean(line.text) output and [^:]{2,20} already excludes newlines. Not harmful, just suggests intent that doesn't apply.

Comment thread packages/tomd/src/tomd/lib/pdf/wg21.py Outdated
"Fuzzy PDF label match: %r -> %r",
candidate, fuzzy_hit,
)
label = fuzzy_hit.title() if " " not in fuzzy_hit else fuzzy_hit
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this .title() is effectively a no-op: _store_field immediately calls label.lower().strip() and routes via substring checks. Could probably just pass fuzzy_hit through.

break
next_text = _strip_bullets(_clean(next_block.lines[0].text)) if next_block.lines else ""
if not next_text or _LABEL_RE.match(next_text):
if not next_text or _is_label_line(next_text):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possible bug worth checking: the gate that decides whether to enter this continuation pass (around line 307) substring-checks the original block text for "reply", which a typo like Repy-to: wouldn't satisfy. If so, fuzzy-matched reply-to labels would extract only the first line and skip continuation, i.e. the typo case this PR is trying to fix. A synthetic-block test would confirm.

Comment thread packages/tomd/src/tomd/lib/pdf/wg21.py Outdated
fuzzy_hit = fuzzy_match_label(candidate, _FUZZY_LABEL_TARGETS)
if fuzzy_hit is not None:
_log.warning(
"Fuzzy PDF label match: %r -> %r",
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Formatnit: HTML side logs "Fuzzy label match: %r -> %r (field %r)"

Also, again recommend info rather than warning here.

# Fuzzy fallback when no exact synonym matched.
fuzzy_hit = fuzzy_match_label(norm, _ALL_SYNONYMS.keys())
if fuzzy_hit is not None:
_log.warning("Fuzzy label match: %r -> %r (field %r)", norm, fuzzy_hit, _ALL_SYNONYMS[fuzzy_hit])
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably should be info rather than warning. Fuzzy hit isn't really an error condition, and could be noisy on batch runs.

Bug fix:
- wg21.py: replace raw-text "reply" substring check at continuation
  gate with has_reply_label flag set from the resolved label name,
  so fuzzy-matched labels like "Repy-to" still trigger multi-block
  reply-to absorption

Nits (gregjkal):
- wg21.py: fix over-indented comment above _FUZZY_LABEL_TARGETS
- wg21.py: remove unnecessary re.DOTALL from _COLON_SPLIT_RE
- wg21.py: drop .title() no-op on fuzzy_hit (immediately lowercased
  by _store_field)
- wg21.py: align log format with HTML side, change warning to info
- extract.py: change fuzzy match log from warning to info

Tests:
- test_wg21.py: add vocabulary drift guard (MUST_MATCH/MUST_REJECT
  lists asserted against both _LABEL_RE and fuzzy_match_label)
- test_wg21.py: add fuzzy reply-to continuation test with synthetic
  "Repy-to:" block followed by continuation block
sabriguenes pushed a commit to sabriguenes/wg21-paperflow that referenced this pull request May 1, 2026
Bug fix:
- wg21.py: replace raw-text "reply" substring check at continuation
  gate with has_reply_label flag set from the resolved label name,
  so fuzzy-matched labels like "Repy-to" still trigger multi-block
  reply-to absorption

Nits (gregjkal):
- wg21.py: fix over-indented comment above _FUZZY_LABEL_TARGETS
- wg21.py: remove unnecessary re.DOTALL from _COLON_SPLIT_RE
- wg21.py: drop .title() no-op on fuzzy_hit (immediately lowercased
  by _store_field)
- wg21.py: align log format with HTML side, change warning to info
- extract.py: change fuzzy match log from warning to info

Tests:
- test_wg21.py: add vocabulary drift guard (MUST_MATCH/MUST_REJECT
  lists asserted against both _LABEL_RE and fuzzy_match_label)
- test_wg21.py: add fuzzy reply-to continuation test with synthetic
  "Repy-to:" block followed by continuation block
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@packages/tomd/src/tomd/lib/html/extract.py`:
- Around line 92-94: The fallback that adds bare-text lines from an <address>
block currently treats any non-email line as a reply-to entry; update the branch
in _extract_handwritten_metadata() so it only adds author-like lines: implement
and call a small helper (e.g., _is_author_line(text)) that returns true for
plausible person names (for example, two or more words with initial capitals, no
date/affiliation keywords, and no digits) and/or matches an AUTHOR_RE, and only
then call _add(cleaned); alternatively restrict the fallback to known author
rows rather than all leftover lines to avoid adding
document/date/audience/affiliation entries.
- Around line 439-453: The table-based and hand-written HTML metadata extractors
should call the shared _match_field(label) instead of inspecting raw substrings
on the label; update those extractor code paths to call _match_field(label),
check its return (canonical field or None), and use that canonical name (from
_FIELD_SYNONYMS/_ALL_SYNONYMS via fuzzy_match_label) when assigning metadata
like document/date/audience/reply-to so typo-tolerant matching applies across
all HTML extractors.
- Around line 8-12: The current relative import in extract.py fails because
symbols like deobfuscate_email, parse_author_lines, DATE_RE, DOC_NUM_RE,
EMAIL_RE, _OBFUSCATED_UNDERSCORE_RE, and _OBFUSCATED_WORD_RE are not exported
from tomd.lib; fix by either (A) changing the import in tomd/lib/html/extract.py
to import those symbols directly from the module that defines them (locate the
module that defines deobfuscate_email and parse_author_lines and import from it
instead of using "from .. import ..."), or (B) re-export the missing symbols
from packages/tomd/src/tomd/lib/__init__.py so "from .. import DATE_RE,
DOC_NUM_RE, EMAIL_RE, parse_author_lines, deobfuscate_email,
_OBFUSCATED_UNDERSCORE_RE, _OBFUSCATED_WORD_RE" works; update tests/imports
accordingly and ensure fuzzy_match_label import remains from ..similarity.

In `@packages/tomd/src/tomd/lib/pdf/wg21.py`:
- Around line 11-12: wg21.py and html/extract.py expect deobfuscate_email,
_OBFUSCATED_UNDERSCORE_RE, and _OBFUSCATED_WORD_RE to be available from the
package lib namespace but they aren't exported; fix by either (A) defining and
exporting these three symbols in the lib package __init__ (add definitions for
_OBFUSCATED_UNDERSCORE_RE and _OBFUSCATED_WORD_RE and the deobfuscate_email
function, then add them to __all__) or (B) change the imports in wg21.py and
html/extract.py to import those symbols from the module that actually implements
them; refer to the symbol names deobfuscate_email, _OBFUSCATED_UNDERSCORE_RE,
and _OBFUSCATED_WORD_RE when making the change.

In `@packages/tomd/src/tomd/PDF_ARCH.md`:
- Around line 219-226: The subsection currently uses bold text "Fuzzy label
recovery" which doesn't create an anchor for the existing link to
"#fuzzy-label-recovery"; replace the bolded line with a proper Markdown heading
(for example "### Fuzzy label recovery" or "## Fuzzy label recovery") so a
stable anchor is generated, keep the rest of the paragraph unchanged, and ensure
the heading text exactly matches the link target (fuzzy-label-recovery) so
references to `_is_label_line`, `_FUZZY_LABEL_TARGETS`, and `fuzzy_match_label`
resolve correctly.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: f0443aed-68cd-478f-a5b6-a04dbc12513e

📥 Commits

Reviewing files that changed from the base of the PR and between 574187f and 5101500.

📒 Files selected for processing (4)
  • packages/tomd/src/tomd/PDF_ARCH.md
  • packages/tomd/src/tomd/lib/html/extract.py
  • packages/tomd/src/tomd/lib/pdf/wg21.py
  • packages/tomd/tests/test_wg21.py

Comment thread packages/tomd/src/tomd/lib/html/extract.py
Comment thread packages/tomd/src/tomd/lib/html/extract.py
Comment on lines 439 to +453
def _match_field(label: str) -> str | None:
"""Map a metadata label to its canonical field name, or None if unrecognized."""
"""Map a metadata label to its canonical field name, or None if unrecognized.

Two-stage: exact synonym lookup first, then fuzzy fallback via
``fuzzy_match_label`` from ``similarity.py``.
"""
norm = _normalize_label(label)
for field, synonyms in _FIELD_SYNONYMS.items():
if norm in synonyms:
return field
# Fuzzy fallback when no exact synonym matched.
fuzzy_hit = fuzzy_match_label(norm, _ALL_SYNONYMS.keys())
if fuzzy_hit is not None:
_log.info("Fuzzy label match: %r -> %r (field %r)", norm, fuzzy_hit, _ALL_SYNONYMS[fuzzy_hit])
return _ALL_SYNONYMS[fuzzy_hit]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Wire _match_field() into the other HTML extractors too.

This helper now does the fuzzy recovery, but the hand-written and generic table paths still inspect raw substrings on label directly. As written, typo-tolerant matching only applies to the WG21 <dl> path, so table-based HTML metadata can still miss typoed document/date/audience/reply-to labels.

Based on learnings: "Use shared label matcher (_FIELD_SYNONYMS + _match_field) across HTML extractors with exact synonym lookup first, then fuzzy fallback via similarity.fuzzy_match_label for metadata extraction".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/tomd/src/tomd/lib/html/extract.py` around lines 439 - 453, The
table-based and hand-written HTML metadata extractors should call the shared
_match_field(label) instead of inspecting raw substrings on the label; update
those extractor code paths to call _match_field(label), check its return
(canonical field or None), and use that canonical name (from
_FIELD_SYNONYMS/_ALL_SYNONYMS via fuzzy_match_label) when assigning metadata
like document/date/audience/reply-to so typo-tolerant matching applies across
all HTML extractors.

Comment thread packages/tomd/src/tomd/lib/pdf/wg21.py
Comment on lines +219 to +226
**Fuzzy label recovery**

- Label detection uses a two-stage strategy: first an exact regex match via `_LABEL_RE`, then a fuzzy fallback via `_is_label_line` which delegates to `similarity.fuzzy_match_label`.
- The fuzzy stage compares the candidate text against `_FUZZY_LABEL_TARGETS` (a frozen set of canonical labels: "Document Number", "Date", "Project", "Reply to", "Audience") using symmetric `SequenceMatcher` (max of both argument orders) with a threshold of 0.82.
- This recovers metadata from PDFs with typos in label text (e.g. "Repy-to", "Documnet Number") that the exact regex cannot match.
- All call sites that previously checked `_LABEL_RE.match()` directly now call `_is_label_line()` for consistent exact-then-fuzzy behavior.

**Sources:** `_is_label_line`, `_FUZZY_LABEL_TARGETS` in [`lib/pdf/wg21.py`](lib/pdf/wg21.py); `fuzzy_match_label` in [`lib/similarity.py`](lib/similarity.py).
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Use a real heading for this subsection.

Line 438 now links to #fuzzy-label-recovery, but Lines 219-226 only introduce bold text, so no anchor is generated and the appendix link stays broken.

Suggested fix
-**Fuzzy label recovery**
+### Fuzzy label recovery
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
**Fuzzy label recovery**
- Label detection uses a two-stage strategy: first an exact regex match via `_LABEL_RE`, then a fuzzy fallback via `_is_label_line` which delegates to `similarity.fuzzy_match_label`.
- The fuzzy stage compares the candidate text against `_FUZZY_LABEL_TARGETS` (a frozen set of canonical labels: "Document Number", "Date", "Project", "Reply to", "Audience") using symmetric `SequenceMatcher` (max of both argument orders) with a threshold of 0.82.
- This recovers metadata from PDFs with typos in label text (e.g. "Repy-to", "Documnet Number") that the exact regex cannot match.
- All call sites that previously checked `_LABEL_RE.match()` directly now call `_is_label_line()` for consistent exact-then-fuzzy behavior.
**Sources:** `_is_label_line`, `_FUZZY_LABEL_TARGETS` in [`lib/pdf/wg21.py`](lib/pdf/wg21.py); `fuzzy_match_label` in [`lib/similarity.py`](lib/similarity.py).
### Fuzzy label recovery
- Label detection uses a two-stage strategy: first an exact regex match via `_LABEL_RE`, then a fuzzy fallback via `_is_label_line` which delegates to `similarity.fuzzy_match_label`.
- The fuzzy stage compares the candidate text against `_FUZZY_LABEL_TARGETS` (a frozen set of canonical labels: "Document Number", "Date", "Project", "Reply to", "Audience") using symmetric `SequenceMatcher` (max of both argument orders) with a threshold of 0.82.
- This recovers metadata from PDFs with typos in label text (e.g. "Repy-to", "Documnet Number") that the exact regex cannot match.
- All call sites that previously checked `_LABEL_RE.match()` directly now call `_is_label_line()` for consistent exact-then-fuzzy behavior.
**Sources:** `_is_label_line`, `_FUZZY_LABEL_TARGETS` in [`lib/pdf/wg21.py`](lib/pdf/wg21.py); `fuzzy_match_label` in [`lib/similarity.py`](lib/similarity.py).
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/tomd/src/tomd/PDF_ARCH.md` around lines 219 - 226, The subsection
currently uses bold text "Fuzzy label recovery" which doesn't create an anchor
for the existing link to "#fuzzy-label-recovery"; replace the bolded line with a
proper Markdown heading (for example "### Fuzzy label recovery" or "## Fuzzy
label recovery") so a stable anchor is generated, keep the rest of the paragraph
unchanged, and ensure the heading text exactly matches the link target
(fuzzy-label-recovery) so references to `_is_label_line`,
`_FUZZY_LABEL_TARGETS`, and `fuzzy_match_label` resolve correctly.

extract.py and wg21.py import deobfuscate_email, _OBFUSCATED_WORD_RE,
and _OBFUSCATED_UNDERSCORE_RE from tomd.lib, but the definition in
lib/__init__.py was not included in the original commit. This caused
ImportError on CI while passing locally (uncommitted file present in
working directory).

Also extends parse_author_lines to deobfuscate "user at domain dot com"
and "user_at_domain.com" anti-spam patterns before treating a line as a
bare name.
Docs:
- CLAUDE.md: document deobfuscate_email and _extract_plaintext_authors
  in file map entries for lib/__init__.py and lib/html/extract.py
- PDF_ARCH.md: add deobfuscation fallback to email enrichment section
- HTML_ARCH.md: document _extract_plaintext_authors fallback and
  deobfuscation in _collect_metadata_emails bootstrap path

Tests (test_html_extract.py):
- TestDeobfuscateEmail: 12 cases covering word-based ("at"/"dot"),
  underscore-based (_at_), rejection of prose and valid emails
- TestExtractPlaintextAuthors: 4 cases for address tags, plain-text
  emails, td fallback, empty containers
- TestHandwrittenObfuscatedReplyTo: integration test reproducing
  p2285r1 pattern (obfuscated emails in address tags)
@sabriguenes sabriguenes force-pushed the sg/fuzzy-label-matching branch from a4e606e to f1c4bba Compare May 1, 2026 15:29
Copy link
Copy Markdown
Collaborator

@gregjkal gregjkal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes per my earlier review look good. Four new minor comments on the new deobfuscation code.

from .. import _OBFUSCATED_UNDERSCORE_RE, _OBFUSCATED_WORD_RE
um = _OBFUSCATED_UNDERSCORE_RE.search(lt)
wm = _OBFUSCATED_WORD_RE.search(lt)
match_start = (um or wm).start()
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fragile: deobfuscate_email already found the match span internally and discarded it; this code re-runs both regexes to recover it. If the underscore family matches but fails EMAIL_RE.fullmatch and the word family is what actually succeeded, (um or wm).start() returns the wrong offset and the name slice is off. Same pattern duplicated in lib/__init__.py:497 and lib/html/extract.py:84. Maybe have deobfuscate_email return (email, span) so callers can slice without re-matching.


EMAIL_RE = re.compile(r"[\w.+-]+@[\w.-]+\.\w+")

import logging as _logging
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Belongs with top-of-file imports.

deob = deobfuscate_email(lt)
if deob:
# Name is everything before the obfuscated email
from .. import _OBFUSCATED_UNDERSCORE_RE, _OBFUSCATED_WORD_RE
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Belongs with top-of-file imports.

if fuzzy_hit is not None:
_log.info(
"Fuzzy label match: %r -> %r (target %r)",
candidate, fuzzy_hit, fuzzy_hit,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same fuzzy_hit item twice. HTML side passes _ALL_SYNONYMS[fuzzy_hit] (canonical field). Maybe that's what you meant here too?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants