Add fuzzy label matching for PDF/HTML metadata extraction#66
Add fuzzy label matching for PDF/HTML metadata extraction#66sabriguenes wants to merge 5 commits intocppalliance:mainfrom
Conversation
Metadata labels in WG21 papers sometimes contain typos ("Repy-to",
"Documnet Number", "Auther") that exact regex cannot match, causing
metadata loss during conversion. This changeset adds typo-tolerant
label recovery across both PDF and HTML extraction paths.
Core (tomd):
- similarity.py: add _symmetric_similarity (max of both SM argument
orders) and fuzzy_match_label (best-match against known label set,
threshold 0.82)
- wg21.py: replace all _LABEL_RE.match() calls with _is_label_line(),
a two-stage wrapper (exact regex first, fuzzy fallback second) using
_FUZZY_LABEL_TARGETS frozenset
- extract.py: extend _match_field with fuzzy fallback via _ALL_SYNONYMS
inverted dict; warn on fuzzy matches
- cleanup.py: update import from _LABEL_RE to _is_label_line for
consistency
CLI (paperlint):
- logutil.py: add rich_console_handler context manager that temporarily
swaps the console logger to a RichHandler tied to a specific Console
instance, so log output does not collide with Rich progress bars
- progress.py: change Console() to Console(stderr=True); wrap progress
context with rich_console_handler for coordinated output
Tests:
- test_similarity.py: 6 new tests for fuzzy_match_label + symmetry
- test_html_extract.py: 5 new tests for HTML fuzzy label fallback
Docs:
- PDF_ARCH.md: new "Fuzzy label recovery" subsection
- HTML_ARCH.md: document fuzzy fallback in "Shared label mapping"
- CLAUDE.md (tomd): update file map entries for similarity.py and
extract.py
|
Warning Rate limit exceeded
To keep reviews running without waiting, you can enable usage-based add-on for your organization. This allows additional reviews beyond the hourly cap. Account admins can enable it under billing. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (13)
📝 WalkthroughWalkthroughAdds fuzzy label matching (symmetric SequenceMatcher, 0.82 threshold) and integrates it into PDF/HTML metadata extraction paths; supplies new tests. Adds a context-managed Rich console handler and uses it from paperlint progress callbacks to enable Rich output on stderr during progress sessions. Changes
Sequence Diagram(s)sequenceDiagram
participant Extractor as Document Extractor
participant Fuzzy as fuzzy_match_label()
participant Known as Known Labels
participant Seq as SequenceMatcher
Extractor->>Fuzzy: submit candidate label text
Fuzzy->>Known: iterate known label set
loop for each known label
Fuzzy->>Seq: compute ratio(candidate, known)
Fuzzy->>Seq: compute ratio(known, candidate)
Seq-->>Fuzzy: return both ratios
Fuzzy->>Fuzzy: take max (symmetric score)
Fuzzy->>Fuzzy: apply max-length guard, compare >= 0.82
end
Fuzzy->>Extractor: return best canonical label or None
Estimated Code Review Effort🎯 4 (Complex) | ⏱️ ~50 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Review rate limit: 0/1 reviews remaining, refill in 34 minutes and 55 seconds.Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
packages/tomd/src/tomd/PDF_ARCH.md (1)
219-227: 💤 Low valueLGTM — Clear and accurate documentation of the fuzzy label recovery feature.
The two-stage approach (exact regex first, fuzzy fallback second) is well explained, with appropriate technical details (symmetric SequenceMatcher, 0.82 threshold) and illustrative examples of recoverable typos. The source citations align with the code changes in
wg21.py,cleanup.py, andsimilarity.py.Optional: Add similarity.py to the appendix table
For improved discoverability, consider adding
lib/similarity.pyto the appendix table (lines 423-438) since it now provides a shared fuzzy-matching capability used by both PDF and HTML extraction paths:| [`lib/pdf/wg21.py`](lib/pdf/wg21.py) | [Page zero metadata](`#page-zero-metadata`) | +| [`lib/similarity.py`](lib/similarity.py) | [Page zero metadata](`#page-zero-metadata`) fuzzy matching | | [`lib/pdf/table.py`](lib/pdf/table.py) | [Tables](`#tables`) |However, since the table currently focuses on
lib/pdf/*modules and the Sources line already citessimilarity.py, this is purely a discoverability enhancement.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@packages/tomd/src/tomd/PDF_ARCH.md` around lines 219 - 227, Update the appendix table in PDF_ARCH.md to include lib/similarity.py for discoverability: add an entry referencing similarity.py (the module that contains fuzzy_match_label) to the existing appendix table section so readers can find the shared fuzzy-matching capability used by wg21.py and cleanup.py; ensure the new row follows the table format used for other lib/pdf/* modules and include a short description like "Shared fuzzy matching utilities (fuzzy_match_label)" and a source pointer to lib/similarity.py.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@packages/tomd/src/tomd/PDF_ARCH.md`:
- Around line 219-227: Update the appendix table in PDF_ARCH.md to include
lib/similarity.py for discoverability: add an entry referencing similarity.py
(the module that contains fuzzy_match_label) to the existing appendix table
section so readers can find the shared fuzzy-matching capability used by wg21.py
and cleanup.py; ensure the new row follows the table format used for other
lib/pdf/* modules and include a short description like "Shared fuzzy matching
utilities (fuzzy_match_label)" and a source pointer to lib/similarity.py.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 790e2eca-589e-4060-b4fe-ec9639f9d309
📒 Files selected for processing (11)
packages/paperlint/src/paperlint/logutil.pypackages/paperlint/src/paperlint/progress.pypackages/tomd/src/tomd/CLAUDE.mdpackages/tomd/src/tomd/HTML_ARCH.mdpackages/tomd/src/tomd/PDF_ARCH.mdpackages/tomd/src/tomd/lib/html/extract.pypackages/tomd/src/tomd/lib/pdf/cleanup.pypackages/tomd/src/tomd/lib/pdf/wg21.pypackages/tomd/src/tomd/lib/similarity.pypackages/tomd/tests/test_html_extract.pypackages/tomd/tests/test_similarity.py
gregjkal
left a comment
There was a problem hiding this comment.
Mostly small things. The one I'm least sure about but want to flag: I think the reply-to continuation pass gets skipped on the fuzzy path because the gate at line 307 substring-checks for "reply" in the literal block text, which a Repy-to: typo wouldn't satisfy.
| re.IGNORECASE, | ||
| ) | ||
| # Fallback for PDF sources with typos in label text (e.g. "Repy-to", "Document Number") | ||
| # that _LABEL_RE cannot match exactly. |
| "document number", "doc no", "title", "date", "audience", "subgroup", | ||
| "reply-to", "reply to", "author", "authors", "editor", "editors", | ||
| "target", "project", "email", "emails", | ||
| }) |
There was a problem hiding this comment.
This set and _LABEL_RE are now two sources of truth for the same vocabulary and could drift. e.g. _LABEL_RE has Doc\.?\s*No\.? but the set only has "doc no". Maybe a small unit test asserting every _LABEL_RE alternative has a fuzzy target, or derive one from the other?
| }) | ||
|
|
||
| ## Split "Label: Value" where the label portion is 2-20 chars before the colon. | ||
| _COLON_SPLIT_RE = re.compile(r"^([^:]{2,20}):\s*(.*)", re.DOTALL) |
There was a problem hiding this comment.
re.DOTALL looks unnecessary here: input is single-line _clean(line.text) output and [^:]{2,20} already excludes newlines. Not harmful, just suggests intent that doesn't apply.
| "Fuzzy PDF label match: %r -> %r", | ||
| candidate, fuzzy_hit, | ||
| ) | ||
| label = fuzzy_hit.title() if " " not in fuzzy_hit else fuzzy_hit |
There was a problem hiding this comment.
I think this .title() is effectively a no-op: _store_field immediately calls label.lower().strip() and routes via substring checks. Could probably just pass fuzzy_hit through.
| break | ||
| next_text = _strip_bullets(_clean(next_block.lines[0].text)) if next_block.lines else "" | ||
| if not next_text or _LABEL_RE.match(next_text): | ||
| if not next_text or _is_label_line(next_text): |
There was a problem hiding this comment.
Possible bug worth checking: the gate that decides whether to enter this continuation pass (around line 307) substring-checks the original block text for "reply", which a typo like Repy-to: wouldn't satisfy. If so, fuzzy-matched reply-to labels would extract only the first line and skip continuation, i.e. the typo case this PR is trying to fix. A synthetic-block test would confirm.
| fuzzy_hit = fuzzy_match_label(candidate, _FUZZY_LABEL_TARGETS) | ||
| if fuzzy_hit is not None: | ||
| _log.warning( | ||
| "Fuzzy PDF label match: %r -> %r", |
There was a problem hiding this comment.
Formatnit: HTML side logs "Fuzzy label match: %r -> %r (field %r)"
Also, again recommend info rather than warning here.
| # Fuzzy fallback when no exact synonym matched. | ||
| fuzzy_hit = fuzzy_match_label(norm, _ALL_SYNONYMS.keys()) | ||
| if fuzzy_hit is not None: | ||
| _log.warning("Fuzzy label match: %r -> %r (field %r)", norm, fuzzy_hit, _ALL_SYNONYMS[fuzzy_hit]) |
There was a problem hiding this comment.
Probably should be info rather than warning. Fuzzy hit isn't really an error condition, and could be noisy on batch runs.
Bug fix: - wg21.py: replace raw-text "reply" substring check at continuation gate with has_reply_label flag set from the resolved label name, so fuzzy-matched labels like "Repy-to" still trigger multi-block reply-to absorption Nits (gregjkal): - wg21.py: fix over-indented comment above _FUZZY_LABEL_TARGETS - wg21.py: remove unnecessary re.DOTALL from _COLON_SPLIT_RE - wg21.py: drop .title() no-op on fuzzy_hit (immediately lowercased by _store_field) - wg21.py: align log format with HTML side, change warning to info - extract.py: change fuzzy match log from warning to info Tests: - test_wg21.py: add vocabulary drift guard (MUST_MATCH/MUST_REJECT lists asserted against both _LABEL_RE and fuzzy_match_label) - test_wg21.py: add fuzzy reply-to continuation test with synthetic "Repy-to:" block followed by continuation block
Bug fix: - wg21.py: replace raw-text "reply" substring check at continuation gate with has_reply_label flag set from the resolved label name, so fuzzy-matched labels like "Repy-to" still trigger multi-block reply-to absorption Nits (gregjkal): - wg21.py: fix over-indented comment above _FUZZY_LABEL_TARGETS - wg21.py: remove unnecessary re.DOTALL from _COLON_SPLIT_RE - wg21.py: drop .title() no-op on fuzzy_hit (immediately lowercased by _store_field) - wg21.py: align log format with HTML side, change warning to info - extract.py: change fuzzy match log from warning to info Tests: - test_wg21.py: add vocabulary drift guard (MUST_MATCH/MUST_REJECT lists asserted against both _LABEL_RE and fuzzy_match_label) - test_wg21.py: add fuzzy reply-to continuation test with synthetic "Repy-to:" block followed by continuation block
There was a problem hiding this comment.
Actionable comments posted: 5
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@packages/tomd/src/tomd/lib/html/extract.py`:
- Around line 92-94: The fallback that adds bare-text lines from an <address>
block currently treats any non-email line as a reply-to entry; update the branch
in _extract_handwritten_metadata() so it only adds author-like lines: implement
and call a small helper (e.g., _is_author_line(text)) that returns true for
plausible person names (for example, two or more words with initial capitals, no
date/affiliation keywords, and no digits) and/or matches an AUTHOR_RE, and only
then call _add(cleaned); alternatively restrict the fallback to known author
rows rather than all leftover lines to avoid adding
document/date/audience/affiliation entries.
- Around line 439-453: The table-based and hand-written HTML metadata extractors
should call the shared _match_field(label) instead of inspecting raw substrings
on the label; update those extractor code paths to call _match_field(label),
check its return (canonical field or None), and use that canonical name (from
_FIELD_SYNONYMS/_ALL_SYNONYMS via fuzzy_match_label) when assigning metadata
like document/date/audience/reply-to so typo-tolerant matching applies across
all HTML extractors.
- Around line 8-12: The current relative import in extract.py fails because
symbols like deobfuscate_email, parse_author_lines, DATE_RE, DOC_NUM_RE,
EMAIL_RE, _OBFUSCATED_UNDERSCORE_RE, and _OBFUSCATED_WORD_RE are not exported
from tomd.lib; fix by either (A) changing the import in tomd/lib/html/extract.py
to import those symbols directly from the module that defines them (locate the
module that defines deobfuscate_email and parse_author_lines and import from it
instead of using "from .. import ..."), or (B) re-export the missing symbols
from packages/tomd/src/tomd/lib/__init__.py so "from .. import DATE_RE,
DOC_NUM_RE, EMAIL_RE, parse_author_lines, deobfuscate_email,
_OBFUSCATED_UNDERSCORE_RE, _OBFUSCATED_WORD_RE" works; update tests/imports
accordingly and ensure fuzzy_match_label import remains from ..similarity.
In `@packages/tomd/src/tomd/lib/pdf/wg21.py`:
- Around line 11-12: wg21.py and html/extract.py expect deobfuscate_email,
_OBFUSCATED_UNDERSCORE_RE, and _OBFUSCATED_WORD_RE to be available from the
package lib namespace but they aren't exported; fix by either (A) defining and
exporting these three symbols in the lib package __init__ (add definitions for
_OBFUSCATED_UNDERSCORE_RE and _OBFUSCATED_WORD_RE and the deobfuscate_email
function, then add them to __all__) or (B) change the imports in wg21.py and
html/extract.py to import those symbols from the module that actually implements
them; refer to the symbol names deobfuscate_email, _OBFUSCATED_UNDERSCORE_RE,
and _OBFUSCATED_WORD_RE when making the change.
In `@packages/tomd/src/tomd/PDF_ARCH.md`:
- Around line 219-226: The subsection currently uses bold text "Fuzzy label
recovery" which doesn't create an anchor for the existing link to
"#fuzzy-label-recovery"; replace the bolded line with a proper Markdown heading
(for example "### Fuzzy label recovery" or "## Fuzzy label recovery") so a
stable anchor is generated, keep the rest of the paragraph unchanged, and ensure
the heading text exactly matches the link target (fuzzy-label-recovery) so
references to `_is_label_line`, `_FUZZY_LABEL_TARGETS`, and `fuzzy_match_label`
resolve correctly.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: f0443aed-68cd-478f-a5b6-a04dbc12513e
📒 Files selected for processing (4)
packages/tomd/src/tomd/PDF_ARCH.mdpackages/tomd/src/tomd/lib/html/extract.pypackages/tomd/src/tomd/lib/pdf/wg21.pypackages/tomd/tests/test_wg21.py
| def _match_field(label: str) -> str | None: | ||
| """Map a metadata label to its canonical field name, or None if unrecognized.""" | ||
| """Map a metadata label to its canonical field name, or None if unrecognized. | ||
|
|
||
| Two-stage: exact synonym lookup first, then fuzzy fallback via | ||
| ``fuzzy_match_label`` from ``similarity.py``. | ||
| """ | ||
| norm = _normalize_label(label) | ||
| for field, synonyms in _FIELD_SYNONYMS.items(): | ||
| if norm in synonyms: | ||
| return field | ||
| # Fuzzy fallback when no exact synonym matched. | ||
| fuzzy_hit = fuzzy_match_label(norm, _ALL_SYNONYMS.keys()) | ||
| if fuzzy_hit is not None: | ||
| _log.info("Fuzzy label match: %r -> %r (field %r)", norm, fuzzy_hit, _ALL_SYNONYMS[fuzzy_hit]) | ||
| return _ALL_SYNONYMS[fuzzy_hit] |
There was a problem hiding this comment.
Wire _match_field() into the other HTML extractors too.
This helper now does the fuzzy recovery, but the hand-written and generic table paths still inspect raw substrings on label directly. As written, typo-tolerant matching only applies to the WG21 <dl> path, so table-based HTML metadata can still miss typoed document/date/audience/reply-to labels.
Based on learnings: "Use shared label matcher (_FIELD_SYNONYMS + _match_field) across HTML extractors with exact synonym lookup first, then fuzzy fallback via similarity.fuzzy_match_label for metadata extraction".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@packages/tomd/src/tomd/lib/html/extract.py` around lines 439 - 453, The
table-based and hand-written HTML metadata extractors should call the shared
_match_field(label) instead of inspecting raw substrings on the label; update
those extractor code paths to call _match_field(label), check its return
(canonical field or None), and use that canonical name (from
_FIELD_SYNONYMS/_ALL_SYNONYMS via fuzzy_match_label) when assigning metadata
like document/date/audience/reply-to so typo-tolerant matching applies across
all HTML extractors.
| **Fuzzy label recovery** | ||
|
|
||
| - Label detection uses a two-stage strategy: first an exact regex match via `_LABEL_RE`, then a fuzzy fallback via `_is_label_line` which delegates to `similarity.fuzzy_match_label`. | ||
| - The fuzzy stage compares the candidate text against `_FUZZY_LABEL_TARGETS` (a frozen set of canonical labels: "Document Number", "Date", "Project", "Reply to", "Audience") using symmetric `SequenceMatcher` (max of both argument orders) with a threshold of 0.82. | ||
| - This recovers metadata from PDFs with typos in label text (e.g. "Repy-to", "Documnet Number") that the exact regex cannot match. | ||
| - All call sites that previously checked `_LABEL_RE.match()` directly now call `_is_label_line()` for consistent exact-then-fuzzy behavior. | ||
|
|
||
| **Sources:** `_is_label_line`, `_FUZZY_LABEL_TARGETS` in [`lib/pdf/wg21.py`](lib/pdf/wg21.py); `fuzzy_match_label` in [`lib/similarity.py`](lib/similarity.py). |
There was a problem hiding this comment.
Use a real heading for this subsection.
Line 438 now links to #fuzzy-label-recovery, but Lines 219-226 only introduce bold text, so no anchor is generated and the appendix link stays broken.
Suggested fix
-**Fuzzy label recovery**
+### Fuzzy label recovery📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| **Fuzzy label recovery** | |
| - Label detection uses a two-stage strategy: first an exact regex match via `_LABEL_RE`, then a fuzzy fallback via `_is_label_line` which delegates to `similarity.fuzzy_match_label`. | |
| - The fuzzy stage compares the candidate text against `_FUZZY_LABEL_TARGETS` (a frozen set of canonical labels: "Document Number", "Date", "Project", "Reply to", "Audience") using symmetric `SequenceMatcher` (max of both argument orders) with a threshold of 0.82. | |
| - This recovers metadata from PDFs with typos in label text (e.g. "Repy-to", "Documnet Number") that the exact regex cannot match. | |
| - All call sites that previously checked `_LABEL_RE.match()` directly now call `_is_label_line()` for consistent exact-then-fuzzy behavior. | |
| **Sources:** `_is_label_line`, `_FUZZY_LABEL_TARGETS` in [`lib/pdf/wg21.py`](lib/pdf/wg21.py); `fuzzy_match_label` in [`lib/similarity.py`](lib/similarity.py). | |
| ### Fuzzy label recovery | |
| - Label detection uses a two-stage strategy: first an exact regex match via `_LABEL_RE`, then a fuzzy fallback via `_is_label_line` which delegates to `similarity.fuzzy_match_label`. | |
| - The fuzzy stage compares the candidate text against `_FUZZY_LABEL_TARGETS` (a frozen set of canonical labels: "Document Number", "Date", "Project", "Reply to", "Audience") using symmetric `SequenceMatcher` (max of both argument orders) with a threshold of 0.82. | |
| - This recovers metadata from PDFs with typos in label text (e.g. "Repy-to", "Documnet Number") that the exact regex cannot match. | |
| - All call sites that previously checked `_LABEL_RE.match()` directly now call `_is_label_line()` for consistent exact-then-fuzzy behavior. | |
| **Sources:** `_is_label_line`, `_FUZZY_LABEL_TARGETS` in [`lib/pdf/wg21.py`](lib/pdf/wg21.py); `fuzzy_match_label` in [`lib/similarity.py`](lib/similarity.py). |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@packages/tomd/src/tomd/PDF_ARCH.md` around lines 219 - 226, The subsection
currently uses bold text "Fuzzy label recovery" which doesn't create an anchor
for the existing link to "#fuzzy-label-recovery"; replace the bolded line with a
proper Markdown heading (for example "### Fuzzy label recovery" or "## Fuzzy
label recovery") so a stable anchor is generated, keep the rest of the paragraph
unchanged, and ensure the heading text exactly matches the link target
(fuzzy-label-recovery) so references to `_is_label_line`,
`_FUZZY_LABEL_TARGETS`, and `fuzzy_match_label` resolve correctly.
extract.py and wg21.py import deobfuscate_email, _OBFUSCATED_WORD_RE, and _OBFUSCATED_UNDERSCORE_RE from tomd.lib, but the definition in lib/__init__.py was not included in the original commit. This caused ImportError on CI while passing locally (uncommitted file present in working directory). Also extends parse_author_lines to deobfuscate "user at domain dot com" and "user_at_domain.com" anti-spam patterns before treating a line as a bare name.
Docs:
- CLAUDE.md: document deobfuscate_email and _extract_plaintext_authors
in file map entries for lib/__init__.py and lib/html/extract.py
- PDF_ARCH.md: add deobfuscation fallback to email enrichment section
- HTML_ARCH.md: document _extract_plaintext_authors fallback and
deobfuscation in _collect_metadata_emails bootstrap path
Tests (test_html_extract.py):
- TestDeobfuscateEmail: 12 cases covering word-based ("at"/"dot"),
underscore-based (_at_), rejection of prose and valid emails
- TestExtractPlaintextAuthors: 4 cases for address tags, plain-text
emails, td fallback, empty containers
- TestHandwrittenObfuscatedReplyTo: integration test reproducing
p2285r1 pattern (obfuscated emails in address tags)
a4e606e to
f1c4bba
Compare
| from .. import _OBFUSCATED_UNDERSCORE_RE, _OBFUSCATED_WORD_RE | ||
| um = _OBFUSCATED_UNDERSCORE_RE.search(lt) | ||
| wm = _OBFUSCATED_WORD_RE.search(lt) | ||
| match_start = (um or wm).start() |
There was a problem hiding this comment.
Fragile: deobfuscate_email already found the match span internally and discarded it; this code re-runs both regexes to recover it. If the underscore family matches but fails EMAIL_RE.fullmatch and the word family is what actually succeeded, (um or wm).start() returns the wrong offset and the name slice is off. Same pattern duplicated in lib/__init__.py:497 and lib/html/extract.py:84. Maybe have deobfuscate_email return (email, span) so callers can slice without re-matching.
|
|
||
| EMAIL_RE = re.compile(r"[\w.+-]+@[\w.-]+\.\w+") | ||
|
|
||
| import logging as _logging |
There was a problem hiding this comment.
Belongs with top-of-file imports.
| deob = deobfuscate_email(lt) | ||
| if deob: | ||
| # Name is everything before the obfuscated email | ||
| from .. import _OBFUSCATED_UNDERSCORE_RE, _OBFUSCATED_WORD_RE |
There was a problem hiding this comment.
Belongs with top-of-file imports.
| if fuzzy_hit is not None: | ||
| _log.info( | ||
| "Fuzzy label match: %r -> %r (target %r)", | ||
| candidate, fuzzy_hit, fuzzy_hit, |
There was a problem hiding this comment.
Same fuzzy_hit item twice. HTML side passes _ALL_SYNONYMS[fuzzy_hit] (canonical field). Maybe that's what you meant here too?
Metadata labels in WG21 papers sometimes contain typos ("Repy-to", "Documnet Number", "Auther") that exact regex cannot match, causing metadata loss during conversion. This changeset adds typo-tolerant label recovery across both PDF and HTML extraction paths.
Core (tomd)
_symmetric_similarity(max of both SM argument orders) andfuzzy_match_label(best-match against known label set, threshold 0.82)_LABEL_RE.match()calls with_is_label_line(), a two-stage wrapper (exact regex first, fuzzy fallback second) using_FUZZY_LABEL_TARGETSfrozenset_match_fieldwith fuzzy fallback via_ALL_SYNONYMSinverted dict; warn on fuzzy matches_LABEL_REto_is_label_linefor consistencyCLI (paperlint)
rich_console_handlercontext manager that temporarily swaps the console logger to a RichHandler tied to a specific Console instance, so log output does not collide with Rich progress barsConsole()toConsole(stderr=True); wrap progress context withrich_console_handlerfor coordinated outputTests
fuzzy_match_label+ symmetryDocs
Summary by CodeRabbit
New Features
Documentation
Tests