Skip to content

Resolve merge conflicts#65

Open
vinniefalco wants to merge 5 commits intocppalliance:mainfrom
vinniefalco:main
Open

Resolve merge conflicts#65
vinniefalco wants to merge 5 commits intocppalliance:mainfrom
vinniefalco:main

Conversation

@vinniefalco
Copy link
Copy Markdown
Member

@vinniefalco vinniefalco commented Apr 29, 2026

Note

Medium Risk
Touches core PDF/HTML conversion and front-matter canonicalization, so output diffs could be widespread and edge-case regressions are possible. Changes are mostly additive/sanitizing with new tests and fixtures to constrain behavior.

Overview
Standardizes tomd’s output front matter by adding revision, defaulting intent: info, always quoting title, and enforcing a fixed key order; mailing metadata fallback can now override document (e.g. replacing embedded D numbers with published P/N IDs) and reorders YAML after injection.

Improves conversion quality across HTML/PDF: extracts missing document/title/date more aggressively (including PDF info dict), enriches reply-to via mailto/email correlation, overrides revision from filename when needed, strips duplicate paragraphs and redundant body metadata (including leading H1 and metadata tables), and expands HTML table handling (rowspan/colspan denormalization, mangled/nested table recovery, code-in-table extraction) while tagging lossy conversions with <!-- tomd:lossy-table --> and reporting them in QA.

Updates docs/tests and adds new gold-standard fixtures/tests to lock in the canonical front matter + structure expectations; extends .gitignore for local QA artifacts.

Reviewed by Cursor Bugbot for commit b9aeb0f. Bugbot is set up for automated code reviews on this repo. Configure here.

Summary by CodeRabbit

  • New Features

    • Added lossy table detection and reporting in quality assurance.
    • Enhanced metadata extraction with automatic revision and intent inference.
    • Implemented paragraph deduplication and redundant metadata removal.
  • Bug Fixes

    • Improved handling of complex tables with rowspan and colspan attributes.
    • Better title and author extraction from PDFs and HTML documents.
  • Documentation

    • Updated conversion specifications for YAML front matter and table handling.

0101sg and others added 5 commits April 29, 2026 13:11
…dation

Add two new quality checks to qa.py: mojibake detection via ftfy.badness() with U+FFFD counting, and heading level skip detection following markdownlint MD001 semantics. Both penalties are capped (20 and 15 points respectively) to prevent single-issue score domination.

37 tests cover the new checks including edge cases for C++ templates, valid Unicode, real corpus patterns, and penalty cap boundaries.

Includes reference markdown fixtures (D4036, P2583R3) for future wording section validation, QA-001/QA-002 plan files, and the QA-001 implementation report with corpus verification results (270 papers, 5 real issues found).

Made-with: Cursor
Research-only plan covering the full tomd architecture (39 PDF techniques, 18 HTML techniques), exact corpus quality scores from qa-report.json (100 papers, 15 below score 90), YAML front-matter contract from DESIGN.md, 3 known crasher bugs from code review, downstream quote-verification risk from paperlint, complete documentation inventory (5 CLAUDE.md files, 12 tomd docs, 499 tests), and cross-references to QA-002 mpark/wg21 framework findings.

Made-with: Cursor
HTML reply-to extraction:
- Schultke generator: parse dl elements for metadata
- WG21 generator: iterative continuation-row parsing for multi-dd reply-to
- Field synonyms: map co-author/co-authors to reply-to
- Enrichment bootstrap: scan metadata region for mailto links when
  reply-to is empty (handles source typos like "Repy-to")
- dascandy/fiets generator: extract all authors from continuation rows

PDF reply-to extraction:
- Append instead of overwrite when multiple author-like labels exist
  (fixes p2000r5, p5000r0 where Author overwrote Reply-to)
- Handle separate Email: lines with name-email pairing (fixes p3427r3)
- Continuation block pairing: merge bare emails with preceding bare names
- Add _enrich_pdf_reply_to post-pass as safety net for missed emails
- Smart dedup: _is_already_present prevents "Alice" duplicating
  "Alice <alice@example.com>"

Testing and verification:
- 12 new PDF extraction unit tests (test_pdf_wg21.py)
- 16 golden fixtures updated for improved extraction
- 270 papers audited: 0 failures, 213 pass, 57 warnings (all expected)
- Stratified spot-check of 19 papers: 18 pass, 1 warn (mailing override)

All changes within packages/tomd/ only.

Made-with: Cursor
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 29, 2026

📝 Walkthrough

Walkthrough

This PR extends the TOMD Markdown conversion pipeline with lossy table detection, canonical YAML front-matter formatting (including title quoting and revision extraction), enhanced metadata enrichment from PDFs and HTML sources, paragraph deduplication, and body cleanup utilities. It adds QA metrics for lossy tables and introduces comprehensive test coverage validating gold-standard document structures and WG21 wording conventions.

Changes

Cohort / File(s) Summary
Configuration & Ignore
.gitignore
Extended to exclude Cursor editor state (.cursor/), QA/audit-related files/directories (leading-underscore patterns, _qa016_* outputs, QA JSONs), local plans/reports folders, and development files (CONCURRENCY-TODO.md, STATUS.md).
Core Documentation
CLAUDE.md, packages/tomd/src/tomd/CLAUDE.md, packages/tomd/src/tomd/README.md, packages/tomd/src/tomd/lib/html/ARCHITECTURE.md
Documented YAML front-matter schema with deterministic field ordering, lossy table marker behavior, table rendering strategies (code-table extraction, flat reconstruction, denormalization for rowspan/colspan), and QA reporting of lossy tables without score penalty.
API & Metadata Fallback
packages/tomd/src/tomd/api.py
Implemented authoritative field override logic (currently for document), forced title field quoting, YAML reordering to canonical order, and suppression of no-op returns when overrides trigger changes.
Core Library Utilities
packages/tomd/src/tomd/lib/__init__.py
Added paragraph deduplication with thresholds, revision extraction from document IDs, title normalization and double-quoting, intent defaulting to "info", leading H1 stripping based on fuzzy title matching, redundant body metadata removal, and updated canonical FRONT_MATTER_ORDER including revision.
HTML Conversion Pipeline
packages/tomd/src/tomd/lib/html/__init__.py, packages/tomd/src/tomd/lib/html/extract.py, packages/tomd/src/tomd/lib/html/render.py
Enhanced metadata enrichment (document/revision/title parsing from filenames and rendered content), leading --- block removal, paragraph deduplication, redundant metadata stripping, and H1 removal. Extended table rendering with rowspan/colspan denormalization, flat reconstruction for complex tables, lossy marker emission, and improved row/cell handling.
HTML Metadata Extraction
packages/tomd/src/tomd/lib/html/extract.py
Added mailto parsing utilities, post-processing enrichment pass (_enrich_reply_to), deduplication/merging of author entries across generators, context-based name/email pairing, and metadata-region email scanning with fallback recovery.
PDF Conversion Pipeline
packages/tomd/src/tomd/lib/pdf/__init__.py
Added metadata enrichment from PDF info-dict and filename patterns, revision reconciliation from filenames, title detection from headings, reply-to derivation from PDF author field with page-0 email detection and inference.
PDF Structure & Cleanup
packages/tomd/src/tomd/lib/pdf/structure.py, packages/tomd/src/tomd/lib/pdf/cleanup.py
Enhanced title detection with multi-signal gating (known headings, contents headers), page-0-specific thresholds for metadata stripping to prevent over-removal near section markers.
PDF Emission & QA
packages/tomd/src/tomd/lib/pdf/emit.py, packages/tomd/src/tomd/lib/pdf/qa.py
Added post-processing to deduplicate paragraphs, remove redundant metadata, and strip leading H1s (applied before and after cleanup). Introduced lossy_table_count metric and conditional issue reporting for lossy tables without score adjustment.
PDF WG21 Extraction
packages/tomd/src/tomd/lib/pdf/wg21.py
Broadened label recognition for metadata fields, enhanced _store_field to merge author/email data into reply-to with deduplication, improved title selection heuristics, added fallback date and reply-to extraction passes with email/name pairing and deduplication.
Test Fixtures – Golden Standard
packages/tomd/tests/fixtures/golden/*.golden.md, packages/tomd/tests/fixtures/d4036-gold-standard.md
Updated golden fixture front-matter to include normalized metadata fields (revision from document ID, intent: info, quoted title, document field, and structured reply-to lists). Removed redundant H1 headings. Added new d4036 gold-standard fixture with full article and structured front-matter.
Test Suite – API & Fallback
packages/tomd/tests/test_emit.py, packages/tomd/tests/test_fallback.py
Updated emit tests to expect quoted title and intent: info defaults with revision derivation. Added comprehensive fallback tests for YAML reordering, field override behavior, and multiline key handling.
Test Suite – HTML Extraction & Rendering
packages/tomd/tests/test_html_extract.py, packages/tomd/tests/test_html_render.py
Expanded to validate mailto extraction, reply-to enrichment logic, name/email pairing, and lossy table marker presence/absence. Added tests for denormalized table rendering and complex table handling.
Test Suite – Gold Standard & QA
packages/tomd/tests/test_gold_standard.py, packages/tomd/tests/test_qa.py, packages/tomd/tests/test_pdf_wg21.py, packages/tomd/tests/test_wg21.py
Added structural validation for front-matter format, markdown conventions, heading hierarchy, and WG21 wording constructs. Added lossy table metric and scoring tests. Added PDF reply-to enrichment and WG21 numeric-line filtering tests.
Test Suite – Paper Extraction
packages/tomd/tests/test_paper_extract.py
Updated to expect quoted title values and canonical field ordering with revision.
Planning & Reports
plans/QA-001-extend-qa-scoring.md, plans/QA-002-mpark-wording-support.md, plans/QA-003-tomd-deep-analysis.md, reports/QA-001-extend-qa-scoring.md
Added planning documents specifying mojibake detection, heading-level skip detection, mpark wording support, and deep-analysis research. Added QA-001 completion report documenting implementation and corpus-level results.

Sequence Diagram

sequenceDiagram
    actor Input as PDF/HTML Input
    participant Conversion as tomd Converter
    participant Metadata as Metadata Extractor
    participant Render as Renderer & Cleanup
    participant QA as QA Metrics
    actor Output as Markdown + YAML

    Input->>Conversion: PDF/HTML document
    Conversion->>Metadata: Extract metadata (title, document, authors)
    Metadata->>Metadata: Enrich from filename/content (revision, reply-to)
    Metadata->>Metadata: Deduplicate/merge author entries
    Conversion->>Render: Render tables (detect rowspan/colspan)
    Render->>Render: Apply denormalization/flat reconstruction
    alt Lossy rendering path
        Render->>Render: Emit <!-- tomd:lossy-table --> marker
    end
    Render->>Render: Deduplicate paragraphs
    Render->>Render: Strip redundant H1 & body metadata
    Render->>Output: Canonical YAML front-matter + body
    Output->>QA: Scan for lossy table markers
    QA->>QA: Count lossy_table_count
    QA->>Output: Report issues (no score penalty)
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Poem

🐰 Hop! The tables hop with markers bright,
Revision fields now in proper sight,
Metadata flows from PDF to YAML true,
While H1s vanish—cleaned up fresh and new,
Paragraphs deduplicate with glee,
The pipeline hops as it should be!

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 55.21% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check ❓ Inconclusive The PR title 'Resolve merge conflicts' is vague and does not convey meaningful information about the actual changeset, which contains substantial functional improvements to tomd (lossy table detection, metadata enrichment, paragraph deduplication, etc.) and new documentation/tests. Provide a more descriptive title that reflects the primary changes, such as 'Add lossy table markers and metadata enrichment to tomd pipeline' or similar to accurately represent the scope of work.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
Review rate limit: 0/1 reviews remaining, refill in 60 minutes.

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 4 potential issues.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit b9aeb0f. Configure here.

total is capped at 3 occurrences (keeps the first 3).

Headings and code fences are never dropped.
"""
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docstring contradicts actual dedup threshold constant

Medium Severity

The dedup_paragraphs docstring states paragraphs appearing "more than 3 times" are "capped at 3 occurrences," but the actual threshold is _MAX_PARAGRAPH_OCCURRENCES = 10. Either the constant is wrong (should be 3) or the docstring is wrong (should say 10). This likely resulted from a merge conflict where one branch changed the constant and another wrote the docstring with the old value.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit b9aeb0f. Configure here.

for cell in tr.find_all(["th", "td"], recursive=False):
col_count += int(cell.get("colspan", 1))
if col_count > max_cols:
max_cols = col_count
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Table denormalization undercounts columns with rowspan+colspan

Medium Severity

In _denormalize_table, the first pass computes max_cols by summing colspans per row independently, without accounting for columns consumed by rowspans from upper rows. When a row has both inherited rowspan cells AND its own cells, the effective column count exceeds the first-pass estimate, causing the grid to be too narrow and later cells to be silently dropped via the col_idx >= max_cols break.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit b9aeb0f. Configure here.

if bare_names and bare_emails and len(bare_names) == len(bare_emails):
merged_entries: list[str] = []
name_iter = iter(bare_names)
email_iter = iter(bare_emails)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused variables name_iter and email_iter in enrichment

Low Severity

In _enrich_reply_to, name_iter and email_iter are assigned via iter() but never read. The actual iteration uses zip(bare_names, bare_emails) directly. These are dead stores left over from a refactor.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit b9aeb0f. Configure here.

number = stem_m.group(2)
metadata["document"] = f"{prefix}{number}R{stem_rev}"
_log.debug("Overrode document revision from filename: %s -> %s",
f"{doc_m.group(0)}", metadata["document"])
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplicated _override_revision_from_filename across HTML and PDF

Low Severity

_override_revision_from_filename and its associated _PID_BASE_RE regex are defined identically in both lib/html/__init__.py and lib/pdf/__init__.py. This duplication increases maintenance burden and risks inconsistent bug fixes if one copy is updated without the other. The function belongs in the shared lib/__init__.py module.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit b9aeb0f. Configure here.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 10

Note

Due to the large number of review comments, Critical, Major severity comments were prioritized as inline comments.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (6)
packages/tomd/tests/fixtures/golden/p0957r8.golden.md (1)

1206-1224: ⚠️ Potential issue | 🟠 Major

Uncertain region markdown looks structurally broken (empty list item + heavily “wrapped” sentence).

Right after the updated <!-- tomd:uncertain:L1216-L1307 --> marker, there’s an isolated list marker - on its own line, and the following sentence is split into many single-word lines (If, typename, F::reflection_type, ...). This likely produces an empty list item and degrades the “unwrapped paragraph lines / correct list formatting” quality target.

Consider emitting the uncertain region in a way that preserves content but still:

  • attaches the bullet marker to the If ... text (e.g., - If typename ...)
  • joins the sentence back into a single logical line (or at least unbreaks words into a normal sentence form)
  • avoids extra blank lines around the marker
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/tomd/tests/fixtures/golden/p0957r8.golden.md` around lines 1206 -
1224, The uncertain-region output contains an isolated list marker and a heavily
wrapped sentence; locate the tomd uncertain block marked <!--
tomd:uncertain:L1216-L1307 --> and replace the broken fragment (the standalone
'-' and the subsequent single-word-wrapped lines starting with "If typename
F::reflection_type is not void...") with a single properly formatted bullet
whose text is one joined sentence, e.g. "- If typename F::reflection_type is not
void, it shall be constructible from std::in_place_type_t<P> in a constant
expression."; ensure there are no extra blank lines before/after the marker and
no empty list items remain.
packages/tomd/tests/fixtures/golden/p0533r9.golden.md (1)

1-10: ⚠️ Potential issue | 🟠 Major

Normalize audience to the canonical YAML form in this golden.

Line 7 is still quoted, so this fixture is not asserting the same serialization contract the formatter now targets. Updating the new fields without fixing that will keep the golden out of sync with the canonical front matter.

As per coding guidelines: Front matter title must be double-quoted; document, date, intent, audience must be unquoted; reply-to is a YAML list.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/tomd/tests/fixtures/golden/p0533r9.golden.md` around lines 1 - 10,
The front-matter in the golden still quotes the audience field and may not match
the formatter's canonical form; update the YAML so that title is double-quoted
and document, date, intent, and audience are unquoted, and ensure reply-to is
rendered as a YAML list (preserve the existing reply-to array), i.e., change
"audience: \"LWG & CWG\"" to audience: LWG & CWG and verify title: "..." remains
quoted while document/date/intent are unquoted to match the canonical
front-matter expectations.
packages/tomd/tests/fixtures/golden/p3411r5.golden.md (1)

1-13: ⚠️ Potential issue | 🟠 Major

This fixture still encodes the old front-matter/body contract.

title at Line 2 is still unquoted, and the body starts with H1 headings at Lines 15 and 52. Updating only revision/intent will keep the goldens aligned with output the formatter is supposed to stop producing.

Based on learnings: Body headings in converted papers must start at H2. The front-matter title renders as H1, and no # H1 should appear in the body. As per coding guidelines: Front matter title must be double-quoted; document, date, intent, audience must be unquoted; reply-to is a YAML list.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/tomd/tests/fixtures/golden/p3411r5.golden.md` around lines 1 - 13,
The front-matter uses the old contract: quote the title (change title: any_view
to title: "any_view") and keep document, date, intent, audience unquoted and
reply-to as a YAML list; then remove or demote body H1 headings (any leading "#
" headings in the markdown body) to H2 ("## ") so the front-matter title is the
only H1—ensure only revision and intent were intended changes and adjust the
fixture's front-matter and the two body headings accordingly.
packages/tomd/tests/fixtures/golden/p2728r11.golden.md (1)

1-10: ⚠️ Potential issue | 🟠 Major

This golden still blesses H1 body headings.

The body still uses # High-Level Overview, # UTF Primer, and other H1s, so updating only the front matter keeps this fixture asserting output that violates the promoted-title contract.

Based on learnings: Body headings in converted papers must start at H2. The front-matter title renders as H1, and no # H1 should appear in the body.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/tomd/tests/fixtures/golden/p2728r11.golden.md` around lines 1 - 10,
The fixture's Markdown body contains H1 headings (e.g., "# High-Level Overview",
"# UTF Primer") while the front-matter title already renders as the document H1;
update the body so all top-level sections start at H2 by replacing leading
single '#' headings with '##' (or otherwise demoting any H1s in the body) and
ensure no "# " headings remain in the content while keeping the front-matter
title unchanged.
packages/tomd/tests/fixtures/golden/p4005r0.golden.md (1)

1-10: ⚠️ Potential issue | 🟠 Major

Canonicalize the whole front-matter/title shape here, not just revision/intent.

title at Line 2 is unquoted, and the body repeats it as an H1 at Line 14. If this golden is refreshed only partially, tests will continue to accept output that violates the current serialization rules.

Based on learnings: Body headings in converted papers must start at H2. The front-matter title renders as H1, and no # H1 should appear in the body. As per coding guidelines: Front matter title must be double-quoted; document, date, intent, audience must be unquoted; reply-to is a YAML list.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/tomd/tests/fixtures/golden/p4005r0.golden.md` around lines 1 - 10,
The front-matter must be fully canonicalized: ensure the YAML keys 'title' is
double-quoted (e.g. title: "…"), keep 'document', 'date', 'intent', and
'audience' unquoted, and represent 'reply-to' as a YAML list; also remove the
duplicated H1 in the body so body headings start at H2. Locate the front-matter
block (fields: title, document, revision, date, intent, audience, reply-to) and
update the serialization to quote only title and format reply-to as a list, then
edit the body to eliminate the top-level "# ..." H1 so the first heading becomes
"## ...".
packages/tomd/src/tomd/lib/pdf/cleanup.py (1)

169-186: ⚠️ Potential issue | 🟠 Major

Apply the page-0 metadata exemption before span stripping too.

The new guard only preserves full-line matches. If page-0 metadata comes through as a merged spatial line, the line-level match fails, execution falls into the span loop, and matching spans above _page0_meta_y still get removed. That means the first-page WG21 metadata block can still be over-stripped.

🩹 Suggested guard
             if any(_matches(text, rp) for rp in line_patterns):
                 if block.page_num == 0 and line.bbox[1] < _page0_meta_y:
                     kept_lines.append(line)
                     continue
                 continue
+
+            if block.page_num == 0 and line.bbox[1] < _page0_meta_y:
+                kept_lines.append(line)
+                continue
 
             kept_spans = []
             stripped_any = False
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/tomd/src/tomd/lib/pdf/cleanup.py` around lines 169 - 186, The page-0
metadata exemption currently only preserves whole lines matched by
line_patterns; modify the span-processing in the loop that iterates line.spans
(inside function handling block/line cleanup) to apply the same page-0 exemption
early: if block.page_num == 0 and line.bbox[1] < _page0_meta_y then skip
span-level pattern checks and append all spans to kept_spans (i.e., preserve the
merged spatial line as metadata), otherwise continue using _y_bucket(span.bbox)
-> _patterns_near(...) and _matches to decide stripping; reference symbols:
block, line, line_patterns, _page0_meta_y, line.spans, _y_bucket,
_patterns_near, _matches.
🟡 Minor comments (12)
packages/tomd/src/tomd/lib/html/ARCHITECTURE.md-106-114 (1)

106-114: ⚠️ Potential issue | 🟡 Minor

Fix the table-path summary for code tables.

Path 1 explicitly says the original table structure is dropped and fenced code blocks are emitted, so “All four paths emit standard CommonMark pipe tables” is inaccurate.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/tomd/src/tomd/lib/html/ARCHITECTURE.md` around lines 106 - 114,
Update the table-path summary in ARCHITECTURE.md to correct the statement that
"All four paths emit standard CommonMark pipe tables": explicitly note that
_render_code_table is an exception and emits fenced code blocks (dropping table
structure) while the other three paths emit pipe tables; reference the
_render_code_table, _render_table_flat, and _render_denormalized_table symbols
so the change is made adjacent to their descriptions and adjust the sentence
about lossiness/HTML marker accordingly (i.e., keep the lossy-table marker
behavior but exclude code-table from the "emit pipe tables" claim).
packages/tomd/tests/test_wg21.py-127-127 (1)

127-127: ⚠️ Potential issue | 🟡 Minor

Use _consumed for intentionally unused tuple values.

Both new tests unpack consumed and never use it, which triggers RUF059. Rename to _consumed to keep lint clean.

💡 Suggested patch
-    meta, consumed = extract_metadata_from_blocks([b])
+    meta, _consumed = extract_metadata_from_blocks([b])
@@
-    meta, consumed = extract_metadata_from_blocks([b])
+    meta, _consumed = extract_metadata_from_blocks([b])

Also applies to: 141-141

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/tomd/tests/test_wg21.py` at line 127, The test unpacks the tuple
returned by extract_metadata_from_blocks([b]) into meta, consumed but never uses
consumed, triggering the unused-variable lint; change the unpack target from
consumed to _consumed in both occurrences (the call site using
extract_metadata_from_blocks([b]) at the second unpack and the other occurrence
noted) so the tuple is still captured but marked intentionally unused.
packages/tomd/tests/fixtures/golden/p1112r4.golden.md-2-12 (1)

2-12: ⚠️ Potential issue | 🟡 Minor

Remove leftover WG21 metadata line from body.

Front matter now carries the metadata, but the body still has a Reply-to ... Target ... metadata line. Keeping both in a golden fixture can preserve a normalization regression.

🧹 Suggested fixture cleanup
-
-Reply-to: Balog, Pal (pasa@lib.hu) Target: C++26
-

Based on learnings: WG21 metadata block becomes YAML front matter.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/tomd/tests/fixtures/golden/p1112r4.golden.md` around lines 2 - 12,
Remove the duplicated WG21 metadata line left in the document body: delete the
line starting with "Reply-to: Balog, Pal (pasa@lib.hu) Target: C++26" so the
information only exists in the YAML front matter
(title/document/revision/date/intent/audience/reply-to) and the golden fixture
no longer contains the redundant body metadata that can cause normalization
regressions.
packages/tomd/tests/fixtures/golden/p1068r11.golden.md-8-14 (1)

8-14: ⚠️ Potential issue | 🟡 Minor

Keep reply-to limited to contact entries.

Acknowledgements: Tomasz Kamiński is not a Name <email> contact, so this golden file now bakes metadata pollution into the expected output instead of the sanitized reply-to list. That should stay in the body, not in front matter.

🧹 Suggested fixture cleanup
 reply-to:
   - "Ilya Burylov <burylov@gmail.com>"
   - "Pavel Dyakov <pavel.dyakov@intel.com>"
   - "Ruslan Arutyunyan <ruslan.arutyunyan@intel.com>"
   - "Andrey Nikolaev <af.nikolaev@gmail.com>"
   - "Alina Elizarova <alina.elizarova@intel.com>"
-  - "Acknowledgements: Tomasz Kamiński"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/tomd/tests/fixtures/golden/p1068r11.golden.md` around lines 8 - 14,
The frontmatter reply-to array contains a non-contact entry "Acknowledgements:
Tomasz Kamiński" which pollutes the expected metadata; remove that exact string
from the reply-to list and move it into the document body (e.g., append a short
"Acknowledgements: Tomasz Kamiński" line in the markdown content) so reply-to
only contains Name <email> contact entries and the acknowledgement is preserved
in the body.
packages/tomd/tests/test_fallback.py-36-40 (1)

36-40: ⚠️ Potential issue | 🟡 Minor

Rename l here, Ruff E741 will fail this file.

These comprehensions and loops use the ambiguous name l, which Ruff flags as an error. Renaming them to line is enough.

Possible fix
-            l.split(":")[0]
-            for l in result.splitlines()
-            if l and not l.startswith((" ", "\t", "-"))
+            line.split(":")[0]
+            for line in result.splitlines()
+            if line and not line.startswith((" ", "\t", "-"))

Also applies to: 47-50, 64-64, 97-109, 165-168

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/tomd/tests/test_fallback.py` around lines 36 - 40, Rename the
ambiguous loop/comprehension variable `l` to `line` in the test file so Ruff
error E741 is resolved; update the list comprehension that builds `keys` (the
comprehension using `for l in result.splitlines()`), and likewise replace `l`
with `line` in the other occurrences called out (around the other
comprehensions/loops at the ranges noted: 47-50, 64, 97-109, 165-168), keeping
the logic identical.
packages/tomd/src/tomd/lib/html/__init__.py-92-111 (1)

92-111: ⚠️ Potential issue | 🟡 Minor

Parse the front-matter boundary once instead of searching for the next ---.

Both H1-stripping passes use md.find("---", 4) to find the closing fence. If a title or another scalar contains ---, the body slice starts at the wrong place and strip_leading_h1() stops being reliable. Please split the document with a front-matter regex/helper here, then reuse that exact boundary in both passes.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/tomd/src/tomd/lib/html/__init__.py` around lines 92 - 111, The two
H1-stripping blocks repeatedly call md.find("---", 4) which can mis-locate the
front-matter end if `---` appears elsewhere; replace those duplicate searches by
extracting the front-matter boundary once (e.g., with a single regex or a helper
like find_front_matter_end(md)) and store the resulting index (fm_end) and body
start; then reuse that stored boundary for both uses of strip_leading_h1 and
when slicing md before/after calling strip_redundant_body_meta; update
references to md, metadata, title, strip_leading_h1 and
strip_redundant_body_meta to use the single computed fm_end/body_start so both
passes operate on the exact same split.
packages/tomd/tests/test_pdf_wg21.py-1-3 (1)

1-3: ⚠️ Potential issue | 🟡 Minor

Use the repository's standard BSL header format on this new file.

This header is missing the usual file-author attribution, so it does not match the repo rule for new Python files.

As per coding guidelines: **/*.py: Use BSL-1.0 copyright headers on new .py files, attributed to the file author. Leave existing headers unchanged

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/tomd/tests/test_pdf_wg21.py` around lines 1 - 3, Update the file
header in test_pdf_wg21.py to use the repository's standard Boost Software
License 1.0 (BSL-1.0) header format and include the required file-author
attribution; locate the top-of-file comment block and replace it with the
project's canonical BSL header that includes the file author (e.g., "Author:
<name>") while leaving any existing license text otherwise unchanged so the file
matches the repo rule for new .py files.
packages/tomd/src/tomd/api.py-126-132 (1)

126-132: ⚠️ Potential issue | 🟡 Minor

Keep reply-to last when reordering YAML.

_reorder_yaml_body() currently sorts unknown keys after reply-to, which disagrees with the canonical front-matter contract used by format_front_matter(). Any body that contains both custom keys and reply-to will be reordered incorrectly here.

Possible fix
-    priority = {k: i for i, k in enumerate(FRONT_MATTER_ORDER)}
-    fallback = len(FRONT_MATTER_ORDER)
-    sorted_keys = sorted(order, key=lambda k: (priority.get(k, fallback), order.index(k)))
+    priority = {
+        key: index
+        for index, key in enumerate(FRONT_MATTER_ORDER)
+        if key != "reply-to"
+    }
+    sorted_keys = sorted(
+        [key for key in order if key != "reply-to"],
+        key=lambda key: (priority.get(key, len(priority)), order.index(key)),
+    )
     lines: list[str] = []
-    for k in sorted_keys:
-        lines.extend(blocks[k])
+    for key in sorted_keys:
+        lines.extend(blocks[key])
+    if "reply-to" in blocks:
+        lines.extend(blocks["reply-to"])
     return "\n".join(lines)

Based on learnings: Front matter keys must be emitted in strict canonical order: title, document, revision, date, intent, audience, reply-to. Missing keys are skipped. Unknown keys appear after audience so reply-to is always last.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/tomd/src/tomd/api.py` around lines 126 - 132, _reorder_yaml_body()
currently assigns unknown keys a fallback index of len(FRONT_MATTER_ORDER),
which places them after "reply-to" and causes reply-to to not be last; change
the fallback to the index of "reply-to" in FRONT_MATTER_ORDER (or otherwise
special-case "reply-to" to always sort last) so unknown keys get the same
priority as keys that come before reply-to and "reply-to" is guaranteed to be
emitted last; update references in the function (priority, fallback,
sorted_keys) accordingly and ensure this matches format_front_matter() canonical
order.
packages/tomd/tests/test_gold_standard.py-94-95 (1)

94-95: ⚠️ Potential issue | 🟡 Minor

Rename l to avoid Ruff E741 in this new test module.

The new comprehensions at Lines 94, 101, and 174 use l, which Ruff flags as ambiguous.

Also applies to: 101-101, 174-175

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/tomd/tests/test_gold_standard.py` around lines 94 - 95, The
comprehensions in test_gold_standard.py (e.g. the one that builds h1_lines using
"l") use the ambiguous variable name "l" which triggers Ruff E741; rename that
loop variable to a clearer identifier like "line" (and similarly update the
other comprehensions around the other occurrences) so the comprehensions (e.g.
the h1_lines comprehension and the two other list comprehensions at the noted
locations) use "line" instead of "l" to avoid the E741 warning and keep intent
clear.
packages/tomd/src/tomd/lib/pdf/__init__.py-190-195 (1)

190-195: ⚠️ Potential issue | 🟡 Minor

Keep enriching partially populated reply-to lists.

This returns as soon as any email is present, so a mixed list like ["Alice", "Bob <bob@example.com>"] never gets a chance to pair Alice with the page 0 email. The safety-net should only bail out when every existing entry already has an email.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/tomd/src/tomd/lib/pdf/__init__.py` around lines 190 - 195, The
current logic in the reply-to enrichment prematurely returns if any email is
found because it checks existing_emails on the joined string; change it to only
bail when every existing entry already contains an email. Replace the block that
computes existing_joined/existing_emails and the early return with a per-entry
check using EMAIL_RE.search on each element of metadata.get("reply-to", [])
(e.g., compute entries = metadata.get("reply-to", []); if entries and
all(EMAIL_RE.search(e) for e in entries): return) so partial lists like
["Alice", "Bob <bob@example.com>"] will still be enriched.
packages/tomd/tests/test_html_render.py-433-433 (1)

433-433: ⚠️ Potential issue | 🟡 Minor

Rename l to avoid Ruff E741.

The new table assertions use l as a loop variable at Lines 433 and 530, which Ruff flags as ambiguous.

Also applies to: 530-530

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/tomd/tests/test_html_render.py` at line 433, Replace the ambiguous
loop variable name `l` used in the list comprehensions that build `lines` from
`md.splitlines()` with a clearer name (for example `line` or `ln`) in both
occurrences (the comprehension currently written as [l for l in md.splitlines()
if l.startswith("|")]) and any other test assertions that iterate using `l`;
update all references to that variable in the same expression so Ruff E741 is
resolved and behavior remains unchanged.
plans/QA-003-tomd-deep-analysis.md-44-48 (1)

44-48: ⚠️ Potential issue | 🟡 Minor

Refresh the public-contract details in this analysis doc.

This currently documents a stale convert_paper signature, lists only 6 HTML generator families, and omits revision from the canonical front-matter order. Future work that uses this file as reference will start from the wrong API/spec surface.

Based on learnings, convert_paper(paper_id, store, *, write_prompts=True) is the public API entry point and the HTML pipeline supports eight generator families. As per coding guidelines, "Canonical YAML front matter for converted papers must have fixed field order: title, document, revision, date, intent, audience, reply-to".

Also applies to: 139-140, 314-317

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@plans/QA-003-tomd-deep-analysis.md` around lines 44 - 48, Update the doc to
reflect the current public API and spec: change the documented entry point
signature to convert_paper(paper_id, store, *, write_prompts=True) (showing the
store param and keyword-only write_prompts), correct the HTML pipeline family
count to eight generator families, and refresh every occurrence of the canonical
YAML front-matter order to include revision in this exact sequence: title,
document, revision, date, intent, audience, reply-to; apply these edits wherever
the old signature, generator count, or front-matter ordering appear in the
analysis doc.
🧹 Nitpick comments (4)
.gitignore (1)

26-35: Risk: _*.py and _*.json are overly broad.

The patterns _*.py and _*.json will ignore any repo file (tracked or otherwise) whose basename starts with _, which can accidentally hide legitimate source/config files and also makes it easier for contributors to “miss” accidentally-created artifacts. Since you already have more specific QA ignores below (_qa016_renders/, _qa016_visual_review.html, qa-report*.json), consider narrowing these broad patterns to just the expected QA/local artifacts.

Suggested tightening (example)
- _*.py
- _*.json
+ _qa*.py
+ _qa*.json

(Adjust the prefixes to match your actual local artifacts; the goal is to avoid a blanket “leading underscore” ignore.)

[Because intent isn’t fully verifiable from this file alone, please confirm the repo never needs files that start with _ outside of local QA artifacts.]

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.gitignore around lines 26 - 35, The two broad ignore patterns `_*.py` and
`_*.json` are too permissive; replace them with more specific patterns that
match only the local QA artifacts you actually produce (for example tighten to
`_qa*.py`, `_qa016_*.json`, or explicit filenames/directories) so legitimate
repo files that start with `_` aren’t accidentally ignored; update the
.gitignore entries currently showing `_*.py` and `_*.json` to narrowly match the
existing QA entries like `_qa016_renders/`, `_qa016_visual_review.html`, and
`qa-report*.json` (or explicit artifact names) and confirm no other project
files intentionally start with `_` outside those QA artifacts.
packages/tomd/tests/test_paper_extract.py (1)

176-177: Also assert revision in canonical front-matter expectation.

This test now checks document: P1 normalization from P1R0, but it does not verify that the extracted revision is preserved. Adding revision: 0 makes the test cover the full normalization contract.

✅ Suggested expectation update
         expected = (
             "---\n"
             'title: "Out Of Order"\n'
             "document: P1\n"
+            "revision: 0\n"
             "date: 2026-04-28\n"
             "intent: info\n"
             "audience: LEWG\n"
             "reply-to:\n"
             '  - "Alice <a@x>"\n'
             "---"
         )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/tomd/tests/test_paper_extract.py` around lines 176 - 177, Update the
test expectation in tests/test_paper_extract.py so the canonical front-matter
includes the extracted revision; specifically, where the test asserts
normalization from "P1R0" to "document: P1" also assert "revision: 0" is present
in the expected front-matter string (the block constructed in the failing test
case), ensuring the test for document normalization covers preservation of the
revision value.
packages/tomd/tests/test_gold_standard.py (1)

22-51: Assert the full YAML contract here, not just field presence.

Because _parse_front_matter() strips quotes and collapses everything into a dict, these gold tests cannot catch regressions in canonical key order, revision, or title quoting. This file is the right place to pin that contract.

As per coding guidelines, "Canonical YAML front matter for converted papers must have fixed field order: title, document, revision, date, intent, audience, reply-to" and "Front matter title must be double-quoted; document, date, intent, audience must be unquoted; reply-to is a YAML list".

Also applies to: 64-87

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/tomd/tests/test_gold_standard.py` around lines 22 - 51, The current
_parse_front_matter function in tests/test_gold_standard.py flattens and strips
the YAML (losing quote info, ordering, and list structure) so the gold tests
can't assert the full canonical front-matter contract; change the tests to
either (A) capture and assert the raw YAML front-matter block exactly (return
the raw block from _parse_front_matter or add a helper get_raw_front_matter) or
(B) parse with a YAML loader that preserves ordering and quoting (e.g.,
ruamel.yaml) and assert that the sequence of keys equals
["title","document","revision","date","intent","audience","reply-to"], that
title is double-quoted, document/date/intent/audience are unquoted scalars, and
reply-to is a list; update _parse_front_matter (or add a new parser used by the
gold tests) and the assertions in the tests to enforce those exact field formats
and order.
packages/tomd/src/tomd/lib/pdf/wg21.py (1)

341-348: Move these regexes to module scope.

_HEADING_RE and _EMAIL_LINE_RE are compiled inside extract_metadata_from_blocks(). This file already centralizes the other metadata patterns at module level, and keeping these local makes the known-section and metadata matching rules harder to maintain. As per coding guidelines, "Regex patterns for section numbers, known section names, list markers, and metadata fields must be precompiled at module level and defined in one place."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/tomd/src/tomd/lib/pdf/wg21.py` around lines 341 - 348, Move the two
regex compilations out of extract_metadata_from_blocks() and define them at
module scope alongside the other metadata patterns: create module-level
constants _HEADING_RE =
re.compile(r"^(?:Abstract|Contents|Table\s+of\s+Contents|Introduction|Foreword|Revision|Preamble|Overview|Motivation)\b",
re.IGNORECASE) and _EMAIL_LINE_RE = re.compile(r"^(.+?)\s*[<(](" +
EMAIL_RE.pattern + r")[)>]\s*$") (reusing the existing module-level EMAIL_RE),
then remove the local compilations inside extract_metadata_from_blocks() and
update that function to reference the module-level _HEADING_RE and
_EMAIL_LINE_RE symbols.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@packages/tomd/src/tomd/lib/__init__.py`:
- Around line 289-300: The call to _REDUNDANT_META_RE.sub("", md) in
strip_redundant_body_meta currently runs over the whole document and can remove
legitimate later occurrences; change it to only operate on the small region
immediately after the YAML front matter (the same scope used by
_strip_metadata_table). Locate the end of the front matter (e.g., the closing
--- or closing ```yaml front-matter marker) and apply _REDUNDANT_META_RE.sub to
the substring between that end and the start of the body preamble (the same
slice _strip_metadata_table inspects), then reassemble md and call
_strip_metadata_table as before. Ensure you reference strip_redundant_body_meta,
_REDUNDANT_META_RE and _strip_metadata_table when making the change.
- Around line 64-102: The current dedup_paragraphs splits on "\n\n", which
breaks fenced code blocks containing blank lines; modify dedup_paragraphs to
first tokenize the markdown into blocks while preserving fenced code blocks by
scanning lines and tracking an in_fence boolean (toggle when encountering lines
that start with "```" or other fence markers), accumulating lines into the
current block and allowing internal blank lines when in_fence so a code fence
becomes a single block; then feed that blocks list into the existing Pass 1 and
Pass 2 logic (keeping use of _MIN_DEDUP_LENGTH, _MAX_PARAGRAPH_OCCURRENCES and
the _dedup_log) and ensure detection of code fences for the skip logic still
works (block startswith "```" or contains the opening fence).
- Around line 257-262: The code currently removes any leading H1 when processing
headings even if front-matter title is missing; update the logic in the block
that uses stripped, title, title_clean and _titles_match so that you only delete
the H1 when title_clean is non-empty and _titles_match(h1_text, title_clean) is
true, otherwise demote a body H1 to H2 (replace lines[i] with "## " + h1_text)
instead of setting it to ""; preserve the existing title_clean =
title.strip().strip('"').strip() and h1_text = stripped[2:].strip() calculations
and only call _titles_match to decide deletion when title_clean is truthy.

In `@packages/tomd/src/tomd/lib/html/extract.py`:
- Around line 17-19: The _extract_mailto_email function currently only strips
the mailto prefix and leaves query parameters (e.g. ?subject=) in the returned
string; update _extract_mailto_email to remove the mailto/mailto:// prefix and
then strip any URL query part so it returns only the bare email address (stop at
'?' or use a URL parser), ensuring examples like
"mailto:alice@example.com?subject=x" become "alice@example.com" (refer to
_extract_mailto_email and _MAILTO_PREFIX_RE to find where to apply the change).

In `@packages/tomd/src/tomd/lib/pdf/__init__.py`:
- Around line 203-208: Move the metadata regex compilations out of request-time
code and define _NAMED_EMAIL_RE, _BARE_EMAIL_RE, _TITLE_BOILERPLATE_RE, and
_AUTHOR_BOILERPLATE_RE at module scope so they are compiled once; find where
those names are currently created inline and replace those runtime re.compile
calls with references to the module-level compiled patterns, ensuring the same
pattern strings are used (e.g., reuse EMAIL_RE.pattern when composing
_NAMED_EMAIL_RE/_BARE_EMAIL_RE) and remove any duplicate compilations in the
other occurrence blocks that currently recreate these regexes.
- Around line 48-53: The current branch replaces the whole document id using
stem_m (which can change the paper series); instead, only replace the revision
suffix: when stem_rev is not None and stem_rev != doc_rev, build
metadata["document"] from the original doc match (use doc_m.group(1).upper() and
doc_m.group(2)) and append "R{stem_rev}", and update the _log.debug call to show
the old doc id -> new doc id; referenced symbols: stem_rev, doc_rev, stem_m,
doc_m, metadata["document"], _log.debug.

In `@packages/tomd/src/tomd/lib/pdf/structure.py`:
- Around line 416-428: The current block sets title_found and immediately
appends sec to structured then continues, preventing the later
heading-classification code from running; change it so that when is_large and
(is_known or is_section_num) you only set title_found = True (do not call
structured.append(sec) or continue) so the section falls through into the normal
heading classification path (which will assign SectionKind.HEADING); also ensure
KNOWN_SECTIONS includes the listed unnumbered titles (Abstract, Revision
History, References, Acknowledgements, Motivation, Wording, Proposed Wording,
Design Decisions) and that SECTION_NUM_RE/related checks remain unchanged.

In `@packages/tomd/src/tomd/lib/pdf/wg21.py`:
- Around line 112-115: The branch handling label normalization currently maps
label_lower in ("audience", "subgroup", "target") to metadata["audience"], which
causes labels like "Project" (now accepted by _LABEL_RE) to be matched then
discarded; update that branch so it also includes "project" in the alias set
(e.g., ("audience", "subgroup", "target", "project")), or alternatively
normalize label_lower earlier before this dispatch; ensure you use the same
helpers (value_lines, _clean) and assign the cleaned value into
metadata["audience"] (or perform normalization to a separate key if intended).

In `@packages/tomd/tests/fixtures/d4036-gold-standard.md`:
- Around line 2-9: The front-matter for the canonical fixture is missing the
required revision field; update the YAML header in the fixture for document
"P4036R0" by adding revision: 0 so the canonical output matches the new
shape—ensure the top-matter now contains title, document, date, intent,
audience, reply-to and revision: 0 to prevent non-canonical markdown from being
blessed in tests.

---

Outside diff comments:
In `@packages/tomd/src/tomd/lib/pdf/cleanup.py`:
- Around line 169-186: The page-0 metadata exemption currently only preserves
whole lines matched by line_patterns; modify the span-processing in the loop
that iterates line.spans (inside function handling block/line cleanup) to apply
the same page-0 exemption early: if block.page_num == 0 and line.bbox[1] <
_page0_meta_y then skip span-level pattern checks and append all spans to
kept_spans (i.e., preserve the merged spatial line as metadata), otherwise
continue using _y_bucket(span.bbox) -> _patterns_near(...) and _matches to
decide stripping; reference symbols: block, line, line_patterns, _page0_meta_y,
line.spans, _y_bucket, _patterns_near, _matches.

In `@packages/tomd/tests/fixtures/golden/p0533r9.golden.md`:
- Around line 1-10: The front-matter in the golden still quotes the audience
field and may not match the formatter's canonical form; update the YAML so that
title is double-quoted and document, date, intent, and audience are unquoted,
and ensure reply-to is rendered as a YAML list (preserve the existing reply-to
array), i.e., change "audience: \"LWG & CWG\"" to audience: LWG & CWG and verify
title: "..." remains quoted while document/date/intent are unquoted to match the
canonical front-matter expectations.

In `@packages/tomd/tests/fixtures/golden/p0957r8.golden.md`:
- Around line 1206-1224: The uncertain-region output contains an isolated list
marker and a heavily wrapped sentence; locate the tomd uncertain block marked
<!-- tomd:uncertain:L1216-L1307 --> and replace the broken fragment (the
standalone '-' and the subsequent single-word-wrapped lines starting with "If
typename F::reflection_type is not void...") with a single properly formatted
bullet whose text is one joined sentence, e.g. "- If typename F::reflection_type
is not void, it shall be constructible from std::in_place_type_t<P> in a
constant expression."; ensure there are no extra blank lines before/after the
marker and no empty list items remain.

In `@packages/tomd/tests/fixtures/golden/p2728r11.golden.md`:
- Around line 1-10: The fixture's Markdown body contains H1 headings (e.g., "#
High-Level Overview", "# UTF Primer") while the front-matter title already
renders as the document H1; update the body so all top-level sections start at
H2 by replacing leading single '#' headings with '##' (or otherwise demoting any
H1s in the body) and ensure no "# " headings remain in the content while keeping
the front-matter title unchanged.

In `@packages/tomd/tests/fixtures/golden/p3411r5.golden.md`:
- Around line 1-13: The front-matter uses the old contract: quote the title
(change title: any_view to title: "any_view") and keep document, date, intent,
audience unquoted and reply-to as a YAML list; then remove or demote body H1
headings (any leading "# " headings in the markdown body) to H2 ("## ") so the
front-matter title is the only H1—ensure only revision and intent were intended
changes and adjust the fixture's front-matter and the two body headings
accordingly.

In `@packages/tomd/tests/fixtures/golden/p4005r0.golden.md`:
- Around line 1-10: The front-matter must be fully canonicalized: ensure the
YAML keys 'title' is double-quoted (e.g. title: "…"), keep 'document', 'date',
'intent', and 'audience' unquoted, and represent 'reply-to' as a YAML list; also
remove the duplicated H1 in the body so body headings start at H2. Locate the
front-matter block (fields: title, document, revision, date, intent, audience,
reply-to) and update the serialization to quote only title and format reply-to
as a list, then edit the body to eliminate the top-level "# ..." H1 so the first
heading becomes "## ...".

---

Minor comments:
In `@packages/tomd/src/tomd/api.py`:
- Around line 126-132: _reorder_yaml_body() currently assigns unknown keys a
fallback index of len(FRONT_MATTER_ORDER), which places them after "reply-to"
and causes reply-to to not be last; change the fallback to the index of
"reply-to" in FRONT_MATTER_ORDER (or otherwise special-case "reply-to" to always
sort last) so unknown keys get the same priority as keys that come before
reply-to and "reply-to" is guaranteed to be emitted last; update references in
the function (priority, fallback, sorted_keys) accordingly and ensure this
matches format_front_matter() canonical order.

In `@packages/tomd/src/tomd/lib/html/__init__.py`:
- Around line 92-111: The two H1-stripping blocks repeatedly call md.find("---",
4) which can mis-locate the front-matter end if `---` appears elsewhere; replace
those duplicate searches by extracting the front-matter boundary once (e.g.,
with a single regex or a helper like find_front_matter_end(md)) and store the
resulting index (fm_end) and body start; then reuse that stored boundary for
both uses of strip_leading_h1 and when slicing md before/after calling
strip_redundant_body_meta; update references to md, metadata, title,
strip_leading_h1 and strip_redundant_body_meta to use the single computed
fm_end/body_start so both passes operate on the exact same split.

In `@packages/tomd/src/tomd/lib/html/ARCHITECTURE.md`:
- Around line 106-114: Update the table-path summary in ARCHITECTURE.md to
correct the statement that "All four paths emit standard CommonMark pipe
tables": explicitly note that _render_code_table is an exception and emits
fenced code blocks (dropping table structure) while the other three paths emit
pipe tables; reference the _render_code_table, _render_table_flat, and
_render_denormalized_table symbols so the change is made adjacent to their
descriptions and adjust the sentence about lossiness/HTML marker accordingly
(i.e., keep the lossy-table marker behavior but exclude code-table from the
"emit pipe tables" claim).

In `@packages/tomd/src/tomd/lib/pdf/__init__.py`:
- Around line 190-195: The current logic in the reply-to enrichment prematurely
returns if any email is found because it checks existing_emails on the joined
string; change it to only bail when every existing entry already contains an
email. Replace the block that computes existing_joined/existing_emails and the
early return with a per-entry check using EMAIL_RE.search on each element of
metadata.get("reply-to", []) (e.g., compute entries = metadata.get("reply-to",
[]); if entries and all(EMAIL_RE.search(e) for e in entries): return) so partial
lists like ["Alice", "Bob <bob@example.com>"] will still be enriched.

In `@packages/tomd/tests/fixtures/golden/p1068r11.golden.md`:
- Around line 8-14: The frontmatter reply-to array contains a non-contact entry
"Acknowledgements: Tomasz Kamiński" which pollutes the expected metadata; remove
that exact string from the reply-to list and move it into the document body
(e.g., append a short "Acknowledgements: Tomasz Kamiński" line in the markdown
content) so reply-to only contains Name <email> contact entries and the
acknowledgement is preserved in the body.

In `@packages/tomd/tests/fixtures/golden/p1112r4.golden.md`:
- Around line 2-12: Remove the duplicated WG21 metadata line left in the
document body: delete the line starting with "Reply-to: Balog, Pal (pasa@lib.hu)
Target: C++26" so the information only exists in the YAML front matter
(title/document/revision/date/intent/audience/reply-to) and the golden fixture
no longer contains the redundant body metadata that can cause normalization
regressions.

In `@packages/tomd/tests/test_fallback.py`:
- Around line 36-40: Rename the ambiguous loop/comprehension variable `l` to
`line` in the test file so Ruff error E741 is resolved; update the list
comprehension that builds `keys` (the comprehension using `for l in
result.splitlines()`), and likewise replace `l` with `line` in the other
occurrences called out (around the other comprehensions/loops at the ranges
noted: 47-50, 64, 97-109, 165-168), keeping the logic identical.

In `@packages/tomd/tests/test_gold_standard.py`:
- Around line 94-95: The comprehensions in test_gold_standard.py (e.g. the one
that builds h1_lines using "l") use the ambiguous variable name "l" which
triggers Ruff E741; rename that loop variable to a clearer identifier like
"line" (and similarly update the other comprehensions around the other
occurrences) so the comprehensions (e.g. the h1_lines comprehension and the two
other list comprehensions at the noted locations) use "line" instead of "l" to
avoid the E741 warning and keep intent clear.

In `@packages/tomd/tests/test_html_render.py`:
- Line 433: Replace the ambiguous loop variable name `l` used in the list
comprehensions that build `lines` from `md.splitlines()` with a clearer name
(for example `line` or `ln`) in both occurrences (the comprehension currently
written as [l for l in md.splitlines() if l.startswith("|")]) and any other test
assertions that iterate using `l`; update all references to that variable in the
same expression so Ruff E741 is resolved and behavior remains unchanged.

In `@packages/tomd/tests/test_pdf_wg21.py`:
- Around line 1-3: Update the file header in test_pdf_wg21.py to use the
repository's standard Boost Software License 1.0 (BSL-1.0) header format and
include the required file-author attribution; locate the top-of-file comment
block and replace it with the project's canonical BSL header that includes the
file author (e.g., "Author: <name>") while leaving any existing license text
otherwise unchanged so the file matches the repo rule for new .py files.

In `@packages/tomd/tests/test_wg21.py`:
- Line 127: The test unpacks the tuple returned by
extract_metadata_from_blocks([b]) into meta, consumed but never uses consumed,
triggering the unused-variable lint; change the unpack target from consumed to
_consumed in both occurrences (the call site using
extract_metadata_from_blocks([b]) at the second unpack and the other occurrence
noted) so the tuple is still captured but marked intentionally unused.

In `@plans/QA-003-tomd-deep-analysis.md`:
- Around line 44-48: Update the doc to reflect the current public API and spec:
change the documented entry point signature to convert_paper(paper_id, store, *,
write_prompts=True) (showing the store param and keyword-only write_prompts),
correct the HTML pipeline family count to eight generator families, and refresh
every occurrence of the canonical YAML front-matter order to include revision in
this exact sequence: title, document, revision, date, intent, audience,
reply-to; apply these edits wherever the old signature, generator count, or
front-matter ordering appear in the analysis doc.

---

Nitpick comments:
In @.gitignore:
- Around line 26-35: The two broad ignore patterns `_*.py` and `_*.json` are too
permissive; replace them with more specific patterns that match only the local
QA artifacts you actually produce (for example tighten to `_qa*.py`,
`_qa016_*.json`, or explicit filenames/directories) so legitimate repo files
that start with `_` aren’t accidentally ignored; update the .gitignore entries
currently showing `_*.py` and `_*.json` to narrowly match the existing QA
entries like `_qa016_renders/`, `_qa016_visual_review.html`, and
`qa-report*.json` (or explicit artifact names) and confirm no other project
files intentionally start with `_` outside those QA artifacts.

In `@packages/tomd/src/tomd/lib/pdf/wg21.py`:
- Around line 341-348: Move the two regex compilations out of
extract_metadata_from_blocks() and define them at module scope alongside the
other metadata patterns: create module-level constants _HEADING_RE =
re.compile(r"^(?:Abstract|Contents|Table\s+of\s+Contents|Introduction|Foreword|Revision|Preamble|Overview|Motivation)\b",
re.IGNORECASE) and _EMAIL_LINE_RE = re.compile(r"^(.+?)\s*[<(](" +
EMAIL_RE.pattern + r")[)>]\s*$") (reusing the existing module-level EMAIL_RE),
then remove the local compilations inside extract_metadata_from_blocks() and
update that function to reference the module-level _HEADING_RE and
_EMAIL_LINE_RE symbols.

In `@packages/tomd/tests/test_gold_standard.py`:
- Around line 22-51: The current _parse_front_matter function in
tests/test_gold_standard.py flattens and strips the YAML (losing quote info,
ordering, and list structure) so the gold tests can't assert the full canonical
front-matter contract; change the tests to either (A) capture and assert the raw
YAML front-matter block exactly (return the raw block from _parse_front_matter
or add a helper get_raw_front_matter) or (B) parse with a YAML loader that
preserves ordering and quoting (e.g., ruamel.yaml) and assert that the sequence
of keys equals
["title","document","revision","date","intent","audience","reply-to"], that
title is double-quoted, document/date/intent/audience are unquoted scalars, and
reply-to is a list; update _parse_front_matter (or add a new parser used by the
gold tests) and the assertions in the tests to enforce those exact field formats
and order.

In `@packages/tomd/tests/test_paper_extract.py`:
- Around line 176-177: Update the test expectation in
tests/test_paper_extract.py so the canonical front-matter includes the extracted
revision; specifically, where the test asserts normalization from "P1R0" to
"document: P1" also assert "revision: 0" is present in the expected front-matter
string (the block constructed in the failing test case), ensuring the test for
document normalization covers preservation of the revision value.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 06f432c7-2b83-4790-9643-8e9dac09388d

📥 Commits

Reviewing files that changed from the base of the PR and between 5510d10 and b9aeb0f.

⛔ Files ignored due to path filters (1)
  • packages/tomd/tests/fixtures/p2583r3-gold-standard.pdf is excluded by !**/*.pdf
📒 Files selected for processing (45)
  • .gitignore
  • CLAUDE.md
  • packages/tomd/src/tomd/CLAUDE.md
  • packages/tomd/src/tomd/README.md
  • packages/tomd/src/tomd/api.py
  • packages/tomd/src/tomd/lib/__init__.py
  • packages/tomd/src/tomd/lib/html/ARCHITECTURE.md
  • packages/tomd/src/tomd/lib/html/__init__.py
  • packages/tomd/src/tomd/lib/html/extract.py
  • packages/tomd/src/tomd/lib/html/render.py
  • packages/tomd/src/tomd/lib/pdf/__init__.py
  • packages/tomd/src/tomd/lib/pdf/cleanup.py
  • packages/tomd/src/tomd/lib/pdf/emit.py
  • packages/tomd/src/tomd/lib/pdf/qa.py
  • packages/tomd/src/tomd/lib/pdf/structure.py
  • packages/tomd/src/tomd/lib/pdf/wg21.py
  • packages/tomd/tests/fixtures/d4036-gold-standard.md
  • packages/tomd/tests/fixtures/golden/n5034.golden.md
  • packages/tomd/tests/fixtures/golden/p0533r9.golden.md
  • packages/tomd/tests/fixtures/golden/p0957r8.golden.md
  • packages/tomd/tests/fixtures/golden/p1068r11.golden.md
  • packages/tomd/tests/fixtures/golden/p1112r4.golden.md
  • packages/tomd/tests/fixtures/golden/p1122r3.golden.md
  • packages/tomd/tests/fixtures/golden/p2040r0.golden.md
  • packages/tomd/tests/fixtures/golden/p2728r11.golden.md
  • packages/tomd/tests/fixtures/golden/p3411r5.golden.md
  • packages/tomd/tests/fixtures/golden/p3556r0.golden.md
  • packages/tomd/tests/fixtures/golden/p3714r0.golden.md
  • packages/tomd/tests/fixtures/golden/p3911r2.golden.md
  • packages/tomd/tests/fixtures/golden/p3953r0.golden.md
  • packages/tomd/tests/fixtures/golden/p4005r0.golden.md
  • packages/tomd/tests/fixtures/golden/p4020r0.golden.md
  • packages/tomd/tests/test_emit.py
  • packages/tomd/tests/test_fallback.py
  • packages/tomd/tests/test_gold_standard.py
  • packages/tomd/tests/test_html_extract.py
  • packages/tomd/tests/test_html_render.py
  • packages/tomd/tests/test_paper_extract.py
  • packages/tomd/tests/test_pdf_wg21.py
  • packages/tomd/tests/test_qa.py
  • packages/tomd/tests/test_wg21.py
  • plans/QA-001-extend-qa-scoring.md
  • plans/QA-002-mpark-wording-support.md
  • plans/QA-003-tomd-deep-analysis.md
  • reports/QA-001-extend-qa-scoring.md
💤 Files with no reviewable changes (1)
  • packages/tomd/tests/fixtures/golden/p3911r2.golden.md

Comment on lines +64 to +102
def dedup_paragraphs(md: str) -> str:
"""Remove duplicate paragraphs from Markdown text.

Two passes:
1. Consecutive identical paragraphs are collapsed to one.
2. Any paragraph longer than 40 chars appearing more than 3 times
total is capped at 3 occurrences (keeps the first 3).

Headings and code fences are never dropped.
"""
blocks = md.split("\n\n")
if len(blocks) <= 1:
return md

# Pass 1: consecutive dedup
deduped: list[str] = [blocks[0]]
for block in blocks[1:]:
if block.strip() != deduped[-1].strip():
deduped.append(block)

# Pass 2: frequency cap for long paragraphs
from collections import Counter
counts: Counter[str] = Counter()
result: list[str] = []
for block in deduped:
stripped = block.strip()
is_heading = stripped.startswith("#")
is_code = stripped.startswith("```")
if is_heading or is_code or len(stripped) < _MIN_DEDUP_LENGTH:
result.append(block)
continue
counts[stripped] += 1
if counts[stripped] <= _MAX_PARAGRAPH_OCCURRENCES:
result.append(block)

removed = len(blocks) - len(result)
if removed:
_dedup_log.debug("Deduplication removed %d repeated paragraph(s)", removed)
return "\n\n".join(result)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Make deduplication fence-aware.

Splitting on "\n\n" breaks fenced code blocks that contain blank lines. Only the first chunk starts with ``````, so the later chunks are treated as normal paragraphs and can be collapsed or frequency-capped. That can corrupt emitted code blocks.

Possible direction
-def dedup_paragraphs(md: str) -> str:
-    ...
-    blocks = md.split("\n\n")
+def dedup_paragraphs(md: str) -> str:
+    # Tokenize into fenced-code regions and non-code regions first.
+    # Only deduplicate paragraph blocks from non-code regions.
+    ...
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/tomd/src/tomd/lib/__init__.py` around lines 64 - 102, The current
dedup_paragraphs splits on "\n\n", which breaks fenced code blocks containing
blank lines; modify dedup_paragraphs to first tokenize the markdown into blocks
while preserving fenced code blocks by scanning lines and tracking an in_fence
boolean (toggle when encountering lines that start with "```" or other fence
markers), accumulating lines into the current block and allowing internal blank
lines when in_fence so a code fence becomes a single block; then feed that
blocks list into the existing Pass 1 and Pass 2 logic (keeping use of
_MIN_DEDUP_LENGTH, _MAX_PARAGRAPH_OCCURRENCES and the _dedup_log) and ensure
detection of code fences for the skip logic still works (block startswith "```"
or contains the opening fence).

Comment on lines +257 to +262
if stripped.startswith("# ") and not stripped.startswith("## "):
h1_text = stripped[2:].strip()
title_clean = title.strip().strip('"').strip()
if not title_clean or _titles_match(h1_text, title_clean):
lines[i] = ""
break
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Do not delete the first H1 when the front-matter title is missing.

Lines 257-262 currently drop any leading # heading if title is empty. That loses real content such as # Abstract. If the goal is to keep body headings at H2, only strip exact title duplicates, otherwise demote the heading to ## instead of removing it. As per coding guidelines, "Body headings in converted papers must start at H2. The front-matter title renders as H1; no # H1 in body".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/tomd/src/tomd/lib/__init__.py` around lines 257 - 262, The code
currently removes any leading H1 when processing headings even if front-matter
title is missing; update the logic in the block that uses stripped, title,
title_clean and _titles_match so that you only delete the H1 when title_clean is
non-empty and _titles_match(h1_text, title_clean) is true, otherwise demote a
body H1 to H2 (replace lines[i] with "## " + h1_text) instead of setting it to
""; preserve the existing title_clean = title.strip().strip('"').strip() and
h1_text = stripped[2:].strip() calculations and only call _titles_match to
decide deletion when title_clean is truthy.

Comment on lines +289 to +300
def strip_redundant_body_meta(md: str) -> str:
"""Remove body lines that duplicate YAML frontmatter metadata.

Strips:
- Standalone 'Document number: XXX' lines
- Pipe tables immediately following front matter that contain only
metadata fields (Document number, Date, Audience, Reply-to),
followed by a --- HR separator
"""
md = _REDUNDANT_META_RE.sub("", md)
md = _strip_metadata_table(md)
return md
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Limit redundant-metadata stripping to the post-front-matter preamble.

_REDUNDANT_META_RE.sub("", md) runs over the entire document, so any later body line that happens to start with Document number: will be deleted as well. This cleanup should be scoped to the small region immediately after front matter, similar to _strip_metadata_table(). Based on learnings "WG21 metadata block becomes YAML front matter."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/tomd/src/tomd/lib/__init__.py` around lines 289 - 300, The call to
_REDUNDANT_META_RE.sub("", md) in strip_redundant_body_meta currently runs over
the whole document and can remove legitimate later occurrences; change it to
only operate on the small region immediately after the YAML front matter (the
same scope used by _strip_metadata_table). Locate the end of the front matter
(e.g., the closing --- or closing ```yaml front-matter marker) and apply
_REDUNDANT_META_RE.sub to the substring between that end and the start of the
body preamble (the same slice _strip_metadata_table inspects), then reassemble
md and call _strip_metadata_table as before. Ensure you reference
strip_redundant_body_meta, _REDUNDANT_META_RE and _strip_metadata_table when
making the change.

Comment on lines 56 to +63
metadata = _extract.extract_metadata(soup, generator)
if metadata and "document" not in metadata:
from .. import DOC_NUM_RE
stem_match = DOC_NUM_RE.search(path.stem)
if stem_match:
metadata["document"] = stem_match.group(1).upper()
if metadata and "document" in metadata:
_override_revision_from_filename(metadata, path)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Apply the filename/title fallbacks even when metadata extraction returns {}.

Both fallbacks are gated on if metadata, so an empty dict skips them entirely. For generic or weakly-detected HTML this means we can still infer document from the filename and title from the rendered body, but we currently emit no front matter at all.

Possible fix
-    metadata = _extract.extract_metadata(soup, generator)
-    if metadata and "document" not in metadata:
+    metadata = _extract.extract_metadata(soup, generator) or {}
+    if "document" not in metadata:
         from .. import DOC_NUM_RE
         stem_match = DOC_NUM_RE.search(path.stem)
         if stem_match:
             metadata["document"] = stem_match.group(1).upper()
-    if metadata and "document" in metadata:
+    if "document" in metadata:
         _override_revision_from_filename(metadata, path)
@@
-    if metadata and "title" not in metadata:
+    if "title" not in metadata:
         h_match = re.search(r"^##\s+(.+)$", body_md, re.MULTILINE)
         if h_match:
             metadata["title"] = h_match.group(1).strip()

Based on learnings: WG21 metadata block becomes YAML front matter. Collapse multiple spaces, replace non-breaking spaces, normalize whitespace.

Also applies to: 72-75

Comment on lines +17 to +19
def _extract_mailto_email(href: str) -> str:
"""Normalize both ``mailto:`` and the invalid ``mailto://`` to a bare address."""
return _MAILTO_PREFIX_RE.sub("", href)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Strip mailto: query params before storing the address.

_extract_mailto_email() currently turns mailto:alice@example.com?subject=x into alice@example.com?subject=x. That leaks invalid addresses into front matter and breaks later email matching and deduplication.

Suggested fix
 def _extract_mailto_email(href: str) -> str:
     """Normalize both ``mailto:`` and the invalid ``mailto://`` to a bare address."""
-    return _MAILTO_PREFIX_RE.sub("", href)
+    address = _MAILTO_PREFIX_RE.sub("", href).split("?", 1)[0].split("#", 1)[0]
+    return address.strip()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/tomd/src/tomd/lib/html/extract.py` around lines 17 - 19, The
_extract_mailto_email function currently only strips the mailto prefix and
leaves query parameters (e.g. ?subject=) in the returned string; update
_extract_mailto_email to remove the mailto/mailto:// prefix and then strip any
URL query part so it returns only the bare email address (stop at '?' or use a
URL parser), ensuring examples like "mailto:alice@example.com?subject=x" become
"alice@example.com" (refer to _extract_mailto_email and _MAILTO_PREFIX_RE to
find where to apply the change).

Comment on lines +48 to +53
if stem_rev is not None and stem_rev != doc_rev:
prefix = stem_m.group(1).upper()
number = stem_m.group(2)
metadata["document"] = f"{prefix}{number}R{stem_rev}"
_log.debug("Overrode document revision from filename: %s -> %s",
f"{doc_m.group(0)}", metadata["document"])
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Only fix the revision suffix here, not the paper series.

If extraction yields N5034 and the filename is P5034R0..., this branch rewrites the document id to P5034R0. That changes the paper family, not just the revision, and produces wrong front matter.

Suggested fix
     stem_rev = stem_m.group(3)
     doc_rev = doc_m.group(3)
     if stem_rev is not None and stem_rev != doc_rev:
-        prefix = stem_m.group(1).upper()
+        if doc_m.group(1).upper() != stem_m.group(1).upper():
+            return
+        prefix = doc_m.group(1).upper()
         number = stem_m.group(2)
         metadata["document"] = f"{prefix}{number}R{stem_rev}"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/tomd/src/tomd/lib/pdf/__init__.py` around lines 48 - 53, The current
branch replaces the whole document id using stem_m (which can change the paper
series); instead, only replace the revision suffix: when stem_rev is not None
and stem_rev != doc_rev, build metadata["document"] from the original doc match
(use doc_m.group(1).upper() and doc_m.group(2)) and append "R{stem_rev}", and
update the _log.debug call to show the old doc id -> new doc id; referenced
symbols: stem_rev, doc_rev, stem_m, doc_m, metadata["document"], _log.debug.

Comment on lines +203 to +208
_NAMED_EMAIL_RE = re.compile(
r"([A-Z][A-Za-z.''\- ]+?)\s*[<(](" + EMAIL_RE.pattern + r")[)>]"
)
_BARE_EMAIL_RE = re.compile(
r"^\s*[<(]?(" + EMAIL_RE.pattern + r")[)>]?\s*$"
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Hoist these metadata regexes to module scope.

_NAMED_EMAIL_RE, _BARE_EMAIL_RE, _TITLE_BOILERPLATE_RE, and _AUTHOR_BOILERPLATE_RE are metadata-field patterns, but they're compiled inside request-time code paths. The repo rule here is to define those regexes once at module level.

As per coding guidelines, "Regex patterns for section numbers, known section names, list markers, and metadata fields must be precompiled at module level and defined in one place."

Also applies to: 417-435

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/tomd/src/tomd/lib/pdf/__init__.py` around lines 203 - 208, Move the
metadata regex compilations out of request-time code and define _NAMED_EMAIL_RE,
_BARE_EMAIL_RE, _TITLE_BOILERPLATE_RE, and _AUTHOR_BOILERPLATE_RE at module
scope so they are compiled once; find where those names are currently created
inline and replace those runtime re.compile calls with references to the
module-level compiled patterns, ensuring the same pattern strings are used
(e.g., reuse EMAIL_RE.pattern when composing _NAMED_EMAIL_RE/_BARE_EMAIL_RE) and
remove any duplicate compilations in the other occurrence blocks that currently
recreate these regexes.

Comment on lines 416 to +428
if not title_found:
if (sec.font_size > body_size * _TITLE_SIZE_RATIO
and not SECTION_NUM_RE.match(first_line)):
metadata["title"] = first_line
is_large = sec.font_size > body_size * _TITLE_SIZE_RATIO
is_known = (first_line.lower().rstrip(":") in KNOWN_SECTIONS
or first_line.lower() in ("contents", "table of contents"))
is_section_num = bool(SECTION_NUM_RE.match(first_line))
has_email = "@" in first_line
is_date = bool(DATE_RE.match(first_line))
too_long = len(first_line) > 120

if is_large and (is_known or is_section_num):
title_found = True
structured.append(sec)
continue
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Let large known/numbered sections fall through to heading classification.

Lines 425-428 append and continue before the normal heading path runs, so a large Abstract, References, or numbered clause stays a paragraph instead of becoming SectionKind.HEADING.

Suggested fix
-            if is_large and (is_known or is_section_num):
-                title_found = True
-                structured.append(sec)
-                continue
+            if is_large and (is_known or is_section_num):
+                title_found = True

Based on learnings: Known unnumbered sections (Abstract, Revision History, References, Acknowledgements, Motivation, Wording, Proposed Wording, Design Decisions) must be treated as top-level (##).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/tomd/src/tomd/lib/pdf/structure.py` around lines 416 - 428, The
current block sets title_found and immediately appends sec to structured then
continues, preventing the later heading-classification code from running; change
it so that when is_large and (is_known or is_section_num) you only set
title_found = True (do not call structured.append(sec) or continue) so the
section falls through into the normal heading classification path (which will
assign SectionKind.HEADING); also ensure KNOWN_SECTIONS includes the listed
unnumbered titles (Abstract, Revision History, References, Acknowledgements,
Motivation, Wording, Proposed Wording, Design Decisions) and that
SECTION_NUM_RE/related checks remain unchanged.

Comment on lines +112 to +115
elif label_lower in ("audience", "subgroup", "target"):
value = _clean(" ".join(value_lines))
if value:
metadata["audience"] = value
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Project: labels are parsed but then discarded.

_LABEL_RE now accepts Project, but this branch only normalizes audience, subgroup, and target. A paper that uses Project: will match the label and still lose the value. Please include "project" in this alias set, or normalize labels before dispatch.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/tomd/src/tomd/lib/pdf/wg21.py` around lines 112 - 115, The branch
handling label normalization currently maps label_lower in ("audience",
"subgroup", "target") to metadata["audience"], which causes labels like
"Project" (now accepted by _LABEL_RE) to be matched then discarded; update that
branch so it also includes "project" in the alias set (e.g., ("audience",
"subgroup", "target", "project")), or alternatively normalize label_lower
earlier before this dispatch; ensure you use the same helpers (value_lines,
_clean) and assign the cleaned value into metadata["audience"] (or perform
normalization to a separate key if intended).

Comment on lines +2 to +9
title: "Why Span Is Not Enough"
document: P4036R0
date: 2026-02-28
intent: info
audience: LEWG
reply-to:
- "Vinnie Falco <vinnie.falco@gmail.com>"
---
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Add the missing revision field to this canonical header.

This gold-standard fixture still encodes the old front-matter shape. For P4036R0, the canonical output now includes revision: 0, so leaving it out here will bless non-canonical markdown in future tests.

📌 Suggested fixture update
 ---
 title: "Why Span Is Not Enough"
 document: P4036R0
+revision: 0
 date: 2026-02-28
 intent: info
 audience: LEWG
 reply-to:
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
title: "Why Span Is Not Enough"
document: P4036R0
date: 2026-02-28
intent: info
audience: LEWG
reply-to:
- "Vinnie Falco <vinnie.falco@gmail.com>"
---
title: "Why Span Is Not Enough"
document: P4036R0
revision: 0
date: 2026-02-28
intent: info
audience: LEWG
reply-to:
- "Vinnie Falco <vinnie.falco@gmail.com>"
---
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/tomd/tests/fixtures/d4036-gold-standard.md` around lines 2 - 9, The
front-matter for the canonical fixture is missing the required revision field;
update the YAML header in the fixture for document "P4036R0" by adding revision:
0 so the canonical output matches the new shape—ensure the top-matter now
contains title, document, date, intent, audience, reply-to and revision: 0 to
prevent non-canonical markdown from being blessed in tests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants