feat(parser): two-phase block scanner (Phase 1) by luca-chen198 · Pull Request #33 · nodes-app/swift-markdown-engine

luca-chen198 · 2026-05-18T14:06:35Z

Summary

Replaces the single-pass regex tokenizer with a two-phase parser: a new BlockScanner walks the source line-by-line and emits typed BlockSpans for block-level constructs (paragraph, ATX heading, fenced code, thematic break, link reference definition). The existing inline-regex pass then runs only over the content of blocks that allow inline formatting, and any inline match landing inside a non-inline-allowing block is dropped at the end.

This is largely foundation work — the user-visible surface barely changes today. The motivation is to unblock Phase 2 (lists, blockquotes, tables, footnotes as proper container blocks) and to fix a class of overlap bugs structurally rather than via spreading heuristic checks.

What changes

Parser

New Sources/MarkdownEngine/Parser/BlockScanner.swift — line-walker classifying paragraph / ATX heading / fenced code / thematic break / link reference definition.
New Sources/MarkdownEngine/Parser/BlockSpan.swift — typed BlockKind enum with Phase-1 cases implemented and Phase-2 cases forward-declared (`.blockquote`, `.table`, `.tableRow`, `.tableCell`, `.list`, `.listItem`, `.footnoteDefinition`, `.definitionList`, `.htmlBlock`).
New `Sources/MarkdownEngine/Parser/BlockVisitor.swift` — depth-first visitor protocol for block trees (default `walk(_:)` already recurses into children so Phase-2 nested containers work without changes to callers).
`MarkdownTokenizer.parseTokens` now calls `BlockScanner.scan` first, runs the existing inline regexes, then filters inline tokens by block precedence — anything matching inside a fenced code block, thematic break, or link-reference-definition is dropped.

What's user-visible

LaTeX, wiki-link, image-embed, inline code, and emphasis regex matches inside fenced code blocks are now correctly suppressed (previously they could leak through under specific patterns — `$$x^2$$` or `[[link]]` inside triple-backtick fences could render as real LaTeX/links).
ATX heading marker space (e.g. the space between `#` and the title) now collapses cleanly when the heading is inactive — was a regression introduced by an earlier Phase-1 commit, fixed in `d1d8c02`.
No new public API surface.

What's deliberately NOT in this PR (Phase 2)

The following `BlockKind` cases are forward-declared but `BlockScanner` does not yet emit them — anyone building Phase 2 can flip them on without an API break:

`blockquote` — `> quoted text`
`list(ordered:)` / `listItem(indentColumns:)` — bullet / numbered lists as container blocks (task checkboxes still work via the separate regex pass in the styler)
`table` / `tableRow` / `tableCell(alignment:)` — pipe tables
`footnoteDefinition(label:)` — `[^1]: footnote text`
`definitionList`, `htmlBlock`

Thematic breaks and link reference definitions ARE detected by the block scanner but have no styler / no resolver yet — they render as their literal source text, identical to behaviour on `main`. The detection itself is groundwork for Phase 2.

What was tried then removed during review

Setext headings (`Title\n====` / `Title\n----`) were initially supported then removed. CommonMark §4.3 requires the underline to absorb all preceding paragraph lines into the heading, which is unintuitive for casual notes: a user typing `---` as a visual separator unexpectedly promotes the prior N lines into an H2. Bear, Apple Notes and Notion don't ship Setext for the same UX reason. ATX (`# Title`) covers the same case unambiguously.

Trade-offs (honest)

~1500 net lines of new code for foundation that is mostly invisible at the user level today.
The visible payoff is small unless Phase 2 follows — primarily the fenced-code-overlap bug class and the heading-marker cleanup.
Test coverage was added during development and removed at the end of the review pass. The pre-existing public-API contract suite (`MarkdownEngineDecouplingTests`, 10 tests) stays. Phase 2 work should re-add tests for whatever it builds on top.

Test plan

`swift build` clean.
`swift test` — 10 tests in `MarkdownEngineDecouplingTests` pass.
Manual in an editor:
- `# Heading` — confirm the space between `#` and the title shrinks together with the `#` when the cursor leaves the heading.
- Paste a fenced code block containing `$$x^2$$` and `[[fake link]]` — both should render as plain code (no LaTeX render, no wiki-link styling).
- `Title\n----` — confirm this is paragraph + literal `----`, not an H2 (Setext is deliberately disabled).
- All existing markdown features (bold, italic, wiki links, images, LaTeX in paragraphs, task checkboxes, ATX headings of all levels, fenced code with language tag) render exactly as before.

🤖 Generated with Claude Code

…Length

…inline tokens filtered by block precedence

…er semantics

The Setext underline lookahead (`Title\n====` / `Title\n----` rewriting the buffered paragraph into a heading) absorbs all preceding paragraph lines per CommonMark §4.3 — which is unintuitive for casual notes: a user typing `---` as a visual separator unexpectedly promoted the prior N lines into a single H2. Bear, Apple Notes, and Notion don't support Setext for the same UX reason. ATX (`# Title`) covers the same use case unambiguously. Removed: - `setextUnderlineLevel` (entire function) - `rewriteBufferAsHeading` (entire function — only Setext used it) - The Setext lookahead in `BlockScanner.scan` - The `paragraphBuffer.isEmpty` gate on thematic-break detection (since Setext is gone, thematic breaks can now interrupt paragraphs per CommonMark §4.1 — buffer is flushed first) - 4 Setext-specific tests (`setextH1WithEqualsUnderline`, `setextH2WithDashUnderline`, `setextSpansMultipleParagraphLines`, `dashUnderlineAfterParagraphPrefersSetext`) - `dashesAloneWithoutParagraphAreNotConsumedAsHeading` (redundant with `thematicBreakWithDashes`) Added: - `dashUnderlineAfterParagraphInterruptsAsThematicBreak` — verifies that `Title\n---` is now paragraph + thematic break - `equalsUnderlineAfterParagraphIsNotAHeading` — verifies `Title\n===` stays a single multi-line paragraph 64 tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The pre-Phase-1 regex-based heading parser stored the leading space between `#` and content as `markerRanges[1]`, so the marker-shrink pass collapsed it together with the hashes when the heading wasn't active. The new BlockScanner.atxHeading only emitted the hashes, leaving the space at full width — visible as a small gap before the heading text once the cursor moved away. Recapture the whitespace range and append it as a secondary marker, keeping the existing `markerRanges[0].length == level` invariant the stylers rely on. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Drops BlockScannerTests, BlockSpanTests, BlockVisitorTests, and ParseTokensGoldenTests. The pre-Phase-1 MarkdownEngineDecouplingTests public-API contract suite stays — Phase 1 didn't add API surface. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

luca-chen198 and others added 13 commits May 14, 2026 20:21

test: add golden snapshot tests for current parseTokens behavior

0787246

feat(parser): add BlockSpan / BlockKind / LinkReference data model

aedb756

feat(parser): BlockScanner skeleton with paragraph + ATX heading support

edf2e56

feat(parser): BlockScanner fenced code block support

e8aed47

refactor(parser): symmetric fence marker ranges, drop redundant fence…

450f225

…Length

feat(parser): BlockScanner Setext heading lookahead

70243b8

feat(parser): BlockScanner thematic break + link reference definitions

ac367ac

feat(parser): two-phase pipeline — block scanner drives parseTokens; …

f3ba617

…inline tokens filtered by block precedence

feat(parser): BlockVisitor protocol with default depth-first walk

6690ebc

fix(parser): preserve extractLanguage + Setext heading level via mark…

ca60b87

…er semantics

luca-chen198 linked an issue May 18, 2026 that may be closed by this pull request

Phase 1: Block-phase parser foundation #24

Open

luca-chen198 closed this May 20, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(parser): two-phase block scanner (Phase 1)#33

feat(parser): two-phase block scanner (Phase 1)#33
luca-chen198 wants to merge 13 commits into
mainfrom
phase-1-block-scanner

luca-chen198 commented May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

luca-chen198 commented May 18, 2026

Summary

What changes

Trade-offs (honest)

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant