Skip to content

feat(parser): two-phase block scanner (Phase 1)#33

Closed
luca-chen198 wants to merge 13 commits into
mainfrom
phase-1-block-scanner
Closed

feat(parser): two-phase block scanner (Phase 1)#33
luca-chen198 wants to merge 13 commits into
mainfrom
phase-1-block-scanner

Conversation

@luca-chen198
Copy link
Copy Markdown
Collaborator

Summary

Replaces the single-pass regex tokenizer with a two-phase parser: a new BlockScanner walks the source line-by-line and emits typed BlockSpans for block-level constructs (paragraph, ATX heading, fenced code, thematic break, link reference definition). The existing inline-regex pass then runs only over the content of blocks that allow inline formatting, and any inline match landing inside a non-inline-allowing block is dropped at the end.

This is largely foundation work — the user-visible surface barely changes today. The motivation is to unblock Phase 2 (lists, blockquotes, tables, footnotes as proper container blocks) and to fix a class of overlap bugs structurally rather than via spreading heuristic checks.

What changes

Parser

  • New Sources/MarkdownEngine/Parser/BlockScanner.swift — line-walker classifying paragraph / ATX heading / fenced code / thematic break / link reference definition.
  • New Sources/MarkdownEngine/Parser/BlockSpan.swift — typed BlockKind enum with Phase-1 cases implemented and Phase-2 cases forward-declared (`.blockquote`, `.table`, `.tableRow`, `.tableCell`, `.list`, `.listItem`, `.footnoteDefinition`, `.definitionList`, `.htmlBlock`).
  • New `Sources/MarkdownEngine/Parser/BlockVisitor.swift` — depth-first visitor protocol for block trees (default `walk(_:)` already recurses into children so Phase-2 nested containers work without changes to callers).
  • `MarkdownTokenizer.parseTokens` now calls `BlockScanner.scan` first, runs the existing inline regexes, then filters inline tokens by block precedence — anything matching inside a fenced code block, thematic break, or link-reference-definition is dropped.

What's user-visible

  • LaTeX, wiki-link, image-embed, inline code, and emphasis regex matches inside fenced code blocks are now correctly suppressed (previously they could leak through under specific patterns — `$$x^2$$` or `[[link]]` inside triple-backtick fences could render as real LaTeX/links).
  • ATX heading marker space (e.g. the space between `#` and the title) now collapses cleanly when the heading is inactive — was a regression introduced by an earlier Phase-1 commit, fixed in `d1d8c02`.
  • No new public API surface.

What's deliberately NOT in this PR (Phase 2)

The following `BlockKind` cases are forward-declared but `BlockScanner` does not yet emit them — anyone building Phase 2 can flip them on without an API break:

  • `blockquote` — `> quoted text`
  • `list(ordered:)` / `listItem(indentColumns:)` — bullet / numbered lists as container blocks (task checkboxes still work via the separate regex pass in the styler)
  • `table` / `tableRow` / `tableCell(alignment:)` — pipe tables
  • `footnoteDefinition(label:)` — `[^1]: footnote text`
  • `definitionList`, `htmlBlock`

Thematic breaks and link reference definitions ARE detected by the block scanner but have no styler / no resolver yet — they render as their literal source text, identical to behaviour on `main`. The detection itself is groundwork for Phase 2.

What was tried then removed during review

Setext headings (`Title\n====` / `Title\n----`) were initially supported then removed. CommonMark §4.3 requires the underline to absorb all preceding paragraph lines into the heading, which is unintuitive for casual notes: a user typing `---` as a visual separator unexpectedly promotes the prior N lines into an H2. Bear, Apple Notes and Notion don't ship Setext for the same UX reason. ATX (`# Title`) covers the same case unambiguously.

Trade-offs (honest)

  • ~1500 net lines of new code for foundation that is mostly invisible at the user level today.
  • The visible payoff is small unless Phase 2 follows — primarily the fenced-code-overlap bug class and the heading-marker cleanup.
  • Test coverage was added during development and removed at the end of the review pass. The pre-existing public-API contract suite (`MarkdownEngineDecouplingTests`, 10 tests) stays. Phase 2 work should re-add tests for whatever it builds on top.

Test plan

  • `swift build` clean.
  • `swift test` — 10 tests in `MarkdownEngineDecouplingTests` pass.
  • Manual in an editor:
    • `# Heading` — confirm the space between `#` and the title shrinks together with the `#` when the cursor leaves the heading.
    • Paste a fenced code block containing `$$x^2$$` and `[[fake link]]` — both should render as plain code (no LaTeX render, no wiki-link styling).
    • `Title\n----` — confirm this is paragraph + literal `----`, not an H2 (Setext is deliberately disabled).
    • All existing markdown features (bold, italic, wiki links, images, LaTeX in paragraphs, task checkboxes, ATX headings of all levels, fenced code with language tag) render exactly as before.

🤖 Generated with Claude Code

luca-chen198 and others added 13 commits May 14, 2026 20:21
The Setext underline lookahead (`Title\n====` / `Title\n----` rewriting
the buffered paragraph into a heading) absorbs all preceding paragraph
lines per CommonMark §4.3 — which is unintuitive for casual notes: a
user typing `---` as a visual separator unexpectedly promoted the prior
N lines into a single H2.

Bear, Apple Notes, and Notion don't support Setext for the same UX
reason. ATX (`# Title`) covers the same use case unambiguously.

Removed:
- `setextUnderlineLevel` (entire function)
- `rewriteBufferAsHeading` (entire function — only Setext used it)
- The Setext lookahead in `BlockScanner.scan`
- The `paragraphBuffer.isEmpty` gate on thematic-break detection
  (since Setext is gone, thematic breaks can now interrupt
  paragraphs per CommonMark §4.1 — buffer is flushed first)
- 4 Setext-specific tests (`setextH1WithEqualsUnderline`,
  `setextH2WithDashUnderline`, `setextSpansMultipleParagraphLines`,
  `dashUnderlineAfterParagraphPrefersSetext`)
- `dashesAloneWithoutParagraphAreNotConsumedAsHeading` (redundant
  with `thematicBreakWithDashes`)

Added:
- `dashUnderlineAfterParagraphInterruptsAsThematicBreak` — verifies
  that `Title\n---` is now paragraph + thematic break
- `equalsUnderlineAfterParagraphIsNotAHeading` — verifies `Title\n===`
  stays a single multi-line paragraph

64 tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The pre-Phase-1 regex-based heading parser stored the leading space
between `#` and content as `markerRanges[1]`, so the marker-shrink pass
collapsed it together with the hashes when the heading wasn't active.
The new BlockScanner.atxHeading only emitted the hashes, leaving the
space at full width — visible as a small gap before the heading text
once the cursor moved away.

Recapture the whitespace range and append it as a secondary marker,
keeping the existing `markerRanges[0].length == level` invariant the
stylers rely on.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Drops BlockScannerTests, BlockSpanTests, BlockVisitorTests, and
ParseTokensGoldenTests. The pre-Phase-1 MarkdownEngineDecouplingTests
public-API contract suite stays — Phase 1 didn't add API surface.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@luca-chen198 luca-chen198 linked an issue May 18, 2026 that may be closed by this pull request
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Phase 1: Block-phase parser foundation

1 participant