feat(parser): two-phase block scanner (Phase 1)#33
Closed
luca-chen198 wants to merge 13 commits into
Closed
Conversation
…inline tokens filtered by block precedence
The Setext underline lookahead (`Title\n====` / `Title\n----` rewriting the buffered paragraph into a heading) absorbs all preceding paragraph lines per CommonMark §4.3 — which is unintuitive for casual notes: a user typing `---` as a visual separator unexpectedly promoted the prior N lines into a single H2. Bear, Apple Notes, and Notion don't support Setext for the same UX reason. ATX (`# Title`) covers the same use case unambiguously. Removed: - `setextUnderlineLevel` (entire function) - `rewriteBufferAsHeading` (entire function — only Setext used it) - The Setext lookahead in `BlockScanner.scan` - The `paragraphBuffer.isEmpty` gate on thematic-break detection (since Setext is gone, thematic breaks can now interrupt paragraphs per CommonMark §4.1 — buffer is flushed first) - 4 Setext-specific tests (`setextH1WithEqualsUnderline`, `setextH2WithDashUnderline`, `setextSpansMultipleParagraphLines`, `dashUnderlineAfterParagraphPrefersSetext`) - `dashesAloneWithoutParagraphAreNotConsumedAsHeading` (redundant with `thematicBreakWithDashes`) Added: - `dashUnderlineAfterParagraphInterruptsAsThematicBreak` — verifies that `Title\n---` is now paragraph + thematic break - `equalsUnderlineAfterParagraphIsNotAHeading` — verifies `Title\n===` stays a single multi-line paragraph 64 tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The pre-Phase-1 regex-based heading parser stored the leading space between `#` and content as `markerRanges[1]`, so the marker-shrink pass collapsed it together with the hashes when the heading wasn't active. The new BlockScanner.atxHeading only emitted the hashes, leaving the space at full width — visible as a small gap before the heading text once the cursor moved away. Recapture the whitespace range and append it as a secondary marker, keeping the existing `markerRanges[0].length == level` invariant the stylers rely on. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Drops BlockScannerTests, BlockSpanTests, BlockVisitorTests, and ParseTokensGoldenTests. The pre-Phase-1 MarkdownEngineDecouplingTests public-API contract suite stays — Phase 1 didn't add API surface. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Replaces the single-pass regex tokenizer with a two-phase parser: a new
BlockScannerwalks the source line-by-line and emits typedBlockSpans for block-level constructs (paragraph, ATX heading, fenced code, thematic break, link reference definition). The existing inline-regex pass then runs only over the content of blocks that allow inline formatting, and any inline match landing inside a non-inline-allowing block is dropped at the end.This is largely foundation work — the user-visible surface barely changes today. The motivation is to unblock Phase 2 (lists, blockquotes, tables, footnotes as proper container blocks) and to fix a class of overlap bugs structurally rather than via spreading heuristic checks.
What changes
Parser
Sources/MarkdownEngine/Parser/BlockScanner.swift— line-walker classifying paragraph / ATX heading / fenced code / thematic break / link reference definition.Sources/MarkdownEngine/Parser/BlockSpan.swift— typedBlockKindenum with Phase-1 cases implemented and Phase-2 cases forward-declared (`.blockquote`, `.table`, `.tableRow`, `.tableCell`, `.list`, `.listItem`, `.footnoteDefinition`, `.definitionList`, `.htmlBlock`).What's user-visible
What's deliberately NOT in this PR (Phase 2)
The following `BlockKind` cases are forward-declared but `BlockScanner` does not yet emit them — anyone building Phase 2 can flip them on without an API break:
Thematic breaks and link reference definitions ARE detected by the block scanner but have no styler / no resolver yet — they render as their literal source text, identical to behaviour on `main`. The detection itself is groundwork for Phase 2.
What was tried then removed during review
Setext headings (`Title\n====` / `Title\n----`) were initially supported then removed. CommonMark §4.3 requires the underline to absorb all preceding paragraph lines into the heading, which is unintuitive for casual notes: a user typing `---` as a visual separator unexpectedly promotes the prior N lines into an H2. Bear, Apple Notes and Notion don't ship Setext for the same UX reason. ATX (`# Title`) covers the same case unambiguously.
Trade-offs (honest)
Test plan
🤖 Generated with Claude Code