Skip to content

Replace line-based parser with tokenizer and walker#22

Merged
JesseHerrick merged 28 commits intomainfrom
feat-tokenizer
Apr 21, 2026
Merged

Replace line-based parser with tokenizer and walker#22
JesseHerrick merged 28 commits intomainfrom
feat-tokenizer

Conversation

@JesseHerrick
Copy link
Copy Markdown
Collaborator

@JesseHerrick JesseHerrick commented Apr 13, 2026

Replaces the regex-over-joined-lines approach to Elixir parsing with a new single-pass tokenizer and token walker. This is a correctness and performance overhaul — same parse output contract, better results.

What changed

New tokenizer (internal/parser/tokenizer.go)

A hand-written Elixir lexer that produces a flat []Token stream. It handles strings, heredocs, sigils, and interpolation as atomic tokens, which eliminates the need for per-line comment stripping, string blanking, and multi-line join state tracking that the old parser required.

Rewritten parser (internal/parser/parser_tokenized.go)

A token walker replaces the old regex loop. Because the tokenizer handles quoting correctly, the walker never needs to guess whether it's inside a string — it just skips those token kinds.

LSP extraction functions (internal/lsp/elixir.go)

parseUsingBody, parseHelperQuoteBlock, extractAliasesFromLines, ExtractImports, ExtractUses, and ExtractUsesWithOpts all ported to the tokenizer. Removes 15 compiled regexes and ~250 lines of heredoc/line-joining state machine.

Multi-line alias block support

Added ExtractAliasBlockParent and wired it into definition/hover/references/completion so modules inside alias Parent.{ Child, Other } blocks resolve correctly.

Recursive delegate chain following

LookupFollowDelegate is now recursive (depth limit 5), so multi-hop chains like Payments → Billing → Handlers resolve to the actual implementation rather than stopping at the first defdelegate.

Bug fixes

  • # inside heredoc markdown links was misread as a comment, which cascaded into line merges that swallowed entire defmacro __using__ bodies (broke use-chain resolution in some modules)
  • Multi-line bracket expressions: missed refs and produced incorrect line numbers
  • require Module, as: Name didn't register aliases for go-to-definition
  • Multi-hop defdelegate chains (A → B → C) stopped at the intermediate delegate instead of resolving to the final target
  • Multi-line alias blocks (alias Parent.{ Child }) — go-to-definition, hover, references, and completion didn't resolve child modules
  • use Module, opts spanning multiple lines didn't parse the opts correctly

Performance

On real-world .ex files (geomean across 5 files):

Before After
Time per file 993 µs 362 µs
Throughput 84 MB/s 231 MB/s

2.7x faster, with better correctness.

Notes

  • IndexVersion bumped 10 → 11 (parse output differs in edge cases; existing indexes will be rebuilt on next startup)

Note

High Risk
High risk because it rewrites core LSP parsing/extraction logic (cursor expression, aliases/imports/uses, __using__ analysis, hover/doc parsing) and introduces new caching paths; subtle token-walking edge cases could impact definition/hover/completion/references across the server.

Overview
Replaces multiple line/regex-based LSP text extractors with tokenizer-backed implementations, introducing TokenizedFile (reusable tokens + line offsets) and updating cursor logic to be token-aware for expression extraction, call-context detection (including no-paren calls), bare-call scanning, and module-attribute lookups.

Rewrites alias/import/use and __using__ parsing to walk tokens (including multi-line opts, heredocs/sigils, dynamic unquote opt-bindings, and helper quote do delegation), and adds ExtractAliasBlockParent to resolve modules inside multi-line alias Parent.{...} blocks.

Adds document-level token caching in DocumentStore, migrates hover doc/moduledoc extraction to tokenized parsing via new elixir_docs.go, and substantially expands/updates tests to cover regressions around heredocs, multi-line constructs, scope tracking, and depth/forward-progress guarantees.

Reviewed by Cursor Bugbot for commit 94c001b. Bugbot is set up for automated code reviews on this repo. Configure here.

Swap ParseText from regex-over-joined-lines to a single-pass token walker
that consumes []Token from a new Elixir tokenizer. The tokenizer handles
strings, heredocs, sigils, and interpolation as single tokens, eliminating
the need for line joining, StripCommentsAndStrings per line, and multi-line
sigil/heredoc state tracking.

Results on real-world .ex files (geomean across 5 files):
- 2.7x faster (993µs → 362µs per file)
- Throughput 84 MB/s → 231 MB/s

Also improves correctness: the old line-based approach missed refs in
multi-line bracket expressions and produced wrong line numbers for refs
inside joined lines.

Bump IndexVersion 10 → 11 (parse output differs in edge cases).
Comment thread internal/parser/parser_tokenized.go
Comment thread internal/parser/tokenizer.go Outdated
@JesseHerrick JesseHerrick changed the base branch from fix/parser-edge-cases to main April 13, 2026 03:05
Comment thread internal/lsp/elixir.go
… following

Replace line-based regex parsing with tokenizer-based token walking in
parseUsingBody, parseHelperQuoteBlock, extractAliasesFromLines,
ExtractImports, ExtractUses, and ExtractUsesWithOpts. This eliminates
15 compiled regexes, ~250 lines of heredoc/line-joining state machine,
and fixes a regression where bracketDepth treated # in heredoc markdown
links as comments — cascading into file-wide line merges that swallowed
defmacro __using__ bodies (broke args_schema use-chain resolution).

Make LookupFollowDelegate recursive (depth limit 5) so multi-hop
delegate chains like PaymentsHub → FundFlowExecution → Handlers resolve
to the actual implementation instead of stopping at the intermediate
defdelegate.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Comment thread internal/lsp/elixir.go Outdated
Comment thread internal/lsp/elixir.go Outdated
Comment thread internal/lsp/elixir.go
Comment thread internal/lsp/elixir.go Outdated
Comment thread internal/lsp/elixir.go
Advance inline-def parsing past parameter lists and def bodies so nested statements are not treated as top-level __using__/quote statements, and keep defmodule scope tracking active when `do` appears on the next line to avoid misattributed aliases.
Comment thread internal/parser/parser_tokenized.go
…form

processModuleDef stopped scanning at TokEOL, so `defmodule Foo\ndo` left
TokDo to the main loop (double-counting depth) and misattributed functions
after the inner module's end. Now scans past EOL with statement-boundary
guards to avoid stealing a later module's TokDo.

Also always emits the module Definition even for `, do:` one-liners so
they are tracked in the store. No frame is pushed for inline modules
since there is no do..end scope.
Comment thread internal/parser/parser_tokenized.go Outdated
When collectModuleName encounters a non-TokModule token inside a
multi-alias brace block (e.g. an atom or number), it returns without
advancing the position. The three brace-scanning loops in
parseTextFromTokens, parseHelperQuoteBlock, and parseUsingBody were
missing the forward-progress guard that extractAliasesFromText already
had, causing them to spin forever. Add the same `if nk == k { k++ }`
guard to all three sites.
The `require Module, as: Name` syntax was not being parsed, so modules
aliased via require couldn't be resolved for go-to-definition. Updated
the parser and LSP alias extraction to handle this pattern.
Comment thread internal/lsp/elixir.go Outdated
The tokenizer emits `do:` as `TokIdent("do") + TokColon` via isKeywordKey,
never as `TokDo + TokColon`. Only block-opening `do` (without trailing
colon) produces TokDo, so checking if TokDo is followed by TokColon is
unreachable.

Made-with: Cursor
Comment thread internal/parser/parser.go Outdated
Comment thread internal/lsp/elixir.go Outdated
Backslash-escaped newlines (\\\n) inside strings, heredocs, sigils,
interpolations, and char literals were skipped with i += 2 without
incrementing the line counter. This caused all subsequent tokens to
report line numbers that were too low, producing wrong go-to-definition
targets (e.g. landing on line 588 instead of 594 in ecto/schema.ex).

Fixed all 7 affected scan sites: scanStringContent, scanHeredocContent,
scanInterpolation (2 sites), scanSigilContent (2 branches), and the
main-loop char literal path. Added regression tests for each.
Comment thread internal/lsp/elixir.go
Comment thread internal/lsp/elixir.go Outdated
Comment thread internal/lsp/elixir.go
Centralize block/alias token scans across parser and LSP to prevent drift, and move hover doc extraction to tokenized paths with added regression tests.
Comment thread internal/lsp/elixir_docs.go Outdated
Comment thread internal/lsp/elixir.go Outdated
@JesseHerrick JesseHerrick self-assigned this Apr 14, 2026
This restores hover docs for non-quoted sigil forms and avoids false go-to-definition hits on attribute reference sites.
Comment thread internal/lsp/elixir.go
Prevent signature-help call detection from treating keywords like `if`
as callable expressions in no-paren contexts, and add a regression test
to ensure keyword forms do not produce false call contexts.
Comment thread internal/lsp/elixir.go
Comment thread internal/lsp/elixir.go
- ExtractAliasBlockParent: assert parent on both Accounts and blank lines;
  assert defmodule line is not inside the block.
- ExtractAliasesInScope: cover require ... as pairs on the same line as alias;
  document nextPos/for-loop regression.
- parseUsingBody: add quote-body case for two semicolon-separated alias as forms.

Made-with: Cursor
Comment thread internal/lsp/elixir.go
Comment thread internal/lsp/elixir.go Outdated
…/defimpl

- Fix off-by-one bug in parseUsingBody's while-style loop where
  `i = nextPos - 1` was incorrectly compensating for non-existent
  post-increment (the loop has manual increments, not auto-increment)
- Remove duplicated nextSig/collectModuleName closures in parseUsingBody,
  use the shared tokNextSig/tokCollectModuleName helpers instead
- Remove unused tokText alias (only used once, replaced with direct call)
- Add TokDefprotocol and TokDefimpl to extractEnclosingModuleFromTokens
  so __MODULE__ resolves correctly inside protocol/impl blocks

Made-with: Cursor
Comment thread internal/lsp/elixir.go Outdated
Comment thread internal/lsp/elixir.go
Comment thread internal/lsp/documents.go
Comment thread internal/lsp/elixir.go Outdated
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 2 total unresolved issues (including 1 from previous review).

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit a34256a. Configure here.

Comment thread internal/parser/token_walk.go Outdated
@JesseHerrick JesseHerrick merged commit 39914e8 into main Apr 21, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant