Replace line-based parser with tokenizer and walker by JesseHerrick · Pull Request #22 · remoteoss/dexter

JesseHerrick · 2026-04-13T02:49:28Z

Replaces the regex-over-joined-lines approach to Elixir parsing with a new single-pass tokenizer and token walker. This is a correctness and performance overhaul — same parse output contract, better results.

What changed

New tokenizer (internal/parser/tokenizer.go)

A hand-written Elixir lexer that produces a flat []Token stream. It handles strings, heredocs, sigils, and interpolation as atomic tokens, which eliminates the need for per-line comment stripping, string blanking, and multi-line join state tracking that the old parser required.

Rewritten parser (internal/parser/parser_tokenized.go)

A token walker replaces the old regex loop. Because the tokenizer handles quoting correctly, the walker never needs to guess whether it's inside a string — it just skips those token kinds.

LSP extraction functions (internal/lsp/elixir.go)

parseUsingBody, parseHelperQuoteBlock, extractAliasesFromLines, ExtractImports, ExtractUses, and ExtractUsesWithOpts all ported to the tokenizer. Removes 15 compiled regexes and ~250 lines of heredoc/line-joining state machine.

Multi-line alias block support

Added ExtractAliasBlockParent and wired it into definition/hover/references/completion so modules inside alias Parent.{ Child, Other } blocks resolve correctly.

Recursive delegate chain following

LookupFollowDelegate is now recursive (depth limit 5), so multi-hop chains like Payments → Billing → Handlers resolve to the actual implementation rather than stopping at the first defdelegate.

Bug fixes

# inside heredoc markdown links was misread as a comment, which cascaded into line merges that swallowed entire defmacro __using__ bodies (broke use-chain resolution in some modules)
Multi-line bracket expressions: missed refs and produced incorrect line numbers
require Module, as: Name didn't register aliases for go-to-definition
Multi-hop defdelegate chains (A → B → C) stopped at the intermediate delegate instead of resolving to the final target
Multi-line alias blocks (alias Parent.{ Child }) — go-to-definition, hover, references, and completion didn't resolve child modules
use Module, opts spanning multiple lines didn't parse the opts correctly

Performance

On real-world .ex files (geomean across 5 files):

	Before	After
Time per file	993 µs	362 µs
Throughput	84 MB/s	231 MB/s

2.7x faster, with better correctness.

Notes

IndexVersion bumped 10 → 11 (parse output differs in edge cases; existing indexes will be rebuilt on next startup)

Note

High Risk
High risk because it rewrites core LSP parsing/extraction logic (cursor expression, aliases/imports/uses, __using__ analysis, hover/doc parsing) and introduces new caching paths; subtle token-walking edge cases could impact definition/hover/completion/references across the server.

Overview
Replaces multiple line/regex-based LSP text extractors with tokenizer-backed implementations, introducing TokenizedFile (reusable tokens + line offsets) and updating cursor logic to be token-aware for expression extraction, call-context detection (including no-paren calls), bare-call scanning, and module-attribute lookups.

Rewrites alias/import/use and __using__ parsing to walk tokens (including multi-line opts, heredocs/sigils, dynamic unquote opt-bindings, and helper quote do delegation), and adds ExtractAliasBlockParent to resolve modules inside multi-line alias Parent.{...} blocks.

Adds document-level token caching in DocumentStore, migrates hover doc/moduledoc extraction to tokenized parsing via new elixir_docs.go, and substantially expands/updates tests to cover regressions around heredocs, multi-line constructs, scope tracking, and depth/forward-progress guarantees.

^{Reviewed by Cursor Bugbot for commit 94c001b. Bugbot is set up for automated code reviews on this repo. Configure here.}

Swap ParseText from regex-over-joined-lines to a single-pass token walker that consumes []Token from a new Elixir tokenizer. The tokenizer handles strings, heredocs, sigils, and interpolation as single tokens, eliminating the need for line joining, StripCommentsAndStrings per line, and multi-line sigil/heredoc state tracking. Results on real-world .ex files (geomean across 5 files): - 2.7x faster (993µs → 362µs per file) - Throughput 84 MB/s → 231 MB/s Also improves correctness: the old line-based approach missed refs in multi-line bracket expressions and produced wrong line numbers for refs inside joined lines. Bump IndexVersion 10 → 11 (parse output differs in edge cases).

… following Replace line-based regex parsing with tokenizer-based token walking in parseUsingBody, parseHelperQuoteBlock, extractAliasesFromLines, ExtractImports, ExtractUses, and ExtractUsesWithOpts. This eliminates 15 compiled regexes, ~250 lines of heredoc/line-joining state machine, and fixes a regression where bracketDepth treated # in heredoc markdown links as comments — cascading into file-wide line merges that swallowed defmacro __using__ bodies (broke args_schema use-chain resolution). Make LookupFollowDelegate recursive (depth limit 5) so multi-hop delegate chains like PaymentsHub → FundFlowExecution → Handlers resolve to the actual implementation instead of stopping at the intermediate defdelegate. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Advance inline-def parsing past parameter lists and def bodies so nested statements are not treated as top-level __using__/quote statements, and keep defmodule scope tracking active when `do` appears on the next line to avoid misattributed aliases.

…form processModuleDef stopped scanning at TokEOL, so `defmodule Foo\ndo` left TokDo to the main loop (double-counting depth) and misattributed functions after the inner module's end. Now scans past EOL with statement-boundary guards to avoid stealing a later module's TokDo. Also always emits the module Definition even for `, do:` one-liners so they are tracked in the store. No frame is pushed for inline modules since there is no do..end scope.

When collectModuleName encounters a non-TokModule token inside a multi-alias brace block (e.g. an atom or number), it returns without advancing the position. The three brace-scanning loops in parseTextFromTokens, parseHelperQuoteBlock, and parseUsingBody were missing the forward-progress guard that extractAliasesFromText already had, causing them to spin forever. Add the same `if nk == k { k++ }` guard to all three sites.

The `require Module, as: Name` syntax was not being parsed, so modules aliased via require couldn't be resolved for go-to-definition. Updated the parser and LSP alias extraction to handle this pattern.

The tokenizer emits `do:` as `TokIdent("do") + TokColon` via isKeywordKey, never as `TokDo + TokColon`. Only block-opening `do` (without trailing colon) produces TokDo, so checking if TokDo is followed by TokColon is unreachable. Made-with: Cursor

Backslash-escaped newlines (\\\n) inside strings, heredocs, sigils, interpolations, and char literals were skipped with i += 2 without incrementing the line counter. This caused all subsequent tokens to report line numbers that were too low, producing wrong go-to-definition targets (e.g. landing on line 588 instead of 594 in ecto/schema.ex). Fixed all 7 affected scan sites: scanStringContent, scanHeredocContent, scanInterpolation (2 sites), scanSigilContent (2 branches), and the main-loop char literal path. Added regression tests for each.

Centralize block/alias token scans across parser and LSP to prevent drift, and move hover doc extraction to tokenized paths with added regression tests.

This restores hover docs for non-quoted sigil forms and avoids false go-to-definition hits on attribute reference sites.

Prevent signature-help call detection from treating keywords like `if` as callable expressions in no-paren contexts, and add a regression test to ensure keyword forms do not produce false call contexts.

- ExtractAliasBlockParent: assert parent on both Accounts and blank lines; assert defmodule line is not inside the block. - ExtractAliasesInScope: cover require ... as pairs on the same line as alias; document nextPos/for-loop regression. - parseUsingBody: add quote-body case for two semicolon-separated alias as forms. Made-with: Cursor

…/defimpl - Fix off-by-one bug in parseUsingBody's while-style loop where `i = nextPos - 1` was incorrectly compensating for non-existent post-increment (the loop has manual increments, not auto-increment) - Remove duplicated nextSig/collectModuleName closures in parseUsingBody, use the shared tokNextSig/tokCollectModuleName helpers instead - Remove unused tokText alias (only used once, replaced with direct call) - Add TokDefprotocol and TokDefimpl to extractEnclosingModuleFromTokens so __MODULE__ resolves correctly inside protocol/impl blocks Made-with: Cursor

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 2 total unresolved issues (including 1 from previous review).

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit a34256a. Configure here.}

JesseHerrick added 3 commits April 12, 2026 19:50

Fix parser: handle multi-line and edge case parsing

03cc708

Handle more edge-cases

1ecda73