Parser: fix exponential parse time on compound chains#2344
Merged
Conversation
Dandandan
approved these changes
May 20, 2026
Contributor
Author
|
Copilot does not seem to beat the AI bubble allegations lol |
iffyio
reviewed
May 21, 2026
iffyio
approved these changes
May 21, 2026
iffyio
left a comment
Contributor
There was a problem hiding this comment.
LGTM! Thanks @LucaCappelletti94!
This was referenced May 22, 2026
3 tasks
moshap-firebolt
added a commit
to firebolt-analytics/datafusion-sqlparser-rs
that referenced
this pull request
Jun 9, 2026
Mirrors the position-keyed failure cache pattern from apache#2344, apache#2350, apache#2352. `parse_table_factor`'s `(` arm speculatively parses a derived table; on failure it rewinds and tries `parse_table_and_joins` (nested-join). Both arms recurse back into `parse_table_factor` consuming the next `(`, so on pathological inputs like `SELECT 1 FROM (((((...` each level re-runs the speculative arm — work doubles at each level. With 30 nested parens this takes >7s; with 50, the libFuzzer per-test timeout fires (>1300s seen in CI). Cache the parser position at which `parse_derived_table_factor` was already attempted and failed. The next time `parse_table_factor` reaches that position (via the nested-join arm's recursive descent), skip the speculative call and go straight to the fallback. The cache only stores positions where a non-`RecursionLimitExceeded` failure occurred, so the recursion-limit guard still propagates. Regression test: `parse_table_factor_paren_chain_no_exponential_blowup` runs the parse on a worker thread and asserts it returns within 5 s; pre-fix it hangs the libFuzzer worker for >20 minutes on a 666-byte ClickHouse seed surfaced by the `sql_parser_dialects` fuzz harness. Bench: `parse_table_factor_paren_chain/chain_{10,20,30}`. Drive-by: add the missing comma in `criterion_group!` between `parse_compound_keyword_chain` and `parse_prefix_keyword_call_chain` (was a parse error preventing the new bench from registering).
moshap-firebolt
added a commit
to firebolt-analytics/datafusion-sqlparser-rs
that referenced
this pull request
Jun 9, 2026
Mirrors the position-keyed failure cache pattern from apache#2344, apache#2350, apache#2352. `parse_table_factor`'s `(` arm speculatively parses a derived table; on failure it rewinds and tries `parse_table_and_joins` (nested-join). Both arms recurse back into `parse_table_factor` consuming the next `(`, so on pathological inputs like `SELECT 1 FROM (((((...` each level re-runs the speculative arm — work doubles at each level. With 30 nested parens this takes >7s; with 50, the libFuzzer per-test timeout fires (>1300s seen in CI). Cache the parser position at which `parse_derived_table_factor` was already attempted and failed. The next time `parse_table_factor` reaches that position (via the nested-join arm's recursive descent), skip the speculative call and go straight to the fallback. The cache only stores positions where a non-`RecursionLimitExceeded` failure occurred, so the recursion-limit guard still propagates. Regression test: `parse_table_factor_paren_chain_no_exponential_blowup` runs the parse on a worker thread and asserts it returns within 5 s; pre-fix it hangs the libFuzzer worker for >20 minutes on a 666-byte ClickHouse seed surfaced by the `sql_parser_dialects` fuzz harness. Bench: `parse_table_factor_paren_chain/chain_{10,20,30}`. Drive-by: add the missing comma in `criterion_group!` between `parse_compound_keyword_chain` and `parse_prefix_keyword_call_chain` (was a parse error preventing the new bench from registering).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
parse_compound_exprrecursed intoparse_subexprafter every., which re-walks the rest of the chain inside a rollback boundary. On inputs likeIF a.b.c...x.#the work doubled per chain element, giving 2^N parse times. Switching the inner call toparse_prefixremoves the redundant traversal: the outer loop already walks the chain, so the resulting AST is identical on valid SQL.Measured on
PostgreSqlDialectwith three inputs from a libFuzzer corpus, release build:iF i.D.i.:Fi....(65 B)iF i.D.i.i. ... .*~(58 B)if-stf-localtclocal33alt....(306 B)Regression test in
tests/sqlparser_postgres.rsruns a 30-element chain with a 5 s timeout. Pre-fix it hits the timeout, post-fix it finishes in well under a millisecond.