[SPARK-57258][SQL] Reduce regexp_extract/regexp_extract_all generated code size via shared extract helpers by LuciferYang · Pull Request #56318 · apache/spark

LuciferYang · 2026-06-04T09:07:09Z

What changes were proposed in this pull request?

This is a sub-task of SPARK-56908 (reduce the size of generated Java code in whole-stage codegen).

RegExpExtract and RegExpExtractAll inline the entire match-result-extraction logic into the generated Java produced by doGenCode (a find() / toMatchResult() / checkGroupIndex / group(idx) block, plus a while loop and an ArrayList accumulation for the *All variant). This duplicates the same logic that already exists in their nullSafeEval interpreted path and emits a large block into every generated class that uses these functions.

This PR extracts that logic into two shared helpers on the existing object RegExpExtractBase (placed next to checkGroupIndex, which the generated code already calls the same way):

RegExpExtractBase.extract(matcher, idx, prettyName): UTF8String
RegExpExtractBase.extractAll(matcher, idx, prettyName): GenericArrayData

Both nullSafeEval and doGenCode now call these helpers, so the generated Java is a single method call instead of an inline block. This mirrors the approach already used by RegExpReplace (RegExpUtils.replace, SPARK-57255 / #56315), reusing RegExpExtractBase here because checkGroupIndex is already co-located there.

RegExpInStr (a third RegExpExtractBase subclass) is intentionally left unchanged: it returns the match start position rather than an extracted group, so these helpers do not apply.

The unused java.util.regex.MatchResult import and the now-dead codegen locals (matchResult, matchResults, arrayClass) are removed.

Why are the changes needed?

Smaller generated methods reduce JIT/Janino pressure and the risk of hitting the 64KB method limit in wide whole-stage-codegen stages. Measured with debugCodegen() on a single-expression stage (spark.range(1000).selectExpr(...)):

Plan	`maxMethodCodeSize`	`maxConstantPoolSize`
`regexp_extract(cast(id as string), '([0-9]+)', 1)`	415 -> 357 (-14.0%)	260 -> 239 (-8.1%)
`regexp_extract_all(cast(id as string), '([0-9])', 1)`	569 -> 477 (-16.2%)	320 -> 285 (-10.9%)

Does this PR introduce any user-facing change?

No. This is a behavior-preserving refactor. The interpreted and codegen paths produce identical results, including the INVALID_PARAMETER_VALUE.REGEX_GROUP_INDEX error contract (the group-index check still runs only after a successful match, so a non-matching input never throws).

How was this patch tested?

Existing RegexpExpressionsSuite tests for RegExpExtract and RegExpExtractAll pass (they exercise both interpreted and codegen via checkEvaluation, including the REGEX_GROUP_INDEX error path), and scalastyle is clean. No new test is needed because the refactor preserves behavior and the existing tests already cover both paths.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Code

… code size via shared extract helpers

[SPARK-57258][SQL] Reduce regexp_extract/regexp_extract_all generated…

d5442ae

… code size via shared extract helpers

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-57258][SQL] Reduce regexp_extract/regexp_extract_all generated code size via shared extract helpers#56318

[SPARK-57258][SQL] Reduce regexp_extract/regexp_extract_all generated code size via shared extract helpers#56318
LuciferYang wants to merge 1 commit into
apache:masterfrom
LuciferYang:regexpextract-codegen-helper

LuciferYang commented Jun 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

LuciferYang commented Jun 4, 2026

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant