Skip to content

[SPARK-57258][SQL] Reduce regexp_extract/regexp_extract_all generated code size via shared extract helpers#56318

Open
LuciferYang wants to merge 1 commit into
apache:masterfrom
LuciferYang:regexpextract-codegen-helper
Open

[SPARK-57258][SQL] Reduce regexp_extract/regexp_extract_all generated code size via shared extract helpers#56318
LuciferYang wants to merge 1 commit into
apache:masterfrom
LuciferYang:regexpextract-codegen-helper

Conversation

@LuciferYang
Copy link
Copy Markdown
Contributor

What changes were proposed in this pull request?

This is a sub-task of SPARK-56908 (reduce the size of generated Java code in whole-stage codegen).

RegExpExtract and RegExpExtractAll inline the entire match-result-extraction logic into the generated Java produced by doGenCode (a find() / toMatchResult() / checkGroupIndex / group(idx) block, plus a while loop and an ArrayList accumulation for the *All variant). This duplicates the same logic that already exists in their nullSafeEval interpreted path and emits a large block into every generated class that uses these functions.

This PR extracts that logic into two shared helpers on the existing object RegExpExtractBase (placed next to checkGroupIndex, which the generated code already calls the same way):

  • RegExpExtractBase.extract(matcher, idx, prettyName): UTF8String
  • RegExpExtractBase.extractAll(matcher, idx, prettyName): GenericArrayData

Both nullSafeEval and doGenCode now call these helpers, so the generated Java is a single method call instead of an inline block. This mirrors the approach already used by RegExpReplace (RegExpUtils.replace, SPARK-57255 / #56315), reusing RegExpExtractBase here because checkGroupIndex is already co-located there.

RegExpInStr (a third RegExpExtractBase subclass) is intentionally left unchanged: it returns the match start position rather than an extracted group, so these helpers do not apply.

The unused java.util.regex.MatchResult import and the now-dead codegen locals (matchResult, matchResults, arrayClass) are removed.

Why are the changes needed?

Smaller generated methods reduce JIT/Janino pressure and the risk of hitting the 64KB method limit in wide whole-stage-codegen stages. Measured with debugCodegen() on a single-expression stage (spark.range(1000).selectExpr(...)):

Plan maxMethodCodeSize maxConstantPoolSize
regexp_extract(cast(id as string), '([0-9]+)', 1) 415 -> 357 (-14.0%) 260 -> 239 (-8.1%)
regexp_extract_all(cast(id as string), '([0-9])', 1) 569 -> 477 (-16.2%) 320 -> 285 (-10.9%)

Does this PR introduce any user-facing change?

No. This is a behavior-preserving refactor. The interpreted and codegen paths produce identical results, including the INVALID_PARAMETER_VALUE.REGEX_GROUP_INDEX error contract (the group-index check still runs only after a successful match, so a non-matching input never throws).

How was this patch tested?

Existing RegexpExpressionsSuite tests for RegExpExtract and RegExpExtractAll pass (they exercise both interpreted and codegen via checkEvaluation, including the REGEX_GROUP_INDEX error path), and scalastyle is clean. No new test is needed because the refactor preserves behavior and the existing tests already cover both paths.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Code

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant