[SPARK-57258][SQL] Reduce regexp_extract/regexp_extract_all generated code size via shared extract helpers#56318
Open
LuciferYang wants to merge 1 commit into
Open
Conversation
… code size via shared extract helpers
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
This is a sub-task of SPARK-56908 (reduce the size of generated Java code in whole-stage codegen).
RegExpExtractandRegExpExtractAllinline the entire match-result-extraction logic into the generated Java produced bydoGenCode(afind()/toMatchResult()/checkGroupIndex/group(idx)block, plus awhileloop and anArrayListaccumulation for the*Allvariant). This duplicates the same logic that already exists in theirnullSafeEvalinterpreted path and emits a large block into every generated class that uses these functions.This PR extracts that logic into two shared helpers on the existing
object RegExpExtractBase(placed next tocheckGroupIndex, which the generated code already calls the same way):RegExpExtractBase.extract(matcher, idx, prettyName): UTF8StringRegExpExtractBase.extractAll(matcher, idx, prettyName): GenericArrayDataBoth
nullSafeEvalanddoGenCodenow call these helpers, so the generated Java is a single method call instead of an inline block. This mirrors the approach already used byRegExpReplace(RegExpUtils.replace, SPARK-57255 / #56315), reusingRegExpExtractBasehere becausecheckGroupIndexis already co-located there.RegExpInStr(a thirdRegExpExtractBasesubclass) is intentionally left unchanged: it returns the match start position rather than an extracted group, so these helpers do not apply.The unused
java.util.regex.MatchResultimport and the now-dead codegen locals (matchResult,matchResults,arrayClass) are removed.Why are the changes needed?
Smaller generated methods reduce JIT/Janino pressure and the risk of hitting the 64KB method limit in wide whole-stage-codegen stages. Measured with
debugCodegen()on a single-expression stage (spark.range(1000).selectExpr(...)):maxMethodCodeSizemaxConstantPoolSizeregexp_extract(cast(id as string), '([0-9]+)', 1)regexp_extract_all(cast(id as string), '([0-9])', 1)Does this PR introduce any user-facing change?
No. This is a behavior-preserving refactor. The interpreted and codegen paths produce identical results, including the
INVALID_PARAMETER_VALUE.REGEX_GROUP_INDEXerror contract (the group-index check still runs only after a successful match, so a non-matching input never throws).How was this patch tested?
Existing
RegexpExpressionsSuitetests forRegExpExtractandRegExpExtractAllpass (they exercise both interpreted and codegen viacheckEvaluation, including theREGEX_GROUP_INDEXerror path), and scalastyle is clean. No new test is needed because the refactor preserves behavior and the existing tests already cover both paths.Was this patch authored or co-authored using generative AI tooling?
Generated-by: Claude Code