[SPARK-57255][SQL] Simplify RegExpReplace codegen by extracting the match/replace loop into a shared helper#56315
Conversation
…atch/replace loop into a shared helper RegExpReplace inlined the match/replace loop (matcher.region, the find/appendReplacement/appendTail loop, and the invalidRegexpReplaceError construction in a try/catch) in both nullSafeEval and doGenCode, so every generated whole-stage class carried that block plus the error-construction constant-pool entries. Move the loop into a shared `RegExpUtils.replace(matcher, subject, replacement, regexp, rep, pos)` that both eval and codegen call; doGenCode now emits a single call. The eval-only cached `result` StringBuilder field is dropped (the helper allocates its own; codegen always allocated per call anyway). For a single regexp_replace whole-stage stage, this drops maxMethodCodeSize from 551 to 435 (-21%) and maxConstantPoolSize from 277 to 244 (-12%), with the loop body compiled once instead of per stage. Behavior is unchanged. Adds a throwing-path test (a replacement referencing a non-existent group -> INVALID_REGEXP_REPLACE), which the tree previously did not cover.
…lace so subject.toString is computed once The initial commit had RegExpUtils.replace take a pre-built Matcher, so the caller built the matcher from subject.toString and the helper recomputed subject.toString for the region/error - a duplicate decode per row in the eval path (the codegen path already decoded twice). Make replace take the cached Pattern and build the matcher itself, computing source = subject.toString once. To give codegen the cached pattern term without building a matcher, split initLastPatternCode out of initLastMatcherCode (the latter now composes the former + the matcher line); the other callers (RLike/RegExpExtract/RegExpExtractAll) are unchanged. This also removes the matcher build from every generated class. For a single regexp_replace whole-stage stage the numbers improve to maxMethodCodeSize 551 -> 423 (-23%) and maxConstantPoolSize 277 -> 236 (-15%).
| } else { | ||
| s | ||
| } | ||
| RegExpUtils.replace(pattern, s.asInstanceOf[UTF8String], lastReplacement, |
There was a problem hiding this comment.
I would prefer using Java String as the method parameter here. So that we can avoid all these .asInstanceOf[UTF8String]
There was a problem hiding this comment.
Done in 8c24f32 — RegExpUtils.replace now takes source and regexp as String, so the nullSafeEval call site no longer needs the .asInstanceOf[UTF8String] casts. I also dropped the now-redundant rep parameter and reuse replacement for the error message, since they're the same value (lastReplacement is built from the rep argument). Thanks for the review!
There was a problem hiding this comment.
Follow-up on this, plus one open question for a joint call:
Pushed edcd5c6: I also dropped the regexp parameter entirely and now use pattern.pattern() for the error message. Since compileRegexPattern compiles the raw regexp without escaping (collation is carried in the int flags, not the pattern text), pattern.pattern() round-trips the original regexp exactly, and it's only evaluated on the rare error path -- this removes the eager per-row regexp.toString() that the String-param version would otherwise have added in the generated code.
Open question on the remaining subject argument: to make the eval site fully cast-free I pass it as String (s.toString). One consequence is that the out-of-range no-op branch now returns UTF8String.fromString(source) instead of the original subject bytes. For valid UTF-8 this is byte-identical, and it's consistent with the match path (which already builds its result via UTF8String.fromString(...)). The only divergence is for malformed UTF-8 subjects (e.g. cast(binary as string) with invalid bytes), where the String round-trip substitutes U+FFFD; passing it as String also drops the checked cast, so a hypothetical type mismatch would no longer fail fast.
I can instead keep subject: UTF8String (returning it unchanged on the out-of-range path -> exact bytes + checked cast) at the cost of one remaining .asInstanceOf[UTF8String] at the call site. Do you prefer (a) the fully cast-free String version as-is, or (b) keeping subject: UTF8String for byte-exact passthrough? Happy to go either way.
gengliangwang
left a comment
There was a problem hiding this comment.
LGTM except for one comment
…s.replace to drop UTF8String casts
…egExpUtils.replace by using pattern.pattern() for the error message
What changes were proposed in this pull request?
This is a sub-task of SPARK-56908 (reduce generated Java size in whole-stage codegen).
RegExpReplaceinlined the same match/replace loop in bothnullSafeEvalanddoGenCode— the matcher build, thewhile (find) { appendReplacement }loop with atry/catchthat buildsinvalidRegexpReplaceError(...), andappendTail. So every generated whole-stage class that usesregexp_replacecarried that ~20-line block plus the matcher build and the error-construction constant-pool entries.This PR moves it into a shared helper that takes the cached
Patternand builds the matcher itself:Both
evaland codegen call it;doGenCodeemits a singleRegExpUtils.replace(...)call (via the Scala-object static forwarder, the same mechanism as the generatedQueryExecutionErrors.xxxcalls). Building the matcher inside the helper meanssubject.toStringis computed exactly once per row (the previous inline forms decoded it twice). To give codegen the cached pattern term without building a matcher,initLastPatternCodeis split out ofinitLastMatcherCode(the latter now composes the former plus the matcher line); the other callers (RLike/RegExpExtract/RegExpExtractAll) are unchanged.The Pattern/replacement caching (the per-instance mutable state) stays at the call sites. The eval-only cached
resultStringBuilderfield is dropped — the helper allocates its own, and the codegen path always allocated one per call anyway.Why are the changes needed?
To reduce the size of the generated Java in whole-stage codegen, as tracked by SPARK-56908. Moving the loop + matcher build + error construction into one compiled method removes them from every generated class. Measured with
debugCodegen()onspark.range(1000).selectExpr("regexp_replace(cast(id as string), '0', 'x')"), a single-regexp_replacewhole-stage stage:maxMethodCodeSizemaxConstantPoolSizeThese are real (not Janino-foldable) reductions, replicated per stage class, with the loop body compiled once.
Does this PR introduce any user-facing change?
No. This is a refactor;
evalresults,nullable,dataType, and the error message/args are unchanged, so SQL output and golden files are unaffected.One internal trade-off: interpreted
evalnow allocates aStringBuilderper row instead of reusing a cached field. This matches what the codegen path always did, and is negligible relative to the regex matching; it removes a@transient lazy valfield.How was this patch tested?
RegexpExpressionsSuite"RegexReplace" — existing eval + codegen coverage, plus a new throwing-path case (a replacement referencing a non-existent group raisesINVALID_REGEXP_REPLACE), which the tree previously did not cover;checkErrorInExpressionexercises it in both interpreted and codegen modes. The full suite (17 tests, incl. the SPARK-22570 global-variable-count test and the otherinitLastMatcherCodecallers) passes.StringFunctionsSuite(regexp_replace) andCollationSQLRegexpSuite.Follow-up:
RegExpExtract/RegExpExtractAllshare a similar inline-loop shape and could get the same treatment in a separate sub-task.Was this patch authored or co-authored using generative AI tooling?
Generated-by: Claude Code (Claude Opus 4.8)