[SPARK-57255][SQL] Simplify RegExpReplace codegen by extracting the match/replace loop into a shared helper by LuciferYang · Pull Request #56315 · apache/spark

LuciferYang · 2026-06-04T05:39:26Z

What changes were proposed in this pull request?

This is a sub-task of SPARK-56908 (reduce generated Java size in whole-stage codegen).

RegExpReplace inlined the same match/replace loop in both nullSafeEval and doGenCode — the matcher build, the while (find) { appendReplacement } loop with a try/catch that builds invalidRegexpReplaceError(...), and appendTail. So every generated whole-stage class that uses regexp_replace carried that ~20-line block plus the matcher build and the error-construction constant-pool entries.

This PR moves it into a shared helper that takes the cached Pattern and builds the matcher itself:

object RegExpUtils {
  def replace(pattern: Pattern, subject: UTF8String, replacement: String,
              regexp: UTF8String, rep: UTF8String, pos: Int): UTF8String = { ... }
}

Both eval and codegen call it; doGenCode emits a single RegExpUtils.replace(...) call (via the Scala-object static forwarder, the same mechanism as the generated QueryExecutionErrors.xxx calls). Building the matcher inside the helper means subject.toString is computed exactly once per row (the previous inline forms decoded it twice). To give codegen the cached pattern term without building a matcher, initLastPatternCode is split out of initLastMatcherCode (the latter now composes the former plus the matcher line); the other callers (RLike / RegExpExtract / RegExpExtractAll) are unchanged.

The Pattern/replacement caching (the per-instance mutable state) stays at the call sites. The eval-only cached result StringBuilder field is dropped — the helper allocates its own, and the codegen path always allocated one per call anyway.

Why are the changes needed?

To reduce the size of the generated Java in whole-stage codegen, as tracked by SPARK-56908. Moving the loop + matcher build + error construction into one compiled method removes them from every generated class. Measured with debugCodegen() on spark.range(1000).selectExpr("regexp_replace(cast(id as string), '0', 'x')"), a single-regexp_replace whole-stage stage:

Metric	Before	After
`maxMethodCodeSize`	551	423 (-23%)
`maxConstantPoolSize`	277	236 (-15%)

These are real (not Janino-foldable) reductions, replicated per stage class, with the loop body compiled once.

Does this PR introduce any user-facing change?

No. This is a refactor; eval results, nullable, dataType, and the error message/args are unchanged, so SQL output and golden files are unaffected.

One internal trade-off: interpreted eval now allocates a StringBuilder per row instead of reusing a cached field. This matches what the codegen path always did, and is negligible relative to the regex matching; it removes a @transient lazy val field.

How was this patch tested?

RegexpExpressionsSuite "RegexReplace" — existing eval + codegen coverage, plus a new throwing-path case (a replacement referencing a non-existent group raises INVALID_REGEXP_REPLACE), which the tree previously did not cover; checkErrorInExpression exercises it in both interpreted and codegen modes. The full suite (17 tests, incl. the SPARK-22570 global-variable-count test and the other initLastMatcherCode callers) passes.
StringFunctionsSuite (regexp_replace) and CollationSQLRegexpSuite.

Follow-up: RegExpExtract / RegExpExtractAll share a similar inline-loop shape and could get the same treatment in a separate sub-task.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Code (Claude Opus 4.8)

…atch/replace loop into a shared helper RegExpReplace inlined the match/replace loop (matcher.region, the find/appendReplacement/appendTail loop, and the invalidRegexpReplaceError construction in a try/catch) in both nullSafeEval and doGenCode, so every generated whole-stage class carried that block plus the error-construction constant-pool entries. Move the loop into a shared `RegExpUtils.replace(matcher, subject, replacement, regexp, rep, pos)` that both eval and codegen call; doGenCode now emits a single call. The eval-only cached `result` StringBuilder field is dropped (the helper allocates its own; codegen always allocated per call anyway). For a single regexp_replace whole-stage stage, this drops maxMethodCodeSize from 551 to 435 (-21%) and maxConstantPoolSize from 277 to 244 (-12%), with the loop body compiled once instead of per stage. Behavior is unchanged. Adds a throwing-path test (a replacement referencing a non-existent group -> INVALID_REGEXP_REPLACE), which the tree previously did not cover.

…lace so subject.toString is computed once The initial commit had RegExpUtils.replace take a pre-built Matcher, so the caller built the matcher from subject.toString and the helper recomputed subject.toString for the region/error - a duplicate decode per row in the eval path (the codegen path already decoded twice). Make replace take the cached Pattern and build the matcher itself, computing source = subject.toString once. To give codegen the cached pattern term without building a matcher, split initLastPatternCode out of initLastMatcherCode (the latter now composes the former + the matcher line); the other callers (RLike/RegExpExtract/RegExpExtractAll) are unchanged. This also removes the matcher build from every generated class. For a single regexp_replace whole-stage stage the numbers improve to maxMethodCodeSize 551 -> 423 (-23%) and maxConstantPoolSize 277 -> 236 (-15%).

LuciferYang · 2026-06-04T07:00:27Z

cc @gengliangwang

gengliangwang · 2026-06-04T22:08:28Z

-    } else {
-      s
-    }
+    RegExpUtils.replace(pattern, s.asInstanceOf[UTF8String], lastReplacement,


I would prefer using Java String as the method parameter here. So that we can avoid all these .asInstanceOf[UTF8String]

Done in 8c24f32 — RegExpUtils.replace now takes source and regexp as String, so the nullSafeEval call site no longer needs the .asInstanceOf[UTF8String] casts. I also dropped the now-redundant rep parameter and reuse replacement for the error message, since they're the same value (lastReplacement is built from the rep argument). Thanks for the review!

Follow-up on this, plus one open question for a joint call:

Pushed edcd5c6: I also dropped the regexp parameter entirely and now use pattern.pattern() for the error message. Since compileRegexPattern compiles the raw regexp without escaping (collation is carried in the int flags, not the pattern text), pattern.pattern() round-trips the original regexp exactly, and it's only evaluated on the rare error path -- this removes the eager per-row regexp.toString() that the String-param version would otherwise have added in the generated code.

Open question on the remaining subject argument: to make the eval site fully cast-free I pass it as String (s.toString). One consequence is that the out-of-range no-op branch now returns UTF8String.fromString(source) instead of the original subject bytes. For valid UTF-8 this is byte-identical, and it's consistent with the match path (which already builds its result via UTF8String.fromString(...)). The only divergence is for malformed UTF-8 subjects (e.g. cast(binary as string) with invalid bytes), where the String round-trip substitutes U+FFFD; passing it as String also drops the checked cast, so a hypothetical type mismatch would no longer fail fast.

I can instead keep subject: UTF8String (returning it unchanged on the out-of-range path -> exact bytes + checked cast) at the cost of one remaining .asInstanceOf[UTF8String] at the call site. Do you prefer (a) the fully cast-free String version as-is, or (b) keeping subject: UTF8String for byte-exact passthrough? Happy to go either way.

gengliangwang

LGTM except for one comment

…s.replace to drop UTF8String casts

…egExpUtils.replace by using pattern.pattern() for the error message

LuciferYang added 2 commits June 4, 2026 13:25

LuciferYang mentioned this pull request Jun 4, 2026

[SPARK-57258][SQL] Reduce regexp_extract/regexp_extract_all generated code size via shared extract helpers #56318

Open

gengliangwang reviewed Jun 4, 2026

View reviewed changes

gengliangwang approved these changes Jun 4, 2026

View reviewed changes

LuciferYang added 2 commits June 5, 2026 10:36

[SPARK-57255][SQL][FOLLOWUP] Use Java String parameters in RegExpUtil…

8c24f32

…s.replace to drop UTF8String casts

[SPARK-57255][SQL][FOLLOWUP] Avoid eager per-row regexp.toString in R…

edcd5c6

…egExpUtils.replace by using pattern.pattern() for the error message

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-57255][SQL] Simplify RegExpReplace codegen by extracting the match/replace loop into a shared helper#56315

[SPARK-57255][SQL] Simplify RegExpReplace codegen by extracting the match/replace loop into a shared helper#56315
LuciferYang wants to merge 4 commits into
apache:masterfrom
LuciferYang:regexpreplace-codegen-helper

LuciferYang commented Jun 4, 2026 •

edited

Loading

Uh oh!

LuciferYang commented Jun 4, 2026

Uh oh!

gengliangwang Jun 4, 2026

Uh oh!

LuciferYang Jun 5, 2026

Uh oh!

LuciferYang Jun 5, 2026

Uh oh!

gengliangwang left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

LuciferYang commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

LuciferYang commented Jun 4, 2026

Uh oh!

gengliangwang Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

LuciferYang Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

LuciferYang Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

gengliangwang left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

LuciferYang commented Jun 4, 2026 •

edited

Loading