Skip to content

[SPARK-57255][SQL] Simplify RegExpReplace codegen by extracting the match/replace loop into a shared helper#56315

Open
LuciferYang wants to merge 4 commits into
apache:masterfrom
LuciferYang:regexpreplace-codegen-helper
Open

[SPARK-57255][SQL] Simplify RegExpReplace codegen by extracting the match/replace loop into a shared helper#56315
LuciferYang wants to merge 4 commits into
apache:masterfrom
LuciferYang:regexpreplace-codegen-helper

Conversation

@LuciferYang
Copy link
Copy Markdown
Contributor

@LuciferYang LuciferYang commented Jun 4, 2026

What changes were proposed in this pull request?

This is a sub-task of SPARK-56908 (reduce generated Java size in whole-stage codegen).

RegExpReplace inlined the same match/replace loop in both nullSafeEval and doGenCode — the matcher build, the while (find) { appendReplacement } loop with a try/catch that builds invalidRegexpReplaceError(...), and appendTail. So every generated whole-stage class that uses regexp_replace carried that ~20-line block plus the matcher build and the error-construction constant-pool entries.

This PR moves it into a shared helper that takes the cached Pattern and builds the matcher itself:

object RegExpUtils {
  def replace(pattern: Pattern, subject: UTF8String, replacement: String,
              regexp: UTF8String, rep: UTF8String, pos: Int): UTF8String = { ... }
}

Both eval and codegen call it; doGenCode emits a single RegExpUtils.replace(...) call (via the Scala-object static forwarder, the same mechanism as the generated QueryExecutionErrors.xxx calls). Building the matcher inside the helper means subject.toString is computed exactly once per row (the previous inline forms decoded it twice). To give codegen the cached pattern term without building a matcher, initLastPatternCode is split out of initLastMatcherCode (the latter now composes the former plus the matcher line); the other callers (RLike / RegExpExtract / RegExpExtractAll) are unchanged.

The Pattern/replacement caching (the per-instance mutable state) stays at the call sites. The eval-only cached result StringBuilder field is dropped — the helper allocates its own, and the codegen path always allocated one per call anyway.

Why are the changes needed?

To reduce the size of the generated Java in whole-stage codegen, as tracked by SPARK-56908. Moving the loop + matcher build + error construction into one compiled method removes them from every generated class. Measured with debugCodegen() on spark.range(1000).selectExpr("regexp_replace(cast(id as string), '0', 'x')"), a single-regexp_replace whole-stage stage:

Metric Before After
maxMethodCodeSize 551 423 (-23%)
maxConstantPoolSize 277 236 (-15%)

These are real (not Janino-foldable) reductions, replicated per stage class, with the loop body compiled once.

Does this PR introduce any user-facing change?

No. This is a refactor; eval results, nullable, dataType, and the error message/args are unchanged, so SQL output and golden files are unaffected.

One internal trade-off: interpreted eval now allocates a StringBuilder per row instead of reusing a cached field. This matches what the codegen path always did, and is negligible relative to the regex matching; it removes a @transient lazy val field.

How was this patch tested?

  • RegexpExpressionsSuite "RegexReplace" — existing eval + codegen coverage, plus a new throwing-path case (a replacement referencing a non-existent group raises INVALID_REGEXP_REPLACE), which the tree previously did not cover; checkErrorInExpression exercises it in both interpreted and codegen modes. The full suite (17 tests, incl. the SPARK-22570 global-variable-count test and the other initLastMatcherCode callers) passes.
  • StringFunctionsSuite (regexp_replace) and CollationSQLRegexpSuite.

Follow-up: RegExpExtract / RegExpExtractAll share a similar inline-loop shape and could get the same treatment in a separate sub-task.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Code (Claude Opus 4.8)

…atch/replace loop into a shared helper

RegExpReplace inlined the match/replace loop (matcher.region, the
find/appendReplacement/appendTail loop, and the invalidRegexpReplaceError
construction in a try/catch) in both nullSafeEval and doGenCode, so every
generated whole-stage class carried that block plus the error-construction
constant-pool entries.

Move the loop into a shared `RegExpUtils.replace(matcher, subject, replacement,
regexp, rep, pos)` that both eval and codegen call; doGenCode now emits a single
call. The eval-only cached `result` StringBuilder field is dropped (the helper
allocates its own; codegen always allocated per call anyway).

For a single regexp_replace whole-stage stage, this drops maxMethodCodeSize from
551 to 435 (-21%) and maxConstantPoolSize from 277 to 244 (-12%), with the loop
body compiled once instead of per stage. Behavior is unchanged.

Adds a throwing-path test (a replacement referencing a non-existent group ->
INVALID_REGEXP_REPLACE), which the tree previously did not cover.
…lace so subject.toString is computed once

The initial commit had RegExpUtils.replace take a pre-built Matcher, so the
caller built the matcher from subject.toString and the helper recomputed
subject.toString for the region/error - a duplicate decode per row in the eval
path (the codegen path already decoded twice).

Make replace take the cached Pattern and build the matcher itself, computing
source = subject.toString once. To give codegen the cached pattern term without
building a matcher, split initLastPatternCode out of initLastMatcherCode (the
latter now composes the former + the matcher line); the other callers
(RLike/RegExpExtract/RegExpExtractAll) are unchanged.

This also removes the matcher build from every generated class. For a single
regexp_replace whole-stage stage the numbers improve to maxMethodCodeSize
551 -> 423 (-23%) and maxConstantPoolSize 277 -> 236 (-15%).
@LuciferYang
Copy link
Copy Markdown
Contributor Author

cc @gengliangwang

} else {
s
}
RegExpUtils.replace(pattern, s.asInstanceOf[UTF8String], lastReplacement,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer using Java String as the method parameter here. So that we can avoid all these .asInstanceOf[UTF8String]

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 8c24f32RegExpUtils.replace now takes source and regexp as String, so the nullSafeEval call site no longer needs the .asInstanceOf[UTF8String] casts. I also dropped the now-redundant rep parameter and reuse replacement for the error message, since they're the same value (lastReplacement is built from the rep argument). Thanks for the review!

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Follow-up on this, plus one open question for a joint call:

Pushed edcd5c6: I also dropped the regexp parameter entirely and now use pattern.pattern() for the error message. Since compileRegexPattern compiles the raw regexp without escaping (collation is carried in the int flags, not the pattern text), pattern.pattern() round-trips the original regexp exactly, and it's only evaluated on the rare error path -- this removes the eager per-row regexp.toString() that the String-param version would otherwise have added in the generated code.

Open question on the remaining subject argument: to make the eval site fully cast-free I pass it as String (s.toString). One consequence is that the out-of-range no-op branch now returns UTF8String.fromString(source) instead of the original subject bytes. For valid UTF-8 this is byte-identical, and it's consistent with the match path (which already builds its result via UTF8String.fromString(...)). The only divergence is for malformed UTF-8 subjects (e.g. cast(binary as string) with invalid bytes), where the String round-trip substitutes U+FFFD; passing it as String also drops the checked cast, so a hypothetical type mismatch would no longer fail fast.

I can instead keep subject: UTF8String (returning it unchanged on the out-of-range path -> exact bytes + checked cast) at the cost of one remaining .asInstanceOf[UTF8String] at the call site. Do you prefer (a) the fully cast-free String version as-is, or (b) keeping subject: UTF8String for byte-exact passthrough? Happy to go either way.

Copy link
Copy Markdown
Member

@gengliangwang gengliangwang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM except for one comment

…egExpUtils.replace by using pattern.pattern() for the error message
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants