Skip to content

Improve performance of regexps in IAST and query obfuscator#11649

Merged
gh-worker-dd-mergequeue-cf854d[bot] merged 1 commit into
masterfrom
malvarez/iast-migrate-regexp-re2j
Jun 23, 2026
Merged

Improve performance of regexps in IAST and query obfuscator#11649
gh-worker-dd-mergequeue-cf854d[bot] merged 1 commit into
masterfrom
malvarez/iast-migrate-regexp-re2j

Conversation

@manuel-alvarez-alvarez

@manuel-alvarez-alvarez manuel-alvarez-alvarez commented Jun 15, 2026

Copy link
Copy Markdown
Member

What Does This Do

  • Migrate the IAST evidence-redaction regexps to RE2/J for linear-time matching, and bound how much evidence is analyzed and serialized.
  • Replace the query obfuscator's while (matcher.find()) + per-match Strings.replace loop (O(N×Q)) with a single Matcher.appendReplacement/appendTail pass (O(Q)).

Motivation

This change guarantees the regexp matching (and the query obfuscator's replacement) is always linear in the input length, reducing CPU spent on these paths during trace post-processing.

Additional Notes

Results of the provided benchmark:

  • Before the patch, some regular expressions exhibited non-linear growth:
image
  • After the patch, all regular expressions grow linearly:
image

Contributor Checklist

  • Format the title according to the contribution guidelines
  • Assign the type: and (comp: or inst:) labels in addition to any other useful labels
  • Avoid using close, fix, or any linking keywords when referencing an issue
    Use solves instead, and assign the PR milestone to the issue
  • Update the CODEOWNERS file on source file addition, migration, or deletion
  • Update public documentation with any new configuration flags or behaviors
  • Add your completed PR to the merge queue by commenting /merge. You can also:
    • Customize the commit message associated with the merge with /merge --commit-message "..."
    • Remove your PR from the merge queue with /merge -c
    • Skip all merge queue checks with /merge -f --reason "reason"; please use this judiciously, as some checks do not run at the PR-level (note: the PR still needs to be mergeable, this will only skip the pre-merge build)
    • Get more information in this doc

Jira ticket: APPSEC-68339

@manuel-alvarez-alvarez manuel-alvarez-alvarez force-pushed the malvarez/iast-migrate-regexp-re2j branch from 95f7550 to 9cff660 Compare June 15, 2026 14:59
@manuel-alvarez-alvarez manuel-alvarez-alvarez changed the title perf: improve performance of regexps in IAST and query obfuscator Improve performance of regexps in IAST and query obfuscator Jun 15, 2026
@datadog-official

This comment has been minimized.

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR focuses on improving runtime performance and worst-case behavior of regexp-heavy code paths used in query obfuscation and IAST evidence redaction by switching several tokenizers to RE2J and reducing repeated string copying during replacements.

Changes:

  • Optimized query obfuscation replacement logic to avoid repeated full-string rebuilds during iterative replacements.
  • Migrated IAST “sensitive analyzer” tokenizers from java.util.regex to RE2J and adjusted patterns accordingly (including Oracle/Postgres SQL literal handling).
  • Added IAST tokenizer JMH benchmarks and introduced an evidence redaction iteration budget aligned with the existing truncation max length.

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
dd-trace-core/src/main/java/datadog/trace/core/tagprocessor/QueryObfuscator.java Reworks query obfuscation to use matcher append APIs to reduce repeated string copying.
dd-trace-api/src/main/java/datadog/trace/api/ConfigDefaults.java Exposes default IAST redaction patterns for cross-module fallback use.
dd-java-agent/agent-iast/src/main/java/com/datadog/iast/sensitive/AbstractRegexTokenizer.java Switches base tokenizer regex engine to RE2J.
dd-java-agent/agent-iast/src/main/java/com/datadog/iast/sensitive/UrlRegexpTokenizer.java Updates URL tokenizer to RE2J and RE2-style named groups.
dd-java-agent/agent-iast/src/main/java/com/datadog/iast/sensitive/LdapRegexTokenizer.java Updates LDAP tokenizer to RE2J and RE2-style named groups.
dd-java-agent/agent-iast/src/main/java/com/datadog/iast/sensitive/CommandRegexpTokenizer.java Switches command tokenizer to RE2J patterns.
dd-java-agent/agent-iast/src/main/java/com/datadog/iast/sensitive/HeaderRegexpTokenizer.java Switches header tokenizer to use RE2J Pattern.
dd-java-agent/agent-iast/src/main/java/com/datadog/iast/sensitive/SqlRegexpTokenizer.java Refactors SQL tokenizer to avoid unsupported regex features and handle dialect specifics efficiently under RE2J.
dd-java-agent/agent-iast/src/main/java/com/datadog/iast/sensitive/SensitiveHandlerImpl.java Compiles configurable redaction patterns with RE2J and adds fallback compilation behavior.
dd-java-agent/agent-iast/src/main/java/com/datadog/iast/model/json/EvidenceAdapter.java Adds a max-consumed budget to stop redaction iteration once truncation limit is reached.
dd-java-agent/agent-iast/src/jmh/java/com/datadog/iast/sensitive/SensitiveTokenizerBenchmark.java Adds JMH benchmarks covering pathological tokenizer inputs.
dd-java-agent/agent-iast/build.gradle Adds RE2J dependency and excludes it from the shaded artifact to avoid duplication.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@manuel-alvarez-alvarez manuel-alvarez-alvarez added tag: performance Performance related changes comp: asm iast Application Security Management (IAST) type: enhancement Enhancements and improvements labels Jun 15, 2026
@dd-octo-sts

dd-octo-sts Bot commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

🟢 Java Benchmark SLOs — All performance SLOs passed

Suite Status
Startup 🟢 pass

SLO thresholds are defined here based on automatically generated metrics. A warning is raised when results are within 5% of the threshold.

PR vs. master results
Scenario Candidate master Δ (95% CI of mean)
startup:insecure-bank:iast:Agent 14.00 s 13.93 s [-0.3%; +1.2%] (no difference)
startup:insecure-bank:tracing:Agent 12.95 s 13.05 s [-1.6%; +0.1%] (no difference)
startup:petclinic:appsec:Agent 16.89 s 16.66 s [+0.4%; +2.3%] (maybe worse)
startup:petclinic:iast:Agent 16.85 s 16.98 s [-1.7%; +0.3%] (no difference)
startup:petclinic:profiling:Agent 16.88 s 16.93 s [-1.3%; +0.7%] (no difference)
startup:petclinic:sca:Agent 16.92 s 16.85 s [-0.5%; +1.4%] (no difference)
startup:petclinic:tracing:Agent 16.09 s 16.05 s [-0.6%; +1.1%] (no difference)

Commit: 92ebc2a6 · CI Pipeline · Benchmarking Platform UI


Load and DaCapo benchmarks can be triggered manually in the GitLab pipeline. Results will appear in the Benchmarking Platform UI after completion.

@manuel-alvarez-alvarez manuel-alvarez-alvarez marked this pull request as ready for review June 15, 2026 15:47
@manuel-alvarez-alvarez manuel-alvarez-alvarez requested review from a team as code owners June 15, 2026 15:47
@dd-octo-sts dd-octo-sts Bot added the tag: ai generated Largely based on code generated by an AI or LLM label Jun 15, 2026

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 9cff660f50

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 12 out of 12 changed files in this pull request and generated 1 comment.

@bric3 bric3 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't reviewed the code changes much but left some coment on re2j.

Also, I wonder if https://github.com/DataDog/java-reggie may be considered for this task, if it can handle the job.

Comment thread dd-java-agent/agent-iast/build.gradle Outdated
Comment thread dd-java-agent/agent-iast/build.gradle Outdated
jandro996

This comment was marked as duplicate.

jandro996

This comment was marked as duplicate.

Comment thread dd-java-agent/agent-iast/build.gradle Outdated
Comment thread dd-trace-api/src/main/java/datadog/trace/api/ConfigDefaults.java

@jandro996 jandro996 left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some non blocking comments, thanks for this!

@manuel-alvarez-alvarez

Copy link
Copy Markdown
Member Author

@codex review

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 22 out of 22 changed files in this pull request and generated 2 comments.

Comments suppressed due to low confidence (1)

dd-java-agent/agent-iast/src/main/java/com/datadog/iast/sensitive/UrlRegexpTokenizer.java:9

  • The RFC link in the javadoc is malformed (missing the closing quote/angle bracket after the URL), which breaks generated docs and IDE navigation.
/**
 * @see <a href="https://www.rfc-editor.org/rfc/rfc1738>Uniform Resource Locators (URL)</a>
 */

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6898657415

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread gradle/libs.versions.toml Outdated

@jbachorik jbachorik left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A general comment - do you have an estimation about how big improvement we are talking? Can you, perhaps, add JMH microbenchmarks for the affected patterns and sample input for JDK vs re2j and attach the results in the PR desc?

@manuel-alvarez-alvarez

Copy link
Copy Markdown
Member Author

A general comment - do you have an estimation about how big improvement we are talking? Can you, perhaps, add JMH microbenchmarks for the affected patterns and sample input for JDK vs re2j and attach the results in the PR desc?

@jbachorik yes of course:

Before:
image

After:
image

These are coming from the provided benchmark.

@bric3

bric3 commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

@manuel-alvarez-alvarez Imho, the benchmark pic deserve to be in the main description:

#11649 (comment)

@manuel-alvarez-alvarez

Copy link
Copy Markdown
Member Author

@manuel-alvarez-alvarez Imho, the benchmark pic deserve to be in the main description:

#11649 (comment)

Hello @bric3, added the pics to the PR description, thanks for the feedback.

@manuel-alvarez-alvarez manuel-alvarez-alvarez added this pull request to the merge queue Jun 23, 2026
@manuel-alvarez-alvarez manuel-alvarez-alvarez removed this pull request from the merge queue due to a manual request Jun 23, 2026
@dd-octo-sts

dd-octo-sts Bot commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

/merge

@gh-worker-devflow-routing-ef8351

gh-worker-devflow-routing-ef8351 Bot commented Jun 23, 2026

Copy link
Copy Markdown

View all feedbacks in Devflow UI.

2026-06-23 10:45:33 UTC ℹ️ Start processing command /merge


2026-06-23 10:45:38 UTC ℹ️ MergeQueue: pull request added to the queue

The expected merge time in master is approximately 1h (p90).


2026-06-23 10:47:59 UTC ⚠️ MergeQueue: This merge request build was cancelled

manuel.alvarezalvarez@datadoghq.com cancelled this merge request build

@manuel-alvarez-alvarez

Copy link
Copy Markdown
Member Author

/merge -c

@gh-worker-devflow-routing-ef8351

gh-worker-devflow-routing-ef8351 Bot commented Jun 23, 2026

Copy link
Copy Markdown

View all feedbacks in Devflow UI.

2026-06-23 10:47:42 UTC ℹ️ Start processing command /merge -c

Migrate the IAST evidence-redaction regexps to RE2/J for linear-time
matching. RE2/J has no back-references, so the SQL tokenizer is reworked
to find Postgres dollar-quoted literals via a precomputed tag index
(binary search) and to enumerate Oracle q'...' delimiters explicitly
instead of relying on a back-reference. Configured redaction patterns
that are valid under java.util.regex but unsupported by RE2/J fall back
to the defaults instead of failing to compile.

Replace the query obfuscator's `while (matcher.find())` + per-match
`Strings.replace` loop (O(N*Q)) with a single appendReplacement /
appendTail pass (O(Q)).

Add JUnit 5 tests for the tokenizers and the obfuscator, a tokenizer
JMH benchmark, and migrate SensitiveHandlerTest from Groovy to JUnit 5.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@manuel-alvarez-alvarez manuel-alvarez-alvarez force-pushed the malvarez/iast-migrate-regexp-re2j branch from cdce1ba to 92ebc2a Compare June 23, 2026 11:07
@manuel-alvarez-alvarez manuel-alvarez-alvarez added this pull request to the merge queue Jun 23, 2026
@dd-octo-sts

dd-octo-sts Bot commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

/merge

@gh-worker-devflow-routing-ef8351

gh-worker-devflow-routing-ef8351 Bot commented Jun 23, 2026

Copy link
Copy Markdown

View all feedbacks in Devflow UI.

2026-06-23 13:20:50 UTC ℹ️ Start processing command /merge


2026-06-23 13:20:56 UTC ℹ️ MergeQueue: pull request added to the queue

The expected merge time in master is approximately 1h (p90).


2026-06-23 14:32:14 UTC ℹ️ MergeQueue: This merge request was merged

@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks Jun 23, 2026
@gh-worker-dd-mergequeue-cf854d gh-worker-dd-mergequeue-cf854d Bot merged commit b95abba into master Jun 23, 2026
583 checks passed
@gh-worker-dd-mergequeue-cf854d gh-worker-dd-mergequeue-cf854d Bot deleted the malvarez/iast-migrate-regexp-re2j branch June 23, 2026 14:32
@github-actions github-actions Bot added this to the 1.64.0 milestone Jun 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp: asm iast Application Security Management (IAST) tag: ai generated Largely based on code generated by an AI or LLM tag: performance Performance related changes type: enhancement Enhancements and improvements

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants