Improve performance of regexps in IAST and query obfuscator#11649
Conversation
95f7550 to
9cff660
Compare
This comment has been minimized.
This comment has been minimized.
There was a problem hiding this comment.
Pull request overview
This PR focuses on improving runtime performance and worst-case behavior of regexp-heavy code paths used in query obfuscation and IAST evidence redaction by switching several tokenizers to RE2J and reducing repeated string copying during replacements.
Changes:
- Optimized query obfuscation replacement logic to avoid repeated full-string rebuilds during iterative replacements.
- Migrated IAST “sensitive analyzer” tokenizers from
java.util.regexto RE2J and adjusted patterns accordingly (including Oracle/Postgres SQL literal handling). - Added IAST tokenizer JMH benchmarks and introduced an evidence redaction iteration budget aligned with the existing truncation max length.
Reviewed changes
Copilot reviewed 12 out of 12 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| dd-trace-core/src/main/java/datadog/trace/core/tagprocessor/QueryObfuscator.java | Reworks query obfuscation to use matcher append APIs to reduce repeated string copying. |
| dd-trace-api/src/main/java/datadog/trace/api/ConfigDefaults.java | Exposes default IAST redaction patterns for cross-module fallback use. |
| dd-java-agent/agent-iast/src/main/java/com/datadog/iast/sensitive/AbstractRegexTokenizer.java | Switches base tokenizer regex engine to RE2J. |
| dd-java-agent/agent-iast/src/main/java/com/datadog/iast/sensitive/UrlRegexpTokenizer.java | Updates URL tokenizer to RE2J and RE2-style named groups. |
| dd-java-agent/agent-iast/src/main/java/com/datadog/iast/sensitive/LdapRegexTokenizer.java | Updates LDAP tokenizer to RE2J and RE2-style named groups. |
| dd-java-agent/agent-iast/src/main/java/com/datadog/iast/sensitive/CommandRegexpTokenizer.java | Switches command tokenizer to RE2J patterns. |
| dd-java-agent/agent-iast/src/main/java/com/datadog/iast/sensitive/HeaderRegexpTokenizer.java | Switches header tokenizer to use RE2J Pattern. |
| dd-java-agent/agent-iast/src/main/java/com/datadog/iast/sensitive/SqlRegexpTokenizer.java | Refactors SQL tokenizer to avoid unsupported regex features and handle dialect specifics efficiently under RE2J. |
| dd-java-agent/agent-iast/src/main/java/com/datadog/iast/sensitive/SensitiveHandlerImpl.java | Compiles configurable redaction patterns with RE2J and adds fallback compilation behavior. |
| dd-java-agent/agent-iast/src/main/java/com/datadog/iast/model/json/EvidenceAdapter.java | Adds a max-consumed budget to stop redaction iteration once truncation limit is reached. |
| dd-java-agent/agent-iast/src/jmh/java/com/datadog/iast/sensitive/SensitiveTokenizerBenchmark.java | Adds JMH benchmarks covering pathological tokenizer inputs. |
| dd-java-agent/agent-iast/build.gradle | Adds RE2J dependency and excludes it from the shaded artifact to avoid duplication. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
🟢 Java Benchmark SLOs — All performance SLOs passed
PR vs. master results
Commit: Load and DaCapo benchmarks can be triggered manually in the GitLab pipeline. Results will appear in the Benchmarking Platform UI after completion. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 9cff660f50
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
bric3
left a comment
There was a problem hiding this comment.
I haven't reviewed the code changes much but left some coment on re2j.
Also, I wonder if https://github.com/DataDog/java-reggie may be considered for this task, if it can handle the job.
jandro996
left a comment
There was a problem hiding this comment.
Some non blocking comments, thanks for this!
|
@codex review |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 22 out of 22 changed files in this pull request and generated 2 comments.
Comments suppressed due to low confidence (1)
dd-java-agent/agent-iast/src/main/java/com/datadog/iast/sensitive/UrlRegexpTokenizer.java:9
- The RFC link in the javadoc is malformed (missing the closing quote/angle bracket after the URL), which breaks generated docs and IDE navigation.
/**
* @see <a href="https://www.rfc-editor.org/rfc/rfc1738>Uniform Resource Locators (URL)</a>
*/
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 6898657415
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
jbachorik
left a comment
There was a problem hiding this comment.
A general comment - do you have an estimation about how big improvement we are talking? Can you, perhaps, add JMH microbenchmarks for the affected patterns and sample input for JDK vs re2j and attach the results in the PR desc?
@jbachorik yes of course: These are coming from the provided benchmark. |
|
@manuel-alvarez-alvarez Imho, the benchmark pic deserve to be in the main description: |
Hello @bric3, added the pics to the PR description, thanks for the feedback. |
|
/merge |
|
View all feedbacks in Devflow UI.
The expected merge time in
manuel.alvarezalvarez@datadoghq.com cancelled this merge request build |
|
/merge -c |
|
View all feedbacks in Devflow UI.
|
Migrate the IAST evidence-redaction regexps to RE2/J for linear-time matching. RE2/J has no back-references, so the SQL tokenizer is reworked to find Postgres dollar-quoted literals via a precomputed tag index (binary search) and to enumerate Oracle q'...' delimiters explicitly instead of relying on a back-reference. Configured redaction patterns that are valid under java.util.regex but unsupported by RE2/J fall back to the defaults instead of failing to compile. Replace the query obfuscator's `while (matcher.find())` + per-match `Strings.replace` loop (O(N*Q)) with a single appendReplacement / appendTail pass (O(Q)). Add JUnit 5 tests for the tokenizers and the obfuscator, a tokenizer JMH benchmark, and migrate SensitiveHandlerTest from Groovy to JUnit 5. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
cdce1ba to
92ebc2a
Compare
|
/merge |
|
View all feedbacks in Devflow UI.
The expected merge time in
|


What Does This Do
while (matcher.find())+ per-matchStrings.replaceloop (O(N×Q)) with a singleMatcher.appendReplacement/appendTailpass (O(Q)).Motivation
This change guarantees the regexp matching (and the query obfuscator's replacement) is always linear in the input length, reducing CPU spent on these paths during trace post-processing.
Additional Notes
Results of the provided benchmark:
Contributor Checklist
type:and (comp:orinst:) labels in addition to any other useful labelsclose,fix, or any linking keywords when referencing an issueUse
solvesinstead, and assign the PR milestone to the issue/merge. You can also:/merge --commit-message "..."/merge -c/merge -f --reason "reason"; please use this judiciously, as some checks do not run at the PR-level (note: the PR still needs to be mergeable, this will only skip the pre-merge build)Jira ticket: APPSEC-68339