🍒 11649 - Improve performance of regexps in IAST and query obfuscator#11710
🍒 11649 - Improve performance of regexps in IAST and query obfuscator#11710manuel-alvarez-alvarez wants to merge 1 commit into
Conversation
Migrate the IAST evidence-redaction regexps to RE2/J for linear-time matching. RE2/J has no back-references, so the SQL tokenizer is reworked to find Postgres dollar-quoted literals via a precomputed tag index (binary search) and to enumerate Oracle q'...' delimiters explicitly instead of relying on a back-reference. Configured redaction patterns that are valid under java.util.regex but unsupported by RE2/J fall back to the defaults instead of failing to compile. Replace the query obfuscator's `while (matcher.find())` + per-match `Strings.replace` loop (O(N*Q)) with a single appendReplacement / appendTail pass (O(Q)). Add JUnit 5 tests for the tokenizers and the obfuscator, a tokenizer JMH benchmark, and migrate SensitiveHandlerTest from Groovy to JUnit 5. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> (cherry picked from commit 92ebc2a)
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: a004f21de6
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| "\\$(?<ESCAPE>[^$]*?)\\$.*?\\$\\k<ESCAPE>\\$"; | ||
| private static final String ORACLE_ESCAPED_LITERAL = buildOracleEscapedLiteral(); | ||
| // $$ or $tag$ where tag is a SQL identifier | ||
| private static final String POSTGRESQL_ESCAPED_LITERAL = "\\$(?:[a-zA-Z_]\\w*)?\\$"; |
There was a problem hiding this comment.
Preserve PostgreSQL dollar-quote tags with non-ASCII identifiers
For PostgreSQL, dollar-quote tags follow unquoted identifier rules, which can include non-ASCII letters. Restricting the opener to ASCII means a valid literal like SELECT $é$secret$é$ is no longer recognized as a dollar-quoted string; the new tokenizer then skips the unmatched $...$ token and the literal body is not redacted. Please keep this pattern and the tag indexer aligned with PostgreSQL identifier characters rather than only [a-zA-Z_]\w*.
Useful? React with 👍 / 👎.
Debugger benchmarksParameters
See matching parameters
SummaryFound 0 performance improvements and 0 performance regressions! Performance is the same for 10 metrics, 5 unstable metrics. See unchanged results
Request duration reports for reportsgantt
title reports - request duration [CI 0.99] : candidate=None, baseline=None
dateFormat X
axisFormat %s
section baseline
noprobe (330.65 µs) : 309, 353
. : milestone, 331,
basic (296.443 µs) : 290, 303
. : milestone, 296,
loop (8.982 ms) : 8977, 8987
. : milestone, 8982,
section candidate
noprobe (338.283 µs) : 304, 373
. : milestone, 338,
basic (298.029 µs) : 291, 305
. : milestone, 298,
loop (8.983 ms) : 8978, 8988
. : milestone, 8983,
|
Backport #11649 to release/v1.63.x