fix: use runes instead of codeUnits in AhoCorasick.search#32
Conversation
📝 WalkthroughWalkthroughFixed Unicode handling across the text-filtering stack: AhoCorasick.search now iterates Unicode runes (matching addWord), SafeTextFilter rewritten to operate on runes for normalization, matching, and slicing, and tests added to verify correct masking of out‑of‑BMP emoji. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
lib/src/aho_corasick.dart (1)
107-150:⚠️ Potential issue | 🔴 CriticalConvert rune indices to code-unit indices in
search()return values — current indices cause incorrect text masking for non-BMP characters.Line 149 stores rune loop counter
ias the match end index. Downstream consumers (lines 214–215, 259, 289 insafe_text_filter.dart) use these indices with code-unit APIs:word.length,substring(), andcodeUnitAt(). For non-BMP characters (emoji, certain Unicode scripts), a single rune encodes as 2 code units in UTF-16, causing index mismatch and incorrect boundary calculations.Proposed fix
- final units = textLower.runes.toList(); - - for (int i = 0; i < units.length; i++) { - final rune = units[i]; + int codeUnitEndIndex = -1; + for (final rune in textLower.runes) { + codeUnitEndIndex += (rune > 0xFFFF) ? 2 : 1; - if (current.outputs.isNotEmpty) { - matches.putIfAbsent(i, () => []).addAll(current.outputs); + if (current.outputs.isNotEmpty) { + matches.putIfAbsent(codeUnitEndIndex, () => []).addAll(current.outputs); }Also update the docstring to clarify UTF-16 code-unit indices.
|
@KirthiSaiT can you please only push the change regarding runes in this PR and add test cases as mentioned previously. |
- Updated SafeTextFilter to use runes consistently for searching and filtering. - Optimized rune conversions to maintain Aho-Corasick performance. - Added unit test for emoji filtering.
There was a problem hiding this comment.
🧹 Nitpick comments (1)
lib/src/safe_text_filter.dart (1)
356-358: Performance: String allocation in hot path.
_isAlphanumericis called for every word boundary check during matching. Creating a newStringfor each rune has allocation overhead. For most use cases this is acceptable, but for high-throughput scenarios, consider caching common checks or using direct code point range comparisons for ASCII.♻️ Optional: Fast-path for ASCII alphanumerics
static bool _isAlphanumeric(int rune) { + // Fast path for ASCII + if ((rune >= 0x30 && rune <= 0x39) || // 0-9 + (rune >= 0x41 && rune <= 0x5A) || // A-Z + (rune >= 0x61 && rune <= 0x7A)) { // a-z + return true; + } + // Fallback for non-ASCII Unicode letters/digits return _unicodeLetterOrDigit.hasMatch(String.fromCharCode(rune)); }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@lib/src/safe_text_filter.dart` around lines 356 - 358, _isAlphanumeric currently allocates a new String via String.fromCharCode(rune) on every call; replace that hot-path allocation with an ASCII fast-path: check rune ranges for '0'-'9', 'A'-'Z', and 'a'-'z' and return true immediately for those, and only for non-ASCII code points fall back to _unicodeLetterOrDigit.hasMatch(String.fromCharCode(rune)); keep the function name _isAlphanumeric and _unicodeLetterOrDigit as the fallback to minimize changes.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@lib/src/safe_text_filter.dart`:
- Around line 356-358: _isAlphanumeric currently allocates a new String via
String.fromCharCode(rune) on every call; replace that hot-path allocation with
an ASCII fast-path: check rune ranges for '0'-'9', 'A'-'Z', and 'a'-'z' and
return true immediately for those, and only for non-ASCII code points fall back
to _unicodeLetterOrDigit.hasMatch(String.fromCharCode(rune)); keep the function
name _isAlphanumeric and _unicodeLetterOrDigit as the fallback to minimize
changes.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: aa5123ba-fb67-41d4-9d20-7672ec2cc758
📒 Files selected for processing (2)
lib/src/safe_text_filter.darttest/profanity_filter_test.dart
|
@master-wayne7 pushed only the bug fix alone! |
|
@KirthiSaiT the CI tests are failing, its a small formatting issue. Can you please push one more commit after running |
Do take a pull from develop branch to avoid any merge conflicts |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@test/aho_corasick_test.dart`:
- Around line 59-66: The test "case-insensitive search with Unicode" only adds
an emoji so it doesn't exercise case-folding; update the test to also add a real
word that changes under case operations (e.g., call ac.addWord('poop') or a
Unicode-sensitive token like 'straße') before ac.buildFailureLinks(), then call
ac.search('POOP 💩') and assert that the returned matches include the
case-folded token (verify matches contains 'poop' or the chosen token) so
ac.addWord, ac.buildFailureLinks, and ac.search actually validate
case-insensitive behavior.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 6cb02be5-4a1b-4e78-a91f-3e42db97b494
📒 Files selected for processing (2)
lib/src/safe_text_filter.darttest/aho_corasick_test.dart
🚧 Files skipped from review as they are similar to previous changes (1)
- lib/src/safe_text_filter.dart
|
@master-wayne7 the cli test cases all passed |
Description
addWordbuilds the trie using.runes(Unicode code points) butsearchwas traversing using
.codeUnits(UTF-16 code units). For characters aboveU+FFFF these produce different integer sequences, causing those words to
silently pass through the filter undetected.
Fixed by changing
textLower.codeUnitstotextLower.runes.toList()inAhoCorasick.searchso both methods use the same encoding.Related Issue
Closes #31
Type of Change
Checklist
Summary by CodeRabbit
Bug Fixes
Documentation
Tests