Skip to content

fix: use runes instead of codeUnits in AhoCorasick.search#32

Merged
master-wayne7 merged 3 commits into
master-wayne7:developfrom
KirthiSaiT:fix/aho-corasick-runes-codeunits
Apr 20, 2026
Merged

fix: use runes instead of codeUnits in AhoCorasick.search#32
master-wayne7 merged 3 commits into
master-wayne7:developfrom
KirthiSaiT:fix/aho-corasick-runes-codeunits

Conversation

@KirthiSaiT

@KirthiSaiT KirthiSaiT commented Apr 11, 2026

Copy link
Copy Markdown
Contributor

Description

addWord builds the trie using .runes (Unicode code points) but search
was traversing using .codeUnits (UTF-16 code units). For characters above
U+FFFF these produce different integer sequences, causing those words to
silently pass through the filter undetected.
Fixed by changing textLower.codeUnits to textLower.runes.toList() in
AhoCorasick.search so both methods use the same encoding.

Related Issue

Closes #31

Type of Change

  • Bug fix

Checklist

  • Code compiles
  • Tests added
  • No breaking changes

Summary by CodeRabbit

  • Bug Fixes

    • Improved Unicode handling so multibyte and non‑BMP characters (e.g., emojis) are matched, detected, and masked correctly across search and filtering.
  • Documentation

    • Clarified search behavior, case‑insensitivity, trie semantics, and match index semantics (zero‑based inclusive end indices).
  • Tests

    • Added tests covering Unicode/multibyte matches and masking, including out‑of‑BMP emoji scenarios.

@coderabbitai

coderabbitai Bot commented Apr 11, 2026

Copy link
Copy Markdown
📝 Walkthrough

Walkthrough

Fixed Unicode handling across the text-filtering stack: AhoCorasick.search now iterates Unicode runes (matching addWord), SafeTextFilter rewritten to operate on runes for normalization, matching, and slicing, and tests added to verify correct masking of out‑of‑BMP emoji.

Changes

Cohort / File(s) Summary
Aho-Corasick trie
lib/src/aho_corasick.dart
Expanded documentation for TrieNode and AhoCorasick (semantics, fail links, outputs). Fixed Unicode bug by switching search from textLower.codeUnits to textLower.runes.toList(). Clarified API expectations for addWord and buildFailureLinks.
Safe text filtering (rune-based rewrite)
lib/src/safe_text_filter.dart
Replaced code-unit logic with rune-based normalization and matching: added _normalizeToRunes, changed normalizeText, containsBadWord, _hasMatch, _addMatchesForWord, filterText, and _isWordBoundary to operate on List<int> runes and use String.fromCharCodes for slicing. Adjusted alphanumeric checks to use runes.
Tests
test/profanity_filter_test.dart, test/aho_corasick_test.dart
Added tests validating masking of out‑of‑BMP emoji and comprehensive AhoCorasick Unicode/rune-indexed search behavior (ASCII, emoji, mixed terms, case-insensitivity, overlapping matches).

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Suggested reviewers

  • master-wayne7

Poem

🐰 I hopped through runes both near and far,

Found emojis hiding like a star,
Now patterns meet where codepoints play,
Masks fall true in proper array,
A little hop, and bugs away!

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'fix: use runes instead of codeUnits in AhoCorasick.search' directly and accurately describes the primary change: fixing the bug where search used codeUnits instead of runes.
Linked Issues check ✅ Passed The PR directly addresses issue #31 by changing AhoCorasick.search to use runes instead of codeUnits, fixing supplementary Unicode character matching; all related code changes in aho_corasick.dart and supporting files are aligned with the bug fix objective.
Out of Scope Changes check ✅ Passed Changes to safe_text_filter.dart and profanity_filter_test.dart extend rune-based handling downstream to maintain consistency with the AhoCorasick fix; test additions validate the core bug fix for emoji/Unicode characters.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
lib/src/aho_corasick.dart (1)

107-150: ⚠️ Potential issue | 🔴 Critical

Convert rune indices to code-unit indices in search() return values — current indices cause incorrect text masking for non-BMP characters.

Line 149 stores rune loop counter i as the match end index. Downstream consumers (lines 214–215, 259, 289 in safe_text_filter.dart) use these indices with code-unit APIs: word.length, substring(), and codeUnitAt(). For non-BMP characters (emoji, certain Unicode scripts), a single rune encodes as 2 code units in UTF-16, causing index mismatch and incorrect boundary calculations.

Proposed fix
-    final units = textLower.runes.toList();
-
-    for (int i = 0; i < units.length; i++) {
-      final rune = units[i];
+    int codeUnitEndIndex = -1;
+    for (final rune in textLower.runes) {
+      codeUnitEndIndex += (rune > 0xFFFF) ? 2 : 1;

-      if (current.outputs.isNotEmpty) {
-        matches.putIfAbsent(i, () => []).addAll(current.outputs);
+      if (current.outputs.isNotEmpty) {
+        matches.putIfAbsent(codeUnitEndIndex, () => []).addAll(current.outputs);
       }

Also update the docstring to clarify UTF-16 code-unit indices.


ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: ff03c41b-f7a2-4293-85d6-3df476584066

📥 Commits

Reviewing files that changed from the base of the PR and between 40ca9ce and 11fa705.

📒 Files selected for processing (1)
  • lib/src/aho_corasick.dart

@master-wayne7

Copy link
Copy Markdown
Owner

@KirthiSaiT can you please only push the change regarding runes in this PR and add test cases as mentioned previously.

- Updated SafeTextFilter to use runes consistently for searching and filtering.
- Optimized rune conversions to maintain Aho-Corasick performance.
- Added unit test for emoji filtering.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
lib/src/safe_text_filter.dart (1)

356-358: Performance: String allocation in hot path.

_isAlphanumeric is called for every word boundary check during matching. Creating a new String for each rune has allocation overhead. For most use cases this is acceptable, but for high-throughput scenarios, consider caching common checks or using direct code point range comparisons for ASCII.

♻️ Optional: Fast-path for ASCII alphanumerics
 static bool _isAlphanumeric(int rune) {
+  // Fast path for ASCII
+  if ((rune >= 0x30 && rune <= 0x39) ||  // 0-9
+      (rune >= 0x41 && rune <= 0x5A) ||  // A-Z
+      (rune >= 0x61 && rune <= 0x7A)) {  // a-z
+    return true;
+  }
+  // Fallback for non-ASCII Unicode letters/digits
   return _unicodeLetterOrDigit.hasMatch(String.fromCharCode(rune));
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@lib/src/safe_text_filter.dart` around lines 356 - 358, _isAlphanumeric
currently allocates a new String via String.fromCharCode(rune) on every call;
replace that hot-path allocation with an ASCII fast-path: check rune ranges for
'0'-'9', 'A'-'Z', and 'a'-'z' and return true immediately for those, and only
for non-ASCII code points fall back to
_unicodeLetterOrDigit.hasMatch(String.fromCharCode(rune)); keep the function
name _isAlphanumeric and _unicodeLetterOrDigit as the fallback to minimize
changes.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@lib/src/safe_text_filter.dart`:
- Around line 356-358: _isAlphanumeric currently allocates a new String via
String.fromCharCode(rune) on every call; replace that hot-path allocation with
an ASCII fast-path: check rune ranges for '0'-'9', 'A'-'Z', and 'a'-'z' and
return true immediately for those, and only for non-ASCII code points fall back
to _unicodeLetterOrDigit.hasMatch(String.fromCharCode(rune)); keep the function
name _isAlphanumeric and _unicodeLetterOrDigit as the fallback to minimize
changes.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: aa5123ba-fb67-41d4-9d20-7672ec2cc758

📥 Commits

Reviewing files that changed from the base of the PR and between 11fa705 and 60b09ee.

📒 Files selected for processing (2)
  • lib/src/safe_text_filter.dart
  • test/profanity_filter_test.dart

@KirthiSaiT

Copy link
Copy Markdown
Contributor Author

@master-wayne7 pushed only the bug fix alone!

@master-wayne7

Copy link
Copy Markdown
Owner

@KirthiSaiT the CI tests are failing, its a small formatting issue. Can you please push one more commit after running dart format . ?

@master-wayne7

master-wayne7 commented Apr 14, 2026

Copy link
Copy Markdown
Owner

@KirthiSaiT the CI tests are failing, its a small formatting issue. Can you please push one more commit after running dart format . ?

Do take a pull from develop branch to avoid any merge conflicts

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@test/aho_corasick_test.dart`:
- Around line 59-66: The test "case-insensitive search with Unicode" only adds
an emoji so it doesn't exercise case-folding; update the test to also add a real
word that changes under case operations (e.g., call ac.addWord('poop') or a
Unicode-sensitive token like 'straße') before ac.buildFailureLinks(), then call
ac.search('POOP 💩') and assert that the returned matches include the
case-folded token (verify matches contains 'poop' or the chosen token) so
ac.addWord, ac.buildFailureLinks, and ac.search actually validate
case-insensitive behavior.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 6cb02be5-4a1b-4e78-a91f-3e42db97b494

📥 Commits

Reviewing files that changed from the base of the PR and between 60b09ee and 02bed1e.

📒 Files selected for processing (2)
  • lib/src/safe_text_filter.dart
  • test/aho_corasick_test.dart
🚧 Files skipped from review as they are similar to previous changes (1)
  • lib/src/safe_text_filter.dart

Comment thread test/aho_corasick_test.dart
@KirthiSaiT

Copy link
Copy Markdown
Contributor Author

@master-wayne7 the cli test cases all passed

@master-wayne7 master-wayne7 merged commit 872f009 into master-wayne7:develop Apr 20, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] emoji and supplementary Unicode characters silently fail to match in AhoCorasick search

2 participants