fix: use runes instead of codeUnits in AhoCorasick.search by KirthiSaiT · Pull Request #32 · master-wayne7/safe_text

KirthiSaiT · 2026-04-11T15:55:16Z

Description

addWord builds the trie using .runes (Unicode code points) but search
was traversing using .codeUnits (UTF-16 code units). For characters above
U+FFFF these produce different integer sequences, causing those words to
silently pass through the filter undetected.
Fixed by changing textLower.codeUnits to textLower.runes.toList() in
AhoCorasick.search so both methods use the same encoding.

Related Issue

Closes #31

Type of Change

Bug fix

Checklist

Code compiles
Tests added
No breaking changes

Summary by CodeRabbit

Bug Fixes
- Improved Unicode handling so multibyte and non‑BMP characters (e.g., emojis) are matched, detected, and masked correctly across search and filtering.
Documentation
- Clarified search behavior, case‑insensitivity, trie semantics, and match index semantics (zero‑based inclusive end indices).
Tests
- Added tests covering Unicode/multibyte matches and masking, including out‑of‑BMP emoji scenarios.

coderabbitai · 2026-04-11T15:55:30Z

📝 Walkthrough

Walkthrough

Fixed Unicode handling across the text-filtering stack: AhoCorasick.search now iterates Unicode runes (matching addWord), SafeTextFilter rewritten to operate on runes for normalization, matching, and slicing, and tests added to verify correct masking of out‑of‑BMP emoji.

Changes

Cohort / File(s)	Summary
Aho-Corasick trie `lib/src/aho_corasick.dart`	Expanded documentation for `TrieNode` and `AhoCorasick` (semantics, `fail` links, outputs). Fixed Unicode bug by switching `search` from `textLower.codeUnits` to `textLower.runes.toList()`. Clarified API expectations for `addWord` and `buildFailureLinks`.
Safe text filtering (rune-based rewrite) `lib/src/safe_text_filter.dart`	Replaced code-unit logic with rune-based normalization and matching: added `_normalizeToRunes`, changed `normalizeText`, `containsBadWord`, `_hasMatch`, `_addMatchesForWord`, `filterText`, and `_isWordBoundary` to operate on `List<int>` runes and use `String.fromCharCodes` for slicing. Adjusted alphanumeric checks to use runes.
Tests `test/profanity_filter_test.dart`, `test/aho_corasick_test.dart`	Added tests validating masking of out‑of‑BMP emoji and comprehensive AhoCorasick Unicode/rune-indexed search behavior (ASCII, emoji, mixed terms, case-insensitivity, overlapping matches).

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

docs: improve API reference coverage for pub.dev #30 — similar changes switching AhoCorasick search iteration from UTF‑16 codeUnits to Unicode runes and related rune-based SafeTextFilter updates.

Suggested reviewers

master-wayne7

Poem

🐰 I hopped through runes both near and far,

Found emojis hiding like a star,
Now patterns meet where codepoints play,
Masks fall true in proper array,
A little hop, and bugs away!

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'fix: use runes instead of codeUnits in AhoCorasick.search' directly and accurately describes the primary change: fixing the bug where search used codeUnits instead of runes.
Linked Issues check	✅ Passed	The PR directly addresses issue `#31` by changing AhoCorasick.search to use runes instead of codeUnits, fixing supplementary Unicode character matching; all related code changes in aho_corasick.dart and supporting files are aligned with the bug fix objective.
Out of Scope Changes check	✅ Passed	Changes to safe_text_filter.dart and profanity_filter_test.dart extend rune-based handling downstream to maintain consistency with the AhoCorasick fix; test additions validate the core bug fix for emoji/Unicode characters.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

lib/src/aho_corasick.dart (1)
107-150: ⚠️ Potential issue | 🔴 Critical

Convert rune indices to code-unit indices in search() return values — current indices cause incorrect text masking for non-BMP characters.

Line 149 stores rune loop counter i as the match end index. Downstream consumers (lines 214–215, 259, 289 in safe_text_filter.dart) use these indices with code-unit APIs: word.length, substring(), and codeUnitAt(). For non-BMP characters (emoji, certain Unicode scripts), a single rune encodes as 2 code units in UTF-16, causing index mismatch and incorrect boundary calculations.
Proposed fix
-    final units = textLower.runes.toList();
-
-    for (int i = 0; i < units.length; i++) {
-      final rune = units[i];
+    int codeUnitEndIndex = -1;
+    for (final rune in textLower.runes) {
+      codeUnitEndIndex += (rune > 0xFFFF) ? 2 : 1;

-      if (current.outputs.isNotEmpty) {
-        matches.putIfAbsent(i, () => []).addAll(current.outputs);
+      if (current.outputs.isNotEmpty) {
+        matches.putIfAbsent(codeUnitEndIndex, () => []).addAll(current.outputs);
       }
Also update the docstring to clarify UTF-16 code-unit indices.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: ff03c41b-f7a2-4293-85d6-3df476584066

📥 Commits

Reviewing files that changed from the base of the PR and between 40ca9ce and 11fa705.

📒 Files selected for processing (1)

lib/src/aho_corasick.dart

master-wayne7 · 2026-04-13T11:16:05Z

@KirthiSaiT can you please only push the change regarding runes in this PR and add test cases as mentioned previously.

- Updated SafeTextFilter to use runes consistently for searching and filtering. - Optimized rune conversions to maintain Aho-Corasick performance. - Added unit test for emoji filtering.

coderabbitai

🧹 Nitpick comments (1)

lib/src/safe_text_filter.dart (1)
356-358: Performance: String allocation in hot path.

_isAlphanumeric is called for every word boundary check during matching. Creating a new String for each rune has allocation overhead. For most use cases this is acceptable, but for high-throughput scenarios, consider caching common checks or using direct code point range comparisons for ASCII.
♻️ Optional: Fast-path for ASCII alphanumerics
 static bool _isAlphanumeric(int rune) {
+  // Fast path for ASCII
+  if ((rune >= 0x30 && rune <= 0x39) ||  // 0-9
+      (rune >= 0x41 && rune <= 0x5A) ||  // A-Z
+      (rune >= 0x61 && rune <= 0x7A)) {  // a-z
+    return true;
+  }
+  // Fallback for non-ASCII Unicode letters/digits
   return _unicodeLetterOrDigit.hasMatch(String.fromCharCode(rune));
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@lib/src/safe_text_filter.dart` around lines 356 - 358, _isAlphanumeric
currently allocates a new String via String.fromCharCode(rune) on every call;
replace that hot-path allocation with an ASCII fast-path: check rune ranges for
'0'-'9', 'A'-'Z', and 'a'-'z' and return true immediately for those, and only
for non-ASCII code points fall back to
_unicodeLetterOrDigit.hasMatch(String.fromCharCode(rune)); keep the function
name _isAlphanumeric and _unicodeLetterOrDigit as the fallback to minimize
changes.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@lib/src/safe_text_filter.dart`:
- Around line 356-358: _isAlphanumeric currently allocates a new String via
String.fromCharCode(rune) on every call; replace that hot-path allocation with
an ASCII fast-path: check rune ranges for '0'-'9', 'A'-'Z', and 'a'-'z' and
return true immediately for those, and only for non-ASCII code points fall back
to _unicodeLetterOrDigit.hasMatch(String.fromCharCode(rune)); keep the function
name _isAlphanumeric and _unicodeLetterOrDigit as the fallback to minimize
changes.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: aa5123ba-fb67-41d4-9d20-7672ec2cc758

📥 Commits

Reviewing files that changed from the base of the PR and between 11fa705 and 60b09ee.

📒 Files selected for processing (2)

lib/src/safe_text_filter.dart
test/profanity_filter_test.dart

KirthiSaiT · 2026-04-13T17:41:23Z

@master-wayne7 pushed only the bug fix alone!

master-wayne7 · 2026-04-14T04:49:07Z

@KirthiSaiT the CI tests are failing, its a small formatting issue. Can you please push one more commit after running dart format . ?

master-wayne7 · 2026-04-14T11:38:11Z

@KirthiSaiT the CI tests are failing, its a small formatting issue. Can you please push one more commit after running dart format . ?

Do take a pull from develop branch to avoid any merge conflicts

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@test/aho_corasick_test.dart`:
- Around line 59-66: The test "case-insensitive search with Unicode" only adds
an emoji so it doesn't exercise case-folding; update the test to also add a real
word that changes under case operations (e.g., call ac.addWord('poop') or a
Unicode-sensitive token like 'straße') before ac.buildFailureLinks(), then call
ac.search('POOP 💩') and assert that the returned matches include the
case-folded token (verify matches contains 'poop' or the chosen token) so
ac.addWord, ac.buildFailureLinks, and ac.search actually validate
case-insensitive behavior.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 6cb02be5-4a1b-4e78-a91f-3e42db97b494

📥 Commits

Reviewing files that changed from the base of the PR and between 60b09ee and 02bed1e.

📒 Files selected for processing (2)

lib/src/safe_text_filter.dart
test/aho_corasick_test.dart

🚧 Files skipped from review as they are similar to previous changes (1)

lib/src/safe_text_filter.dart

KirthiSaiT · 2026-04-14T17:49:19Z

@master-wayne7 the cli test cases all passed

fix: use runes instead of codeUnits in AhoCorasick.search

11fa705

coderabbitai Bot reviewed Apr 11, 2026

View reviewed changes

fix: Unicode support and emoji filtering in Aho-Corasick

60b09ee

- Updated SafeTextFilter to use runes consistently for searching and filtering. - Optimized rune conversions to maintain Aho-Corasick performance. - Added unit test for emoji filtering.

coderabbitai Bot reviewed Apr 13, 2026

View reviewed changes

fix: formatting and added Aho-Corasick Unicode tests

02bed1e

coderabbitai Bot reviewed Apr 14, 2026

View reviewed changes

Comment thread test/aho_corasick_test.dart

master-wayne7 approved these changes Apr 20, 2026

View reviewed changes

master-wayne7 merged commit 872f009 into master-wayne7:develop Apr 20, 2026
3 checks passed

Conversation

KirthiSaiT commented Apr 11, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related Issue

Type of Change

Checklist

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Apr 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

master-wayne7 commented Apr 13, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

KirthiSaiT commented Apr 13, 2026

Uh oh!

master-wayne7 commented Apr 14, 2026

Uh oh!

master-wayne7 commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

KirthiSaiT commented Apr 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

KirthiSaiT commented Apr 11, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 11, 2026 •

edited

Loading

master-wayne7 commented Apr 14, 2026 •

edited

Loading