fix: include raw remaining input as suffix in romaji_to_hiragana_predictively#5
Open
yukimemi wants to merge 1 commit intooguna:masterfrom
Open
Conversation
…ictively When the trailing characters of a query are an incomplete romaji syllable (e.g. "s" at the end of "infurawbs"), the predictive suffix generator only produced kana alternatives ([さしすせそ…]) and never the plain ASCII character. This caused mixed kana+ASCII filenames like "インフラWBS" to be missed when querying "infurawbs": the generated pattern contained `インフラwb[サシスセソ…]` but not `インフラwbs`, so the case-insensitive match against "WBS" failed. Fix: after building the kana suffix set, also insert the raw remaining input (`query.to_vec()`) as one extra suffix option. This lets callers (e.g. query_a_word_with_generator) generate the literal pattern "インフラwbs" in addition to the kana alternatives, which then matches "インフラWBS" via the (?i) flag. Existing tests for romaji_to_hiragana_predictively_2/3/4 are updated to reflect that the raw remaining input is now included in the suffix list alongside the kana alternatives.
yukimemi
added a commit
to yukimemi/shun
that referenced
this pull request
Mar 26, 2026
Use yukimemi/rustmigemo@14e8f56 which includes the fix for mixed kana+ASCII queries (e.g. "infurawbs" now matches "インフラWBS"). PR submitted upstream: oguna/rustmigemo#5 Will revert to oguna/rustmigemo once the PR is merged. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
When a query contains a kana-convertible prefix followed by an incomplete romaji
syllable (a bare consonant), mixed kana+ASCII filenames fail to match.
Concrete example (found in a Japanese launcher app using rustmigemo):
インフラWBSinfuraインフラWBSwbsインフラWBSinfurawbsRoot cause
romaji_to_hiragana_predictively("infurawbs")processes the string as:infura→ consumed into prefix asいんふらw,b→ pushed as-is (no romaji match)s→ triggers the predictive suffix block (terminal_nodes_for_prefix("s").len() > 1)Inside the suffix block, only kana alternatives are inserted into the set
(
[さ, し, す, せ, そ, …]). The plain ASCIIsis never added.As a result,
query_a_word_with_generatorgenerates patterns likeインフラwb[サシスセソ…]but notインフラwbs. The case-insensitive regex(?i)インフラwb[サシスセソ…]cannot matchインフラWBSbecauseS ≠ サ.By contrast, querying
"wbs"alone works becausequery_a_word_with_generatoradds the literal word via
generator.add(&word_chars)at the top, sowbsisalready present as a pattern alternative. The problem only surfaces when the
ASCII tail is preceded by a kana-converted prefix.
Fix
One line added to
romaji_to_hiragana_predictively— insert the raw remaininginput as an additional suffix option before returning:
With this change the suffix set for the trailing
sbecomes[さ, し, す, …, s], soquery_a_word_with_generatoralso generatesインフラwbs→(?i)インフラwbsmatchesインフラWBS. ✓Side-effects analysis
The extra suffix is redundant in the no-prefix case: when the query starts
with an incomplete romaji syllable (
"ky","n", …) the raw input is alreadyadded to the regex generator via
generator.add(&word_chars)at the top ofquery_a_word_with_generator, so the final regex pattern is unchanged.In the mixed case (
"infurawbs") the raw suffix produces genuinely newalternatives (
インフラwbs,いんふらwbs) that were previously missing.Tests
results (behavior change is intentional and correct).
predictive_suffix_includes_raw_ascii_trailing_consonant— verifies"infurawbs"yields"s"in suffixespredictive_suffix_includes_raw_for_single_trailing_consonant— verifies"s"alone yields"s"in suffixesAll 55 tests pass.