Skip to content

fix: include raw remaining input as suffix in romaji_to_hiragana_predictively#5

Open
yukimemi wants to merge 1 commit intooguna:masterfrom
yukimemi:fix/predictive-suffix-includes-raw-remaining
Open

fix: include raw remaining input as suffix in romaji_to_hiragana_predictively#5
yukimemi wants to merge 1 commit intooguna:masterfrom
yukimemi:fix/predictive-suffix-includes-raw-remaining

Conversation

@yukimemi
Copy link
Copy Markdown

Problem

When a query contains a kana-convertible prefix followed by an incomplete romaji
syllable (a bare consonant), mixed kana+ASCII filenames fail to match.

Concrete example (found in a Japanese launcher app using rustmigemo):

Filename Query Expected Actual
インフラWBS infura match ✓ match ✓
インフラWBS wbs match ✓ match ✓
インフラWBS infurawbs match ✓ no match ✗

Root cause

romaji_to_hiragana_predictively("infurawbs") processes the string as:

  1. infura → consumed into prefix as いんふら
  2. w, b → pushed as-is (no romaji match)
  3. s → triggers the predictive suffix block (terminal_nodes_for_prefix("s").len() > 1)

Inside the suffix block, only kana alternatives are inserted into the set
([さ, し, す, せ, そ, …]). The plain ASCII s is never added.

As a result, query_a_word_with_generator generates patterns like
インフラwb[サシスセソ…] but not インフラwbs. The case-insensitive regex
(?i)インフラwb[サシスセソ…] cannot match インフラWBS because S ≠ サ.

By contrast, querying "wbs" alone works because query_a_word_with_generator
adds the literal word via generator.add(&word_chars) at the top, so wbs is
already present as a pattern alternative. The problem only surfaces when the
ASCII tail is preceded by a kana-converted prefix.

Fix

One line added to romaji_to_hiragana_predictively — insert the raw remaining
input as an additional suffix option before returning:

// Also include the raw remaining input as a suffix option.
set.insert(query.to_vec());

With this change the suffix set for the trailing s becomes
[さ, し, す, …, s], so query_a_word_with_generator also generates
インフラwbs(?i)インフラwbs matches インフラWBS. ✓

Side-effects analysis

The extra suffix is redundant in the no-prefix case: when the query starts
with an incomplete romaji syllable ("ky", "n", …) the raw input is already
added to the regex generator via generator.add(&word_chars) at the top of
query_a_word_with_generator, so the final regex pattern is unchanged.

In the mixed case ("infurawbs") the raw suffix produces genuinely new
alternatives (インフラwbs, いんふらwbs) that were previously missing.

Tests

  • Three existing predictive tests updated to include the raw suffix in expected
    results (behavior change is intentional and correct).
  • Two new regression tests added:
    • predictive_suffix_includes_raw_ascii_trailing_consonant — verifies "infurawbs" yields "s" in suffixes
    • predictive_suffix_includes_raw_for_single_trailing_consonant — verifies "s" alone yields "s" in suffixes

All 55 tests pass.

…ictively

When the trailing characters of a query are an incomplete romaji syllable
(e.g. "s" at the end of "infurawbs"), the predictive suffix generator only
produced kana alternatives ([さしすせそ…]) and never the plain ASCII character.

This caused mixed kana+ASCII filenames like "インフラWBS" to be missed when
querying "infurawbs": the generated pattern contained `インフラwb[サシスセソ…]`
but not `インフラwbs`, so the case-insensitive match against "WBS" failed.

Fix: after building the kana suffix set, also insert the raw remaining input
(`query.to_vec()`) as one extra suffix option.  This lets callers (e.g.
query_a_word_with_generator) generate the literal pattern "インフラwbs" in
addition to the kana alternatives, which then matches "インフラWBS" via the
(?i) flag.

Existing tests for romaji_to_hiragana_predictively_2/3/4 are updated to
reflect that the raw remaining input is now included in the suffix list
alongside the kana alternatives.
yukimemi added a commit to yukimemi/shun that referenced this pull request Mar 26, 2026
Use yukimemi/rustmigemo@14e8f56 which includes the fix for mixed
kana+ASCII queries (e.g. "infurawbs" now matches "インフラWBS").

PR submitted upstream: oguna/rustmigemo#5
Will revert to oguna/rustmigemo once the PR is merged.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant