Skip to content

feat(cli): dictool candidates mine — extras dict candidate pool from SudachiDict#242

Merged
send merged 9 commits intomainfrom
feat/extras-candidates-mine
May 9, 2026
Merged

feat(cli): dictool candidates mine — extras dict candidate pool from SudachiDict#242
send merged 9 commits intomainfrom
feat/extras-candidates-mine

Conversation

@send
Copy link
Copy Markdown
Owner

@send send commented May 9, 2026

Summary

`dictool candidates mine` で SudachiDict-full を fetch → build dict と diff → POS で 3 bucket 分類 → `engine/data/extras-candidates/` に per-bucket TSV を出力。

「打って候補が出ない」というストレスの proactive 側 を埋めるツール。マイニング結果は build dict に自動マージしません (#156 で消した Sudachi 全マージの教訓を踏襲)。ユーザーが TSV を手動でレビュー → 良いものだけ `dict_source/extras/.tsv` に転記して PR、というワークフロー。

Buckets

ファイル POS 件数
`sudachi-place.tsv` 名詞,固有名詞,地名,* ~537K
`sudachi-common.tsv` 名詞,普通名詞,一般,* ~149K
`sudachi-other.tsv` その他 (動詞/形容詞/接尾辞/...) ~1.2M

各 TSV は Sudachi cost 昇順 (= 出現頻度プロキシ) ソート。grep で domain 絞り込み:

```bash
grep -E "椒|醤|油" engine/data/extras-candidates/sudachi-common.tsv | head -50
```

マイニングの限界 (確認済)

結果 理由
老抽 sudachi-common にあるが読みが `ろうちゅう` (中国語ピンイン) 日本人が打つ `らおちゅう` と一致しない
花椒 同様、読みが `ふぁーちう` curated でカバー済
冪等 / 藤椒 / 一保堂 / 雲母坂 Sudachi にも無し curated 必須
型推論 / 醤油 / 山椒 即効候補 (読み `かたすいろん` / `じょうゆ` / `ざんしょう`) promote 可

→ マイニングは curated を 置換しない、補完する。

過去の Sudachi 全マージ (#156, 0a45265) との設計上の違い

過去 (Sudachi 全マージ) 今回 (candidates mine)
出力先 build dict (`lexime.dict`) `extras-candidates/` (gitignored)
自動投入 yes no
品質制御 不可 ユーザが TSV 手レビュー
削除リスク top-1 ノイズで撤退 build dict 影響ゼロ

`fetch + parse` のコード自体は 0a45265^ から resurrect (~300 LoC)。 ただし `dict_source/` ではなく専用の `candidates/` モジュールに配置し、`SOURCES` レジストリには登録しない (= `dictool fetch --source sudachi` も `compile --source sudachi` も呼べない)。

レイアウト

```
engine/crates/lex-cli/src/
├── candidates/
│ ├── mod.rs # Bucket / classify / write_candidates
│ └── sudachi.rs # fetch + 18-col CSV parse
└── commands/
└── candidates_ops.rs # mine() orchestration + default paths
```

依存: `zip` クレートを lex-cli の hard dep として復活 (Sudachi リリースが ZIP のため)。IME runtime には影響しない (lex-cli は build/dev tool)。

CLI

```bash
dictool candidates mine [--build-dict PATH] [--cache-dir DIR] [--out-dir DIR] [--clean]
```

デフォルト:

  • `--build-dict engine/data/lexime.dict`
  • `--cache-dir engine/data/.sudachi-cache` (gitignored)
  • `--out-dir engine/data/extras-candidates` (gitignored)
  • `--clean` でラン前に出力ディレクトリを wipe

Test plan

  • `cargo fmt --all --check`
  • `cargo clippy --workspace --all-features -- -D warnings`
  • `cargo test --workspace --all-features` (lex-cli 25 tests pass、candidates 系 +10)
  • 実際に `dictool candidates mine` を回し、3 bucket 分配 / count / Sudachi v20260428 fetch を確認
  • sudachi-common 内に 型推論 / 醤油 / 山椒 などの即効候補が存在することを確認
  • 一保堂 / 藤椒 / 雲母坂 が Sudachi に無いことを確認 (curated 不可避の確認)

🤖 Generated with Claude Code

Adds a `dictool candidates mine` subcommand that downloads SudachiDict-full,
diffs against the merged build dict, classifies each `(reading, surface)`
into one of three coarse buckets by Sudachi POS, and writes per-bucket TSVs
to `engine/data/extras-candidates/` (gitignored).

This addresses the proactive side of "I typed X and there was no candidate":
mining surfaces a pool of words Mozc lacks before the user encounters them,
without ever auto-merging into the build dict (which is what got Sudachi
bulk-merge removed in PR #156).

## Why a separate module from `dict_source`

`dict_source` modules feed entries directly into `lexime.dict`. PR #156
(2026-02-20) deleted the Mozc+SudachiDict merge after it produced too much
top-1 noise — that lesson holds. Putting Sudachi into a new `candidates/`
module makes the contract explicit: mined output never enters the build
dict; the user reviews TSVs in `extras-candidates/` and hand-copies useful
rows into `dict_source/extras/<domain>.tsv`.

## Buckets

Sudachi POS doesn't carry semantic categories, so domain-tagged splits
(food / it / medical) aren't possible from the source alone. Coarse
syntactic split:

- `sudachi-place.tsv` — `名詞,固有名詞,地名,*` (place names; ~537K seed)
- `sudachi-common.tsv` — `名詞,普通名詞,一般,*` (general nouns; ~149K seed)
- `sudachi-other.tsv` — verbs/adjectives/suffixes/etc. (~1.2M, low signal)

Each file is sorted by Sudachi cost ascending so the highest-priority
candidates surface first when scanning. User filters via grep:

```
grep -E "椒|醤|油" engine/data/extras-candidates/sudachi-common.tsv | head
```

## Confirmed limits

Mining can't help when:

- The reading is borrowed cross-language (Sudachi has 老抽 with reading
  `ろうちゅう` (Mandarin pinyin), but Japanese cooks type `らおちゅう` —
  Sudachi's reading isn't useful).
- Word is too niche to be in Sudachi at all (一保堂, 雲母坂, 冪等 — all
  absent). Curated TSV remains the only path for these.

So mining augments the curated layer rather than replacing it.

## Layout

```
engine/crates/lex-cli/src/
  candidates/
    mod.rs         # Bucket enum, classify(), write_candidates()
    sudachi.rs     # fetch + parse 18-col CSV (resurrected from PR #156)
  commands/
    candidates_ops.rs   # mine() orchestration, default paths
```

`fetch + parse` is the same code path as the deleted PR #156 sudachi.rs
(commit 0a45265^), confined to candidate mining. The `zip` crate dependency
is reintroduced as a hard dep on lex-cli only — the IME runtime is
unaffected.
Copilot AI review requested due to automatic review settings May 9, 2026 05:57
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new dictool candidates mine workflow to proactively mine potential extras/ dictionary entries from SudachiDict-full without auto-merging them into the build dictionary, outputting gitignored per-bucket TSVs for manual review/promotion.

Changes:

  • Introduces lex_cli::candidates module (Sudachi fetch + parse, POS-based bucketing, TSV output).
  • Adds dictool candidates mine CLI subcommand and orchestration helpers (default paths, optional --clean).
  • Restores zip as a lex-cli dependency to handle Sudachi ZIP releases and updates docs around extras entry workflows.

Reviewed changes

Copilot reviewed 8 out of 9 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
engine/crates/lex-cli/src/lib.rs Exposes new candidates module.
engine/crates/lex-cli/src/dict_source/extras.rs Documents “friction” vs “proactive mining” paths for adding extras entries.
engine/crates/lex-cli/src/commands/mod.rs Registers new candidates_ops command module.
engine/crates/lex-cli/src/commands/candidates_ops.rs Implements mine() orchestration + default/cache/output path helpers.
engine/crates/lex-cli/src/candidates/mod.rs Defines candidate data model, bucketing, and TSV writer.
engine/crates/lex-cli/src/candidates/sudachi.rs Implements SudachiDict version discovery, download/extract, and CSV parsing.
engine/crates/lex-cli/src/bin/dictool.rs Adds dictool candidates mine CLI wiring and flags.
engine/crates/lex-cli/Cargo.toml Adds zip dependency for Sudachi ZIP extraction.
engine/Cargo.lock Locks new zip dependency graph for lex-cli.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread engine/crates/lex-cli/src/candidates/sudachi.rs Outdated
Comment thread engine/crates/lex-cli/src/candidates/sudachi.rs Outdated
Comment thread engine/crates/lex-cli/src/candidates/sudachi.rs Outdated
Comment thread engine/crates/lex-cli/src/candidates/mod.rs Outdated
Comment thread engine/crates/lex-cli/src/commands/candidates_ops.rs Outdated
- (IMP) sudachi.rs:60 — version-aware cache: stamp/latest mismatch now
  wipes stale CSVs before re-downloading, instead of silently reusing
  old data while writing the new version into the stamp/output headers.
  Adds regression tests `read_stamp` / `stale_cache_csvs_wiped_on_version_mismatch`.
- (MIN perf) sudachi.rs:103 — stream `*_lex.csv` via `BufReader::lines`
  rather than slurping the whole (>100 MB) file into a String.
- (MIN perf) sudachi.rs:189 — stage downloaded ZIPs to `.tmp.zip` on
  disk and stream-extract from that file instead of buffering ZIP bytes
  + extracted CSVs simultaneously in memory.
- (MIN perf) mod.rs:126 — use `sort_by` with borrowed string compare
  to avoid cloning surface/reading at sort-key construction (millions
  of allocs at full-Sudachi scale).
- (MIN doc) candidates_ops.rs:78 — comment claimed cache lives outside
  `engine/data/`; aligned with the actual `engine/data/.sudachi-cache`
  path and explained the leading-dot convention.

Memory peak observation: ZIP-on-disk + line-streaming roughly halves
peak RSS during `dictool candidates mine` on Sudachi-full.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 9 changed files in this pull request and generated 5 comments.

Comment thread engine/crates/lex-cli/src/candidates/mod.rs Outdated
Comment thread engine/crates/lex-cli/src/candidates/mod.rs Outdated
Comment thread engine/crates/lex-cli/src/commands/candidates_ops.rs Outdated
Comment thread engine/crates/lex-cli/Cargo.toml Outdated
Comment thread engine/crates/lex-cli/src/candidates/sudachi.rs
send added 3 commits May 9, 2026 15:10
`zip` (added in PR #242 for SudachiDict ZIP extraction) pulls in 4
crates that weren't previously exempt at `safe-to-deploy`:

- `arbitrary 1.4.2` (new)
- `bumpalo 3.19.1` (was `safe-to-run`; raised since zip's path
  needs deploy-grade)
- `derive_arbitrary 1.4.2` (new)
- `zopfli 0.8.3` (new)

Adds them to `engine/supply-chain/config.toml` as exemptions to
unblock CI `audit` job. Manually auditing 23k+ lines of zopfli /
arbitrary upstream is out of scope; if any of these crates publishes
a security concern, cargo-vet will surface it on the next run.
`zip 2.4.2` (added in PR #242 for SudachiDict ZIP extraction) ships a
`build.rs`. Reviewed: standard cfg-detection pattern, no network access
or codegen. Updates the build-script baseline so the audit job stops
flagging it.
- (IMP) sudachi.rs:68 — robust cache invalidation: `fetch` now treats
  ANY of {missing stamp, empty stamp, version mismatch, no stamp +
  orphan CSVs} as a fully invalid cache and wipes before re-downloading.
  R1's fix only handled version-mismatch; this closes the missing-
  stamp + stale-CSV path. Adds `test_stale_cache_invariant` covering
  all five state combinations and `test_any_csv_exists_detects_orphans`.
- (MIN perf) Cargo.toml:32 — switched from `zip = "2"` to `zip = "7"`,
  collapsing the two zip majors in Cargo.lock into one. Drops the
  arbitrary/derive_arbitrary transitive-dep tail that came in with
  zip 2.x's deflate path; their cargo-vet exemptions are removed.
- (MIN perf) mod.rs:156 — stream rows to `BufWriter<File>` instead of
  building the whole TSV (~100MB for full Sudachi) in a String first.
- (MIN perf) candidates_ops.rs:50 — drop the per-reading
  `HashSet<String>` allocation and clone; linear scan over the small
  Vec<DictEntry> from `dict.lookup` is faster for typical homophone
  counts.
- (MIN doc) mod.rs:94 — fixed the off-by-one in the `classify` POS
  column comment (was 4..=9, actually 5..=10 since col 4 is the
  normalized surface, not POS).

Build artifact cleanup: build-script-baseline.txt drops `zip` (zip
7.x has no build.rs, unlike zip 2.x).
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 9 out of 10 changed files in this pull request and generated 5 comments.

Comment thread engine/supply-chain/config.toml
Comment thread engine/crates/lex-cli/src/candidates/mod.rs Outdated
Comment thread engine/crates/lex-cli/src/candidates/sudachi.rs Outdated
Comment thread engine/crates/lex-cli/src/candidates/sudachi.rs
Comment thread engine/crates/lex-cli/src/commands/candidates_ops.rs Outdated
…T CI)

- (CRIT CI) supply-chain/config.toml — add zlib-rs 0.6.2 exemption.
  zip 7's deflate path pulls in flate2 → zlib-rs, which wasn't covered
  by the previous R2 supply-chain edits and was breaking the audit job.
- (MIN doc/style) mod.rs:151 — emit each TSV header line via its own
  `writeln!` so source-code indentation doesn't bleed into the output
  as leading whitespace before `#`. Adds an assertion that header
  lines start at col 0.
- (MIN perf) sudachi.rs:185 — split the parsed row type into
  `CandidateRow` (no reading) for the multimap value and `Candidate`
  (with reading) only at the output site. The reading lives in the
  HashMap key alone, dropping ~1.9M duplicate String allocations at
  full-Sudachi scale.
- (MIN safety) sudachi.rs:245 — use a per-process unique tmp filename
  (`.tmp.{pid}.zip`) and install `TmpFileGuard` BEFORE the download
  block, so a failed `io::copy` no longer leaks a partial tmp and
  parallel `mine` runs against the same cache dir don't clobber each
  other.
- (MIN perf) candidates_ops.rs:57 — drop `cand.reading.clone()` (now
  the outer `reading` is the canonical key) and replace
  `pos.split('-').collect::<Vec>()` + `classify(&[..])` with a new
  `classify_pos_string` helper that does the split inline without
  allocating a Vec.

Tests: +1 `classify_pos_string_matches_classify`, header
col-0 assertion in `write_candidates_splits_buckets_and_sorts_by_cost`.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 9 out of 10 changed files in this pull request and generated 3 comments.

Comment thread engine/crates/lex-cli/src/candidates/sudachi.rs
Comment thread engine/crates/lex-cli/src/candidates/sudachi.rs
Comment thread engine/supply-chain/config.toml
- (IMP) sudachi.rs:225 — `parse_latest_version` now accepts the
  `YYYYMMDD_N` re-cut form (e.g. `20201023_2`). Previously the
  digit-only filter dropped these entries, so a future re-cut release
  at the top of the listing would silently fall back to an older
  date-only version. Adds `is_sudachi_version` helper, regression test
  with `_N` at the head of the listing, and a noise-filter test
  covering empty / dangling / double-underscore / alpha cases.
- (MIN test) sudachi.rs:499 — switch the parse_dir test fixture from
  a `\`-continued multi-line string (which leaks source-code
  indentation into the data) to `concat!` of separate literals, and
  add the missing `surface == "東京都"` assertion that would have
  caught the leakage.
- (MIN doc) supply-chain/config.toml:69 — annotate why bumpalo's
  exemption was raised from `safe-to-run` to `safe-to-deploy` (zip
  ZIP-extraction tool path; bumpalo not used at IME runtime, but
  cargo-vet evaluates the dep graph at deploy-grade).
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 9 out of 10 changed files in this pull request and generated 4 comments.

Comment thread engine/crates/lex-cli/src/candidates/sudachi.rs Outdated
Comment thread engine/crates/lex-cli/src/candidates/sudachi.rs Outdated
Comment thread engine/crates/lex-cli/src/commands/candidates_ops.rs
Comment thread engine/supply-chain/config.toml Outdated
- (perf) sudachi.rs:152 — walk the split iterator with an enumerate
  match instead of `collect::<Vec<&str>>()` per CSV line. Drops
  ~3M Vec allocations on Sudachi-full and breaks early at col 11
  (cols 12..17 aren't read).
- (safety) sudachi.rs:297 — extract each CSV to a per-process
  `.{name}.extract.{pid}` temp path then atomically rename to the
  final csv_path. Two parallel `mine` runs against the same cache
  dir no longer race the prior exists-check + truncate-create
  pattern.
- (perf) candidates_ops.rs:69 — consume `upstream` by value so each
  `CandidateRow`'s `surface` and `pos` Strings can be moved into
  `Candidate` instead of cloned. Reading is still cloned per row
  (the HashMap key remains canonical across the outer loop).
- (doc) supply-chain/config.toml:73 — fix the bumpalo rationale.
  R4's comment claimed bumpalo only came in via build-side
  proc-macro tooling, but PR #242's `zip 7` switch added a
  `zip → zopfli → bumpalo` runtime path through dictool. Comment
  now lists both paths.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 9 out of 10 changed files in this pull request and generated 2 comments.

Comment thread engine/crates/lex-cli/src/commands/candidates_ops.rs
Comment thread engine/crates/lex-cli/src/candidates/sudachi.rs
…und)

- (perf) candidates_ops.rs:60 — scope the `seen` dedupe HashSet to a
  single reading instead of growing it globally as `(reading, surface)`
  tuples. Surface duplicates are only meaningful within one reading
  (when Sudachi has multiple POS variants for the same row), so the
  per-reading set drops both the cross-reading entries and the
  per-row reading clone.
- (safety) sudachi.rs:321 — wrap `extract_tmp` with the existing
  `TmpFileGuard` RAII pattern so a failure in `io::copy` or the
  subsequent `fs::rename` doesn't leak a `.{basename}.extract.{pid}`
  file in the cache dir. Successful rename leaves nothing at the
  staged path, so the guard's drop is a no-op then.
@send send merged commit 0e399f2 into main May 9, 2026
10 checks passed
@send send deleted the feat/extras-candidates-mine branch May 9, 2026 07:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants