feat(cli): dictool candidates mine — extras dict candidate pool from SudachiDict by send · Pull Request #242 · send/lexime

send · 2026-05-09T05:57:01Z

Summary

`dictool candidates mine` で SudachiDict-full を fetch → build dict と diff → POS で 3 bucket 分類 → `engine/data/extras-candidates/` に per-bucket TSV を出力。

「打って候補が出ない」というストレスの proactive 側 を埋めるツール。マイニング結果は build dict に自動マージしません (#156 で消した Sudachi 全マージの教訓を踏襲)。ユーザーが TSV を手動でレビュー → 良いものだけ `dict_source/extras/.tsv` に転記して PR、というワークフロー。

Buckets

ファイル	POS	件数
`sudachi-place.tsv`	名詞,固有名詞,地名,*	~537K
`sudachi-common.tsv`	名詞,普通名詞,一般,*	~149K
`sudachi-other.tsv`	その他 (動詞/形容詞/接尾辞/...)	~1.2M

各 TSV は Sudachi cost 昇順 (= 出現頻度プロキシ) ソート。grep で domain 絞り込み:

```bash
grep -E "椒|醤|油" engine/data/extras-candidates/sudachi-common.tsv | head -50
```

マイニングの限界 (確認済)

語	結果	理由
老抽	sudachi-common にあるが読みが `ろうちゅう` (中国語ピンイン)	日本人が打つ `らおちゅう` と一致しない
花椒	同様、読みが `ふぁーちう`	curated でカバー済
冪等 / 藤椒 / 一保堂 / 雲母坂	Sudachi にも無し	curated 必須
型推論 / 醤油 / 山椒	即効候補 (読み `かたすいろん` / `じょうゆ` / `ざんしょう`)	promote 可

→ マイニングは curated を 置換しない、補完する。

過去の Sudachi 全マージ (#156, `0a45265`) との設計上の違い

軸	過去 (Sudachi 全マージ)	今回 (candidates mine)
出力先	build dict (`lexime.dict`)	`extras-candidates/` (gitignored)
自動投入	yes	no
品質制御	不可	ユーザが TSV 手レビュー
削除リスク	top-1 ノイズで撤退	build dict 影響ゼロ

`fetch + parse` のコード自体は 0a45265^ から resurrect (~300 LoC)。ただし `dict_source/` ではなく専用の `candidates/` モジュールに配置し、`SOURCES` レジストリには登録しない (= `dictool fetch --source sudachi` も `compile --source sudachi` も呼べない)。

レイアウト

```
engine/crates/lex-cli/src/
├── candidates/
│ ├── mod.rs # Bucket / classify / write_candidates
│ └── sudachi.rs # fetch + 18-col CSV parse
└── commands/
└── candidates_ops.rs # mine() orchestration + default paths
```

依存: `zip` クレートを lex-cli の hard dep として復活 (Sudachi リリースが ZIP のため)。IME runtime には影響しない (lex-cli は build/dev tool)。

CLI

```bash
dictool candidates mine [--build-dict PATH] [--cache-dir DIR] [--out-dir DIR] [--clean]
```

デフォルト:

`--build-dict engine/data/lexime.dict`
`--cache-dir engine/data/.sudachi-cache` (gitignored)
`--out-dir engine/data/extras-candidates` (gitignored)
`--clean` でラン前に出力ディレクトリを wipe

Test plan

`cargo fmt --all --check`
`cargo clippy --workspace --all-features -- -D warnings`
`cargo test --workspace --all-features` (lex-cli 25 tests pass、candidates 系 +10)
実際に `dictool candidates mine` を回し、3 bucket 分配 / count / Sudachi v20260428 fetch を確認
sudachi-common 内に型推論 / 醤油 / 山椒などの即効候補が存在することを確認
一保堂 / 藤椒 / 雲母坂が Sudachi に無いことを確認 (curated 不可避の確認)

🤖 Generated with Claude Code

Adds a `dictool candidates mine` subcommand that downloads SudachiDict-full, diffs against the merged build dict, classifies each `(reading, surface)` into one of three coarse buckets by Sudachi POS, and writes per-bucket TSVs to `engine/data/extras-candidates/` (gitignored). This addresses the proactive side of "I typed X and there was no candidate": mining surfaces a pool of words Mozc lacks before the user encounters them, without ever auto-merging into the build dict (which is what got Sudachi bulk-merge removed in PR #156). ## Why a separate module from `dict_source` `dict_source` modules feed entries directly into `lexime.dict`. PR #156 (2026-02-20) deleted the Mozc+SudachiDict merge after it produced too much top-1 noise — that lesson holds. Putting Sudachi into a new `candidates/` module makes the contract explicit: mined output never enters the build dict; the user reviews TSVs in `extras-candidates/` and hand-copies useful rows into `dict_source/extras/<domain>.tsv`. ## Buckets Sudachi POS doesn't carry semantic categories, so domain-tagged splits (food / it / medical) aren't possible from the source alone. Coarse syntactic split: - `sudachi-place.tsv` — `名詞,固有名詞,地名,*` (place names; ~537K seed) - `sudachi-common.tsv` — `名詞,普通名詞,一般,*` (general nouns; ~149K seed) - `sudachi-other.tsv` — verbs/adjectives/suffixes/etc. (~1.2M, low signal) Each file is sorted by Sudachi cost ascending so the highest-priority candidates surface first when scanning. User filters via grep: ``` grep -E "椒|醤|油" engine/data/extras-candidates/sudachi-common.tsv | head ``` ## Confirmed limits Mining can't help when: - The reading is borrowed cross-language (Sudachi has 老抽 with reading `ろうちゅう` (Mandarin pinyin), but Japanese cooks type `らおちゅう` — Sudachi's reading isn't useful). - Word is too niche to be in Sudachi at all (一保堂, 雲母坂, 冪等 — all absent). Curated TSV remains the only path for these. So mining augments the curated layer rather than replacing it. ## Layout ``` engine/crates/lex-cli/src/ candidates/ mod.rs # Bucket enum, classify(), write_candidates() sudachi.rs # fetch + parse 18-col CSV (resurrected from PR #156) commands/ candidates_ops.rs # mine() orchestration, default paths ``` `fetch + parse` is the same code path as the deleted PR #156 sudachi.rs (commit 0a45265^), confined to candidate mining. The `zip` crate dependency is reintroduced as a hard dep on lex-cli only — the IME runtime is unaffected.

Copilot

Pull request overview

Adds a new dictool candidates mine workflow to proactively mine potential extras/ dictionary entries from SudachiDict-full without auto-merging them into the build dictionary, outputting gitignored per-bucket TSVs for manual review/promotion.

Changes:

Introduces lex_cli::candidates module (Sudachi fetch + parse, POS-based bucketing, TSV output).
Adds dictool candidates mine CLI subcommand and orchestration helpers (default paths, optional --clean).
Restores zip as a lex-cli dependency to handle Sudachi ZIP releases and updates docs around extras entry workflows.

Reviewed changes

Copilot reviewed 8 out of 9 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
engine/crates/lex-cli/src/lib.rs	Exposes new `candidates` module.
engine/crates/lex-cli/src/dict_source/extras.rs	Documents “friction” vs “proactive mining” paths for adding extras entries.
engine/crates/lex-cli/src/commands/mod.rs	Registers new `candidates_ops` command module.
engine/crates/lex-cli/src/commands/candidates_ops.rs	Implements `mine()` orchestration + default/cache/output path helpers.
engine/crates/lex-cli/src/candidates/mod.rs	Defines candidate data model, bucketing, and TSV writer.
engine/crates/lex-cli/src/candidates/sudachi.rs	Implements SudachiDict version discovery, download/extract, and CSV parsing.
engine/crates/lex-cli/src/bin/dictool.rs	Adds `dictool candidates mine` CLI wiring and flags.
engine/crates/lex-cli/Cargo.toml	Adds `zip` dependency for Sudachi ZIP extraction.
engine/Cargo.lock	Locks new `zip` dependency graph for `lex-cli`.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

- (IMP) sudachi.rs:60 — version-aware cache: stamp/latest mismatch now wipes stale CSVs before re-downloading, instead of silently reusing old data while writing the new version into the stamp/output headers. Adds regression tests `read_stamp` / `stale_cache_csvs_wiped_on_version_mismatch`. - (MIN perf) sudachi.rs:103 — stream `*_lex.csv` via `BufReader::lines` rather than slurping the whole (>100 MB) file into a String. - (MIN perf) sudachi.rs:189 — stage downloaded ZIPs to `.tmp.zip` on disk and stream-extract from that file instead of buffering ZIP bytes + extracted CSVs simultaneously in memory. - (MIN perf) mod.rs:126 — use `sort_by` with borrowed string compare to avoid cloning surface/reading at sort-key construction (millions of allocs at full-Sudachi scale). - (MIN doc) candidates_ops.rs:78 — comment claimed cache lives outside `engine/data/`; aligned with the actual `engine/data/.sudachi-cache` path and explained the leading-dot convention. Memory peak observation: ZIP-on-disk + line-streaming roughly halves peak RSS during `dictool candidates mine` on Sudachi-full.

Copilot

Pull request overview

Copilot reviewed 8 out of 9 changed files in this pull request and generated 5 comments.

`zip` (added in PR #242 for SudachiDict ZIP extraction) pulls in 4 crates that weren't previously exempt at `safe-to-deploy`: - `arbitrary 1.4.2` (new) - `bumpalo 3.19.1` (was `safe-to-run`; raised since zip's path needs deploy-grade) - `derive_arbitrary 1.4.2` (new) - `zopfli 0.8.3` (new) Adds them to `engine/supply-chain/config.toml` as exemptions to unblock CI `audit` job. Manually auditing 23k+ lines of zopfli / arbitrary upstream is out of scope; if any of these crates publishes a security concern, cargo-vet will surface it on the next run.

`zip 2.4.2` (added in PR #242 for SudachiDict ZIP extraction) ships a `build.rs`. Reviewed: standard cfg-detection pattern, no network access or codegen. Updates the build-script baseline so the audit job stops flagging it.

- (IMP) sudachi.rs:68 — robust cache invalidation: `fetch` now treats ANY of {missing stamp, empty stamp, version mismatch, no stamp + orphan CSVs} as a fully invalid cache and wipes before re-downloading. R1's fix only handled version-mismatch; this closes the missing- stamp + stale-CSV path. Adds `test_stale_cache_invariant` covering all five state combinations and `test_any_csv_exists_detects_orphans`. - (MIN perf) Cargo.toml:32 — switched from `zip = "2"` to `zip = "7"`, collapsing the two zip majors in Cargo.lock into one. Drops the arbitrary/derive_arbitrary transitive-dep tail that came in with zip 2.x's deflate path; their cargo-vet exemptions are removed. - (MIN perf) mod.rs:156 — stream rows to `BufWriter<File>` instead of building the whole TSV (~100MB for full Sudachi) in a String first. - (MIN perf) candidates_ops.rs:50 — drop the per-reading `HashSet<String>` allocation and clone; linear scan over the small Vec<DictEntry> from `dict.lookup` is faster for typical homophone counts. - (MIN doc) mod.rs:94 — fixed the off-by-one in the `classify` POS column comment (was 4..=9, actually 5..=10 since col 4 is the normalized surface, not POS). Build artifact cleanup: build-script-baseline.txt drops `zip` (zip 7.x has no build.rs, unlike zip 2.x).

Copilot

Pull request overview

Copilot reviewed 9 out of 10 changed files in this pull request and generated 5 comments.

…T CI) - (CRIT CI) supply-chain/config.toml — add zlib-rs 0.6.2 exemption. zip 7's deflate path pulls in flate2 → zlib-rs, which wasn't covered by the previous R2 supply-chain edits and was breaking the audit job. - (MIN doc/style) mod.rs:151 — emit each TSV header line via its own `writeln!` so source-code indentation doesn't bleed into the output as leading whitespace before `#`. Adds an assertion that header lines start at col 0. - (MIN perf) sudachi.rs:185 — split the parsed row type into `CandidateRow` (no reading) for the multimap value and `Candidate` (with reading) only at the output site. The reading lives in the HashMap key alone, dropping ~1.9M duplicate String allocations at full-Sudachi scale. - (MIN safety) sudachi.rs:245 — use a per-process unique tmp filename (`.tmp.{pid}.zip`) and install `TmpFileGuard` BEFORE the download block, so a failed `io::copy` no longer leaks a partial tmp and parallel `mine` runs against the same cache dir don't clobber each other. - (MIN perf) candidates_ops.rs:57 — drop `cand.reading.clone()` (now the outer `reading` is the canonical key) and replace `pos.split('-').collect::<Vec>()` + `classify(&[..])` with a new `classify_pos_string` helper that does the split inline without allocating a Vec. Tests: +1 `classify_pos_string_matches_classify`, header col-0 assertion in `write_candidates_splits_buckets_and_sorts_by_cost`.

Copilot

Pull request overview

Copilot reviewed 9 out of 10 changed files in this pull request and generated 3 comments.

- (IMP) sudachi.rs:225 — `parse_latest_version` now accepts the `YYYYMMDD_N` re-cut form (e.g. `20201023_2`). Previously the digit-only filter dropped these entries, so a future re-cut release at the top of the listing would silently fall back to an older date-only version. Adds `is_sudachi_version` helper, regression test with `_N` at the head of the listing, and a noise-filter test covering empty / dangling / double-underscore / alpha cases. - (MIN test) sudachi.rs:499 — switch the parse_dir test fixture from a `\`-continued multi-line string (which leaks source-code indentation into the data) to `concat!` of separate literals, and add the missing `surface == "東京都"` assertion that would have caught the leakage. - (MIN doc) supply-chain/config.toml:69 — annotate why bumpalo's exemption was raised from `safe-to-run` to `safe-to-deploy` (zip ZIP-extraction tool path; bumpalo not used at IME runtime, but cargo-vet evaluates the dep graph at deploy-grade).

Copilot

Pull request overview

Copilot reviewed 9 out of 10 changed files in this pull request and generated 4 comments.

- (perf) sudachi.rs:152 — walk the split iterator with an enumerate match instead of `collect::<Vec<&str>>()` per CSV line. Drops ~3M Vec allocations on Sudachi-full and breaks early at col 11 (cols 12..17 aren't read). - (safety) sudachi.rs:297 — extract each CSV to a per-process `.{name}.extract.{pid}` temp path then atomically rename to the final csv_path. Two parallel `mine` runs against the same cache dir no longer race the prior exists-check + truncate-create pattern. - (perf) candidates_ops.rs:69 — consume `upstream` by value so each `CandidateRow`'s `surface` and `pos` Strings can be moved into `Candidate` instead of cloned. Reading is still cloned per row (the HashMap key remains canonical across the outer loop). - (doc) supply-chain/config.toml:73 — fix the bumpalo rationale. R4's comment claimed bumpalo only came in via build-side proc-macro tooling, but PR #242's `zip 7` switch added a `zip → zopfli → bumpalo` runtime path through dictool. Comment now lists both paths.

Copilot

Pull request overview

Copilot reviewed 9 out of 10 changed files in this pull request and generated 2 comments.

…und) - (perf) candidates_ops.rs:60 — scope the `seen` dedupe HashSet to a single reading instead of growing it globally as `(reading, surface)` tuples. Surface duplicates are only meaningful within one reading (when Sudachi has multiple POS variants for the same row), so the per-reading set drops both the cross-reading entries and the per-row reading clone. - (safety) sudachi.rs:321 — wrap `extract_tmp` with the existing `TmpFileGuard` RAII pattern so a failure in `io::copy` or the subsequent `fs::rename` doesn't leak a `.{basename}.extract.{pid}` file in the cache dir. Successful rename leaves nothing at the staged path, so the guard's drop is a no-op then.

Copilot AI review requested due to automatic review settings May 9, 2026 05:57

Copilot started reviewing on behalf of send May 9, 2026 05:57 View session