Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
2043 commits
Select commit Hold shift + click to select a range
0ae6f67
feat(serve): apply accepted engine fast-path at boot (#55 slice 2a)
Snider Jun 4, 2026
cc88c52
refactor(metal,mlx): collapse the accepted-gate list onto EngineFeatu…
Snider Jun 4, 2026
cc28adc
feat(metal,gemma4): model-declared engine features applied at load (#…
Snider Jun 4, 2026
451249e
refactor(metal): hollow GenerationStream + AsyncDecodePrefetch gates …
Snider Jun 4, 2026
84b9726
refactor(metal): hollow 6 gemma4 decode gates onto the runtime atomic…
Snider Jun 4, 2026
12fb976
refactor(metal): hollow PagedKVPrealloc + Router gates onto the runti…
Snider Jun 4, 2026
c96aade
test(go-mlx): delete the coverageTokens ritual — 5,297 lines of dead …
Snider Jun 4, 2026
a307e1a
docs(goal): replace GOAL.md with the cleanup punch-list — 34 gates + …
Snider Jun 4, 2026
2cbeef9
test(metal): merge 3 Good/Bad/Ugly triplets into single tests (#55 te…
Snider Jun 4, 2026
f7136c3
test(go-mlx): delete 1,278 fake 'compliance coverage' tests — ~16,400…
Snider Jun 4, 2026
4aa79b8
refactor(cmd/mlx): carve pack + bench commands out of main.go (#55)
Snider Jun 4, 2026
32513fc
refactor(cmd/mlx): carve the 4 profile commands out of main.go — 9,37…
Snider Jun 4, 2026
9c09cbc
refactor(metal): #55 gate cleanup + model-declared default_output_length
Snider Jun 5, 2026
9f43a88
refactor(gemma4): drop the final_logit_softcapping=30 fossil
Snider Jun 5, 2026
91cc084
fix(gemma4): read the model's real context, not the text backbone's
Snider Jun 5, 2026
0f3af16
feat(mlx): native Gemma-4 LoRA fuse + SFT loop, no-Python
Snider Jun 5, 2026
4713b14
refactor(gemma4): delete default_output_length — model has no text ou…
Snider Jun 5, 2026
0820ce0
refactor(gemma4): delete the config guess-stratum — model declares, g…
Snider Jun 5, 2026
d9f5018
refactor(mlx): single-owner Gemma-4 policy + shared LoRA adapter parser
Snider Jun 5, 2026
6beccc9
refactor(gemma4): LoRA policy as registry data + generic accessors, e…
Snider Jun 5, 2026
95907ce
refactor(gemma4): weight-name canonicalisation as registry data + gen…
Snider Jun 5, 2026
cd084ff
refactor(metal): route all chat formatting through chat.Format, delet…
Snider Jun 5, 2026
38e4d1a
refactor(metal): engine runtime predicates become model-declared capa…
Snider Jun 5, 2026
f3ef161
refactor(qwen3): move Qwen 3.6 detection out of the engine into the m…
Snider Jun 5, 2026
48b8de1
refactor(metal): rename the generic split-decode path off the Qwen3 name
Snider Jun 5, 2026
8fa0ac7
refactor: rename the fixed-cache plumbing generic — the engine's sett…
Snider Jun 5, 2026
9198dd9
refactor(metal): the MoE router kernels are generic — drop the gemma4…
Snider Jun 5, 2026
bf7213a
feat(gemma4): FeaturesOf — the architecture's feature surface, read f…
Snider Jun 5, 2026
c8607b7
refactor(gemma4): route the MoE load path through FeaturesOf
Snider Jun 5, 2026
c2c062e
refactor(metal): sort the residual family-name leaks
Snider Jun 5, 2026
9a45b07
refactor(mlx): unpin MTP path resolution from the official-reference …
Snider Jun 5, 2026
f00582f
refactor(mlx): remove the production/bench-as-CLI cluster (codex ralp…
Snider Jun 5, 2026
450022d
refactor(metal): nuke the native FFN-residual gate + kernel (gate-sou…
Snider Jun 6, 2026
e1d2afc
refactor(metal): nuke the native attention-residual-norm gate + kerne…
Snider Jun 6, 2026
5956bd9
refactor(metal): nuke the expert-id quantized-matvec gate cluster (ga…
Snider Jun 6, 2026
d53a148
refactor(metal): strip the MoE-router gate (keep the kernel — it is f…
Snider Jun 6, 2026
87d79ce
refactor(gemma4): strip the decode-core native dispatch to the generi…
Snider Jun 6, 2026
fd5430d
refactor(metal): nuke dead gemma4 decode-core kernels + their 5 runti…
Snider Jun 6, 2026
25ae496
refactor(metal): remove the dead Gemma4 decode-core C/Metal subgraph
Snider Jun 6, 2026
f3a857d
refactor(cmd/mlx): drop the slice-smoke harness subcommand
Snider Jun 6, 2026
2c705d7
refactor(cmd/mlx): drop the ffn-estimate harness subcommand
Snider Jun 6, 2026
421e000
refactor(cmd/mlx): gut the tuning subsystem — serve derives from the …
Snider Jun 6, 2026
bfbf5af
refactor(cmd/mlx): drop printUsage entries for deleted commands + dea…
Snider Jun 6, 2026
a8092ad
refactor(mlx): remove the dead RunLocalTuning runner (planner stays l…
Snider Jun 6, 2026
1c372dd
refactor(mlx): collapse the go-mlx bench/eval surface — harness stays…
Snider Jun 6, 2026
f79b726
refactor(metal,profile): single registry-owned Gemma-4 thinking defau…
Snider Jun 6, 2026
7c992e1
feat(mlx,serve): wire the native Gemma-4 MTP speculative-decode lane …
Snider Jun 6, 2026
0fad672
refactor(gemma4): declare attention class from config; derive UsesFix…
Snider Jun 6, 2026
e939319
refactor(gemma4): compose EngineFeatures from config — decide the q6 …
Snider Jun 6, 2026
9866ab0
refactor(metal,profile): registry-declared attached-only archs (kill …
Snider Jun 6, 2026
697f298
refactor(metal,profile): registry-owned architecture resolution (kill…
Snider Jun 6, 2026
cb5aea7
feat(metal,gemma4): model config selects the fixed-sliding KV cache, …
Snider Jun 6, 2026
09d78bc
feat(memory): derive context length from device memory + model, not a…
Snider Jun 6, 2026
1a2bcd1
fix(memory,model): size context from the true KV width (num_kv_heads …
Snider Jun 6, 2026
f00e4f2
fix(gemma4): define the audio tower from its declared config, delete …
Snider Jun 6, 2026
539ecab
fix(gemma4): vision config-led — stop guessing one model's dims for e…
Snider Jun 6, 2026
82621b9
refactor(metal): delete the dead gemma4 decode-core native-availabili…
Snider Jun 6, 2026
4389886
refactor(metal): de-fossil the gemma4-named engine helpers + gates → …
Snider Jun 6, 2026
2b2e07b
refactor(metal): rename the last engine validator generic (ValidateGe…
Snider Jun 6, 2026
e8d0dfa
refactor(metal): AllowGemma4ExtendedTargets → AllowExtendedTargets — …
Snider Jun 6, 2026
ccc388d
fix(metal): runtime gates never read ambient env — close the external…
Snider Jun 6, 2026
c2df5d0
fix(metal): strip ambient-env compute toggles — GELU / per-layer / mx…
Snider Jun 6, 2026
0468b74
fix(metal,mlx): strip the last GO_MLX_* env reads — zero ambient-env …
Snider Jun 6, 2026
b567d04
fix(memory): derive parallel slots + batch from device∩model — kill t…
Snider Jun 6, 2026
371c51c
refactor(memory): consolidate the per-token KV-width calc into one he…
Snider Jun 6, 2026
cd77788
fix(gemma4): delete defaultGemma4VisionConfig — the fabricate-a-confi…
Snider Jun 6, 2026
e9693db
test(memory): fix root-package planner tests to derived reality — b56…
Snider Jun 6, 2026
7cd065b
fix(metal,mlx): generate to the model's context when MaxTokens is uns…
Snider Jun 6, 2026
c711ea8
test(hf): fix PlanHFModelFits tests to derived-from-truth context — c…
Snider Jun 6, 2026
b65243a
fix(mlx): stop defaulting MaxTokens — resolve to context, not a guess…
Snider Jun 6, 2026
3f016f3
fix(gemma4): stop fabricating head_dim/global_head_dim/vocab_size — d…
Snider Jun 6, 2026
1478942
fix(models): stop fabricating vocab_size in gemma3/gptoss/kimi/mixtra…
Snider Jun 6, 2026
8a75e2a
fix(mixtral): derive NumLocalExperts from the routed-expert tensors —…
Snider Jun 6, 2026
e8d77c6
fix(dense): stop fabricating vocab_size in the shared dense parser — …
Snider Jun 6, 2026
56ffc3c
fix(dense): rope_theta default is transformers' 10000, not Qwen's 1e6
Snider Jun 6, 2026
b1800a0
refactor(chat): relocate family formatters to model packages + regist…
Snider Jun 6, 2026
ca50479
feat(serve): warn at boot that --draft MTP lane is greedy-only
Snider Jun 7, 2026
643a62c
refactor(metal): typed Gate enum replaces the GO_MLX_ENABLE_* string …
Snider Jun 7, 2026
cea2a2f
refactor(test): metal_runtime/model_eval build tags replace GO_MLX_RU…
Snider Jun 7, 2026
3d6cca7
refactor(test,blockcache): de-env model paths (HF-cache resolve) + bl…
Snider Jun 7, 2026
6c146f9
refactor(test): Track A — merge orphan root tests into their source's…
Snider Jun 7, 2026
f89f10c
refactor(grpo): extract GRPO training to its own subpackage (Track B …
Snider Jun 7, 2026
b70a13e
feat(mlx): export TokenizerImpl + NewTokenizer — bring-your-own-token…
Snider Jun 7, 2026
9702e10
refactor(distill): extract knowledge-distillation to its own subpacka…
Snider Jun 7, 2026
1615732
refactor(distill): extract knowledge-distillation to its own subpacka…
Snider Jun 7, 2026
f4d9e94
refactor(mlx): rename training.go → primitives.go; consolidate metala…
Snider Jun 7, 2026
e5cb560
merge: reconcile github dev into corrected line
Snider Jun 7, 2026
ab2dcf4
refactor(mlx): split inference_contract.go by concern (1227 → 352 + 2…
Snider Jun 7, 2026
c2f00a9
refactor(mlx): extract metal↔root conversions from backend.go (2302 →…
Snider Jun 7, 2026
e62f76f
refactor(mlx): move tensor-op facade from backend.go to primitives.go
Snider Jun 7, 2026
9f7cbfb
refactor(mlx): extract native model contract from backend.go → native…
Snider Jun 7, 2026
65cd0d0
refactor(mlx): extract Model LoRA methods from backend.go → model_lor…
Snider Jun 7, 2026
4227851
refactor(mlx): extract prompt-cache warming from backend.go → prompt_…
Snider Jun 7, 2026
ec3d831
refactor(mlx): extract the generate API from backend.go → generate.go
Snider Jun 7, 2026
ff77ae0
refactor(mlx): extract LoadConfig + LoadOptions from mlx.go → load_op…
Snider Jun 7, 2026
b43d2d0
refactor(mlx): extract GenerateConfig + GenerateOptions from mlx.go →…
Snider Jun 7, 2026
01a5db9
refactor(mlx): extract CPU packed-quant kernels → split_cpu_ffn_kerne…
Snider Jun 7, 2026
1204d2a
feat(model): neutral ResolveQuant — detect quant from the model's own…
Snider Jun 7, 2026
cb8186a
feat(serve): live GPU memory (active/cache/peak) on admin serve status
Snider Jun 7, 2026
b056208
test(metal): AX-11 BenchmarkGenerate_ContextGrowth — decode memory/th…
Snider Jun 7, 2026
2dffbfc
test(metal): AX-11 bench pins the leak to PagedKVCache (not the decod…
Snider Jun 7, 2026
b71115f
fix(memory): stop the planner selecting the broken PagedKVCache (serv…
Snider Jun 7, 2026
5541126
test(memory): M3Ultra96GB plan asserts default cache, not the broken …
Snider Jun 7, 2026
e315e55
test(mlx): root plan tests assert default cache, not the broken paged
Snider Jun 7, 2026
fbad7a7
refactor(mlx): rename SimpleSelfDistillation* -> SSD* (match the code…
Snider Jun 7, 2026
a8b6cda
refactor(hf): drop the filename quant heuristic — quant is read, not …
Snider Jun 7, 2026
77a6820
refactor(memory): tear out the quant-recommendation cluster — fit is …
Snider Jun 7, 2026
4f3e8d7
feat(serve): PlanModelFit derives a true weights+KV bytes-fit from th…
Snider Jun 7, 2026
3b27f9f
refactor(mlx): expose the NativeModel seam (public interface + NewMod…
Snider Jun 7, 2026
099a4d5
refactor(kvconv): extract the kv<->metal snapshot bridge to its own p…
Snider Jun 7, 2026
20baadd
refactor(kvconv): move the State-KV block-source builder into kvconv
Snider Jun 7, 2026
c46b48c
test(metal): AX-11 growth bench measures the default cache (the real …
Snider Jun 7, 2026
9b3dd2b
test(metal): AX-11 bench applies serve EngineFeatures + is model-para…
Snider Jun 7, 2026
7f6a02b
fix(metal): bound MLX allocator cache to model footprint (was ~half RAM)
Snider Jun 7, 2026
54494b1
perf(gemma4/mtp): batched speculative verify + anchor loop (one targe…
Snider Jun 7, 2026
1cdf2f9
feat(gemma4): load quantized (QAT) assistant drafters
Snider Jun 7, 2026
e823161
test(mlx): log MTP draft/verify duration split in the boost repro
Snider Jun 7, 2026
daaaa06
docs(plans): MTP multi-token decode path — diagnosis + build plan
Snider Jun 7, 2026
d0ce832
perf(metal): row-batch the quantized matvec for small decode batches
Snider Jun 8, 2026
f5053ee
docs(plans): mark MTP slice 1 (batched matvec) done — 0.75x->0.81x
Snider Jun 8, 2026
7e4bede
docs(plans): AX-11 decode benchmark matrix (current code, 2026-06-08)
Snider Jun 8, 2026
91e37ad
test(mlx): GO_MLX_SPEC_DRAFTTOKENS knob for MTP draft-depth sweeps
Snider Jun 8, 2026
3a3c8c3
docs(plans): AX-11 matrix — add 1b (gemma-3) row, both quants clear 100
Snider Jun 8, 2026
4cda776
feat(metal): fused multi-query decode attention kernel (speculative v…
Snider Jun 8, 2026
30450d9
Revert "feat(metal): fused multi-query decode attention kernel (specu…
Snider Jun 8, 2026
9fc4709
perf(metal): fuse the q6 last-token greedy output (parity-proven)
Snider Jun 8, 2026
4efd1b6
fix(gemma4/mtp): feed the EAGLE head the post-norm target feature
Snider Jun 8, 2026
0ae2126
docs(plans): AX-11 full matrix + MTP reality (current code, post-norm…
Snider Jun 8, 2026
87cbf91
refactor(metal): one quantised matvec kernel for every bit width
Snider Jun 8, 2026
8354a40
test(metal): per-token decode phase tracer — locates the real bottleneck
Snider Jun 8, 2026
fc26e51
perf(metal): restore the q6 Group64 fast-path on the main matvec
Snider Jun 8, 2026
0bf069a
docs(plans): AX-11 re-validation + tiered target (12b added, fc26e518)
Snider Jun 8, 2026
44a4549
test(gemma4/mtp): draft-acceptance diagnostic — localises the 0.41 wall
Snider Jun 8, 2026
4a2d587
docs(plans): AX-11 — MTP lever validated via QAT matched pairs
Snider Jun 8, 2026
4ae6766
feat(gemma4/mtp): support gemma4_unified_assistant drafters (12b/26b/…
Snider Jun 8, 2026
e14831a
docs(plans): AX-11 — complete MTP-QAT matrix across all five sizes
Snider Jun 8, 2026
3813276
docs(plans): AX-11 — q6 QAT MTP column (e2b q6 86→100 via MTP, clears…
Snider Jun 8, 2026
d3de0a1
perf(metal): pool DispatchOne output-shape — kill the per-dispatch me…
Snider Jun 8, 2026
08be005
perf(metal): KV cache updateInPlace reads dims via Dim(i), not Shape()
Snider Jun 8, 2026
2d92e0c
perf(metal): pool AsStrided shape/strides — drop the per-reshape cgo …
Snider Jun 8, 2026
1a18164
perf(metal): pool Reshape variadic shape — drop the per-token out-pro…
Snider Jun 8, 2026
66e8856
fix(metal/test): cap MLX allocator memory in TestMain — stop syntheti…
Snider Jun 8, 2026
89d6409
revert(metal/test): drop the TestMain MLX memory caps
Snider Jun 8, 2026
50665a0
perf(gguf): ReadInfo parses the token-vocab for count only (199,915 -…
Snider Jun 8, 2026
a12bb7d
fix(gemma4/lora): ApplyLoRA defaults empty modelType to gemma4_text
Snider Jun 8, 2026
6f58b0b
test(metal): AX-11 decode instruments — matvec-vs-gemm, router, exper…
Snider Jun 9, 2026
5449de9
perf(metal): enable the fused sliding-window attention by default
Snider Jun 9, 2026
ddaca34
perf(gemma4): route the MoE router top-k through the fused kernel
Snider Jun 9, 2026
54354a8
chore(metal): drop the orphaned residual+norm+add fusion scaffold
Snider Jun 9, 2026
9d4d705
perf(metal): route q4/q8/bitstream-q6 decode linears to MLX quantized…
Snider Jun 9, 2026
bb077ab
fix(metal): free rotating-cache trim-branch handles and bench-leaked …
Snider Jun 9, 2026
bb37ec9
feat(metal): WarmDecodeTokensPerSec — steady-state rate for plain decode
Snider Jun 9, 2026
184ca40
test(metal): perf invariants — routing, byte-ordering, zero-garbage, …
Snider Jun 9, 2026
11b3fb7
feat(metal/gemma4): speculative SAMPLING — temperature>0 MTP via reje…
Snider Jun 10, 2026
ec9db8a
perf(openai): SSE/NDJSON framing helpers cut allocs per streamed token
Snider Jun 10, 2026
cf390ba
perf(cmd/mlx): memorypretrain text-hash embedder 9218->4 allocs/op
Snider Jun 10, 2026
6ad2fed
perf(kv): Analyze coherence O(n^2)->O(n) closed-form + lock-count cap
Snider Jun 10, 2026
b3ab1dd
perf(dataset): typed Sample.Format replaces per-row Meta[format] map …
Snider Jun 10, 2026
1ff5385
docs(plans): gguf-native-metal plan, llama.cpp baseline gap matrix, r…
Snider Jun 10, 2026
1a242dd
feat(cmd/mlx): generate — bare decode tok/s for like-for-like benching
Snider Jun 10, 2026
e20eba2
feat(cmd/mlx): generate --temp for greedy-vs-sampled bench control
Snider Jun 10, 2026
5f7bd2d
docs(metal): strip frozen-benchmark narrative from q4 dispatch comments
Snider Jun 10, 2026
83210c7
feat(cmd/mlx): generate -trace — per-token decode time budget
Snider Jun 10, 2026
51802fc
feat(cmd/mlx): generate -trace shows the prefetch logits/cache split …
Snider Jun 10, 2026
88a1cea
feat(cmd/mlx): generate -state — the no-prompt-replay turn loop (wake…
Snider Jun 10, 2026
02a8724
test(state): grow the sleep/wake live harness into a full forensic suite
Snider Jun 10, 2026
c1d7fa9
fix(metal): restored KV caches own their storage — wake/sleep round-t…
Snider Jun 10, 2026
653f6da
feat(chat): continuation renders + exported Model.FormatChatPrompt/Fo…
Snider Jun 10, 2026
21a713b
feat(serve): conversation continuity — chats wake from slept state, n…
Snider Jun 10, 2026
00bdfad
feat(serve): conversation continuity ON by default — the serve is the…
Snider Jun 10, 2026
8be36fa
feat(metal): compiled decode MLP — mlx_compile proven on the live dec…
Snider Jun 10, 2026
705e5be
feat(metal): compiled MLP traces the FUSED kernels (#65 increment 2) …
Snider Jun 10, 2026
d62c2e1
feat(metal): whole-layer compiled decode — one closure apply per gemm…
Snider Jun 10, 2026
790907e
feat(metal): pipelined decode — overlap the graph encode with the GPU…
Snider Jun 10, 2026
fb607af
feat(metal): band-stepped KV storage — cache writes scale with fill, …
Snider Jun 10, 2026
1cc6dda
test(mtp): baseline measurement — the pipeline moved MTP's goalposts
Snider Jun 10, 2026
4e3cdc2
feat(metal): fill-band attention — the compiled decode attends the fi…
Snider Jun 10, 2026
6d3b709
feat(metal): masked-write speculation — the causal mask is the transa…
Snider Jun 10, 2026
86fb2e5
feat(metal): close the 31B encode chase — coverage probe + buffer-cap…
Snider Jun 10, 2026
5410060
feat(metal): dense-family bench sweep vs mlx-lm — kv-storage flag + f…
Snider Jun 10, 2026
8fe8a76
test(metal): coverage probe also reports the live KV storage dtype
Snider Jun 10, 2026
48eb590
fix(metal): half-precision KV runs on the KV-sharing models — mask ca…
Snider Jun 10, 2026
ff509f3
WIP(metal): bf16-native activation stream — mlx-lm parity reached, no…
Snider Jun 11, 2026
a85462c
test(metal): determinism bisect round 1 — not the MLP, not the pipeli…
Snider Jun 11, 2026
fc8d3fe
test(metal): determinism bisect round 2 — localised to the first deco…
Snider Jun 11, 2026
368bba0
test(metal): determinism bisect round 3 — mechanism identified: pool-…
Snider Jun 11, 2026
7782c61
fix(metal): bf16 non-determinism ROOT-CAUSED + FIXED — custom kernel …
Snider Jun 11, 2026
3b57288
fix(metal): MoE router matvec kernel name carries the input dtype — s…
Snider Jun 11, 2026
99281ee
perf(metal): traced MLP routes gemm-preferring configs to MLX gemm — …
Snider Jun 11, 2026
466c49b
perf(metal): compiled MTP verify blocks — the layer closures serve L=…
Snider Jun 11, 2026
20505fb
feat(serve): MTP lane streams per verify block — GenerateWithSink + t…
Snider Jun 11, 2026
c603b2a
perf(metal): batched MTP greedy acceptance — and the qmm small-M vall…
Snider Jun 11, 2026
af1e242
perf(metal): the 48ms found — FixedKVCache.TruncateTo kills the MTP r…
Snider Jun 11, 2026
ff8ee14
WIP: mlx v0.31.2 bump — builds + links; thread-local stream integrati…
Snider Jun 11, 2026
2a8acda
fix(metal): per-thread GPU encoder binding — the 0.31.2 thread-local …
Snider Jun 11, 2026
5e3fce0
feat(metal): mlx v0.31.2 landed — dedicated encoding thread, determin…
Snider Jun 11, 2026
1d40d21
perf(metal): compile the MoE decode — 26B-A4B 48.1 -> 114 tok/s (2.4x…
Snider Jun 11, 2026
25a8f77
perf(metal): the 12B MTP pair is the family sweet spot — story draft=…
Snider Jun 11, 2026
274d0c9
fix(metal): one-shot CLI path — stream registry replay, temp-stream U…
Snider Jun 11, 2026
60ba788
feat(metal): decode-lane provenance in Metrics — lane, eligibility re…
Snider Jun 11, 2026
f1e9d4e
perf(metal): one-shot generations ride the session machinery — serve …
Snider Jun 11, 2026
43a12c9
docs: RFC.diffusion-gemma — the block-diffusion spec, distilled first…
Snider Jun 11, 2026
86ab729
feat(gemma4): DiffusionGemma loads natively — Unit A of the block-dif…
Snider Jun 11, 2026
4e2a0cf
feat(gemma4): the block-diffusion denoising step — Unit B, live-proven
Snider Jun 11, 2026
d10a424
feat(gemma4): the block-diffusion generation loop + the diffuse verb …
Snider Jun 11, 2026
58b8f81
feat(metal): DiffusionGemma on the serve — block-diffusion behind the…
Snider Jun 11, 2026
8fd93d7
perf(gemma4): reference convergence doubles diffusion throughput — 13…
Snider Jun 11, 2026
f464eb9
perf(diffusion): ship the measured decode profile — canvas 64 / 16 st…
Snider Jun 11, 2026
8e70b06
feat(serve): -kv-cache reaches the fixed-cache regime; add the book-b…
Snider Jun 11, 2026
b6f1d81
feat(metal): zero-flag fast lane — default the fixed-cache regime; gu…
Snider Jun 11, 2026
c0213ea
fix(metal): restore prompt-cache snapshots into band-stepped storage,…
Snider Jun 11, 2026
e78ce5f
fix(metal): scalar item readers register thread streams before the it…
Snider Jun 11, 2026
b2c715e
fix(metal): open the dead band-grow gate — the serve's per-turn seria…
Snider Jun 11, 2026
d436fce
fix(metal): readers materialise through the worker — close the ensure…
Snider Jun 11, 2026
98e27bf
fix(metal): rotate retired storage generations at commit — kill the p…
Snider Jun 11, 2026
31062ea
feat(state): trusted-prefix sleep — graft parent blocks by reference,…
Snider Jun 11, 2026
8a8c831
test(mlx): per-phase turn split instrument — the #74 serve gap closed…
Snider Jun 11, 2026
4d49a65
fix(mtp-serve): honour the request's thinking override in pair Chat; …
Snider Jun 11, 2026
1abbb02
feat(serve): quarantine diffusion_gemma from the serve while the regi…
Snider Jun 11, 2026
8d58c49
test(dataset): finish the typed Sample.Format migration in the stream…
Snider Jun 11, 2026
ad6fff8
fix(metal): thread explicit PRNG keys through temperature sampling
Snider Jun 11, 2026
3edfc0f
style(metal): gofmt cache.go field alignment
Snider Jun 11, 2026
7cf5e2c
fix(kv): wake restore carries source cache geometry — snapshot v6 rec…
Snider Jun 11, 2026
4d328e8
style(go-mlx): gofmt alignment leftovers in five test files
Snider Jun 11, 2026
7f8ee71
docs(diffusion): add straggler model-card + vllm gist to the referenc…
Snider Jun 11, 2026
4933403
feat(gemma4): implement the audio Conformer encoder (Mantis #1839)
Snider Jun 11, 2026
46a7444
feat(gemma4): log-mel audio feature extractor with HF golden parity
Snider Jun 11, 2026
f89ee5e
feat(gemma4): audio splice — waveform to soft tokens in the multimoda…
Snider Jun 11, 2026
8367c7f
feat(cli): audio verb — answer a prompt about a WAV clip (Mantis #1839)
Snider Jun 12, 2026
bcd58f5
docs(gemma4): ghost-suppressor comments name the official 12B/26B/31B…
Snider Jun 12, 2026
669eec5
feat(gemma4): image/video front-end + vision CLI verb
Snider Jun 12, 2026
f31c8ae
perf(metal): past-cap FixedKVCache append via 3-op gather; document t…
Snider Jun 12, 2026
237a193
refactor(root): extract spine package — shared types + dispatch conve…
Snider Jun 12, 2026
015b01a
refactor(root): Token, Tokenizer, ModelInfoToMemory join spine (#63)
Snider Jun 12, 2026
7ece6d0
refactor(root): extract the session package — Session machinery leave…
Snider Jun 12, 2026
e511a44
refactor(root): LoRAConfig + metal conversions join spine (#63)
Snider Jun 12, 2026
8696679
refactor(root): extract the train package — SFT + SSD machinery leave…
Snider Jun 12, 2026
4d869a1
test(root): orphan sweep — every root test file pairs with its source…
Snider Jun 12, 2026
ca05ec1
docs: test-pairing map — the one-place unpaired-test list + regenerat…
Snider Jun 12, 2026
9e33586
refactor(root): organisation logic check — files named for what they …
Snider Jun 12, 2026
352715b
test(root): edge tidy — backend family consolidated, dead smoke lane …
Snider Jun 12, 2026
77cf439
test(gemma4): synthetic coverage for the diffusion lane (#69, #77 recon)
Snider Jun 12, 2026
7c1c771
test(metal): #77 retiree-theory probe — theory eliminated, cycle is s…
Snider Jun 12, 2026
8edc796
fix(gemma4): diffusion serve OOM — drop the allocator cache per reque…
Snider Jun 12, 2026
d3fe22c
fix(serve): restore the diffusion gate — the C015 field run falsified…
Snider Jun 12, 2026
8d98784
test(gemma4): #77 per-step probe — diffusion sampler FLAT at book sha…
Snider Jun 12, 2026
c415c99
fix(serve): diffusion models bypass conversation continuity — the AR …
Snider Jun 12, 2026
17433f6
feat(cli): discover proves the GPU pipeline — metallib provenance + k…
Snider Jun 12, 2026
b91f1f3
feat(serve): default port 11434 → 36911 — Lethean's own, never collid…
Snider Jun 12, 2026
9006bbb
fix(admin): HF download lane never worked — core.ReadAll yields strin…
Snider Jun 12, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
20 changes: 17 additions & 3 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,18 +1,27 @@
# Build artifacts
build/
bin/
*.dylib
*.so
*.a

# `go build ./go/cmd/mlx/` without -o lands the binary at repo root.
# Convention is `go build -o bin/mlx` (bin/ already ignored above);
# this catches the shortcut form too.
/mlx

# CMake
CMakeCache.txt
CMakeFiles/
cmake_install.cmake
Makefile

# CMake install output (keep headers for Go module consumers)
dist/*
!dist/include/
# CMake install output
dist/

# Local Go build/test shortcuts
/go/mlx
/*.test

# IDE
.idea/
Expand All @@ -22,6 +31,11 @@ dist/*
# macOS
.DS_Store

# lthn/desktop frontend dist — copied at build time by
# scripts/make-app-bundle.sh, embedded in cmd/mlx via go:embed.
# Single source of truth lives in lthn/desktop/frontend/.
go/cmd/mlx/frontend/dist/

# Knowledge base
KB/
.core/
Expand Down
12 changes: 12 additions & 0 deletions .gitmodules
Original file line number Diff line number Diff line change
Expand Up @@ -22,3 +22,15 @@
path = external/go-io
url = https://github.com/dappcore/go-io.git
branch = dev
[submodule "external/go-ai"]
path = external/go-ai
url = https://github.com/dappcore/go-ai.git
branch = dev
[submodule "external/go-ml"]
path = external/go-ml
url = https://github.com/dappcore/go-ml.git
branch = dev
[submodule "external/go-cgo"]
path = external/go-cgo
url = https://github.com/dappcore/go-cgo.git
branch = dev
20 changes: 15 additions & 5 deletions AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ All Go code lives under `go/`:
`nomlxlm` removes it)
- `go/cmd/violet/` and `go/pkg/daemon/` — local Violet Unix-socket sidecar
- `cpp/` — C++ side companion (CLion-side worktree)
- `lib/mlx/` — upstream MLX submodule pinned at `v0.30.1`
- `lib/mlx/` — upstream MLX submodule pinned at `v0.31.1`
- `patches/` — local patches against `lib/mlx` (manual apply only)
- `docs/`, `examples/` — markdown documentation and per-feature usage examples

Expand All @@ -25,6 +25,15 @@ Unsupported builds compile against the `*_stub.go` files and a stub
`MetalAvailable() bool` that returns false. Do not move CGO code out of
`go/internal/metal/`.

The native path targets [macOS Tahoe 26.0+](https://developer.apple.com/documentation/macos-release-notes/macos-26-release-notes)
on Apple Silicon. The floor is intentional: the Metal 4 API generation this
runner is built around shipped with macOS 26, including lower-overhead command
encoding, explicit compilation control, tensor resources, and machine-learning
passes. Keep build and test invocations aligned with that floor by passing
`-ldflags "-extldflags=-mmacosx-version-min=26.0"` when compiling native code.
See `docs/operator/deployment.md` and `docs/operator/metallib-and-variants.md`
for the full reference chain.

## Conventions

- UK English in code, comments, and docs (colour, organisation, behaviour)
Expand All @@ -47,10 +56,11 @@ model downloads.

## Sandboxing Notes

Before handing off, run the repository gates from the brief with `GOWORK=off`.
On sandboxed systems, set `GOCACHE` to a writable directory such as
`/tmp/codex-go-mlx-cache` so Go can compile without touching the user
cache. If the sandbox cannot resolve the bundled `mlx.metallib`, apply
Before handing off, run the repository gates from the checked-in workspace; do
not use `GOWORK=off` unless the user explicitly asks for an isolated module
check. On sandboxed systems, set `GOCACHE` to a writable directory such as
`/tmp/codex-go-mlx-cache` so Go can compile without touching the user cache.
If the sandbox cannot resolve the bundled `mlx.metallib`, apply
`patches/mlx-metallib-path.patch` inside `lib/mlx` to enable the
`MLX_METALLIB_PATH` env-var override (not auto-applied).

Expand Down
7 changes: 4 additions & 3 deletions CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -44,17 +44,18 @@ After Mantis #1241, all Go code lives under `go/`:
```
go/ Go module root (dappco.re/go/mlx)
*.go Public root API: model, tokenizer, compute, training, eval, distill, GRPO, hf-fit, merge, gguf-quantize, kv-snapshot, lora-fuse
cmd/mlx/ CLI tool (built with `-o core-mlx`; consumers rename: lthn-mlx)
cmd/violet/ Unix-socket sidecar daemon
internal/metal/ All CGO code (mlx-c bindings)
mlxlm/ CGO-free Python subprocess backend
pkg/daemon/ Daemon implementation
pkg/memvid/ Memvid storage CLI
pkg/memvid/ Deprecated State codec compatibility shim
tests/ Integration tests
cpp/ C++ side (CLion-side companion)
docs/ Markdown documentation
examples/ Per-feature usage examples (markdown)
external/ Vendored core libraries
lib/mlx/ Upstream mlx submodule (pinned at v0.30.1)
lib/mlx/ Upstream mlx submodule (pinned at v0.31.1)
patches/ Local patches to lib/mlx (not auto-applied)
```

Expand Down Expand Up @@ -127,7 +128,7 @@ Architecture is detected from `config.json` (`model_type`) for safetensors and f

## Submodule Patches

`lib/mlx` is pinned at upstream tag `v0.30.1`. Local patches that we do not upstream live in `patches/` as standalone diff files (e.g. `patches/mlx-metallib-path.patch` for the `MLX_METALLIB_PATH` env-var override). Patches are not auto-applied — run them inside the submodule manually when their function is needed:
`lib/mlx` is pinned at upstream tag `v0.31.1`. Local patches that we do not upstream live in `patches/` as standalone diff files (e.g. `patches/mlx-metallib-path.patch` for the `MLX_METALLIB_PATH` env-var override). Patches are not auto-applied — run them inside the submodule manually when their function is needed:

```bash
git -C lib/mlx apply ../../patches/mlx-metallib-path.patch
Expand Down
119 changes: 119 additions & 0 deletions CLAUDE.operator.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,119 @@
# CLAUDE.operator.md

Operator-facing guidance for **running** `lthn-mlx` in production. Companion to `CLAUDE.md` (developer-facing — architecture, build, contribute). If you arrived here mid-session needing to deploy, troubleshoot, or reason about distribution, you're in the right doc. If you arrived needing to add a model decoder or change cgo bindings, go to `CLAUDE.md`.

The operator audience is a future Cladius / Athena / Hephaestus session, *or* a human operator (Snider, ops-side) doing a deploy. Same mental model serves both — the difference is just whether the reader can edit code on the spot.

## Read order

1. **This file**, skim through "Operating principles" — calibrates what the binary is and isn't.
2. **`docs/operator/deployment.md`** — what you ship, how it runs, what to bind to.
3. **`docs/operator/metallib-and-variants.md`** — the variant question, the bundling strategy, the active CWD-resolution panic.
4. **`docs/operator/troubleshooting.md`** — the failure modes in lifecycle order, with fixes.
5. **`docs/operator/index.md`** — the full operator doc set + what's planned.

If you have ~3 minutes, read this file. If you have ~30 minutes, read all five.

## What lthn-mlx is

A single-process boundary that wraps native Apple Metal GPU inference (via mlx-c CGO bindings) and serves it as OpenAI / Anthropic / Ollama-compatible HTTP. Snider's framing, made explicit on 2026-05-25:

> **"The actual model is the binary, the rest is package."**

This is the load-bearing architecture decision. Everything that wants inference — `lthn` desktop, `pkg/lemma` in lthn/desktop, providers in `go-ai`, any OpenAI-compatible Python / TypeScript / curl client — talks to `lthn-mlx` over HTTP. There is no in-process library substitute for production. The binary is the boundary.

**One process. One model. One HTTP listener.** That's the unit. Multi-model deployments mean multiple processes on different ports plus a router in front (the `pkg/lemma` client is the canonical Go-side router).

The binary is built from `dappco.re/go/mlx/cmd/mlx`, default output name `core-mlx`, consumers rename to `lthn-mlx`. Module path is `dappco.re/go/mlx`.

## Operating principles

These are the load-bearing facts an operator needs in working memory. Each one shapes a deployment decision.

### 1. Apple Silicon only

`darwin/arm64`. No Linux. No Intel macOS. The CGO files carry `//go:build darwin && arm64`; a stub returns `MetalAvailable() = false` everywhere else. M1 / M2 / M3 / M4, any chip class, any deployment macOS ≥13 — one binary serves them all (modulo the metallib variant matrix; see point 5).

If the deployment target isn't Apple Silicon, you don't want `lthn-mlx` — you want a different go-inference backend (`go-rocm` for AMD GPUs, or the CGO-free `mlxlm` subprocess backend bundled in the same repo for Python-on-anything).

### 2. The binary needs the metallib

`mlx.metallib` (~107 MB, MetalLib v1.2.9, the compiled GPU kernel archive) must be findable at runtime. Today, until the bundling work lands, this means **setting `MLX_METALLIB_PATH` to an absolute path** before invoking. Not setting it is the single most common deployment failure — the binary starts, `/v1/health` passes, then panics inside `mlx_metal_load_library` on the first GPU dispatch.

```bash
export MLX_METALLIB_PATH=/opt/lthn-mlx/lib/mlx.metallib
lthn-mlx serve --model /opt/lthn-mlx/models/lemer-lite --addr :11434
```

The permanent fix is Path B bundling (embed via `//go:embed`, load via `MTLDevice newLibraryWithData:`). Until that ships, treat the env var as mandatory deployment config. See `docs/operator/metallib-and-variants.md` for the why and `docs/operator/troubleshooting.md` for the panic signature.

### 3. Model loads lazily

`lthn-mlx serve` starts in under a second. The model loads on the **first request that needs it**, not at process start. This means:

- Liveness probes against `/v1/health` pass before the model is loaded. They are not readiness probes.
- The first inference request after start takes 2-15 seconds depending on model size and storage speed.
- For consistent first-request latency, pre-warm in the service manager's post-start hook with a one-token completion (see deployment.md).

There is no on-disk lock, no PID file, no recovery state. Restart is safe; the new process starts cold and lazy-loads. The service manager is responsible for single-instance enforcement.

### 4. HTTP surface is trusted-network only

`lthn-mlx serve` has no authentication, no rate limiting, no TLS. Default bind is `:11434` (matches Ollama). Bind to `127.0.0.1:11434` for same-machine, `0.0.0.0:11434` for LAN. **Production LAN exposure sits behind a reverse proxy** that handles auth and TLS (Caddy, nginx).

If you need authenticated remote access, that lives in `pkg/lemma` (the Go client) plus a tunnel / proxy / auth-gateway — not in `lthn-mlx` itself. Don't try to add auth to the serve binary; it would violate the boundary rule and duplicate work already done one layer up.

### 5. Variants matter at the toolchain axis, not the chip axis

Snider's question of 2026-05-25: "if the lib is different for different apple versions, we need to know the variants that need building." The chip family (M1/M2/M3/M4) is **not** a variant axis — Apple's Metal driver handles forward-compatibility from a single archive. What actually varies is the build-host toolchain: Metal language version ≥4.0 + macOS SDK ≥26.2 (Xcode 26+) unlocks the NAX kernel family for M4-class tensor coprocessors.

**Practical ship matrix:**

| Variant | Build host | Runs on | Use case |
|---------|------------|---------|----------|
| `mlx-baseline.metallib` | Any modern Xcode, deployment-min 13 | M1-M4 on macOS 13+ | Default ship today |
| `mlx-nax.metallib` | Xcode 26+, deployment-min 26 | M4-class on macOS 26+ only | Deferred to M4 optimisation lane |

Ship the baseline. The NAX variant is a future M4 fast-path optimisation, not a today-decision. Full evidence and the open questions (driver-side load behaviour for higher `min`, NAX dispatch gating on non-M4) in `docs/operator/metallib-and-variants.md`.

### 6. Unified memory is the budget

On Apple Silicon there is no separate VRAM line item — the GPU and CPU share unified memory. The process budget includes: model weights, KV cache (scales linearly with `--context`), MLX allocator cache, plus everything else macOS is doing. A 7B model in 4-bit needs ~5 GB resident; the default 131k context can add several more.

Tuning knobs live in `dappco.re/go/mlx` at the package level (`SetMemoryLimit`, `SetCacheLimit`, `SetWiredLimit`, `ClearCache`, `GetActiveMemory`, `GetPeakMemory`). They are **not** exposed as `serve` flags today — if you need them on the bundled CLI, file a feature ticket against `cmd/mlx/serve.go`. For now, custom integrations on top of `openai.NewMuxWithAdmin` can wire them directly.

Activity Monitor's "Memory" column is the right place to watch the process. `/v1/cache/stats` reports MLX's allocator view.

### 7. Graceful shutdown is signal-driven

SIGINT and SIGTERM both trigger `http.Server.Shutdown` with `--shutdown-timeout` (default 10s) as the drain deadline. After the deadline, the process exits. There is no explicit model-unload step — the OS reclaims Metal allocations on exit.

If you have long-running generations and need them to drain cleanly on bounce, raise `--shutdown-timeout` (30s-60s). If you need explicit teardown for an exotic daemon scenario, wire the `Sleep` admin callback in a custom integration.

## Mental model in one paragraph

`lthn-mlx serve` is a stateless OpenAI-compatible HTTP server backed by Apple Metal GPU inference, single-model per process, lazy-load on first request, signal-driven graceful shutdown, requires a findable `mlx.metallib` (env var until bundling lands), no built-in auth or TLS, designed for trusted-network use, with a `pkg/lemma`-shaped routing layer one level up for multi-model or remote-access patterns. The architecture insists on the binary as the only process boundary — everything else is packages talking to it over HTTP.

That paragraph plus the seven principles is the working mental model. Everything else in `docs/operator/` fills in the operator's view of specific concerns.

## What this doc does not cover

- **How the inference works inside.** That's `docs/architecture.md`, `docs/runtime/`, `docs/memory/`. Developer-side.
- **How to add a model architecture.** That's a decoder under `go/internal/metal/`. Developer-side.
- **How training works.** That's `docs/training.md`, `docs/distillation.md`, `docs/grpo.md`. Production-bench / research-side.
- **GOAL.md production-bench lane.** Separate concern with its own canonical brief.
- **Memory limits & cache tuning as a knob set.** Stubbed in `docs/operator/performance-tuning.md` — not yet written. Source of truth meanwhile: `go/internal/metal/backend.go:10-12` and the `mlx.Set*` package surface.

## When the docs and reality disagree

This doc and `docs/operator/*` describe behaviour. Behaviour changes. If you find a discrepancy between what `lthn-mlx serve` actually does and what these docs claim, **the code is right and the docs are wrong**. Fix the doc, or PR a comment-block on the responsible source file referencing this directory.

The maintenance discipline lives in `docs/operator/index.md` under "Maintenance discipline." Read it if you're about to merge a PR that touches `cmd/mlx/serve.go`, `go/openai/openai.go`, `go/openai/admin.go`, or `go/internal/metal/backend.go` — those four files are the operator-visible surface.

## Files this directory ships

- `CLAUDE.operator.md` (this file) — operator mental model
- `docs/operator/index.md` — operator doc index + planned slots
- `docs/operator/deployment.md` — what you ship + how it runs
- `docs/operator/metallib-and-variants.md` — bundling strategy + variant matrix
- `docs/operator/troubleshooting.md` — lifecycle-phase failure modes
10 changes: 8 additions & 2 deletions CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,11 @@ cmake_minimum_required(VERSION 3.24)
project(mlx)

set(CMAKE_OSX_DEPLOYMENT_TARGET "26.0" CACHE STRING "Minimum macOS version")
set(CMAKE_CXX_STANDARD 23)
set(CMAKE_CXX_STANDARD_REQUIRED ON)
set(CMAKE_CXX_EXTENSIONS ON)

include(${CMAKE_CURRENT_LIST_DIR}/cmake/CompilerCache.cmake)

if(CMAKE_INSTALL_PREFIX_INITIALIZED_TO_DEFAULT)
set(CMAKE_INSTALL_PREFIX "${CMAKE_CURRENT_SOURCE_DIR}/dist" CACHE PATH "" FORCE)
Expand All @@ -11,13 +16,14 @@ endif()
set(MLX_BUILD_GGUF ON CACHE BOOL "" FORCE)
set(MLX_BUILD_SAFETENSORS ON CACHE BOOL "" FORCE)
set(MLX_C_BUILD_EXAMPLES OFF CACHE BOOL "" FORCE)
set(BUILD_SHARED_LIBS ON CACHE BOOL "" FORCE)
set(BUILD_SHARED_LIBS OFF CACHE BOOL "" FORCE)

set(CMAKE_INSTALL_RPATH "@loader_path")

include(FetchContent)

set(MLX_C_GIT_TAG "v0.4.1" CACHE STRING "")
set(MLX_C_GIT_TAG "fba4470" CACHE STRING "") # mlx-c main: bindings regenerated for MLX 0.31.2 (v0.6.0 predates the 0.31.2 FFT API)
set(FETCHCONTENT_SOURCE_DIR_MLX "${CMAKE_CURRENT_SOURCE_DIR}/lib/mlx" CACHE PATH "Local patched MLX source")

FetchContent_Declare(
mlx-c
Expand Down
Loading