Model Engine with restore KV by Snider · Pull Request #6 · dAppCore/go-inference

Snider · 2026-05-20T05:53:58Z

Summary by CodeRabbit

New Features
- OpenAI, Anthropic and Ollama provider adapters; GGUF model discovery; capability reporting; VQ and JANG quantisation tooling; scheduler with queuing/cancellation; probe event bus and observability; split‑inference planning; durable agent‑memory with on‑disk filestore; dataset/training/tuning/benchmarking and decode harness; parsers for reasoning/tools/thinking; expanded HTTP service endpoints and streaming support.
Documentation
- Comprehensive package & contract documentation, navigation/readme, contributor standards and full EUPL‑1.2 licence text.

Reference: core/api/LICENCE. Co-Authored-By: Cladius Maximus <cladius@lethean.io>

…block) - git submodule update on external/* to current dev tips - go.work paths fixed for Phase 1 /go/ subtree layout where stale - go.work go-version bumped 1.26.0 → 1.26.2 to match submodule floor Workspace-mode build (`go build ./...`) is the verification path. Some repos may surface transitive dep issues (api/go.sum checksum drift, etc.) which are separate cascade tickets — not blocking this metadata refresh. Co-Authored-By: Cladius Maximus <cladius@lethean.io>

Co-Authored-By: Virgil <virgil@lethean.io>

…336) go-inference gets the canonical service-registration shape per #1336. Naming divergence from canon required: the package already exposes `Register(b Backend)` as the well-known init-time backend-registration pattern (every backend init() calls inference.Register(metal.NewBackend())). Renaming would break every backend. So the canonical Core registration is `RegisterCore(c)` here; existing `Register(b Backend)` preserved untouched. Naming-divergence documented inline in service.go. inference.NewService(inference.Options{}) → factory for core.WithService inference.RegisterCore(c) → defaults shorthand inference.Register(b) → unchanged: backend self-registration v1 Options is empty since package behaviour is driven by the global Backend registry which is independently managed via init(). Smoke verified: - GOWORK=off go vet ./... — clean - TestNewService_RegistersInferenceService — PASS - TestRegisterCore_Imperative — PASS Co-Authored-By: Virgil <virgil@lethean.io>

Permanent service_test.go for canon shape (commit bbdaf88). Two cases: NewService(empty) round-trip + RegisterCore imperative shorthand. Note the RegisterCore name (not Register) preserves the existing `func Register(b Backend)` init-time backend self-registration pattern. Coverage sweep (#1387): 8th of 22. Co-Authored-By: Virgil <virgil@lethean.io>

Promote three areas to public packages alongside per-file documentation: - state/ — Wake/Sleep/Fork lifecycle, identity DTOs (Model/Tokenizer/ Adapter/Runtime/Sampler), Store/Resolver/Writer interfaces, InMemoryStore reference impl, filestore/ append-only backend. Identity types hoisted out of inference root; aliases preserved in identity.go for stable imports. - openai/responses.go, services.go — Responses API DTOs + embeddings, rerank, capabilities, cache, cancel handlers. - anthropic/, ollama/ — wire-compat DTO packages. - contracts.go promoted from internal to public: SchedulerModel, CancellableModel, CacheService, EmbeddingModel, RerankModel, ReasoningParser, ToolParser, ModelPackInspector + AgentMemory* aliases. - capability.go: 41 stable CapabilityID values, AlgorithmProfile, RuntimeMemoryLimits, CapabilityReporter. docs/ pass adds per-file documentation under docs/{package}/{file}.md so future readers can plan against shapes without reading code. 24 new docs covering state/ + openai/ + anthropic/ + ollama/ + inference/ root files plus package READMEs and a top-level index. Co-Authored-By: Virgil <virgil@lethean.io>

Lifts model-family reasoning + tool-call parsing out of go-mlx so every driver (mlx, rocm, cuda, tpu, future) inherits the same logic. Surface: - Hint{Architecture, AdapterName} — minimum selector input from drivers - Mode (Show/Hide/Capture) + Config + Chunk + Result — thinking-channel DTOs - OutputParser interface + Registry + ForHint(hint) — registry surface - NewProcessor(cfg, hint) + Filter(text, cfg, hint) — thinking-mode processor - Family(hint) + NormaliseKey(value) — selector helpers Built-in parsers: qwen, gemma, deepseek-r1, gpt-oss, minimax, mistral, kimi, glm, hermes, granite, generic fallback. Marker sets match the prior go-mlx implementation byte-for-byte. Driver side: a hint conversion (parser.Hint{Architecture, AdapterName} from each driver's local model info) and any tokenizer-using wrappers stay in the driver — FilterThinkingTokens in go-mlx is one such shim. Tests cover: family lookup across 11 architectures, reasoning parsing for qwen/gemma/gpt-oss markers, tool parsing tagged + JSON fallback + bad payloads, custom-parser registration, nil-receiver fallbacks, thinking-mode hide + show + capture + processor partial-flush. Co-Authored-By: Virgil <virgil@lethean.io>

Adds the missing probe vocabulary for request-scheduler observability: - ProbeEventScheduler — kind constant for queue/scheduler events - ProbePhaseQueue — phase constant for queue-side timing - ProbeScheduler — request-id, event, queue depth, queue/first-token/ total latency in millis, cancelled flag - Scheduler *ProbeScheduler field on ProbeEvent Drivers (go-mlx scheduler.go and downstream peers) emit through this shape so probe consumers branch on Kind/Phase and unwrap the typed payload uniformly. Co-Authored-By: Virgil <virgil@lethean.io>

Splits the JANG/JANGTQ + VQ codebook quant metadata out of go-mlx so every driver (mlx, rocm, cuda, tpu, future) inherits them. quant/jang/ - Info, Capabilities, TensorRole (+ consts), PackedProfile, PackedTensorDescriptor, BitOrderLSB0, EncodingAffine - ReadConfig(path), ParseConfig(data), ProfileBits(name), BuildPackedProfile, ClonePackedProfile, NewPackedTensorDescriptor, ValidatePackedTensor, DequantizePackedTensor, PackQuantizedValues - Reference CPU dequant + pack for parity tests vs native kernels. - Driver side: HF metadata inference helpers (inferJANGQuantizationFromHF / hfJANGGroupSize) stay in go-mlx as a thin file that imports this package — they depend on mlx.HFModelMetadata which itself isn't lifted yet. quant/codebook/ - Profile, TensorDescriptor, Type ("codebook"), FormatVQ ("vq") - ParseProfile(data), ReadProfile(path), NewTensorDescriptor, ValidateProfile, ValidateTensorDescriptor, ValidateTensorPayload, MatVec(desc, input, codes, table, bias), CloneProfile Symbol-namespace rename — package name takes the disambiguation slot: JANGQuantizationInfo → jang.Info JANGCapabilities → jang.Capabilities JANGPackedQuantizationProfile → jang.PackedProfile JANGPackedTensorDescriptor → jang.PackedTensorDescriptor NewJANGPackedTensorDescriptor → jang.NewPackedTensorDescriptor BuildJANGPackedQuantizationProfile → jang.BuildPackedProfile CodebookQuantizationProfile → codebook.Profile CodebookTensorDescriptor → codebook.TensorDescriptor ParseCodebookQuantizationProfile → codebook.ParseProfile CodebookVQMatVec → codebook.MatVec ... Tests ported — file-aware Test<File>_<Symbol>_<Variant> shape: parity round-trip, attention-wide-bits, unsupported-bits diagnostic, packed-length validation, profile build, descriptor validate-and- matvec, unaligned-shape rejection, out-of-range code diagnostic, JSON config parse. All green. Companion lift: model/minimax/m2 + moe expert_residency policy land in follow-up commits — m2 has safetensorIndex couplings, expert_ residency needs a budget-bytes refactor away from Apple-class enum. Co-Authored-By: Virgil <virgil@lethean.io>

Add eval package with interface-driven design: Sample/Batch/BatchConfig are opaque (any), Dataset is a Next-iterator interface, and Runner is a struct of callbacks the driver fills in (Info, LoadAdapter, BuildBatches, EvaluateBatch, BatchTokens, SampleText). eval.RunDataset orchestrates: sample collection, batch building (via runner), per-batch evaluation, metrics aggregation (loss + perplexity), and default + user-supplied quality probes. AdapterInfo is defined locally rather than imported from go-mlx/lora — keeps eval driver-neutral so go-rocm/go-cuda/etc. can also adopt without pulling go-mlx as a dependency. ResponseCoverageProbe is provided as an exported probe so driver wrappers can attach it without eval needing to know sample field shape. Co-Authored-By: Virgil <virgil@lethean.io>

Verb-shaped Runner: driver provides Generate + per-section Bench* callbacks (BenchPromptCache, BenchMemvidKVBlockWarm, BenchKVRestore, BenchStateBundle, BenchProbeOverhead, BenchSpeculativeDecode, BenchPromptLookupDecode). bench.Run orchestrates Info collection + generation timing + dispatches each enabled callback + assembles the Report. Report types are driver-neutral data: GenerationSummary/Sample, PromptCacheReport, MemvidKVBlockWarmReport, LatencyReport, StateBundleReport, ProbeReport (Events []any for opaque driver-event vocabularies), DecodeOptimisationReport, QualityReport. GenerationMetrics is a flat mirror of the driver's per-call metrics (PrefillTokensPerSec, DecodeTokensPerSec, PeakMemoryBytes, etc.) — same fields as go-mlx's Metrics struct so drivers populate it directly. PopulateMemvidKVBlockWarmBench is exposed so drivers can hand off the cross-cutting derived fields (Speedup, BreakEvenQuestions) once their capture/restore measurements are in. Co-Authored-By: Virgil <virgil@lethean.io>

Covers Run callback dispatch (verb-callbacks fire iff IncludeX flag is set and the callback is non-nil), Generate-error propagation, nil-context fallback, GenerationSummary aggregation (rates averaged, peaks maxed, total-duration fallback to elapsed), default + zero-config normalisation with independent slice clones, PopulateMemvidKVBlockWarmBench derived fields (speedup, saved-per-question, break-even), AdapterInfo.IsEmpty, GenerateOptions probe-sink passthrough + StopTokens clone, NonZeroDuration floor. Backfills the coverage gap left by deleting fast_eval_test.go, fast_eval_example_test.go, and workload_bench_test.go from go-mlx — those exercised the old raw-callback Runner shape; the verb-callback redesign needs tests against the bench package directly. Co-Authored-By: Virgil <virgil@lethean.io>

Lifts the decode-optimisation algorithm from go-mlx (decode_optimisation.go) into a self-contained driver-neutral package. Symbols rename per the folder-taxonomy rule that packages don't repeat their own prefix: RunSpeculativeDecode → decode.Speculative RunPromptLookupDecode → decode.PromptLookup DecodeOptimisationResult → decode.Result DecodeOptimisationMetrics → decode.Metrics SpeculativeDecodeConfig → decode.SpeculativeConfig PromptLookupDecodeConfig → decode.PromptLookupConfig DecodeGenerateFunc → decode.GenerateFunc DecodeGeneration → decode.Generation DecodeModeSpeculative → decode.ModeSpeculative DecodeModePromptLookup → decode.ModePromptLookup Token + GenerateConfig + Generation become decode-package types with a minimal ID/Text/Value surface — drivers convert their native token type at the boundary (same pattern as bench.AdapterInfo). Coverage: ports the original three tests + adds error-propagation + nil-context + token-equality + clone-independence + max-tokens-clamp + draft-tokens-clamp + utility checks. Sixteen tests, five examples, all green. Co-Authored-By: Virgil <virgil@lethean.io>

…odel Lifts the package-first request scheduler from go-mlx into a self-contained driver-neutral package. Symbols rename per the folder-taxonomy rule: ScheduledModel → scheduler.Model SchedulerConfig → scheduler.Config NewScheduledModel → scheduler.New scheduledJob → job (private) emitSchedulerProbe → (Model).emitProbe (private method) scheduledGenerateOptions → generateOptions (private) cloneSchedulerLabels → cloneLabels (private) scheduler.Model wraps an inference.TextModel with bounded queueing, cancellation, streaming backpressure, and ProbeEventScheduler probe emission. Worker pool sized by Config.MaxConcurrent; queue bounded by MaxQueue; per-request stream buffer set by StreamBuffer. Coverage: queue + latency probe, full-queue rejection, cancellation, Generate/Chat/Classify/BatchGenerate delegation, nil-scheduler defence paths, fallback cancel via inference.CancellableModel, Err propagation, generateOptions sampler conversion, cloneLabels defensive copy, millis helpers. Six tests, ten examples, all green. Co-Authored-By: Virgil <virgil@lethean.io>

Add project seed wake/continuation helpers, local tuning DTOs, and split-inference planning contracts for go-mlx agent workflows. Record first-token benchmark timing and Gemma channel thought markers so downstream runners can preserve long-context measurements and strip thinking history correctly. Co-Authored-By: Virgil <virgil@lethean.io>

coderabbitai · 2026-05-20T05:54:11Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Adds a complete backend‑neutral inference surface: contracts, provider wire adapters (OpenAI/Anthropic/Ollama), parsers/thinking, scheduler, state stores (in‑memory + filestore) and agent‑memory, discovery/GGUF reader, evaluation/bench/decode tooling, quant modules (codebook/JANG), probes, extensive docs, tests, EUPL‑1.2 licence, and a submodule pointer update.

Changes

Single cohesive cohort

Layer / File(s)	Summary
Core contracts, capability and identity `go/contracts.go`, `go/capability.go`, `go/identity.go`, `go/service.go`	Adds portable inference contracts, capability reporting/inference, identity re-exports, sampler↔generate converters, and Core service registration helpers.
Provider adapters `go/openai/`, `go/anthropic/`, `go/ollama/*`	OpenAI chat/responses (SSE/non‑SSE), resolvers, stop/truncate/thinking extraction; Anthropic and Ollama wire types and translators; Responses/services handlers for embeddings/rerank/cache/cancel.
Parsing & thinking `go/parser/*`	Parser registry, reasoning marker sets, thinking processor (Show/Hide/Capture), tool-call parsing, and tests.
Scheduler `go/scheduler/*`	Request queueing wrapper around TextModel with cancellation, streaming, probe emission, and delegation; tests/examples.
State, filestore & agent memory `go/state/`, `go/state/filestore/`	State identity/bundle types, in‑memory store, append‑only filestore implementation (legacy support), agent-memory Wake/Sleep/Fork APIs, project-seed helpers, and tests.
Discovery & GGUF `go/discover.go`, `go/gguf.go`	Model discovery extensions, GGUF header/metadata reader and discovery helpers; tests.
Eval, bench, decode, dataset `go/eval/`, `go/bench/`, `go/decode/`, `go/dataset.go`	Dataset/evaluator, bench runner and reports, speculative/prompt‑lookup decode harness, dataset DTOs and examples/tests.
Quantisation tools `go/quant/codebook/`, `go/quant/jang/`	Codebook VQ profile parsing and matvec; JANG packed tensor descriptors, pack/unpack and dequantisation; tests.
Probe/observability `go/probe.go`, tests/examples	Typed ProbeEvent model, ProbeSink and ProbeBus fan‑out impl, and examples/tests.
Docs & Licence `docs/*/`, `LICENCE`	Adds EUPL‑1.2 licence and comprehensive documentation for inference, state, providers, parsers, service, and tools.
Submodule `external/go`	Advances external/go submodule pointer.
Tests & examples many `_test.go` and `example_test.go` files	Extensive unit tests and Go examples across all new modules.

Warning

Billing warning: we have not been able to collect payment for this subscription for more than 72 hours. Please update the payment method or pay any pending invoices in Billing to avoid service interruption.

coderabbitai

Actionable comments posted: 1

Note

Due to the large number of review comments, Critical severity comments were prioritized as inline comments.

🟠 Major comments (35)

docs/inference/gguf.md-1-1 (1)
1-1: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Use the canonical SPDX header key to avoid licence-scanner misses.

SPDX-Licence-Identifier should be SPDX-License-Identifier. Please apply this across all newly added docs headers in this PR.
Proposed fix
-
+
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/inference/gguf.md` at line 1, The SPDX header in the new docs uses the
misspelled key "SPDX-Licence-Identifier"; update all occurrences to the
canonical "SPDX-License-Identifier" (replace the string literal
"SPDX-Licence-Identifier" with "SPDX-License-Identifier") across the newly added
documentation headers in this PR so the licence-scanner recognizes them.
docs/ollama/ollama.md-1-1 (1)
1-1: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Use the canonical SPDX header key to keep licence scanners working.

SPDX-Licence-Identifier is not the recognised SPDX key. Please use SPDX-License-Identifier exactly.
Suggested fix
-
+
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/ollama/ollama.md` at line 1, Replace the incorrect SPDX header key
`SPDX-Licence-Identifier` with the canonical `SPDX-License-Identifier` in the
file (update the header comment string exactly), ensuring the SPDX header uses
the exact spelling so licence scanners recognize it.
docs/state/project_seed.md-1-1 (1)
1-1: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Correct SPDX header token for this document.

Please switch to SPDX-License-Identifier; the current token is non-standard and may be ignored by compliance tooling.
Proposed fix
-
+
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/state/project_seed.md` at line 1, Update the SPDX header token at the
top of docs/state/project_seed.md by replacing the non-standard
"SPDX-Licence-Identifier" token with the correct "SPDX-License-Identifier" token
so compliance tools recognize the license header; locate the existing header
comment and change the token text only, preserving the same license value and
surrounding comment markers.
docs/state/memory.md-1-1 (1)
1-1: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Fix SPDX tag spelling to keep licence scanning reliable.

The header must use SPDX-License-Identifier (not SPDX-Licence-Identifier) for standards-compliant tooling.
Proposed fix
-
+
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/state/memory.md` at line 1, The file contains a misspelled SPDX header
"SPDX-Licence-Identifier" which breaks license scanners; locate the header
string "SPDX-Licence-Identifier" in docs/state/memory.md and replace it with the
correct token "SPDX-License-Identifier" so tooling recognizes the license.
docs/state/identity.md-1-1 (1)
1-1: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Use the exact SPDX header token (SPDX-License-Identifier).

SPDX-Licence-Identifier is not a valid SPDX tag. Automated licence tooling usually requires SPDX-License-Identifier exactly, so this may be missed by scanners.
Proposed fix
-
+
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/state/identity.md` at line 1, Replace the incorrect SPDX header token
`SPDX-Licence-Identifier` with the exact required token
`SPDX-License-Identifier` in the file (update the header comment so automated
license tooling recognizes it).
docs/state/store.md-1-1 (1)
1-1: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Standardise SPDX header spelling in this file as well.

Use SPDX-License-Identifier exactly to ensure licence scanners recognise the tag.
Proposed fix
-
+
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/state/store.md` at line 1, Replace the incorrect SPDX header text "" with the standard tag "" so licence scanners recognise it; update
the header in docs/state/store.md (look for the existing comment string) to use
"License" instead of "Licence".
go/bench/bench.go-395-411 (1)
395-411: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

normalizeConfig skips CachePrompt fallback for zero config.

When configZero(cfg) is true, Line 398 returns immediately, so CachePrompt never inherits Prompt. That can drive prompt-cache benches with an empty cache prompt.
Proposed fix
 func normalizeConfig(cfg Config) Config {
 	def := DefaultConfig()
 	if configZero(cfg) {
-		return def
+		cfg = def
 	}
 	if cfg.Prompt == "" {
 		cfg.Prompt = def.Prompt
 	}
@@
 	if cfg.CachePrompt == "" {
 		cfg.CachePrompt = cfg.Prompt
 	}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@go/bench/bench.go` around lines 395 - 411, normalizeConfig currently returns
early when configZero(cfg) is true, which prevents CachePrompt from inheriting
Prompt and leaves CachePrompt empty; modify normalizeConfig (function
normalizeConfig and call to configZero) so that even when configZero(cfg) is
true you still set cfg.CachePrompt = cfg.Prompt (or the default prompt) before
returning: call DefaultConfig() as def, and if configZero(cfg) is true assign
cfg.Prompt = def.Prompt and cfg.CachePrompt = cfg.Prompt (or set CachePrompt to
def.Prompt if you prefer) then return cfg; ensure other fallbacks (MaxTokens,
Runs) remain unchanged.
go/dataset.go-172-174 (1)
172-174: ⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Evaluator.Evaluate should return core.Result, not error.

This public failure-capable production contract currently exposes (*EvalReport, error), which breaks the package-wide core.Result error-handling model.

As per coding guidelines, "Public production functions that can fail must return core.Result; callers must branch on r.OKand user.Value only after success".
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@go/dataset.go` around lines 172 - 174, The Evaluator interface currently
exposes Evaluate(ctx context.Context, dataset DatasetStream, cfg EvalConfig)
(*EvalReport, error) which violates the package-wide error-handling model;
change the signature of Evaluator.Evaluate to return core.Result where
Result.Value will hold the *EvalReport on success. Update the Evaluator
interface declaration (Evaluator.Evaluate), all concrete implementations, and
every call site to construct and return core.Result (set r.OK/r.Value on success
or r with error information on failure) and to branch on r.OK before using
r.Value; ensure types remain DatasetStream, EvalConfig and that the returned
Value wraps *EvalReport.
go/decode/example_test.go-9-32 (1)
9-32: 🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick win

Replace placeholder examples with real API usage examples.

These examples currently print labels only; they should exercise the corresponding public decode symbols so the example file documents actual behaviour.

As per coding guidelines, "Public symbols in .gomust have triplet tests in_test.goand usage examples in_example_test.go``".
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@go/decode/example_test.go` around lines 9 - 32, The file's example functions
currently only print labels; replace each placeholder Example*
(ExampleSpeculative, ExamplePromptLookup, ExampleTokenEqual, ExampleTokensText,
ExampleCloneTokens) with real usage that calls the corresponding public decode
API (construct necessary inputs, call the public functions/types from package
decode such as Speculative, PromptLookup, TokenEqual, TokensText, CloneTokens)
and print their real return values or observable effects; ensure each example
imports and uses the actual symbols, exercises typical inputs, and includes the
correct "// Output: ..." comment showing the expected output so `go test`
example checks pass.
go/dataset_test.go-5-8 (1)
5-8: 🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick win

Use testify assertions and assert.InDelta for float checks.

This file currently relies on custom checks and direct float equality; please switch to require/assert, and use assert.InDelta for floating-point comparisons.

As per coding guidelines, "**/*_test.go: Use testify/assert for general checks and testify/require for preconditions in tests" and "Use assert.InDelta for float comparisons in tests".

Also applies to: 130-133
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@go/dataset_test.go` around lines 5 - 8, Replace custom checks and direct
float equality in dataset_test.go tests with testify; add imports
"github.com/stretchr/testify/assert" and "github.com/stretchr/testify/require"
to the import block, use require.* for test preconditions (e.g., setup results)
and assert.InDelta for all floating-point comparisons instead of == or manual
delta checks; update assertions inside the test functions (look for any float
comparisons and calls like t.Fatal/t.Errorf or manual if checks) to use
assert.InDelta(actual, expected, delta) and require.NoError/require.NotNil where
appropriate.
go/decode/decode.go-115-116 (1)
115-116: ⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Public decode APIs should follow core.Result failure semantics.

Speculative and PromptLookup are public production entry points that can fail, but currently return Go error tuples instead of core.Result.

As per coding guidelines, "Public production functions that can fail must return core.Result; callers must branch on r.OKand user.Value only after success".

Also applies to: 163-164
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@go/decode/decode.go` around lines 115 - 116, The public functions Speculative
and PromptLookup should return core.Result instead of an (Result, error) tuple;
change their signatures to return core.Result, remove error returns, and convert
all failure paths to return a core.Result with OK=false and a suitable error
message in Value (or other agreed failure encoding), preserving existing Result
semantics (callers must check r.OK). Update any internal helpers called (e.g.,
places returning errors inside Speculative and PromptLookup) to produce
core.Result on failure or propagate errors by wrapping them into a failure
core.Result; ensure successful paths return core.Result with OK=true and Value
populated. Finally, update all callers/tests to branch on r.OK and access
r.Value only after success.
go/decode/decode_test.go-12-49 (1)
12-49: 🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick win

Avoid direct float equality in tests and migrate to testify assertions.

Line 40 compares AcceptanceRate with direct equality; use assert.InDelta, with require/assert replacing manual t.Fatalf checks in this file.

As per coding guidelines, "**/*_test.go: Use testify/assert for general checks and testify/require for preconditions in tests" and "Use assert.InDelta for float comparisons in tests".
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@go/decode/decode_test.go` around lines 12 - 49, Update the
TestSpeculative_AcceptsAndRejectsDraftTokens_Good test to use testify's
require/assert helpers: replace the precondition checks (e.g., err != nil and
Mode checks) with require.* (require.NoError, require.Equal) and replace plain
t.Fatalf comparisons with assert.* where appropriate; specifically use
assert.InDelta to compare result.Metrics.AcceptanceRate to 2.0/3.0 instead of
direct float equality, and convert the other t.Fatalf checks to require or
assert calls (for example require.Equal for exact matches and assert.NotZero for
durations). Keep the same semantics and reference symbols result.Metrics,
AcceptanceRate, TestSpeculative_AcceptsAndRejectsDraftTokens_Good, targetCalls,
and draftCalls when making the replacements.
go/contracts_test.go-77-225 (1)
77-225: 🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick win

Adopt testify/assert and testify/require in these tests.

The file currently uses custom check* helpers throughout; please switch to require for preconditions and assert for behavioural checks to match the test contract consistently.

As per coding guidelines, "**/*_test.go: Use testify/assert for general checks and testify/require for preconditions in tests".
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@go/contracts_test.go` around lines 77 - 225, Replace the custom check*
helpers with testify's require and assert in the tests: import
"github.com/stretchr/testify/require" and "github.com/stretchr/testify/assert",
use require.* for preconditions (e.g., replace checkNoError, checkTrue before
continuing and type assertion checks in TestContracts_OptionalInterfaces,
TestContracts_CacheService, TestContracts_EmbeddingAndRerank,
TestContracts_Parsers, TestContracts_ModelPackInspector,
TestContracts_AgentMemorySession) and use assert.* for behavioural/verifying
checks (e.g., replace checkEqual, checkLen, checkNotNil where the test can
continue after a failure) and update calls that reference the helper names
(checkTrue, checkNoError, checkEqual, checkLen, checkNotNil) accordingly; ensure
each test uses require at the top for mandatory conditions (type assertions and
returned err checks) and assert for subsequent assertions.
go/eval/eval.go-161-217 (1)
161-217: ⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Align failing public API and error text contract with project rules.

RunDataset is public and can fail, but returns error instead of core.Result, and several messages use mlx: / mixed case (EvaluateBatch, BuildBatches, LoRA). Please adapt this API to return core.Result and normalise messages to the required inference: lowercase message without trailing period pattern.

As per coding guidelines, "Public production functions that can fail must return core.Result" and "Error strings must use the format: fmt.Errorf(\"inference: lowercase message without trailing period\")".
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@go/eval/eval.go` around lines 161 - 217, RunDataset is a public function that
returns error and emits non-conforming messages; change its return type to
core.Result (and adjust callers) and replace all core.NewError(...) returns with
fmt.Errorf(...) using the project's error format "inference: lowercase message
without trailing period"; also normalize all literal message text (e.g.,
loader/adapter/runner checks for EvaluateBatch, BuildBatches, LoadAdapter and
dataset warnings) to lowercase phrases starting with "inference:" (no trailing
periods) and ensure the function still sets report fields (Report, ModelInfo,
Adapter) before returning appropriate core.Result success values.
go/gguf_test.go-22-49 (1)
22-49: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Use testify/assert and testify/require instead of custom check* helpers in tests.

These tests currently rely on custom helpers, but project test policy requires testify/assert for assertions and testify/require for preconditions.

As per coding guidelines, "Use testify/assert for general checks and testify/require for preconditions in tests".

Also applies to: 60-87
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@go/gguf_test.go` around lines 22 - 49, Replace the custom test helpers in
TestGGUF_ReadGGUFInfo_Bad and TestGGUF_DiscoverModels_Ugly with testify's
require/assert: import "github.com/stretchr/testify/require" and
"github.com/stretchr/testify/assert", use require.NoError(t, err) /
require.Error(t, err) and require.Equal(t, GGUFInfo{}, info) for preconditions,
and use assert.Len(t, models, 1) and assert.Equal(t, path, models[0].Path) /
assert.Equal(t, "gemma4_text", models[0].ModelType) / assert.Equal(t, "gguf",
models[0].Format) for non-fatal checks; replace checkNoError, checkError,
checkEqual, checkLen calls accordingly while keeping existing helpers like
writeMinimalGGUFAt, DiscoverModels, and ReadGGUFInfo unchanged.
go/ollama/ollama.go-111-113 (1)
111-113: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Pass Ollama's -1 and -2 semantics through to the inference layer.

GenerateOptions() silently drops non-positive NumPredict values. Since Ollama's API uses -1 for infinite generation and -2 to fill context, these should be forwarded to inference.WithMaxTokens() instead of ignored. Currently, NumPredict ≤ 0 falls through without any option, causing the inference layer to use its default 256-token limit—silently losing user intent.

The WithMaxTokens() function accepts negative values; pass all NumPredict values through.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@go/ollama/ollama.go` around lines 111 - 113, Update GenerateOptions so it
forwards all NumPredict values (including non-positive ones) to
inference.WithMaxTokens instead of only when NumPredict > 0: remove the
conditional that drops NumPredict ≤ 0 and always append
inference.WithMaxTokens(options.NumPredict) to opts (referencing
GenerateOptions, options.NumPredict, and inference.WithMaxTokens). This
preserves Ollama semantics for -1 and -2 by passing them through to the
inference layer.
go/openai/services.go-369-390 (1)
369-390: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Bound request-body reads in decodeServiceRequest.

This currently performs an unbounded io.ReadAll, which can be abused to consume memory. Add a max body size via http.MaxBytesReader (at handler entry) or io.LimitReader before reading.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@go/openai/services.go` around lines 369 - 390, The decodeServiceRequest
function uses an unbounded io.ReadAll on r.Body; change this to enforce a
maximum request size by either wrapping the request body with
http.MaxBytesReader at handler entry or by using io.LimitReader inside
decodeServiceRequest (e.g., wrap r.Body with io.LimitReader(r.Body,
MaxRequestBodySize) and define a shared MaxRequestBodySize constant). If the
read fails due to exceeding the limit, return an appropriate error (use
http.StatusRequestEntityTooLarge / 413) and preserve the existing JSON
validation/error handling (resultError, writeError) paths so oversized bodies
are rejected safely.
go/openai/openai.go-152-166 (1)
152-166: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Add a request-body size cap before decoding JSON.

io.ReadAll on an unbounded body allows a client to force large memory allocations. Wrap r.Body/body with http.MaxBytesReader (handler path) or io.LimitReader before reading.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@go/openai/openai.go` around lines 152 - 166, The DecodeRequest function
currently uses io.ReadAll(body) which can allocate unbounded memory; wrap the
incoming body with a size limiter before reading (e.g., replace io.ReadAll(body)
with io.ReadAll(io.LimitReader(body, maxBodySize))) where maxBodySize is a
reasonable constant (e.g., 1<<20 for 1 MiB) and return an explicit error if the
read hits the limit (or detect overflow by attempting to read one extra byte);
update DecodeRequest and keep calls to core.JSONUnmarshalString and resultError
unchanged so the function returns a clear "request body too large" error instead
of allowing unbounded allocation.
go/openai/responses.go-89-104 (1)
89-104: ⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Align ResponseGenerateOptions with the public core.Result failure contract.

This public fallible function returns error directly instead of the repository-wide core.Result pattern required by the coding guidelines for public production functions that can fail.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@go/openai/responses.go` around lines 89 - 104, Change ResponseGenerateOptions
to follow the repository-wide core.Result failure contract: update its return
type to ([]inference.GenerateOption, core.Result), call GenerateOptions(chatReq)
as before but capture its (opts, err) and convert any non-nil err into a
core.Result failure (e.g., core.ResultFromError or core.NewFailure) while
returning opts and core.Result for success; update all callers of
ResponseGenerateOptions to handle core.Result instead of an error.
go/openai/openai.go-151-211 (1)
151-211: ⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Convert public adapter functions to return core.Result.

DecodeRequest, ValidateRequest, GenerateOptions, and NormalizeStopSequences are public fallible functions that should return core.Result rather than raw error, following the pattern established in the coding guidelines. The exception for error values applies only to backend interface method implementations; these are utility functions that adapt wire formats and should follow the standard public API contract.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@go/openai/openai.go` around lines 151 - 211, Change the public adapter
functions DecodeRequest, ValidateRequest, GenerateOptions (and
NormalizeStopSequences) to return core.Result instead of raw error (and in
DecodeRequest the success value should be carried in the Result). For each
failure path currently returning an error, wrap the failure in core.E (or the
project's standard core error/result constructor) and return that core.Result;
for successful returns, return a core.Result representing success and containing
the value (e.g., the decoded ChatCompletionRequest or slice of
inference.GenerateOption). Update all call sites to handle core.Result (check
.OK / extract value) accordingly. Ensure function signatures and doc comments
are updated to reflect core.Result usage.
go/parser/markers.go-13-16 (1)
13-16: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Gemma channel end marker appears malformed and will miss valid closes.

On Line 13–Line 16, the end delimiter is "<channel|>", but the channel format here is "<|channel>...". This likely prevents reasoning blocks from closing for those variants.
Proposed fix
-		{start: "<|channel>thought\n", ends: []string{"<channel|>"}, kind: "thinking"},
-		{start: "<|channel>thinking\n", ends: []string{"<channel|>"}, kind: "thinking"},
-		{start: "<|channel>reasoning\n", ends: []string{"<channel|>"}, kind: "reasoning"},
-		{start: "<|channel>analysis\n", ends: []string{"<channel|>"}, kind: "analysis"},
+		{start: "<|channel>thought\n", ends: []string{"<|channel|>"}, kind: "thinking"},
+		{start: "<|channel>thinking\n", ends: []string{"<|channel|>"}, kind: "thinking"},
+		{start: "<|channel>reasoning\n", ends: []string{"<|channel|>"}, kind: "reasoning"},
+		{start: "<|channel>analysis\n", ends: []string{"<|channel|>"}, kind: "analysis"},
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@go/parser/markers.go` around lines 13 - 16, The end delimiter for the Gemma
"channel" markers is malformed ("<channel|>") and will fail to match the
corresponding start token "<|channel>..."; update the ends value for the entries
whose start is "<|channel>thought\n", "<|channel>thinking\n",
"<|channel>reasoning\n", and "<|channel>analysis\n" to the correct closing
string "<|channel|>" so those reasoning/thinking markers properly close.
go/parser/reasoning.go-28-35 (1)
28-35: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Reasoning token offsets are shifted to the marker, not the reasoning payload.

On Line 28 and Line 34, StartToken uses tokenOffset before accounting for len(marker.start), so offsets for extracted reasoning are misaligned.
Proposed fix
-		if end < 0 {
+		segmentStart := tokenOffset + len(marker.start)
+		if end < 0 {
 			reasoning := trimReasoningText(afterStart)
 			if reasoning != "" {
-				segments = append(segments, inference.ReasoningSegment{Kind: marker.kind, Text: reasoning, StartToken: tokenOffset})
+				segments = append(segments, inference.ReasoningSegment{
+					Kind: marker.kind, Text: reasoning, StartToken: segmentStart,
+				})
 			}
 			break
 		}
 		reasoning := trimReasoningText(afterStart[:end])
 		if reasoning != "" {
-			segments = append(segments, inference.ReasoningSegment{Kind: marker.kind, Text: reasoning, StartToken: tokenOffset, EndToken: tokenOffset + end})
+			segments = append(segments, inference.ReasoningSegment{
+				Kind: marker.kind, Text: reasoning, StartToken: segmentStart, EndToken: segmentStart + end,
+			})
 		}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@go/parser/reasoning.go` around lines 28 - 35, The StartToken/EndToken for
appended inference.ReasoningSegment are currently using tokenOffset (in the
segments append lines around the uses of marker.kind and
trimReasoningText(afterStart[:end])) which ignores the marker.start length;
adjust StartToken to tokenOffset + len(marker.start) and EndToken to tokenOffset
+ len(marker.start) + end (or omit EndToken when using the first append) so the
offsets point to the reasoning payload rather than the marker; update both
places where segments = append(inference.ReasoningSegment{... StartToken: ...})
are constructed (the block using reasoning :=
trimReasoningText(afterStart[:end]) and the earlier append) to use these
corrected calculations.
go/quant/codebook/codebook.go-69-282 (1)
69-282: 🛠️ Refactor suggestion | 🟠 Major | 🏗️ Heavy lift

Align public failing APIs with core.Result contract.

Several public production functions in this file return Go error directly (ParseProfile, ReadProfile, NewTensorDescriptor, ValidateProfile, ValidateTensorDescriptor, MatVec, ValidateTensorPayload). The repository contract requires core.Result for public failure paths outside backend-local interfaces.

As per coding guidelines, "**/*.go: Public production functions that can fail must return core.Result; callers must branch on r.OK and use r.Value only after success".
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@go/quant/codebook/codebook.go` around lines 69 - 282, Change the public
functions ParseProfile, ReadProfile, NewTensorDescriptor, ValidateProfile,
ValidateTensorDescriptor, MatVec, and ValidateTensorPayload to return
core.Result instead of Go error/tuple results; ensure all success returns set
Result.OK = true and Result.Value to the successful value (e.g., *Profile for
ParseProfile/ReadProfile, TensorDescriptor for NewTensorDescriptor, []float32
for MatVec, or nil for pure validators) and all failure paths return Result.OK =
false with Result.Value containing the error object; update internal returns in
each function (e.g., returns that currently do "return nil, err" or "return err"
should become "return core.Result{OK:false, Value:err}") and success returns
(e.g., "return &profile, nil") should become "return core.Result{OK:true,
Value:&profile}" so callers can branch on r.OK and use r.Value only after
success.
go/probe.go-160-192 (1)
160-192: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Protect ProbeBus.sinks against concurrent Add/EmitProbe access.

ProbeBus mutates and iterates the same slice without synchronisation. If Add and EmitProbe run concurrently, this introduces a data race and unstable fan-out behaviour.
Suggested fix
 import (
+    "sync"
 )

 type ProbeBus struct {
+    mu    sync.RWMutex
     sinks []ProbeSink
 }

 func (b *ProbeBus) Add(sink ProbeSink) {
     if b == nil || sink == nil {
         return
     }
+    b.mu.Lock()
+    defer b.mu.Unlock()
     b.sinks = append(b.sinks, sink)
 }

 func (b *ProbeBus) EmitProbe(event ProbeEvent) {
     if b == nil {
         return
     }
-    for _, sink := range b.sinks {
+    b.mu.RLock()
+    sinks := append([]ProbeSink(nil), b.sinks...)
+    b.mu.RUnlock()
+    for _, sink := range sinks {
         if sink == nil {
             continue
         }
         sink.EmitProbe(event)
     }
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@go/probe.go` around lines 160 - 192, ProbeBus currently mutates and iterates
the sinks slice without synchronization, causing data races when Add and
EmitProbe run concurrently; add a sync.RWMutex field (e.g., mu) to ProbeBus,
take mu.Lock() in Add when appending to sinks, and in EmitProbe take mu.RLock(),
copy the b.sinks slice to a local variable, release the RLock, then iterate over
the copied slice calling sink.EmitProbe(event) so iteration happens without
holding the lock and avoids races and potential deadlocks; update references to
ProbeBus.sinks usage in NewProbeBus, Add, and EmitProbe accordingly.
go/scheduler/scheduler.go-139-152 (1)
139-152: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Preserve caller context when delegating cancellation to the base model.

CancelRequest currently discards the incoming context and always calls the base with context.Background(). That can ignore deadlines/cancellation from upstream control paths.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@go/scheduler/scheduler.go` around lines 139 - 152, CancelRequest is
discarding the incoming ctx by calling the base model with context.Background();
preserve and forward the provided context when delegating to the underlying
model. In the CancelRequest method, when checking m.base for
inference.CancellableModel and calling cancellable.CancelRequest, pass the
original ctx parameter instead of context.Background() so upstream
deadlines/cancellations propagate to the base model.
go/scheduler/scheduler.go-107-119 (1)
107-119: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Reject duplicate active request IDs before registration.

A caller-supplied ID can overwrite an existing active[id] entry, which breaks cancellation/visibility for the earlier job. Validate uniqueness (while active) and return an error on collision before register.

Also applies to: 348-352
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@go/scheduler/scheduler.go` around lines 107 - 119, The code currently assigns
a caller-supplied req.ID and then immediately registers the job with m.register,
which allows a new request to overwrite an existing active entry; before calling
m.register (both at this site and the other occurrence around lines 348–352),
check whether core.Trim(req.ID) != "" and if so verify uniqueness against the
scheduler's active map (e.g. m.active or the method that looks up active jobs)
under the same mutex used by register; if a collision exists return an error to
the caller instead of proceeding, otherwise proceed to set req.ID (or
m.nextRequestID()) and then call m.register so existing active jobs cannot be
overwritten.
go/split.go-166-244 (1)
166-244: ⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Use core.Result for public failure paths in split planning APIs.

PlanModelSlice and ValidateSplitInferencePlan currently return error, which diverges from the repository’s public error-handling contract. Please adapt these to core.Result-based returns for consistency with caller branching expectations.

As per coding guidelines, "Public production functions that can fail must return core.Result; callers must branch on r.OK and use r.Value only after success".
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@go/split.go` around lines 166 - 244, Both PlanModelSlice and
ValidateSplitInferencePlan must use core.Result for public failure paths instead
of returning error; change their signatures to return core.Result (e.g.,
core.Result[ModelSlicePlan] or the project’s Result pattern) and convert every
early error return to construct and return a failing core.Result, while
successful paths return a successful core.Result wrapping the value. Update all
places inside PlanModelSlice (including errors from modelSlicePresetComponents
and the custom-components check) to return a failing core.Result with the
original error, and in ValidateSplitInferencePlan convert each
core.NewError/core.Errorf return into a failing core.Result; ensure the final
success path returns an OK core.Result. Also update callers to branch on r.OK
and use r.Value only on success.
go/scheduler/scheduler.go-62-92 (1)
62-92: ⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Worker goroutines have no shutdown path and can leak past Close().

Workers block on m.queue forever, and Close() only delegates to m.base.Close() without stopping worker loops. Add scheduler-owned shutdown (for example: done channel + sync.WaitGroup, and close/cancel on Close) so repeated lifecycle usage does not leak goroutines.

Also applies to: 273-278, 292-296
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@go/scheduler/scheduler.go` around lines 62 - 92, The worker goroutines
started in New leak because they block forever on m.queue and there is no
scheduler-owned shutdown; fix by adding shutdown primitives to Model (e.g., add
fields done chan struct{} and wg sync.WaitGroup), change the worker spawn loop
to a proper for i := 0; i < maxConcurrent; i++ and for each goroutine call
m.wg.Add(1) then go m.worker(i) where worker returns and calls m.wg.Done();
modify worker to select on m.queue and case <-m.done to exit cleanly; update
Close (method Close on Model) to signal shutdown by closing or cancelling m.done
and then calling m.wg.Wait() before delegating to m.base.Close(); ensure no
double-close races (create done once in New).
go/quant/jang/jang.go-125-337 (1)
125-337: ⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Failing public APIs should return core.Result instead of raw error.

ReadConfig, ParseConfig, NewPackedTensorDescriptor, ValidatePackedTensor, DequantizePackedTensor, and PackQuantizedValues expose error in public production paths. This diverges from the project’s required failure contract and makes caller handling inconsistent with the rest of the stack.

As per coding guidelines, "Public production functions that can fail must return core.Result; callers must branch on r.OK and use r.Value only after success".
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@go/quant/jang/jang.go` around lines 125 - 337, Change the listed public
functions to return core.Result instead of raw error or (*T, error): update
signatures for ReadConfig, ParseConfig, NewPackedTensorDescriptor,
ValidatePackedTensor, DequantizePackedTensor, and PackQuantizedValues to return
a core.Result whose OK flag is true on success and Value contains the success
payload (e.g. *Info, *PackedTensorDescriptor, []float32, []byte) and false on
failure with Value holding the error; replace all direct returns of nil, err or
value, nil with appropriate core.Result values and update their callers (e.g.
finalize, BuildPackedProfile, callers of
ParseConfig/ReadConfig/NewPackedTensorDescriptor/etc.) to branch on r.OK and use
r.Value after success.
go/state/memory.go-37-50 (1)
37-50: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Manifest URIs are dropped, breaking ResolveURI for preloaded refs.

NewInMemoryStoreWithManifest stores refs but does not populate uris from ref.URI, so any URI supplied via manifest cannot be resolved later.
Suggested fix
 func NewInMemoryStoreWithManifest(chunks map[int]string, refs map[int]ChunkRef) *InMemoryStore {
 	copyMap := make(map[int]string, len(chunks))
 	nextID := 1
@@
 	refMap := make(map[int]ChunkRef, len(copyMap))
+	uriMap := make(map[string]int)
 	for id := range copyMap {
 		refMap[id] = ChunkRef{
 			ChunkID:        id,
 			FrameOffset:    uint64(id),
 			HasFrameOffset: true,
 			Codec:          CodecMemory,
 		}
 	}
 	for id, ref := range refs {
 		ref.ChunkID = id
 		refMap[id] = ref
+		if ref.URI != "" {
+			uriMap[ref.URI] = id
+		}
 		if id >= nextID {
 			nextID = id + 1
 		}
 	}
 	return &InMemoryStore{
 		chunks: copyMap,
 		data:   make(map[int][]byte),
 		refs:   refMap,
-		uris:   make(map[string]int),
+		uris:   uriMap,
 		nextID: nextID,
 	}
 }
Also applies to: 118-135
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@go/state/memory.go` around lines 37 - 50, NewInMemoryStoreWithManifest (and
the similar constructor around lines 118-135) populates refs but never fills the
InMemoryStore.uris map from each ref's URI, so ResolveURI cannot find preloaded
manifest URIs; update both constructors to, when iterating refs (the loop that
assigns ref.ChunkID and refMap[id] = ref), also check ref.URI and if non-empty
set uris[ref.URI] = id (ensuring you use the same uris map instance returned in
the struct literal), so that InMemoryStore.ResolveURI can resolve
manifest-supplied URIs for those refs.
go/state/filestore/store.go-56-111 (1)

56-111: 🛠️ Refactor suggestion | 🟠 Major | 🏗️ Heavy lift

Public failing API methods should return core.Result, not (..., error).

The exported production surface here still returns Go error pairs across create/open/read/write paths. That diverges from the package-level result contract and makes call-site handling inconsistent.

As per coding guidelines, **/*.go: "Public production functions that can fail must return core.Result; callers must branch on r.OK and use r.Value only after success".

Also applies to: 129-149, 151-193, 195-329
go/state/project_seed.go-105-117 (1)
105-117: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

project_id is not guaranteed when labels/metadata are initially nil.

setProjectLabel exits on nil maps, so WakeRequest and continuation SleepRequest can silently miss project_id when no labels are supplied.
Suggested fix
-func setProjectLabel(labels map[string]string, projectID string) {
-	if labels == nil || projectID == "" {
-		return
-	}
+func setProjectLabel(labels map[string]string, projectID string) map[string]string {
+	if projectID == "" {
+		return labels
+	}
+	if labels == nil {
+		labels = make(map[string]string, 1)
+	}
 	if labels["project_id"] == "" {
 		labels["project_id"] = projectID
 	}
+	return labels
 }
-	labels := mergeStringMaps(s.Labels, opts.Labels)
-	setProjectLabel(labels, s.ProjectID)
+	labels := mergeStringMaps(s.Labels, opts.Labels)
+	labels = setProjectLabel(labels, s.ProjectID)
@@
-	metadata := mergeStringMaps(s.Metadata, opts.Metadata)
-	setProjectLabel(metadata, s.ProjectID)
-	labels := mergeStringMaps(s.Labels, opts.Labels)
-	setProjectLabel(labels, s.ProjectID)
+	metadata := mergeStringMaps(s.Metadata, opts.Metadata)
+	metadata = setProjectLabel(metadata, s.ProjectID)
+	labels := mergeStringMaps(s.Labels, opts.Labels)
+	labels = setProjectLabel(labels, s.ProjectID)
Also applies to: 157-160, 291-297
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@go/state/project_seed.go` around lines 105 - 117, The WakeRequest builder can
call setProjectLabel on a nil map (via mergeStringMaps when s.Labels and
opts.Labels are nil), causing project_id to be omitted; fix by ensuring the
labels map is initialized before setProjectLabel is called (e.g., after labels
:= mergeStringMaps(...), if labels == nil { labels = make(map[string]string) }),
and apply the same nil-check/initialization pattern where setProjectLabel is
used elsewhere (the analogous constructors/methods referenced in the review such
as the SleepRequest builder and the other occurrences that call
setProjectLabel).
go/state/filestore/store_test.go-18-23 (1)
18-23: 🛠️ Refactor suggestion | 🟠 Major | 🏗️ Heavy lift

Align assertions with testify/require + testify/assert.

Please migrate these table and non-table checks to require for setup/preconditions and assert for expected outcomes to keep test style consistent with repo rules.

As per coding guidelines, **/*_test.go: "Use testify/assert for general checks and testify/require for preconditions in tests".

Also applies to: 26-35, 41-46, 49-69, 82-97, 104-143, 150-189, 196-223, 228-381
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@go/state/filestore/store_test.go` around lines 18 - 23, Replace t.Fatalf
precondition checks with testify/require and move expected-outcome checks to
testify/assert: for the Create() error check use require.NoError(t, err)
(precondition) and for the Path() comparison use assert.Equal(t, path,
store.Path()) (expected outcome). Import "github.com/stretchr/testify/require"
and "github.com/stretchr/testify/assert" and apply the same pattern to the other
test assertions in this file (lines covering the ranges mentioned) — use
require.* for setup/preconditions (e.g., error/creation checks) and assert.* for
equality/behavior checks (e.g., Path(), Read/Write results).
go/split_test.go-14-23 (1)
14-23: 🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick win

Adopt testify/assert and testify/require in these tests.

These checks currently use custom helpers; please switch preconditions to require.* and behavioural assertions to assert.* to match the repo test contract.

As per coding guidelines, **/*_test.go: "Use testify/assert for general checks and testify/require for preconditions in tests".

Also applies to: 30-37, 43-49, 58-66, 71-73, 77-89, 94-103
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@go/split_test.go` around lines 14 - 23, Replace the custom test helpers with
testify's require/assert: treat the error precondition checkNoError(t, err) as
require.NoError(t, err); convert strict preconditions (if any) to require.* and
all behavioral checks to assert.*, e.g. replace checkEqual(t,
ModelSlicePresetClient, plan.Preset) with assert.Equal(t,
ModelSlicePresetClient, plan.Preset), checkTrue/checkFalse with
assert.True/assert.False, and keep require.* only for necessary preconditions
(err); update imports to include "github.com/stretchr/testify/assert" and
"github.com/stretchr/testify/require". Apply the same replacements for the other
indicated blocks (lines 30-37, 43-49, 58-66, 71-73, 77-89, 94-103) referencing
the same symbols (plan.HasComponent, plan.AttentionLocal,
plan.FFNRemoteCandidate, plan.SourcePath, etc.).
go/state/store.go-102-183 (1)
102-183: 🛠️ Refactor suggestion | 🟠 Major | 🏗️ Heavy lift

Public failing helpers should return core.Result instead of (value, error).

Resolve, ResolveBytes, ResolveRefBytes, and ResolveURI are exported production helpers that can fail, but currently return error pairs. This diverges from the package contract and pushes mixed error semantics onto callers.

As per coding guidelines, "Public production functions that can fail must return core.Result; callers must branch on r.OK and use r.Value only after success".
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@go/state/store.go` around lines 102 - 183, The four exported helpers Resolve,
ResolveBytes, ResolveRefBytes, and ResolveURI currently return (Chunk, error)
but must return core.Result per package guidelines; change each function
signature to return core.Result, on success return a core.Result with OK:true
and Value set to the Chunk, and on failure return a core.Result with OK:false
and the error (or appropriate error information) set; update all early error
returns inside Resolve, ResolveBytes, ResolveRefBytes, and ResolveURI to
construct and return a core.Result failure, and ensure the final successful
return builds a core.Result success with the Chunk as Value so callers can
branch on r.OK and use r.Value only after success.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: e65b2d7f-2704-419b-a644-5c93b5d4f422

📥 Commits

Reviewing files that changed from the base of the PR and between 613cf98 and f0af335.

⛔ Files ignored due to path filters (1)

go.work is excluded by !**/*.work

📒 Files selected for processing (98)

LICENCE
docs/README.md
docs/anthropic/anthropic.md
docs/inference/README.md
docs/inference/capability.md
docs/inference/contracts.md
docs/inference/dataset.md
docs/inference/discover.md
docs/inference/gguf.md
docs/inference/identity.md
docs/inference/inference.md
docs/inference/local_tuning.md
docs/inference/options.md
docs/inference/probe.md
docs/inference/service.md
docs/inference/training.md
docs/ollama/ollama.md
docs/openai/README.md
docs/openai/openai.md
docs/openai/responses.md
docs/openai/services.md
docs/state/README.md
docs/state/agent_memory.md
docs/state/filestore.md
docs/state/identity.md
docs/state/memory.md
docs/state/project_seed.md
docs/state/store.md
external/go
go/anthropic/anthropic.go
go/anthropic/anthropic_test.go
go/bench/bench.go
go/bench/bench_test.go
go/capability.go
go/capability_example_test.go
go/capability_test.go
go/contracts.go
go/contracts_example_test.go
go/contracts_test.go
go/dataset.go
go/dataset_example_test.go
go/dataset_test.go
go/decode/decode.go
go/decode/decode_test.go
go/decode/example_test.go
go/discover.go
go/eval/eval.go
go/gguf.go
go/gguf_test.go
go/identity.go
go/identity_example_test.go
go/identity_test.go
go/ollama/ollama.go
go/ollama/ollama_test.go
go/openai/openai.go
go/openai/openai_test.go
go/openai/responses.go
go/openai/responses_test.go
go/openai/services.go
go/openai/services_test.go
go/parser/builtin.go
go/parser/markers.go
go/parser/reasoning.go
go/parser/reasoning_test.go
go/parser/registry.go
go/parser/registry_test.go
go/parser/selector.go
go/parser/thinking.go
go/parser/thinking_test.go
go/parser/tools.go
go/parser/tools_test.go
go/parser/types.go
go/probe.go
go/probe_example_test.go
go/probe_test.go
go/quant/codebook/codebook.go
go/quant/codebook/codebook_test.go
go/quant/jang/jang.go
go/quant/jang/jang_test.go
go/scheduler/example_test.go
go/scheduler/scheduler.go
go/scheduler/scheduler_test.go
go/service.go
go/service_test.go
go/split.go
go/split_example_test.go
go/split_test.go
go/state/agent_memory.go
go/state/filestore/store.go
go/state/filestore/store_test.go
go/state/identity.go
go/state/memory.go
go/state/project_seed.go
go/state/project_seed_test.go
go/state/state_test.go
go/state/store.go
go/tuning.go
go/tuning_test.go

Co-Authored-By: Virgil <virgil@lethean.io>

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (4)

go/bench/bench_test.go (2)
442-442: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Rename test suffix to match _Good/_Bad/_Ugly convention

TestAdapterInfo_IsEmpty_GoodBad does not follow the required suffix scheme. Split or rename it to compliant names (for example, separate _Good and _Bad tests).

As per coding guidelines: "Tests use the _Good/_Bad/_Ugly suffix convention: _Good for happy path, _Bad for expected error conditions, _Ugly for edge cases and surprising-but-valid behaviour".
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@go/bench/bench_test.go` at line 442, Rename or split the test function
TestAdapterInfo_IsEmpty_GoodBad to follow the _Good/_Bad/_Ugly convention:
create two tests such as TestAdapterInfo_IsEmpty_Good (covering the happy-path
assertions for AdapterInfo.IsEmpty) and TestAdapterInfo_IsEmpty_Bad (covering
the expected-failure or invalid input cases), move the relevant assertions from
the existing TestAdapterInfo_IsEmpty_GoodBad into the appropriate new test
functions, and update any test callers or references accordingly.
46-507: ⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Switch assertions to testify/assert + testify/require

These tests currently use manual t.Fatal/t.Fatalf assertions throughout. Please convert to require for preconditions and assert for behavioural checks to match the project test standard.

As per coding guidelines: "Use testify/assert for general checks and testify/require for preconditions in tests".
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@go/bench/bench_test.go` around lines 46 - 507, Replace manual
t.Fatal/t.Fatalf checks across the tests to use testify/require for
preconditions (e.g., error != nil, nil report, runner setup expectations) and
testify/assert for behavioral assertions (value equality/inequality, slices,
booleans) so tests follow the project standard; add imports for
"github.com/stretchr/testify/assert" and "github.com/stretchr/testify/require"
and in each test (e.g., TestRun_AggregatesGenerationSummary_Good,
TestRun_FallsBackToElapsedWhenTotalDurationZero_Good,
TestRun_RequiresGenerate_Bad, TestRun_PropagatesGenerateError_Bad,
TestRun_NilContextDefaultsToBackground_Good,
TestRun_PopulatesModelInfoFromCallback_Good,
TestRun_DispatchesVerbCallbacksWhenIncludeFlagsSet_Good,
TestRun_SkipsVerbCallbacksWhenIncludeFlagsFalse_Good,
TestRun_QualityChecks_Good, etc.) convert initial preconditions like err==nil or
report==nil to require.NoError/require.NotNil and convert subsequent comparisons
to assert.Equal/assert.True/assert.Len/assert.Contains as appropriate while
preserving the same failure messages/semantics.
go/bench/bench.go (1)
360-415: ⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Return core.Result from Run instead of (*Report, error)

Run is a public production function that can fail, but it currently returns a Go error pair. This breaks the project API contract and makes caller handling inconsistent with core.Result.

As per coding guidelines: "Public production functions that can fail must return core.Result; callers must branch on r.OK and use r.Value only after success".
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@go/bench/bench.go` around lines 360 - 415, Change Run's signature to return
core.Result instead of (*Report, error); when an error occurs (e.g., missing
runner.Generate or runGeneration error) return a failing core.Result with the
error, and on success return a successful core.Result containing the *Report
value. Update all return sites in Run (the early core.NewError(...) return and
the runGeneration error return, plus the final success return) to construct
core.Result values (OK:false with the error for failures, OK:true with Value:
report for success). Leave the Report construction and population logic
unchanged, and ensure callers of Run now branch on the returned core.Result
(check r.OK before using r.Value).
go/state/filestore/store.go (1)
57-330: ⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Public failing Store APIs should use core.Result

The exported store surface uses (..., error) across many failing methods (Create, Open, Resolve*, Put*, etc.). That diverges from the repository’s required result contract for production APIs and should be normalised.

As per coding guidelines: "Public production functions that can fail must return core.Result; callers must branch on r.OK and use r.Value only after success".
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@go/state/filestore/store.go` around lines 57 - 330, The public store methods
(Create, Open, Path, ChunkCount, Close, Get, Resolve, ResolveURI, Put, PutBytes,
PutBytesStream, ResolveBytes, ResolveRefBytes) must be changed to return
core.Result instead of (..., error): update each function signature to return
core.Result whose Value holds the original successful return (e.g., *Store,
state.Chunk, state.ChunkRef, string, int, or nil) and on failure return a
failing core.Result built from existing error values (use core.E/core.NewError
as before); inside these functions replace direct error returns with
constructing and returning the appropriate core.Result (OK=false) and on success
wrap the return value in a successful core.Result (OK=true/Value=...), and
update any internal helper returns (e.g., resolveLocked, resolveBytesLocked,
rollbackWriteLocked usage sites) accordingly so callers branch on r.OK and use
r.Value.

🧹 Nitpick comments (1)

go/bench/bench_test.go (1)
23-44: ⚡ Quick win

Use shared test stubs instead of introducing newFakeRunner

newFakeRunner adds a parallel local stub pattern. Please reuse the existing stubBackend/stubTextModel test doubles from inference_test.go to keep test fixtures consistent.

As per coding guidelines: "Use existing stubBackend/stubTextModel from inference_test.go for tests rather than creating new stubs".
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@go/bench/bench_test.go` around lines 23 - 44, Replace the local test stub
newFakeRunner by reusing the existing stubBackend/stubTextModel test doubles:
remove newFakeRunner and instead configure stubTextModel (used by stubBackend)
to return the desired generation outputs and errors (match generationText,
generationMetrics, generationError behavior) so the Runner.Generate behavior is
provided by stubTextModel/stubBackend; adapt tests to instantiate stubBackend
and set its underlying stubTextModel fields rather than creating a new Runner
with a custom Generate closure.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@go/bench/bench_test.go`:
- Line 442: Rename or split the test function TestAdapterInfo_IsEmpty_GoodBad to
follow the _Good/_Bad/_Ugly convention: create two tests such as
TestAdapterInfo_IsEmpty_Good (covering the happy-path assertions for
AdapterInfo.IsEmpty) and TestAdapterInfo_IsEmpty_Bad (covering the
expected-failure or invalid input cases), move the relevant assertions from the
existing TestAdapterInfo_IsEmpty_GoodBad into the appropriate new test
functions, and update any test callers or references accordingly.
- Around line 46-507: Replace manual t.Fatal/t.Fatalf checks across the tests to
use testify/require for preconditions (e.g., error != nil, nil report, runner
setup expectations) and testify/assert for behavioral assertions (value
equality/inequality, slices, booleans) so tests follow the project standard; add
imports for "github.com/stretchr/testify/assert" and
"github.com/stretchr/testify/require" and in each test (e.g.,
TestRun_AggregatesGenerationSummary_Good,
TestRun_FallsBackToElapsedWhenTotalDurationZero_Good,
TestRun_RequiresGenerate_Bad, TestRun_PropagatesGenerateError_Bad,
TestRun_NilContextDefaultsToBackground_Good,
TestRun_PopulatesModelInfoFromCallback_Good,
TestRun_DispatchesVerbCallbacksWhenIncludeFlagsSet_Good,
TestRun_SkipsVerbCallbacksWhenIncludeFlagsFalse_Good,
TestRun_QualityChecks_Good, etc.) convert initial preconditions like err==nil or
report==nil to require.NoError/require.NotNil and convert subsequent comparisons
to assert.Equal/assert.True/assert.Len/assert.Contains as appropriate while
preserving the same failure messages/semantics.

In `@go/bench/bench.go`:
- Around line 360-415: Change Run's signature to return core.Result instead of
(*Report, error); when an error occurs (e.g., missing runner.Generate or
runGeneration error) return a failing core.Result with the error, and on success
return a successful core.Result containing the *Report value. Update all return
sites in Run (the early core.NewError(...) return and the runGeneration error
return, plus the final success return) to construct core.Result values (OK:false
with the error for failures, OK:true with Value: report for success). Leave the
Report construction and population logic unchanged, and ensure callers of Run
now branch on the returned core.Result (check r.OK before using r.Value).

In `@go/state/filestore/store.go`:
- Around line 57-330: The public store methods (Create, Open, Path, ChunkCount,
Close, Get, Resolve, ResolveURI, Put, PutBytes, PutBytesStream, ResolveBytes,
ResolveRefBytes) must be changed to return core.Result instead of (..., error):
update each function signature to return core.Result whose Value holds the
original successful return (e.g., *Store, state.Chunk, state.ChunkRef, string,
int, or nil) and on failure return a failing core.Result built from existing
error values (use core.E/core.NewError as before); inside these functions
replace direct error returns with constructing and returning the appropriate
core.Result (OK=false) and on success wrap the return value in a successful
core.Result (OK=true/Value=...), and update any internal helper returns (e.g.,
resolveLocked, resolveBytesLocked, rollbackWriteLocked usage sites) accordingly
so callers branch on r.OK and use r.Value.

---

Nitpick comments:
In `@go/bench/bench_test.go`:
- Around line 23-44: Replace the local test stub newFakeRunner by reusing the
existing stubBackend/stubTextModel test doubles: remove newFakeRunner and
instead configure stubTextModel (used by stubBackend) to return the desired
generation outputs and errors (match generationText, generationMetrics,
generationError behavior) so the Runner.Generate behavior is provided by
stubTextModel/stubBackend; adapt tests to instantiate stubBackend and set its
underlying stubTextModel fields rather than creating a new Runner with a custom
Generate closure.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 9ad806ee-04a8-4ec8-ae4b-4002efc1f092

📥 Commits

Reviewing files that changed from the base of the PR and between f0af335 and feb256a.

📒 Files selected for processing (3)

go/bench/bench.go
go/bench/bench_test.go
go/state/filestore/store.go

Co-Authored-By: Virgil <virgil@lethean.io>

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)

go/state/identity.go (1)
58-58: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Use UK spelling in comments (serialisable).

Please change “serializable” to “serialisable” for guideline consistency.

As per coding guidelines, Use UK English spelling (colour, organisation, serialise, licence) in code and comments.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@go/state/identity.go` at line 58, Update the comment above the SamplerConfig
declaration to use UK English spelling: replace the word "serializable" with
"serialisable" in the comment describing SamplerConfig so it reads
"SamplerConfig is the serialisable form of generation sampler settings." Ensure
the change targets the comment tied to the SamplerConfig symbol in
go/state/identity.go.
go/state/store.go (2)
105-186: ⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Public fallible helpers should return core.Result instead of (value, error).

Resolve, ResolveBytes, ResolveRefBytes, and ResolveURI are exported production functions that can fail, but they currently return Go error pairs rather than core.Result.

As per coding guidelines, Public production functions that can fail must return core.Result; callers must branch on r.OK and use r.Value only after success.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@go/state/store.go` around lines 105 - 186, The four exported helpers Resolve,
ResolveBytes, ResolveRefBytes and ResolveURI must return core.Result instead of
(Chunk, error): change their signatures to return core.Result and update all
return sites to produce a result with Ok true and Value set to the Chunk on
success, or Ok false with Error set to the appropriate error (e.g.
&ChunkNotFoundError{...}, &URIChunkNotFoundError{...}, or wrapped errors from
resolver.Resolve / resolver.ResolveBytes / resolver.ResolveRefBytes); keep the
same nil-context handling and interface checks for Resolver, BinaryResolver,
RefBinaryResolver and URIResolver and convert their error returns into failure
core.Result values. Ensure callers are updated to inspect r.OK and use r.Value
only on success.
13-99: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Standardise error messages to the required inference: format.

These error strings do not follow the required message format contract; please prefix and keep the message lowercase with no trailing period.

As per coding guidelines, Error strings must use the format: fmt.Errorf("inference: lowercase message without trailing period").
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@go/state/store.go` around lines 13 - 99, Update the error messages to the
required "inference:" format: change ErrChunkNotFound to use
core.NewError("inference: state chunk not found"); update
ChunkNotFoundError.Error() to return core.Sprintf("inference: state chunk %d not
found", e.ID); and update URIChunkNotFoundError.Error() to return "inference:
state chunk uri not found" when URI is empty and core.Sprintf("inference: state
chunk uri %q not found", e.URI) otherwise (use lowercase "uri" and no trailing
period). Ensure these changes reference the existing symbols ErrChunkNotFound,
ChunkNotFoundError.Error, and URIChunkNotFoundError.Error.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@go/state/identity.go`:
- Line 58: Update the comment above the SamplerConfig declaration to use UK
English spelling: replace the word "serializable" with "serialisable" in the
comment describing SamplerConfig so it reads "SamplerConfig is the serialisable
form of generation sampler settings." Ensure the change targets the comment tied
to the SamplerConfig symbol in go/state/identity.go.

In `@go/state/store.go`:
- Around line 105-186: The four exported helpers Resolve, ResolveBytes,
ResolveRefBytes and ResolveURI must return core.Result instead of (Chunk,
error): change their signatures to return core.Result and update all return
sites to produce a result with Ok true and Value set to the Chunk on success, or
Ok false with Error set to the appropriate error (e.g. &ChunkNotFoundError{...},
&URIChunkNotFoundError{...}, or wrapped errors from resolver.Resolve /
resolver.ResolveBytes / resolver.ResolveRefBytes); keep the same nil-context
handling and interface checks for Resolver, BinaryResolver, RefBinaryResolver
and URIResolver and convert their error returns into failure core.Result values.
Ensure callers are updated to inspect r.OK and use r.Value only on success.
- Around line 13-99: Update the error messages to the required "inference:"
format: change ErrChunkNotFound to use core.NewError("inference: state chunk not
found"); update ChunkNotFoundError.Error() to return core.Sprintf("inference:
state chunk %d not found", e.ID); and update URIChunkNotFoundError.Error() to
return "inference: state chunk uri not found" when URI is empty and
core.Sprintf("inference: state chunk uri %q not found", e.URI) otherwise (use
lowercase "uri" and no trailing period). Ensure these changes reference the
existing symbols ErrChunkNotFound, ChunkNotFoundError.Error, and
URIChunkNotFoundError.Error.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 406ae289-de22-40c2-a512-f57c6d05db21

📥 Commits

Reviewing files that changed from the base of the PR and between feb256a and 6cb95d7.

📒 Files selected for processing (11)

docs/README.md
docs/ollama/ollama.md
docs/state/README.md
docs/state/agent_memory.md
docs/state/filestore.md
docs/state/identity.md
docs/state/memory.md
docs/state/store.md
go/state/filestore/store_test.go
go/state/identity.go
go/state/store.go

✅ Files skipped from review due to trivial changes (7)

docs/state/README.md
docs/state/memory.md
docs/README.md
docs/state/filestore.md
docs/state/store.md
docs/state/agent_memory.md
docs/state/identity.md

… v0.10.0 GGUF metadata parsing calls readGGUFString once per key plus once per string-typed value: architecture, tokenizer.ggml.tokens (the full vocab of up to 256k entries on tokenisers like Gemma's), block names, file type, RoPE settings. Every call previously did `string(buf)` — a copy of a freshly-allocated, single-owner byte slice. core/go v0.10.0 exports the AsString primitive (zero-copy view). Lift that here. For a 256k-vocab model with average 8-byte tokens, this eliminates ~2 MB of avoidable allocations + copy work per model load. Also bumps core/go dep v0.9.0 → v0.10.0 to pick up the framework-wide perf round (Fs.validatePath cache, IPC AtomicPointer dispatch, Lock wrapper cache, ID single-buffer, CleanPath fast path, WriteString zero-copy, AsBytes/AsString SPOR file).

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@go/go.mod`:
- Line 5: Remove the non-stdlib dependency by deleting the `require dappco.re/go
v0.10.0` entry from go.mod, remove any corresponding entries from go.sum (or run
`go mod tidy` to clean them), and update all non-test Go files that import core
"dappco.re/go" to either use equivalent stdlib packages or replace the imported
functionality with local code; specifically search for the import alias core
"dappco.re/go" and refactor callers that rely on its APIs to use
standard-library alternatives (e.g., context, net/http, encoding/json, io, os)
or in-repo implementations, then run `go vet`/`go build`/`go test` to verify
compilation.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 30067291-0ce1-468b-9a16-f0a373d28259

📥 Commits

Reviewing files that changed from the base of the PR and between 6cb95d7 and 03a06d0.

⛔ Files ignored due to path filters (1)

go/go.sum is excluded by !**/*.sum

📒 Files selected for processing (2)

go/gguf.go
go/go.mod

TextModelCapabilities built its report.Capabilities slice with a literal 4-item initializer then appended ~13 times for each optional interface the model implements. Every append-past-capacity triggered a slice grow. For a FullSurface model (all optional interfaces detected), the loop hit 2-3 grows, each carrying a fresh backing- array allocation. Pre-size to maxTextModelCapabilities (28 — the upper bound across every optional-interface branch in this function). Eliminates the literal-4 alloc + all grow allocs. Bench delta (Apple M3 Ultra, -benchtime=500ms): Capability_TextModelCapabilities_Plain 3125ns → 403ns (7.7x) Capability_TextModelCapabilities_FullSurface 3479ns → 403ns (8.6x) Capability_TextModelCapabilities_Nil 1125ns → 20ns (cached zero path) Allocs (FullSurface): 3 → 1 Bytes (FullSurface): 2208 → 2048 Trade: ~1.7KB more bytes per call on Plain (we always allocate for the upper bound even when only 4 capabilities are emitted). Acceptable because TextModelCapabilities runs once per Backend.Load() reporting, not per request — model load is one-shot, the bytes delta is noise. Every backend (go-mlx, go-rocm, go-cuda) pays this per Load(); the 8x speedup propagates the moment they sync this dep. AX-11 alloc + behavioural budget added: TestCapability_AllocBudget_ TextModelCapabilities_FullSurface asserts (a) the returned report has the expected capability count for the fixture, and (b) alloc count stays at 1 (single backing slice). The error message in the gate tells future agents that adding a new optional-interface branch requires bumping maxTextModelCapabilities to keep the budget at 1 — mechanical CI gate instead of silent perf regression. Co-Authored-By: Virgil <virgil@lethean.io>

The jsonenc package (the wire encoder every adapter — anthropic, ollama, openai — calls per response field) had zero benchmark coverage. AX-11 § "Audit cadence" specifies hot-path functions ship with benchmarks. Per-response JSON encoding is the hottest path in the inference output layer. 12 bench scenarios planted across the public surface: AppendJSONString: ShortNoEscape / LongNoEscape / WithEscapes / AllEscapes / Empty AppendStringField: Typical AppendIntField: Typical AppendInt64Field: Typical AppendBoolField: Typical AppendFloat32Field: Typical AppendFloat32: Typical AppendFloat64: Typical Baseline measurements (Apple M3 Ultra, -benchtime=300ms, all 0 allocs): AppendJSONString_ShortNoEscape 34.36 ns/op AppendJSONString_LongNoEscape 535.60 ns/op (1 KiB body, bulk-copy) AppendJSONString_WithEscapes 252.00 ns/op AppendJSONString_AllEscapes 203.50 ns/op (112-char all-escapes) AppendJSONString_Empty 2.91 ns/op AppendStringField_Typical 9.52 ns/op AppendIntField_Typical 9.07 ns/op AppendInt64Field_Typical 13.46 ns/op AppendBoolField_Typical 0.81 ns/op (literal write) AppendFloat32Field_Typical 20.01 ns/op AppendFloat32_Typical 17.71 ns/op AppendFloat64_Typical 29.68 ns/op All 12 paths confirmed zero-allocation on a pre-sized caller buffer — the caller-provided-buf pattern is the whole point of this package, and the regression gate now mechanically enforces it. AX-11 zero-alloc budget gate added: TestAllocBudget_JSONEnc_Append NoAllocs, asserting allocs/call == 0 for 9 representative cases (both fast-path and edge cases). The error message tells future agents this is per-response hot path and surfaces the profile command for triage if it ever trips. The encoder code itself was already at the alloc/perf floor — this commit lands the AX-11 contract (the bench output now serves as the documented expected cost per the RFC: "benchmark output is a contract"). Future changes that quietly turn an Append* into an allocator now fail loud in CI. Co-Authored-By: Virgil <virgil@lethean.io>

The model/pack package owns the .model on-disk format every backend shells through to bundle + verify model artifacts. No bench coverage existed — AX-11 § "Audit cadence" requires hot-path functions ship with benchmarks. Landing the baseline IS the contract. 4 bench scenarios planted across the public surface: Pack_Hash_Typical — content hash on a fixture model dir Pack_Fingerprint_Typical — sha256 of Manifest.Identity() Pack_List_Typical — manifest decode + entry enumeration Pack_Inspect_Typical — manifest + structural inspection Baselines (Apple M3 Ultra, -benchmem): Pack_Hash_Typical 116 allocs / 12120 B / 69 µs Pack_Fingerprint_Typical 5 allocs / 729 B / 641 ns Pack_List_Typical 196 allocs / 26045 B / 29 µs Pack_Inspect_Typical 139 allocs / 23838 B / 22 µs Hash + List + Inspect all sit > 100 allocs/call — substantial optimisation ground (likely the same Fs-per-call pattern that Discover had pre-5f29441, where `(&core.Fs{}).NewUnrestricted()` fires inside the function body). Next commit will profile + cut. AX-11 alloc-budget gate added: TestAllocBudget_Pack_Hash_Typical. Initial ceiling 130 (matching measured 116 + ~10% headroom). Also asserts hash is deterministic across two runs + correct sha256 length (64 hex chars). Both regression gates in one test. buildFixturePack signature widened from *testing.T to testing.TB so benchmarks reuse the same fixture builder tests already use — no copy-paste of the file list. Trivial refactor; existing tests unaffected. Co-Authored-By: Virgil <virgil@lethean.io>

Every Hash/Pack/Unpack/List/Inspect call paid a fresh (&core.Fs{}).NewUnrestricted() construction. Same pattern as the Discover sync.Once Core cache landed in 5f29441 — Fs is stateless (no per-call context, no auth scope mutation), one handle serves every call. Bench delta (Apple M3 Ultra, -benchtime=300ms): Pack_Hash_Typical 116 → 112 allocs (-4, -3%) 12120 → 11920 B (-200) 69 → 68 µs (noise) Modest win — the remaining alloc floor is dominated by OS file I/O (Stat, ReadFile, WalkSeq) below this layer. Cutting further needs bigger architectural moves (mmap, single-syscall walk, pooled buffers) — RFC-class, not sweep-class. The 4 sites updated also propagate to Pack/Unpack/List/Inspect implicitly — every model bundling op shaves the same per-call Fs init cost. Backends doing repeated pack inspection (fleet sniff loops, model rail enumeration) compound the saving. AX-11 budget gate ratcheted: TestAllocBudget_Pack_Hash_Typical ceiling 130 → 120 (post-fix 112 + ~7% headroom for stdlib drift). Comment notes the remaining floor is OS I/O so future agents know where to look (or NOT look) for next-tier wins. Co-Authored-By: Virgil <virgil@lethean.io>

The chunkenc append* helpers fire per-token on the streaming serve path — serveStreaming emits one ChatCompletionChunk per content / thought delta in the SSE loop. Per-token cost scales with tok/s across every adapter consumer (lthn-mlx, openai-compat proxies, the OpenAI-shaped MCP bridge). Zero bench coverage existed. AX-11 RFC § "What counts as a hot path" lists "Per-token scoring" at the top of the hot-must-bench table. This commit lands the contract. 7 benches planted: AppendChunk_Priming 81 ns/op (once per response, role marker) AppendChunk_Delta_ShortToken 74 ns/op (in-loop hot path, 1-byte delta) AppendChunk_Delta_LongToken 82 ns/op (~30-byte delta, per-byte copy) AppendChunk_Terminating 69 ns/op (once per response, FinishReason) AppendChunkSSE_Delta_ShortToken 76 ns/op (with "data: " framing) AppendChunkSSE_Delta_LongToken 90 ns/op FrameSize_Delta 3 ns/op (pre-alloc helper, lookup-only) All paths confirmed zero-allocation on a pre-sized caller buffer — the caller-owned-buf pattern is fundamental to the per-token streaming design (no GC pressure per token). Per-token math: at 100 tok/s on a typical chat stream, this layer is responsible for 7.5µs/sec of CPU = 0.00075% — well below the inference/decode cost. Optimisation headroom here is small in the current design; the bench's value is regression detection, not further win-hunting. AX-11 zero-alloc budget gate added: TestAllocBudget_ChunkEnc_ AppendNoAllocs covers 4 representative cases (Priming, Delta_ ShortToken, Terminating, SSE_Delta). Error message tells future agents the regression scales per-token (a 1000-token stream pays 1000x any added alloc) and surfaces the profile command. Co-Authored-By: Virgil <virgil@lethean.io>

…locs on NewProcessor The flat []thinkingMarker view + its parallel start-set are read-only after construction. Building them on every NewProcessor call paid a slice alloc for each plus a re-grow when the per-marker ends count exceeded the start-marker count (GPT-OSS: 6 markers × 3 ends per). Hold both views on *builtinOutputParser (built once at registry init). markersForHint + NewProcessor now hand out the cached slice headers. Per-stream open cost: NewProcessor_Qwen 252.8 ns / 664 B / 4 allocs → 132.1 ns / 264 B / 2 allocs (-48% time, -60% bytes) NewProcessor_Gemma 398.1 ns / 1360 B / 4 allocs → 142.0 ns / 272 B / 2 allocs (-64% time, -80% bytes) MarkersForHint_Qwen 172.8 ns / 328 B / 2 allocs → 96.7 ns / 8 B / 1 alloc (-44% time, -98% bytes) MarkersForHint_Gemma 290.9 ns / 912 B / 2 allocs → 107.2 ns / 16 B / 1 alloc (-63% time, -98% bytes) MarkersForHint_GPTOSS 540.5 ns / 2128 B / 4 allocs → 144.9 ns / 16 B / 2 allocs (-73% time, -99% bytes) TDD anchor: TestAllocBudget_Thinking_MarkersForHint + TestAllocBudget_Thinking_NewProcessor gate the floor (1 / 2 / 2-3 allocs) against future regression. Floor is Family's core.Concat transient + (for dash-bearing arch hints) the NormaliseKey replace transient — neither is in the cached view path. Co-Authored-By: Virgil <virgil@lethean.io>

…-call path The common assistant response carries no tool calls (plain prose). The previous shape unconditionally allocated a strings.Builder + wrote the entire response into it + paid a .String() copy at return — even when the scan found zero tool-call markers and we could have returned the input string directly. Lazy-init the builder only on first tagged-block hit; fall through to returning text verbatim when no tags are found AND parseToolPayload finds no JSON-shaped payload. Pre-Grow to len(text) when we DO build, so the first WriteString sizes the buffer exactly once. Per-response no-call path: ParseText_NoCalls_Short not measured → 25.8 ns / 0 B / 0 alloc ParseText_NoCalls_Mid not measured → 58.8 ns / 0 B / 0 alloc ParseText_NoCalls_Long 1307 ns / 10240 B / 1 alloc → 397 ns / 0 B / 0 alloc (-70% time, -100% mem) Per-response one-call path also benefits from the Grow pre-size: ParseText_OneCall_Mid 1460 ns / 3277 B → 1346 ns / 2540 B (-22% bytes) ParseText_OneCall_Long 3667 ns / 18791 B → 2787 ns / 12027 B (-36% bytes) TDD anchor: TestAllocBudget_Tools_ParseText_NoCalls pins zero-alloc on the no-call path at Short/Mid/Long. A regression there scales per response — every assistant turn pays it. Co-Authored-By: Virgil <virgil@lethean.io>

…t — zero-alloc on no-marker response parseReasoningText always allocated a strings.Builder + wrote len(text) bytes into it + paid a .String() copy at return, even when the input carried zero reasoning markers (the common case for every assistant response from a non-reasoning model). Fuse the up-front probe with the first iteration's findReasoningStart so we don't pay for a duplicate scan when the response does carry a span. Short-circuit early-return with text verbatim when the scan misses. Per-response no-span path: ParseText_NoSpan_Qwen 244 ns / 1280 B / 1 alloc → 100 ns / 0 B / 0 alloc (-59% time, -100% mem) Span paths unchanged (3 allocs, same time at 2s benchtime — the fused probe reuses its result for the first iteration). TDD anchor: TestAllocBudget_Reasoning_ParseText_NoSpan pins zero-alloc on the no-span path at Short/Mid/Long. A regression there scales per response — every assistant turn from a non-reasoning model pays it. Co-Authored-By: Virgil <virgil@lethean.io>

…w allocs per model load ParseProfile built profile.Tensors via append() from a nil slice — the cap=0 doubling cascade paid log2(N) extra allocs + discarded backing arrays. Production codebook profiles carry one descriptor per quantised tensor (hundreds for Gemma/Qwen-class models); each unnecessary grow is real bytes off the model-open critical path. Per-load cost: ParseProfile_Large (7 tensors) 9288 ns / 7864 B / 89 allocs → 9018 ns / 6392 B / 86 allocs (-19% bytes, -3 allocs) TDD anchor: TestAllocBudget_Codebook_ParseProfile_TensorCount pins the 7-tensor floor at 88 (2-alloc margin for stdlib JSON drift). A regression there means the pre-size was removed and the doubling cascade is back. Co-Authored-By: Virgil <virgil@lethean.io>

…% mem on plain-token streaming ThinkingExtractor.drain unconditionally allocated two strings.Builders on every per-token Process call, even when only one channel would receive text. Plain non-reasoning tokens (the dominant case on any streaming response) paid for the thoughtDelta builder that never wrote a byte; markerStarts() rebuilt the marker-start slice on every miss- path of drain. Lazy-init the deltas via ensureBuilder; cache the marker-start slice at package level (read-only after init). Add a single-marker fast path so the inPaired + same-channel-thought branches don't pay the []string{marker} per-call slice alloc that splitSafeSuffix needs. Per-token streaming cost: Process_PlainTokenShort 179 ns / 88 B / 2 allocs → 167 ns / 40 B / 2 allocs (-54% mem) Process_LongPlainDelta 275 ns / 336 B / 2 allocs → 267 ns / 288 B / 2 allocs (-14% mem) Process_ChannelMarker 333 ns / 96 B / 3 allocs → 288 ns / 80 B / 4 allocs (-14% time, -17% mem, +1 alloc on marker tokens) Process_PairedThinkBlock 237 ns / 96 B / 3 allocs → 232 ns / 80 B / 4 allocs (-17% mem, +1 alloc on marker tokens) Plain tokens dominate every stream — the +1 alloc on rare marker- bearing tokens is bounded; the saved bytes scale per token. TDD anchor: TestAllocBudget_OpenAI_ThinkingExtractor_PlainToken pins the plain-path floor at 2 allocs (extractor + lazy contentDelta). A regression there scales per token. Co-Authored-By: Virgil <virgil@lethean.io>

…loc per Run report.Quality.Checks starts as nil; qualityChecks() already returns a pre-sized 2-element slice. The previous shape paid an extra append- into-nil grow that copied the 2 elements into a fresh backing array. Direct assignment hands the qualityChecks slice in. Per-Run cost: Run_Minimal 621 ns / 3328 B / 7 allocs → 622 ns / 3232 B / 6 allocs Run_TenRuns 2195 ns / 7619 B / 26 allocs → 2184 ns / 7523 B / 25 allocs Co-Authored-By: Virgil <virgil@lethean.io>

qwenMarkers / gemmaMarkers / gptOSSMarkers / genericMarkers rebuilt their full marker slice (plus per-marker ends []string literals) on every call. Consumers — newBuiltinOutputParser (defensive copy), registry init, parseReasoningText (read-only iteration) — never mutate the returned header, so we can hand back the same backing slice on every invocation. Before → after on Apple M3 Ultra (benchtime=2s): Generic 985.3 ns/op 288 B/op 5 allocs → 1.140 ns/op 0 B/op 0 allocs Qwen 261.7 ns/op 432 B/op 7 allocs → 1.140 ns/op 0 B/op 0 allocs Gemma 472.1 ns/op 1664 B/op 14 allocs → 1.137 ns/op 0 B/op 0 allocs GPTOSS 394.1 ns/op 1360 B/op 12 allocs → 1.162 ns/op 0 B/op 0 allocs Test_Markers_NoAllocs locks the budget at 0 allocs/op so any future change that drops the cache flips an immediate regression. Co-Authored-By: Virgil <virgil@lethean.io>

…ppend grow The size estimate undersized by 1-2 bytes on both populated branches: ContentOnly forgot the closing quote (estimated 11+len, actual 12+len), RolePriming forgot the leading comma between role and content fields (estimated 11+len, actual 13+len). Both shapes triggered a single append grow per call — an avoidable second allocation on the per-token SSE streaming hot path. Before → after on Apple M3 Ultra (benchtime=2s): ContentOnly 72.43 ns/op 72 B/op 2 allocs → 24.86 ns/op 24 B/op 1 alloc RolePriming 66.05 ns/op 96 B/op 2 allocs → 35.01 ns/op 48 B/op 1 alloc Empty 1.57 ns/op 0 B/op 0 allocs → 1.64 ns/op 0 B/op 0 allocs TestChatMessageDelta_Marshal_AllocBudget locks one alloc on the no-escape path so any future tweak that re-undersizes the envelope flips an immediate regression. The escape-heavy path (control chars or quotes in user content) still pays one grow; that's not the streaming hot case. Co-Authored-By: Virgil <virgil@lethean.io>

…— zero-alloc When the first tool_call tag in the stream never closes, the previous shape lazily-allocated a Builder, wrote pending[:idx] + pending[idx:] to it, then returned visible.String() — which by construction equals the original text. Two allocations (builder + String copy) and a 416-byte intermediate buffer on every per-flush call. Detect the unclosed-first-marker case (visible == nil at the end<0 branch) and return the original text slice directly. Multi-marker streams where an earlier block closed successfully still funnel through the builder path — the short-circuit only fires when the builder hasn't been needed yet. Before → after on Apple M3 Ultra (benchtime=2s): Unclosed 120.4 ns/op 416 B/op 2 allocs → 47.44 ns/op 0 B/op 0 allocs Other tool-call benches unchanged (OneCall_Short 1308→1299 ns, FiveCalls_Long 9156→9115 ns — measurement noise). Test_Tools_ParseText_Unclosed_ZeroAlloc locks the budget at zero so a future re-introduction of the builder hop fails immediately. Co-Authored-By: Virgil <virgil@lethean.io>

…der copy When the first reasoning marker in the stream never closes (model emitted `<think>...` then streaming cut off, or the partial-flush hit before the close tag landed), the previous shape always built a strings.Builder + wrote text[:idx] + paid String() to extract back the same bytes. Probe firstReasoningEnd BEFORE the builder alloc and return text[:idx] as a direct slice into the input on the unclosed first-marker path. Multi-marker streams with a later unclosed span still funnel through the builder (the in-loop unclosed branch was moved alongside the lookahead so the slice short-circuit doesn't shadow that path's prefix accumulation). Before → after on Apple M3 Ultra (benchtime=1s): Unclosed_Qwen 175.4 ns/op 80 B/op 2 allocs → 158.5 ns/op 64 B/op 1 alloc NoSpan (zero-marker) unchanged at 105.7 ns/op / 0 allocs; closed-span benches (Qwen/Gemma/GPTOSS/Generic × Tokens × Span%) all unchanged at 3 allocs (builder + buf + segments grow), within measurement noise. Test_Reasoning_ParseText_Unclosed_OneAlloc locks the budget at 1 alloc (the segment slice for the open span) so a future re-introduction of the builder allocation on this path fails immediately. Co-Authored-By: Virgil <virgil@lethean.io>

…faster The previous shape pipelined four string transforms in sequence: core.Trim + core.Lower + replaceAll('-', '_') + replaceAll('.', '_'). Each replaceAll allocates a fresh result string when the byte hits; core.Lower allocates when any ASCII A-Z is present. For wire keys like "Qwen-3.5" the chain landed at 3 allocations per call — NormaliseKey fires on every Registry.Lookup / LookupHint, which fires per generation request when callers don't cache a Registry. The replacement scans for any transformable byte (A-Z, '-', '.') and returns the trimmed input verbatim when none are present — zero allocation on the already-canonical-key fast path. When a transform IS needed, a single-pass walker lowercases ASCII letters and rewrites '-' / '.' to '_' into one make([]byte) buffer, returning the result via the standard byte-to-string conversion. Before → after on Apple M3 Ultra (benchtime=1s): AlreadyClean 21.70 ns/op 0 B/op 0 allocs → 5.28 ns/op 0 B/op 0 allocs MixedCase 42.45 ns/op 8 B/op 1 alloc → 20.08 ns/op 5 B/op 1 alloc NeedsReplace 98.45 ns/op 24 B/op 3 allocs → 24.07 ns/op 8 B/op 1 alloc Empty 15.09 ns/op 0 B/op 0 allocs → 2.39 ns/op 0 B/op 0 allocs Family_* benches inherit the win since Family calls NormaliseKey twice (architecture + adapter): Family_Qwen 71.57 → 37.97 ns/op Family_Gemma 81.85 → 47.17 ns/op Family_Granite 139.80 → 106.40 ns/op Family_Unknown 224.90 → 171.10 ns/op Family_QwenWithAdapter 103.60 → 67.54 ns/op Test_Selector_NormaliseKey_AllocBudget locks the alloc budget at 0 on the clean/empty path and 1 on the transform path so any future reintroduction of the chained-replaceAll shape flips immediately. Co-Authored-By: Virgil <virgil@lethean.io>

The variadic NewProbeBus(s1, s2, s3, s4) construction path called Add per sink, and Add did append on a nil-backed slice — paying the grow-doubling chain (nil → cap 1 → 2 → 4) on every entry past the first. Four sinks costed 4 allocations total (bus struct + 3 grow allocations); the pre-counted live-sink make collapses the trailing allocations into a single right-sized backing array. Nil-sink filtering moves up to the constructor so the bus.sinks slice never holds a nil entry the EmitProbe loop has to test against. Before → after on Apple M3 Ultra (benchtime=1s): NoSinks 15.82 ns/op 24 B/op 1 alloc → 15.64 ns/op 24 B/op 1 alloc OneSink 34.18 ns/op 40 B/op 2 allocs → 31.40 ns/op 40 B/op 2 allocs FourSinks 92.75 ns/op 136 B/op 4 allocs → 45.13 ns/op 88 B/op 2 allocs TestNewProbeBus_AllocBudget locks the 1-alloc empty / 2-alloc populated shape so a future reintroduction of the append-on-nil pattern flips an immediate regression. Co-Authored-By: Virgil <virgil@lethean.io>

… class The previous estimator reserved int64-worst-case digit widths (20) for both Created and Index, plus over-counted the delta envelope by 33 bytes regardless of which delta fields were populated. The priming/delta/final frames all landed in the 240/256-byte size class when their real emit was 159-188 bytes — paying for one allocator size class more than necessary on every per-token streaming SSE frame. The tightened estimator hardcodes the realistic reserves (Created = 10 digits / Unix seconds through year 2286; Index = 4 digits / 9999 n-best, well past any practical multi-choice stream) and uses a chatMessageDeltaSize helper that mirrors appendChatMessageDelta's exact three branches (empty / content-only / role+content). The closing `]` for the choices array was added (was missing — the previous over-estimate covered the bug). Per-iteration logic also fixed: leading comma between choices is now counted only when needed (was double-counted with the closing brace before). Before → after on Apple M3 Ultra (benchtime=2s): Priming 130.2 ns/op 240 B/op 1 alloc → 125.5 ns/op 192 B/op 1 alloc Delta 124.3 ns/op 240 B/op 1 alloc → 116.3 ns/op 192 B/op 1 alloc Final 113.7 ns/op 224 B/op 1 alloc → 112.4 ns/op 176 B/op 1 alloc Per-frame size class drops one tier (240→192, 224→176). For a 256-token stream that's ~12 KB less garbage per request. TestChunkSSEFrameSize_NeverUnderCounts locks the safety property — estimator must always be >= actual emit length across every realistic chunk shape (priming, delta, terminating, large-index, multi-choice, with-thought) — so a future "shave one more byte" change can't silently underflow and reintroduce per-frame grow allocations. Co-Authored-By: Virgil <virgil@lethean.io>

…n walks joinPath/cleanPath fired core.Env("DS") on every call to resolve the directory separator. Env walks a map fallback into os.Getenv when the key isn't set — the common case, since DS is only overridden by tests. Discover's walk hits joinPath/cleanPath per directory entry, so a NestedTree scan with hundreds of entries paid hundreds of Getenv calls just to read the same "/" value. sync.Once-cache the resolved separator since the override is a process-start fixture. The cache is set on first call and never invalidated. joinPath now resolves the separator once per call (passed into both core.Join and core.CleanPath) instead of fetching twice. Before → after on Apple M3 Ultra (benchtime=500ms): JoinPath_ThreeParts 82.51 ns/op → 70.18 ns/op (-15%) AbsolutePath_AlreadyAbsolute 48.08 ns/op → 42.10 ns/op (-13%) AbsolutePath_Relative 123.90 ns/op → 110.40 ns/op (-11%) Discover_NestedTree 175.30 ms/op → 147.86 ms/op (-16%) Discover_NoModels_TenJunkDirs 240.36 ms/op → 184.23 ms/op (-23%) Discover_ThreeSiblings 126.91 ms/op → 118.29 ms/op (-7%) The walk-level wins come from the per-entry cost reduction multiplying across the directory tree — every entry the walker touches paid the Getenv tax before. Co-Authored-By: Virgil <virgil@lethean.io>

roleBits iterates the six tensor roles and calls bitsForRole per role, which in turn called ProfileBits(info.Profile) inside firstPositive's argument list — ProfileBits fires core.Lower on the profile name, an allocating string copy when the name contains any uppercase byte (the canonical "JANGTQ" / "JANG_4" profile names do). Six per-role calls paid six Lower allocations for the same input. The map literal `out := map[string]int{}` also forced a grow on every insert beyond the first. Hoist ProfileBits to a single resolution at roleBits entry and pre-size the result map to len(roles) — bitsForRoleWithFallback takes the resolved fallback as a parameter so the six per-role calls share one Lower allocation. The standalone bitsForRole (used by NewPackedTensorDescriptor for one-off lookups) keeps the same shape; only the batch hot path changes. Before → after on Apple M3 Ultra (benchtime=1s): BuildPackedProfile 635.1 ns/op 560 B/op 13 allocs → 419.8 ns/op 520 B/op 8 allocs 33% faster, 38% fewer allocs. ClonePackedProfile (independent path) unchanged at 183 ns/op / 416 B/op / 3 allocs. Co-Authored-By: Virgil <virgil@lethean.io>

…+ packedFormat quantizationType and packedFormat each built a Lower(Concat(Profile, WeightFormat, Method)) fingerprint and walked the same 3 substring matches (jangtq, mxtq, jang). BuildPackedProfile called both per PackedProfile construction, paying two Concats and two Lowers for the same input bag. Hoist the fingerprint build into BuildPackedProfile and add two internal *_FromFingerprint helpers that take the shared lowered string. quantizationType + packedFormat keep the same self-contained shape for any future caller outside the hot loop. Before → after on Apple M3 Ultra (benchtime=1s): BuildPackedProfile 419.8 ns/op 520 B/op 8 allocs → 309.7 ns/op 472 B/op 6 allocs Cumulative against round-2 baseline: BuildPackedProfile 635.1 ns/op 560 B/op 13 allocs → 309.7 ns/op 472 B/op 6 allocs 51% faster, 54% fewer allocs end-to-end. Co-Authored-By: Virgil <virgil@lethean.io>

@500ms

…e builder + cap-1 segments parseReasoningText fired core.NewBuilder() with no Grow, so the visible builder paid a log2(N) buffer-doubling cascade as it filled — memprofile attributed ~65% of allocated bytes to strings.(*Builder).WriteString growth. The empty segments literal also grew from zero on first append. Pre-grow the builder to the first span's exact visible bound (len(text) - len(marker.start) - end - endSize) and pre-size segments to cap 1 (single <think> block is the dominant shape). before -> after (allocs/op . B/op . ns/op @500ms): qwen/Span50pct/Tokens256: 3 . 1408 . 371 -> 2 . 704 . 369 qwen/Span10pct/Tokens2048: 3 .15168 .2545 -> 2 . 9536 .2113 gemma/Span90pct/Tokens2048: 3 . 1728 .2149 -> 2 . 1216 .2270 3->2 allocs/op on every reasoning-bearing case; B/op down 30-50%; ns flat-to-faster. Fires once per generation, so saves 1 alloc + a third of the reasoning-parse bytes on every reasoning response. Co-authored-by: Hephaestus <hephaestus@lthn.ai>

@300ms

…de->map->remarshal) parsedToolCall.Arguments + parsedFunction.Arguments were typed `any`, so encoding/json decoded each tool-call's arguments object into a map[string]any (one map alloc + interface boxing per value), and normaliseArgumentsJSON then re-marshalled that map straight back to JSON via JSONMarshalString. A pure JSON->map->JSON round-trip per call. Type them core.RawMessage (deferred-decode bytes): the object/array case is the captured bytes verbatim (no map, no re-encode); a JSON-string-encoded argument is unquoted to its inner JSON. before -> after (allocs/op . B/op . ns/op @300ms): ParseTools_OneCall: 20 . 1137 . 879 -> 14 . 712 . 665 ParseTools_FiveCalls: 110 . 6722 .5733 -> 61 . 4280 .4039 -30%/-45% allocs, -37%/-36% bytes, -24%/-30% ns. ParseTools fires once per assistant turn; a tool-calling agent loop pays this per step. Co-authored-by: Hephaestus <hephaestus@lthn.ai>

@500ms

…lice baseTokens rebuilt the per-request option slice from the stored SamplerConfig via up to seven separate inference.WithX calls. Each WithX returns a fresh field-capturing closure (heap-allocated), so a populated sampler paid 1-7 closure allocs plus the backing-array alloc on every scheduled request. Fuse them into a single closure that applies the whole SamplerConfig (same conditional semantics — only override a base default when the sampler carries a meaningful value), and serve the zero-value greedy sampler from a cached slice (the old schedTempZeroOpt path still allocated the slice each call; caching the whole slice removes that too). before -> after (allocs/op . B/op . ns/op @500ms): GenerateOptions_4Fields: 5 . 128 . 53 -> 2 . 104 . 30 GenerateOptions_16StopTokens: 6 . ... -> 2 . 104 . 30 Schedule_1Token: 18 .1674 -> 17 .1610 Generate_256Tokens: 21 .34059 -> 20 .34067 generateOptions fires once per scheduled request; -3 to -4 allocs on a populated sampler, -1 on the full Generate/Schedule path, and the greedy burst-dispatch default is now allocation-free. Co-authored-by: Hephaestus <hephaestus@lthn.ai>

Co-Authored-By: Virgil <virgil@lethean.io>

go-mlx's serveAnthropicMessageStream emits an incomplete Anthropic SSE sequence — a bare MessageResponse as message_start (no {type,message} wrapper), no content_block_start/stop, no index on deltas, no usage on message_delta — so Claude Code's parser can't render against lthn-mlx. Add the typed, hand-rolled event builders (same caller-owns-buf shape as AppendMessageResponse): AppendMessageStartEvent (wrapped), AppendContentBlockStartEvent, AppendContentBlockDeltaEvent (index), AppendContentBlockStopEvent, AppendMessageDeltaEvent (usage + stop_sequence), plus MessageStop/Ping payloads. Tested standalone. Wiring the handler to emit the full sequence (go-mlx) is the next step. Co-Authored-By: Virgil <virgil@lethean.io>

…each content ThinkingExtractor scanned only the <|channel> OPEN marker to end a reasoning channel (gpt-oss style). Gemma4 terminates its thought channel with an explicit <channel|> CLOSE, after which the tokens are the visible answer. Without close-marker handling the answer was classified as thinking and chat-completions content came back EMPTY for any reasoning-heavy generation. Add channelCloseMarker; in the in-thought drain branch honour whichever of open/close appears first, switching back to the assistant channel on close; feed both markers to the boundary-safe suffix split so a partial <channel|> straddling a token boundary is held back, not mis-emitted as thinking. Shared primitive: serves every LEM Engine (lthn-mlx/amd/cuda/cpu) through the /v1/chat/completions handler, so the fix lands once for all drivers. Verified: new whole + chunked ThinkingExtractor tests (RED→GREEN); serve smoke on gemma-4-26B-A4B-4bit + E2B-4bit returns finish=stop with the answer in content (was empty). go-mlx #48. Co-Authored-By: Virgil <virgil@lethean.io>

… reasoning_effort) Expose the Gemma 4 reasoning toggle on /v1/chat/completions. Add GenerateConfig.EnableThinking (*bool) + WithEnableThinking to the shared inference options; parse chat_template_kwargs.enable_thinking (vLLM/SGLang convention) and reasoning_effort=="none" from the request (the hand-rolled decoder gains a nested-object walker); GenerateOptions threads the resolved override. Also fix the chat handler dropping fields: it rebuilt a partial ChatCompletionRequest for GenerateOptions, silently discarding the new (and any future) request fields — pass the decoded request through instead. Shared primitive: every LEM Engine consuming go-inference inherits the surface. Verified at the go-mlx serve — chat_template_kwargs{enable_thinking:false} and reasoning_effort:"none" both yield 0 reasoning tokens (direct answer) vs 179 with thinking on. Co-Authored-By: Virgil <virgil@lethean.io>

DefaultGenerateConfig hardcoded MaxTokens=256, so any caller that did not set an explicit cap had generation truncated at 256 tokens — a guess baked into the shared contract. MaxTokens is a caller-supplied output cap; absent (0) the backend resolves it to the model's context at generation time. Remove the default — 0 means "to context". Example + test updated to match. Co-Authored-By: Virgil <virgil@lethean.io>

sonarqubecloud · 2026-06-06T17:22:21Z

Quality Gate failed

Failed conditions
4.2% Duplication on New Code (required ≤ 3%)

See analysis details on SonarQube Cloud

Snider and others added 22 commits May 1, 2026 08:34

chore: add EUPL-1.2 LICENCE file (UK English canonical)

4734855

Reference: core/api/LICENCE. Co-Authored-By: Cladius Maximus <cladius@lethean.io>

feat: add shared inference contracts

82b08bc

Co-Authored-By: Virgil <virgil@lethean.io>

feat(api): add shared capability reports

c5feeca

Co-Authored-By: Virgil <virgil@lethean.io>

feat(api): add runtime-neutral model primitives

dfdedb0

Co-Authored-By: Virgil <virgil@lethean.io>

feat(api): add openai chat adapter

a881cc6

Co-Authored-By: Virgil <virgil@lethean.io>

feat(api): add openai chat adapter

b533090

Co-Authored-By: Virgil <virgil@lethean.io>

feat(api): add openai chat adapter

b9f4d46

Co-Authored-By: Virgil <virgil@lethean.io>

fix(bench): mirror full DecodeOptimisationResult/Metrics fields

e6513bf

fix(bench): use AdapterInfo struct instead of bare strings

4ab9de2

coderabbitai Bot requested changes May 20, 2026

View reviewed changes

Comment thread go/gguf.go Outdated

api(state): rename bench warm path

feb256a

Co-Authored-By: Virgil <virgil@lethean.io>

coderabbitai Bot reviewed May 21, 2026

View reviewed changes

api(state): promote state naming

6cb95d7

Co-Authored-By: Virgil <virgil@lethean.io>

coderabbitai Bot reviewed May 21, 2026

View reviewed changes

coderabbitai Bot requested changes May 21, 2026

View reviewed changes

Comment thread go/go.mod Outdated

Snider and others added 29 commits May 27, 2026 12:57

chore(deps): bump core/go → v0.10.3

92ba626

Co-Authored-By: Virgil <virgil@lethean.io>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Model Engine with restore KV#6

Model Engine with restore KV#6
Snider wants to merge 209 commits into
mainfrom
dev

Snider commented May 20, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 20, 2026 •

edited

Loading

Reviews paused

Walkthrough

Changes

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

sonarqubecloud Bot commented Jun 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Snider commented May 20, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sonarqubecloud Bot commented Jun 6, 2026

Quality Gate failed

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Snider commented May 20, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 20, 2026 •

edited

Loading