diff --git a/TURBOQUANT_ISSUES.md b/TURBOQUANT_ISSUES.md deleted file mode 100644 index b12f37eb..00000000 --- a/TURBOQUANT_ISSUES.md +++ /dev/null @@ -1,582 +0,0 @@ -# TurboQuant Implementation Tracker - -> Auto-generated from `prd.md` analysis on 2026-04-08. -> Branch: `feature/turboquant` - -## Legend - -| Symbol | Meaning | -|--------|---------| -| DONE | Implemented and tested | -| IN PROGRESS | Partially implemented | -| TODO | Not started | - ---- - -## Step 1: SKaiNET Core Preparation (PRD sections 1-6) - -### Completed - -- [x] **Storage & placement abstractions** — `TensorStorage`, `TensorEncoding`, `BufferHandle`, `Placement`, `LogicalDType` -- [x] **Zero-copy & ownership semantics** — Owned, Borrowed, Aliased, FileBacked, DeviceResident -- [x] **Packed quant unification** — `PackedBlockStorage` contract with Q4_K, Q8_0, Ternary -- [x] **Streaming GGUF loader** — `StreamingGGUFReader` + `StreamingGgufParametersLoader` -- [x] **Memory planning & tracking** — `MemoryPlanner`, `MemoryTracker`, `ActiveMemoryTracker` -- [x] **Transfer & materialization APIs** — `copyMaterialize()`, `copyToHost()`, `copyToDevice()` -- [x] **DSL annotations** — `@Place`, `@Weights` -- [x] **Benchmarks** — `StorageBenchmarks.kt` (Q4_K, Q8_0, Ternary dequant throughput) -- [x] **Acceptance criteria tests** — `AcceptanceCriteriaTest.kt` - -- [x] **KV-cache subsystem** — `KvCacheStore` interface, `DefaultKvCacheStore`, `KvCacheConfig`, `KvCacheMemoryReport` -- [x] **SDPA compressed K/V bridge** — `CompressedKvAttention` with dequant-on-read and raw storage paths -- [x] **Quants.kt port complete** — `byteShapeToQuantShape`, `quantByteSize`, `isBlockQuantized`, `validateQuantizedBytes` -- [x] **SafeTensors zero-copy loading** — `StorageAwareSafeTensorsLoader` with file-backed and borrowed modes - -### Remaining — None (Step 1 complete) - ---- - -### TQ-001: KV-Cache Subsystem - -| Field | Value | -|---|---| -| **Status** | DONE | -| **PRD section** | Step 1, Requirement 4 | -| **Priority** | High — blocks all Step 2 work | -| **Dependencies** | None (Step 1 foundations complete) | - -**Description:** -Create a `KvCacheStore` abstraction that supports append-by-token writes, layer/head addressing, compressed K/V block storage, backend-specific read/dequant flows, and asymmetric K/V policies. - -**Acceptance criteria:** -- [ ] `KvCacheStore` interface defined with append, read, and eviction APIs -- [ ] Layer and head indexing supported -- [ ] Storage accepts any `TensorEncoding` (including future TurboQuant) -- [ ] Backend-specific dequant dispatch is extensible -- [ ] Asymmetric K/V encoding policies configurable per layer -- [ ] Unit tests for append, read, eviction, and multi-head addressing - -**Key files to create/modify:** -- `skainet-lang/skainet-lang-core/src/commonMain/kotlin/sk/ainet/lang/tensor/storage/KvCacheStore.kt` (new) -- Tests in `skainet-lang/skainet-lang-core/src/commonTest/kotlin/sk/ainet/lang/tensor/storage/` - ---- - -### TQ-002: SDPA Integration for Compressed K/V - -| Field | Value | -|---|---| -| **Status** | DONE | -| **PRD section** | Step 1, Requirement 5 | -| **Priority** | High — blocks TurboQuant SDPA path | -| **Dependencies** | TQ-001 | - -**Description:** -Extend `scaledDotProductAttention()` in `TensorOps.kt` to detect compressed K/V from `KvCacheStore`, decompress only the needed tiles on read, and provide a seam for fused dequant+attention. - -**Acceptance criteria:** -- [ ] SDPA detects `TensorEncoding` of K/V inputs -- [ ] Compressed K/V triggers dequant-on-read path -- [ ] Only required tiles/blocks are decompressed (not full cache) -- [ ] Extension point exists for backend-fused kernels -- [ ] Tests with Q4_K and Q8_0 encoded K/V (as proxies before TurboQuant) - -**Key files to modify:** -- `skainet-lang/skainet-lang-core/src/commonMain/kotlin/sk/ainet/lang/tensor/ops/TensorOps.kt` - ---- - -### TQ-003: Complete Quants.kt Port - -| Field | Value | -|---|---| -| **Status** | DONE | -| **PRD section** | Step 1, Requirement 6 | -| **Priority** | Medium | -| **Dependencies** | None | - -**Description:** -Complete the Python-to-Kotlin port of `Quants.kt` and `Constants.kt`. Added `byteShapeToQuantShape`, `quantElementCount`, `quantByteSize`, `isBlockQuantized`, `quantBlockSize`, `quantTypeSize`, `validateQuantizedBytes`. Removed stale TODO from `Constants.kt`. - -**Acceptance criteria:** -- [ ] All quantization types from llama.cpp `quants.py` are ported -- [ ] Multi-dimension shape utilities work correctly -- [ ] `Constants.kt` port complete -- [ ] Unit tests for each ported quant type - -**Key files to modify:** -- `skainet-io/skainet-io-gguf/src/commonMain/kotlin/sk/ainet/io/gguf/Quants.kt` -- `skainet-io/skainet-io-gguf/src/commonMain/kotlin/sk/ainet/io/gguf/Constants.kt` - ---- - -### TQ-004: SafeTensors Zero-Copy / Mapped Loading - -| Field | Value | -|---|---| -| **Status** | DONE | -| **PRD section** | Step 1, Requirement 6 | -| **Priority** | Medium | -| **Dependencies** | None | - -**Description:** -Allow SafeTensors loaders to wrap or map buffers instead of always converting to dense arrays. Should produce `TensorStorage` with `FileBacked` or `Borrowed` buffer handles where possible. - -**Acceptance criteria:** -- [ ] SafeTensors loader can produce `TensorStorage` with `FileBacked` handles -- [ ] No unnecessary heap copy for read-only weight access -- [ ] Falls back to `Owned` copy when mutation is required -- [ ] Integration test with real `.safetensors` file - -**Key files to modify:** -- `skainet-io/skainet-io-safetensors/` (loader implementation) - ---- - -## Step 2: TurboQuant Introduction (PRD sections 1-5) - -### Completed - -- [x] **TQ-010: TurboQuant encoding types** — `TurboQuantPolar`, `TurboQuantPolarQjl` in `TensorEncoding` -- [x] **TQ-011: Random rotation kernel** — `RandomRotation` with Walsh-Hadamard + sign flips -- [x] **TQ-012: Scalar quantizer** — `ScalarQuantizer` with per-group scales, 2/3/4/8-bit -- [x] **TQ-013: QJL residual** — `QjlResidual` with 1-4 bit residual encoding -- [x] **TQ-014: Bit-packing** — `BitPacker` for 2/3/4/8-bit codes -- [x] **TQ-015: KV block APIs** — `TurboQuantCodec` encode/decode + `TurboQuantKvCacheStore` -- [x] **TQ-016: PolarOnly e2e** — Full pipeline: rotation → quant → pack → unpack → dequant → inverse rotation -- [x] **TQ-017+018: SDPA write/read** — `CompressedKvAttention` + `TurboQuantKvCacheStore` integration -- [x] **TQ-019: Role-aware K/V policies** — Asymmetric key/value configs in `TurboQuantKvCacheStore` -- [x] **TQ-020: Presets** — `TurboQuantPresets` with safe-lowbit, balanced, experimental-max - -- [x] **TQ-021: DSL/annotation support** — `@KvCache`, `@KvCacheBypass` annotations -- [x] **TQ-022: CPU SIMD optimization** — `JvmTurboQuantKernels` with Java Vector API -- [x] **TQ-025: JMH benchmarks** — Encode/decode/pack/rotate/KV cache benchmarks - -### Remaining - -- [ ] **TQ-023: Metal/Apple Silicon backend** — Requires Metal shader development -- [ ] **TQ-024: Fused dequant+attention kernels** — Depends on TQ-023 - ---- - -### TQ-010: TurboQuant Encoding Types - -| Field | Value | -|---|---| -| **Status** | DONE | -| **PRD section** | Step 2, Product definition | -| **Priority** | High — blocks all TurboQuant kernels | -| **Dependencies** | None | - -**Description:** -Add TurboQuant variants to the sealed `TensorEncoding` hierarchy: `TurboQuantPolar` (PolarOnly) and `TurboQuantPolarQjl` (PolarPlusQjl), with configurable bit budgets and block sizes. - -**Acceptance criteria:** -- [ ] `TurboQuantPolar` encoding added to `TensorEncoding` -- [ ] `TurboQuantPolarQjl` encoding added to `TensorEncoding` -- [ ] Configurable: bits per element, block size, codebook variant -- [ ] `bytesPerBlock` / `elementsPerBlock` computed correctly -- [ ] Exhaustive `when` coverage in existing code updated - -**Key files to modify:** -- `skainet-lang/skainet-lang-core/src/commonMain/kotlin/sk/ainet/lang/tensor/storage/TensorEncoding.kt` - ---- - -### TQ-011: Random Rotation Kernel - -| Field | Value | -|---|---| -| **Status** | DONE | -| **PRD section** | Step 2, Functional requirement 1 | -| **Priority** | High | -| **Dependencies** | TQ-010 | - -**Description:** -Implement random rotation generation in common Kotlin. This is the first stage of the TurboQuant pipeline — rotating input vectors before quantization. - -**Acceptance criteria:** -- [ ] Deterministic random rotation matrix generation (seeded) -- [ ] Correct orthogonality properties verified -- [ ] Works for arbitrary head dimensions -- [ ] Common Kotlin (no platform-specific code) -- [ ] Unit tests verifying rotation properties - -**Key files to create:** -- `skainet-lang/skainet-lang-core/src/commonMain/kotlin/sk/ainet/lang/tensor/ops/turboquant/` (new package) - ---- - -### TQ-012: Scalar Quantization / Codebook Lookup - -| Field | Value | -|---|---| -| **Status** | DONE | -| **PRD section** | Step 2, Functional requirement 1 | -| **Priority** | High | -| **Dependencies** | TQ-011 | - -**Description:** -Implement scalar quantization with codebook lookup for the rotated vectors. Supports configurable bit widths (2, 3, 4, 8). - -**Acceptance criteria:** -- [ ] Quantize rotated vector to N-bit codes -- [ ] Codebook lookup for dequantization -- [ ] Supports 2-bit, 3-bit, 4-bit, and 8-bit configurations -- [ ] Round-trip error within expected bounds per bit width -- [ ] Unit tests with known reference vectors - -**Key files to create:** -- `skainet-lang/skainet-lang-core/src/commonMain/kotlin/sk/ainet/lang/tensor/ops/turboquant/ScalarQuantizer.kt` (new) - ---- - -### TQ-013: QJL Residual Stage - -| Field | Value | -|---|---| -| **Status** | DONE | -| **PRD section** | Step 2, Functional requirement 1 | -| **Priority** | Medium — only needed for PolarPlusQjl variant | -| **Dependencies** | TQ-012 | - -**Description:** -Implement the QJL (Quantized Johnson-Lindenstrauss) residual stage for the PolarPlusQjl variant. This preserves inner-product accuracy by capturing quantization residuals. - -**Acceptance criteria:** -- [ ] QJL projection of quantization residual -- [ ] Inner-product error reduction verified vs PolarOnly -- [ ] Configurable residual bit budget -- [ ] Can be disabled (for PolarOnly path) -- [ ] Unit tests comparing IP accuracy with/without QJL - -**Key files to create:** -- `skainet-lang/skainet-lang-core/src/commonMain/kotlin/sk/ainet/lang/tensor/ops/turboquant/QjlResidual.kt` (new) - ---- - -### TQ-014: Bit-Packing Kernel - -| Field | Value | -|---|---| -| **Status** | DONE | -| **PRD section** | Step 2, Functional requirement 1 | -| **Priority** | High | -| **Dependencies** | TQ-012 | - -**Description:** -Implement bit-packing/unpacking for TurboQuant codes into compact byte arrays. Must support 2, 3, 4, and 8-bit packing. - -**Acceptance criteria:** -- [ ] Pack N-bit codes into byte arrays -- [ ] Unpack byte arrays back to codes -- [ ] Round-trip correctness for all supported bit widths -- [ ] Append-friendly (can pack incrementally per token) -- [ ] Unit tests for boundary conditions and all bit widths - -**Key files to create:** -- `skainet-lang/skainet-lang-core/src/commonMain/kotlin/sk/ainet/lang/tensor/ops/turboquant/BitPacker.kt` (new) - ---- - -### TQ-015: KV Block Append/Read APIs - -| Field | Value | -|---|---| -| **Status** | DONE | -| **PRD section** | Step 2, Functional requirement 1 | -| **Priority** | High | -| **Dependencies** | TQ-001, TQ-014 | - -**Description:** -Implement append and read APIs that connect TurboQuant encoding/decoding to the `KvCacheStore`. New tokens are compressed on write; stored blocks are decompressed on read. - -**Acceptance criteria:** -- [ ] Append single token's K/V as TurboQuant-compressed block -- [ ] Read and decompress arbitrary range of cached tokens -- [ ] Supports both PolarOnly and PolarPlusQjl paths -- [ ] Memory-efficient (no full cache decompression) -- [ ] Integration test: append N tokens, read back, verify accuracy - -**Key files to create/modify:** -- `skainet-lang/skainet-lang-core/src/commonMain/kotlin/sk/ainet/lang/tensor/ops/turboquant/TurboQuantKvCodec.kt` (new) -- Integrates with `KvCacheStore` from TQ-001 - ---- - -### TQ-016: PolarOnly Variant Implementation - -| Field | Value | -|---|---| -| **Status** | DONE | -| **PRD section** | Step 2, Supported variants | -| **Priority** | High — primary production variant | -| **Dependencies** | TQ-011, TQ-012, TQ-014, TQ-015 | - -**Description:** -Wire together rotation + scalar quantization + bit-packing into the complete PolarOnly end-to-end path. This is the backend-friendly variant without QJL. - -**Acceptance criteria:** -- [ ] End-to-end: float vector in -> compressed bytes -> float vector out -- [ ] Configurable bit budget (2, 3, 4 bits) -- [ ] Accuracy within expected bounds for each bit budget -- [ ] Works through KV append/read APIs -- [ ] Benchmark: compression ratio and throughput - -**Key files to modify:** -- Orchestration in `TurboQuantKvCodec.kt` - ---- - -### TQ-017: SDPA TurboQuant Write Path - -| Field | Value | -|---|---| -| **Status** | DONE | -| **PRD section** | Step 2, Functional requirement 2 | -| **Priority** | High | -| **Dependencies** | TQ-002, TQ-016 | - -**Description:** -Integrate TurboQuant compression into the SDPA write path so K/V are automatically compressed when stored to the KV cache. - -**Acceptance criteria:** -- [ ] SDPA stores K/V through TurboQuant compression when configured -- [ ] Compression is transparent to callers of `scaledDotProductAttention` -- [ ] Configurable per-layer (some layers can skip compression) -- [ ] No hidden densification - -**Key files to modify:** -- `skainet-lang/skainet-lang-core/src/commonMain/kotlin/sk/ainet/lang/tensor/ops/TensorOps.kt` - ---- - -### TQ-018: SDPA TurboQuant Read Path - -| Field | Value | -|---|---| -| **Status** | DONE | -| **PRD section** | Step 2, Functional requirement 2 | -| **Priority** | High | -| **Dependencies** | TQ-002, TQ-016 | - -**Description:** -Integrate TurboQuant decompression into the SDPA read path so attention is computed against decompressed K/V tiles. - -**Acceptance criteria:** -- [ ] SDPA reads and decompresses only required K/V tiles -- [ ] Tile-level decompression (not full cache) -- [ ] Correct attention scores compared to uncompressed baseline -- [ ] Extension point for fused backend kernels - -**Key files to modify:** -- `skainet-lang/skainet-lang-core/src/commonMain/kotlin/sk/ainet/lang/tensor/ops/TensorOps.kt` - ---- - -### TQ-019: Role-Aware K/V Policies - -| Field | Value | -|---|---| -| **Status** | DONE | -| **PRD section** | Step 2, Functional requirement 3 | -| **Priority** | Medium | -| **Dependencies** | TQ-001, TQ-016 | - -**Description:** -Support independent compression policies for keys and values — different bit budgets, block sizes, and even different variants (e.g., Q8_0 for K + TurboQuant-4 for V). - -**Acceptance criteria:** -- [ ] K and V policies configurable independently -- [ ] Different bit budgets for K vs V -- [ ] Mixed encoding (e.g., Q8_0-K + TurboQuant-V) supported -- [ ] Per-layer policy override -- [ ] Configuration validated at init time - -**Key files to modify:** -- `KvCacheStore` from TQ-001 -- `TurboQuantKvCodec.kt` from TQ-015 - ---- - -### TQ-020: Presets - -| Field | Value | -|---|---| -| **Status** | DONE | -| **PRD section** | Step 2, Presets | -| **Priority** | Medium | -| **Dependencies** | TQ-019 | - -**Description:** -Implement named preset configurations: -- **safe-lowbit**: Q8_0-K + TurboQuant-4-V -- **balanced**: TurboQuant-4 / TurboQuant-4 -- **experimental-max**: TurboQuant-3 / TurboQuant-3 - -**Acceptance criteria:** -- [ ] Three named presets available -- [ ] Presets resolve to concrete K/V policy configurations -- [ ] Presets selectable via API and DSL -- [ ] Documentation of expected quality/compression trade-offs - -**Key files to create:** -- `skainet-lang/skainet-lang-core/src/commonMain/kotlin/sk/ainet/lang/tensor/ops/turboquant/TurboQuantPresets.kt` (new) - ---- - -### TQ-021: DSL / Annotation Support - -| Field | Value | -|---|---| -| **Status** | DONE | -| **PRD section** | Step 2, Recommended implementation order item 7 | -| **Priority** | Low | -| **Dependencies** | TQ-020 | - -**Description:** -Extend SKaiNET DSL/annotations (`@Place`, `@Weights`) to support TurboQuant KV cache configuration declaratively. - -**Acceptance criteria:** -- [ ] Annotation-based TurboQuant configuration for KV cache -- [ ] Preset selection via annotation -- [ ] Per-layer override via annotation -- [ ] Integrated with existing `PlacementAnnotations.kt` - -**Key files to modify:** -- `skainet-lang/skainet-lang-core/src/commonMain/kotlin/sk/ainet/lang/tensor/storage/PlacementAnnotations.kt` - ---- - -### TQ-022: CPU SIMD Optimization - -| Field | Value | -|---|---| -| **Status** | DONE | -| **PRD section** | Step 2, Functional requirement 5 | -| **Priority** | Medium | -| **Dependencies** | TQ-016 | - -**Description:** -Optimize TurboQuant kernels (rotation, quantization, bit-packing, dequant) with CPU SIMD using the same pattern as `JvmQuantizedVectorKernels.kt`. - -**Acceptance criteria:** -- [ ] SIMD-optimized rotation kernel -- [ ] SIMD-optimized quant/dequant kernels -- [ ] Benchmark showing speedup over common Kotlin reference -- [ ] Correctness matches reference implementation - -**Key files to create/modify:** -- `skainet-backends/skainet-backend-cpu/src/jvmMain/kotlin/sk/ainet/exec/tensor/ops/` (new kernels) - ---- - -### TQ-023: Metal / Apple Silicon Backend - -| Field | Value | -|---|---| -| **Status** | TODO | -| **PRD section** | Step 2, Functional requirement 5 | -| **Priority** | Medium | -| **Dependencies** | TQ-016 | - -**Description:** -Implement Metal compute shaders for TurboQuant kernels targeting Apple Silicon unified memory. - -**Acceptance criteria:** -- [ ] Metal shader for TurboQuant encode/decode -- [ ] Unified memory path (no CPU-GPU copy for KV cache) -- [ ] Correctness matches CPU reference -- [ ] Benchmark on Apple Silicon - -**Key files to create:** -- Metal backend (new shaders) - ---- - -### TQ-024: Fused Dequant + Attention Kernels - -| Field | Value | -|---|---| -| **Status** | TODO | -| **PRD section** | Step 2, Functional requirement 5 | -| **Priority** | Low — optimization after correctness | -| **Dependencies** | TQ-018, TQ-022 or TQ-023 | - -**Description:** -Fuse TurboQuant decompression with attention score computation to avoid materializing decompressed K/V. - -**Acceptance criteria:** -- [ ] Fused kernel avoids intermediate K/V buffer -- [ ] Correctness matches unfused path -- [ ] Benchmark showing memory and latency improvement -- [ ] At least one backend (CPU SIMD or Metal) - -**Key files to create:** -- Backend-specific fused kernel implementations - ---- - -### TQ-025: TurboQuant Benchmarks - -| Field | Value | -|---|---| -| **Status** | DONE | -| **PRD section** | Step 2, Acceptance criteria | -| **Priority** | High — validates the whole effort | -| **Dependencies** | TQ-016 | - -**Description:** -Add JMH benchmarks for TurboQuant KV compression: encode throughput, decode throughput, compression ratio, attention accuracy degradation. - -**Acceptance criteria:** -- [ ] Encode throughput benchmark (tokens/sec) -- [ ] Decode throughput benchmark (tokens/sec) -- [ ] Compression ratio measurement for each preset -- [ ] Accuracy comparison vs uncompressed KV cache -- [ ] Results documented - -**Key files to create:** -- `skainet-lang/skainet-lang-core/src/jvmMain/kotlin/sk/ainet/lang/tensor/TurboQuantBenchmarks.kt` (new) - ---- - -## Dependency Graph - -``` -Step 1 remaining: - TQ-003 (Quants.kt) — independent - TQ-004 (SafeTensors) — independent - TQ-001 (KV-cache) — independent - TQ-002 (SDPA compressed K/V) — depends on TQ-001 - -Step 2: - TQ-010 (Encoding types) — independent - TQ-011 (Rotation) — depends on TQ-010 - TQ-012 (Scalar quant) — depends on TQ-011 - TQ-013 (QJL residual) — depends on TQ-012 - TQ-014 (Bit-packing) — depends on TQ-012 - TQ-015 (KV append/read) — depends on TQ-001, TQ-014 - TQ-016 (PolarOnly e2e) — depends on TQ-011, TQ-012, TQ-014, TQ-015 - TQ-017 (SDPA write) — depends on TQ-002, TQ-016 - TQ-018 (SDPA read) — depends on TQ-002, TQ-016 - TQ-019 (K/V policies) — depends on TQ-001, TQ-016 - TQ-020 (Presets) — depends on TQ-019 - TQ-021 (DSL) — depends on TQ-020 - TQ-022 (CPU SIMD) — depends on TQ-016 - TQ-023 (Metal) — depends on TQ-016 - TQ-024 (Fused kernels) — depends on TQ-018, TQ-022 or TQ-023 - TQ-025 (Benchmarks) — depends on TQ-016 -``` - -## Recommended Implementation Order - -1. **TQ-001** + **TQ-003** + **TQ-004** + **TQ-010** (parallel — no dependencies between them) -2. **TQ-002** + **TQ-011** (after TQ-001 and TQ-010) -3. **TQ-012** (after TQ-011) -4. **TQ-013** + **TQ-014** (parallel, after TQ-012) -5. **TQ-015** (after TQ-001 + TQ-014) -6. **TQ-016** + **TQ-025** (PolarOnly e2e + benchmarks) -7. **TQ-017** + **TQ-018** + **TQ-019** (SDPA integration + policies) -8. **TQ-020** + **TQ-022** + **TQ-023** (presets + backend optimization) -9. **TQ-021** + **TQ-024** (DSL + fused kernels — last)