Feature/turboquant by michalharakal · Pull Request #458 · SKaiNET-developers/SKaiNET

michalharakal · 2026-04-08T11:23:47Z

No description provided.

Introduce explicit storage model separating logical dtype from physical encoding, with first-class ownership tracking, placement metadata, and layout-preserving loaders. This prevents the runtime from silently collapsing quantized/borrowed/file-backed tensors back into heap arrays. Key additions: - TensorStorage descriptor with LogicalDType, TensorEncoding, BufferHandle, and Placement contracts - BufferHandle sealed hierarchy: Owned, Borrowed, Aliased, FileBacked, DeviceResident - PackedBlockStorage interface unifying Q4_K and Q8_0 block formats - MappedMemoryChunk + JVM mmap implementation for file-backed weights - StreamingGgufParametersLoader with Q4_K/Q8_0 quantized type support - Zero-copy wrapFloatArray/wrapIntArray/wrapByteArray factory methods - Explicit copyMaterialize() and realizeAlias() materialization APIs - MemoryPlanner with device fallback policy - MemoryTracker for allocation observability and copy tracing - StorageSpec for storage-aware factory routing beyond dtype-only Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…ionContext SafeTensorsParametersLoader now uses wrapFloatArray/wrapIntArray instead of fromFloatArray/fromIntArray for freshly-decoded arrays, eliminating a redundant copy. Added wrapFloatArray/wrapIntArray/wrapByteArray convenience methods to ExecutionContext to make borrow semantics accessible to loaders. Refs #451 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Both StreamingGGUFReader and StreamingSafeTensorsReader now expose loadTensorStorage() methods that return TensorStorage descriptors with borrowed byte buffers, instead of only raw ByteArrays. This lets callers work with the storage model (encoding, logical type, placement) directly without manual conversion. Refs #451 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Both StreamingGGUFReader and StreamingSafeTensorsReader now expose loadTensorStorageMapped() which returns a TensorStorage with a FileBacked BufferHandle pointing at the tensor's absolute file offset. This enables zero-heap-copy weight loading — the OS pages data on demand via mmap when the FileBacked handle is resolved by the runtime. Refs #451 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

ExecutionContext now provides memoryPlanner (defaults to CPU-only) and memoryTracker (defaults to null/disabled). Implementations can override these to enable placement resolution and allocation tracking during tensor creation and operation dispatch. Refs #451 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Ternary2BitTensorData now implements PackedBlockStorage alongside Q4_K and Q8_0, completing the unification of all packed quantization formats under a single contract. Uses TernaryPacked encoding and provides dequantizeBlock with scale-factor support. Refs #451 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Heap-backed MappedMemoryChunk implementation for JS, Wasm, and Native targets that lack native mmap support. Eagerly loads data from a RandomAccessSource but satisfies the MappedMemoryChunk contract so code can be written against one interface across all KMP targets. Refs #451 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

TensorStorage now exposes copyMaterialize(), copyToHost(), copyToDevice(), and repackTo() as explicit operations. copyMaterialize and copyToHost work for Owned/Borrowed buffers. copyToDevice and repackTo are stubs that throw until GPU/NPU backends and transcoding kernels are available. Refs #451 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

CopyMaterializationStrategy and DenseTensorDataFactory's internal createFloatTensorData/createIntTensorData now report copy events to ActiveMemoryTracker.current when a tracker is active. This makes hidden copies visible in debug reports without requiring callers to manually instrument every copy site. Refs #451 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

@Weights

@place(device, memory, requirement) declares where a tensor should be allocated. @Weights marks immutable model weights that should be file-backed (mmap) when possible. The MemoryPlanner reads these annotations to make allocation decisions at runtime. Refs #451 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

README example now uses StreamingGGUFReader instead of the legacy GGUFReader. Docs guide adds a prominent streaming section with examples and notes that the legacy reader is not recommended for new code. Refs #451 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

BufferAccessor provides byte-level read access to any BufferHandle. DefaultBufferResolver handles Owned/Borrowed/Aliased directly and delegates FileBacked to a platform-specific resolver. JvmFileBackedResolver maps FileBacked handles through JvmMappedMemoryChunk, completing the path from loadTensorStorageMapped() → mmap → byte access without heap loading. Refs #451 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Tests construct a minimal GGUF binary with F32 and Q8_0 tensors and verify the full pipeline: StreamingGGUFReader → loadTensorStorage → file-backed mmap resolution → byte-level access. Also verifies MemoryTracker reports correct aggregate metrics for mixed models. Refs #451 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

TensorStorageFactory.toTensorData() converts a TensorStorage back into a TensorData that existing backends can consume. Handles Dense FP32/INT32 (bytes → float/int array), Q4_K (→ Q4_KBlockTensorData), and Q8_0 (→ Q8_0BlockTensorData). Round-trip tests verify data integrity through TensorData → Storage → TensorData conversions. Refs #451 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

TransferOpsTest: 11 tests covering copyMaterialize (owned/borrowed/ file-backed/device-resident), copyToHost (identity and copy paths), copyToDevice (CPU delegation, GPU throws), repackTo (same/different). Q4KDequantizationTest: 6 tests covering dequantizeBlock with uniform codes, zero codes, nibble extraction, multi-block toFloatArray, out-of-bounds, and physical byte verification. TernaryDequantizationTest: 6 tests covering dequantizeBlock for all -1s, all 0s, all +1s with scale factors, mixed values matching toFloatArray, output offset writing, and invalid block index. Refs #451 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…tiguous storage, loader ActiveMemoryTrackerTest (5 tests): verifies global tracker hook captures copies from DenseTensorDataFactory, null tracker is safe, clear resets. FallbackMappedMemoryChunkTest (10 tests): covers readByte, readBytes, slice, nested slice offset composition, dataOffset, metadata, close. NonContiguousStorageTest (6 tests): verifies strides preservation, isContiguous flag, equals/hashCode include strides, defaults. StreamingGgufParametersLoaderTest (3 tests): end-to-end F32 and Q8_0 loading through the parameters loader, progress callback verification. Also fixes StackOverflow in TensorStorage.equals() where the private contentEquals extension recursively called itself instead of LongArray.contentEquals. Refs #451 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Tiktoken-based BPE tokenizer that parses Mistral's tekken.json format: base64-decoded byte vocab, implicit merge ordering from rank, separate special token ID space, Unicode-aware pre-tokenization regex. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add Tekken tokenizer for Mistral models

…ors zero-copy Finish all remaining Step 1 PRD items for TurboQuant readiness: - Add KvCacheStore interface with append-by-token, range reads, eviction, asymmetric K/V encoding policies, and DefaultKvCacheStore implementation - Add CompressedKvAttention bridge between KvCacheStore and SDPA with full-tile dequant and raw-storage extension points for fused backends - Complete Quants.kt port: byteShapeToQuantShape, quantByteSize, isBlockQuantized, validateQuantizedBytes, and related utilities - Add StorageAwareSafeTensorsLoader producing TensorStorage with FileBacked (zero-copy) or Borrowed handles - Add TURBOQUANT_ISSUES.md tracker with 21 traceable issues Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add complete TurboQuant implementation as KV-cache compression: Core kernels (common Kotlin): - RandomRotation: Walsh-Hadamard + random sign flips, O(d log d) - ScalarQuantizer: per-group symmetric quantization, 2/3/4/8-bit - BitPacker: compact bit-packing/unpacking for all bit widths - QjlResidual: Quantized Johnson-Lindenstrauss residual stage End-to-end codec: - TurboQuantCodec: full encode/decode pipeline (PolarOnly + PolarPlusQjl) - TurboQuantKvCacheStore: compressed KV cache with per-head TurboQuant blocks - Asymmetric K/V policies (different bit budgets for keys vs values) Encoding types: - TurboQuantPolar and TurboQuantPolarQjl added to sealed TensorEncoding Presets: - safe-lowbit (Q8_0-K + TurboQuant-4-V) - balanced (TurboQuant-4 / TurboQuant-4) - experimental-max (TurboQuant-3 / TurboQuant-3) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

@kvcache

- @kvcache and @KvCacheBypass annotations for declarative KV cache compression configuration on attention layers - JvmTurboQuantKernels: SIMD-accelerated abs-max, quantize, dequantize, and Walsh-Hadamard butterfly using Java Vector API - TurboQuantBenchmarks: JMH benchmarks for encode/decode throughput, bit-packing, random rotation, and KV cache append/read performance Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Detailed task document covering: - Metal compute shaders for TurboQuant encode/decode - Fused dequant+SDPA kernel design (avoids materializing decompressed K/V) - Unified-memory KV cache (zero CPU↔GPU copies on Apple Silicon) - Kotlin/Native cinterop setup for Metal.framework - 5-phase implementation plan with 20 subtasks - Shader signatures and parameter structs - Performance targets and acceptance criteria Refs #452 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

@kvcache

Make TurboQuant adoption a one-liner for skainet-transformers: - KvCacheStore.turboQuant("balanced", ...) factory method - KvCacheStore.dense() and .fromPreset() convenience factories - TurboQuantPresets.forModel() lookup by preset name + model dims - KvCacheAnnotationResolver: resolve @kvcache annotations to stores - TurboQuantUsage: documented integration guide with compilable examples showing cache creation, attention layer wiring, and generation loop Any GGUF model (LLaMA, Mistral, Gemma, Qwen) can use TurboQuant immediately — it compresses the KV cache at runtime, not model weights. Refs #452 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

github-actions · 2026-04-08T11:25:58Z

📖 Documentation Preview

The documentation has been built successfully for this PR.

Generated Files:

Operator documentation: docs/modules/operators/_generated_/
JSON schema output: operators.json

Artifacts:

Download the documentation-preview-458 artifact to view the complete documentation locally.

This comment will be updated automatically when the PR is updated.

michalharakal and others added 27 commits April 6, 2026 17:59

Add sin, cos, tanh and convTranspose1d tensor ops

9886cd9

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add TransposedConv1d, Snake activation and LayerScale modules

f274085

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add storage benchmarks

cbd069f

Merge pull request #457 from SKaiNET-developers/feature/mistral

3deeb06

Add Tekken tokenizer for Mistral models

remove unlrelated file

4b9872c

michalharakal merged commit 5ed67ab into develop Apr 8, 2026
6 checks passed

michalharakal mentioned this pull request Apr 8, 2026

Introduce TensorEncoding and storage abstractions beyond DType #453

Closed

This was referenced Apr 8, 2026

Prepare SKaiNET for TurboQuant with refactoring #451

Closed

[E] Add TurboQuant support #450

Closed

michalharakal deleted the feature/turboquant branch April 8, 2026 15:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/turboquant#458

Feature/turboquant#458
michalharakal merged 27 commits intodevelopfrom
feature/turboquant

michalharakal commented Apr 8, 2026

Uh oh!

Uh oh!

github-actions Bot commented Apr 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

michalharakal commented Apr 8, 2026

Uh oh!

Uh oh!

github-actions Bot commented Apr 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant