SKaiNET-developers · michalharakal · Apr 8, 2026 · Apr 8, 2026 · Apr 8, 2026 · Apr 8, 2026
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,5 +1,48 @@
 # Changelog
 
+## [0.18.0] - 2026-04-08
+
+### Added
+- **TurboQuant KV-Cache Compression**: Runtime KV-cache compression for LLM inference using rotation-based quantization (Google Research TurboQuant paper). Supports PolarOnly and PolarPlusQjl variants with 2/3/4/8-bit encoding.
+  - `TurboQuantCodec`: End-to-end encode/decode pipeline (random rotation, scalar quantization, QJL residual, bit-packing).
+  - `TurboQuantKvCacheStore`: Compressed KV cache with per-head TurboQuant blocks and asymmetric K/V policies.
+  - `TurboQuantPresets`: Named presets — `safe-lowbit` (Q8_0-K + TQ4-V), `balanced` (TQ4/TQ4), `experimental-max` (TQ3/TQ3).
+  - `KvCacheStore.turboQuant("balanced", ...)`: One-line factory for skainet-transformers integration.
+  - `CompressedKvAttention`: SDPA bridge with FULL_TILE and RAW_STORAGE dequant strategies.
+  - `@KvCache` and `@KvCacheBypass` DSL annotations for declarative KV cache configuration.
+  - `KvCacheAnnotationResolver`: Resolve annotations to cache instances.
+  - `TurboQuantUsage`: Documented integration guide with compilable examples.
+- **Memory Architecture Hardening**: First-class storage and placement abstractions for zero-copy, quantization-preserving tensor management.
+  - `TensorStorage`: Runtime descriptor replacing ad-hoc array passing (logical type, physical encoding, buffer ownership, placement).
+  - `TensorEncoding`: Sealed hierarchy — `Dense`, `Q4_K`, `Q8_0`, `TernaryPacked`, `TurboQuantPolar`, `TurboQuantPolarQjl`, `Opaque`.
+  - `BufferHandle`: Five ownership modes — `Owned`, `Borrowed`, `Aliased`, `FileBacked`, `DeviceResident`.
+  - `Placement`: Device/memory-domain intent with fallback policies (`CPU_HEAP`, `MMAP_WEIGHTS`, `GPU_PREFERRED`).
+  - `LogicalDType`: Semantic numeric types separate from physical encoding.
+  - `PackedBlockStorage`: Unified contract for all packed quantized formats.
+  - `MemoryPlanner`, `MemoryTracker`, `ActiveMemoryTracker`: Placement resolution and copy diagnostics.
+- **KV-Cache Subsystem**: `KvCacheStore` interface with append-by-token writes, layer/head addressing, eviction, and `DefaultKvCacheStore` (dense FP32 baseline).
+- **Quantization-Preserving Loaders**: `StreamingGGUFReader` and `StreamingSafeTensorsReader` produce `TensorStorage` with `FileBacked` or `Borrowed` handles (no forced densification).
+  - `StorageAwareSafeTensorsLoader`: Zero-copy file-backed SafeTensors loading.
+  - Completed `Quants.kt` port: `byteShapeToQuantShape`, `quantByteSize`, `isBlockQuantized`, `validateQuantizedBytes`.
+- **Tekken Tokenizer**: Mistral Tekken (tiktoken-based BPE) tokenizer support.
+- **CPU SIMD TurboQuant Kernels**: `JvmTurboQuantKernels` with Java Vector API acceleration for abs-max, quantize, dequantize, and Walsh-Hadamard butterfly.
+- **JMH Benchmarks**: TurboQuant encode/decode throughput, bit-packing, rotation, and KV cache append/read benchmarks (`TurboQuantBenchmarks.kt`).
+- **Storage Benchmarks**: Dequantization throughput (Q4_K, Q8_0, Ternary), buffer accessor, and TensorData bridge benchmarks (`StorageBenchmarks.kt`).
+- **New Ops**: `sin`, `cos`, `tanh`, `convTranspose1d`.
+- **New Layers**: `TransposedConv1d`, `Snake` activation, `LayerScale`.
+
+### Changed
+- **Streaming GGUF as Default**: `StreamingGGUFReader` is now the recommended GGUF loading path (memory-efficient, supports quantized types).
+- **DSL Annotations**: Extended `PlacementAnnotations.kt` with `@KvCache(preset=...)` and `@KvCacheBypass` for TurboQuant configuration.
+
+### Fixed
+- **Int Overflow for Large Tensors**: Fixed `StreamingTensorInfo.nBytes` and `StreamingSafeTensorInfo.sizeInBytes` from `Int` to `Long`, preventing silent overflow for tensors > 2 GB. Fixes loading of Gemma 4 E4B and future large models. (#452)
+- **Legacy GGUFReader Overflow Guard**: Added explicit overflow check with actionable error message for tensors > 2 GB in the legacy eager loader.
+
+### Dependencies
+- io.github.kotest:kotest: 6.1.9 → 6.1.11.
+- com.squareup:kotlinpoet: 2.2.0 → 2.3.0.
+
 ## [0.17.0] - 2026-03-25
 
 ### Added

diff --git a/README.md b/README.md
@@ -19,8 +19,8 @@ Add the core dependencies (Gradle Kotlin DSL):
 
 ```kotlin
 dependencies {
-    implementation("sk.ainet.core:SKaiNET-lang-core:0.17.0")
-    implementation("sk.ainet.core:SKaiNET-backend-cpu:0.17.0")
+    implementation("sk.ainet.core:SKaiNET-lang-core:0.18.0")
+    implementation("sk.ainet.core:SKaiNET-backend-cpu:0.18.0")
 }
 ```
 
@@ -107,6 +107,7 @@ SKaiNET is a modular ecosystem. While this repository contains the core engine,
 
 - **ComputeGraphExecutor**: Optimized engine with fusion passes and trace-to-DAG bridging.
 - **SDPA & Gather**: High-performance Scaled Dot-Product Attention and indexing operations.
+- **TurboQuant**: Runtime KV-cache compression (~8x at 4-bit) for long-context LLM inference. Presets: `safe-lowbit`, `balanced`, `experimental-max`. See `TurboQuantUsage` for integration guide.
 
 ### Agentic AI Infrastructure
 
@@ -148,12 +149,14 @@ SKaiNET is a modular ecosystem. While this repository contains the core engine,
 
 ---
 
-## What's New in 0.17.0
+## What's New in 0.18.0
 
-- **Core Engine Focus** — Refactored the repository to focus on the core `ComputeGraph` framework, compiler, and backends. Extracted high-level LLM and transformer implementations to standalone repositories.
-- **LLM-as-DSL** — New high-level DSL for defining and running LLM architectures within the core framework.
-- **Optimized ComputeGraphExecutor** — New executor with support for fusion passes and trace-to-DAG bridging for faster inference.
-- **SDPA & Gather** — Implemented Scaled Dot-Product Attention and `gather`/`indexSelect` ops for improved performance.
+- **TurboQuant KV-Cache Compression** — Runtime KV-cache compression for LLM inference: ~8x memory reduction with 4-bit, works with any model (LLaMA, Mistral, Gemma, Qwen). One-line integration via `KvCacheStore.turboQuant("balanced", ...)`.
+- **Memory Architecture Hardening** — First-class storage/placement abstractions (`TensorStorage`, `TensorEncoding`, `BufferHandle`, `Placement`), zero-copy ownership semantics, quantization-preserving loaders.
+- **KV-Cache Subsystem** — Dedicated `KvCacheStore` with append-by-token writes, layer/head addressing, asymmetric K/V encoding policies, and `CompressedKvAttention` SDPA bridge.
+- **Mistral Tokenizer** — Tekken (tiktoken-based BPE) tokenizer support for Mistral models.
+- **Large Tensor Fix** — Fixed Int overflow in GGUF and SafeTensors loaders for tensors > 2 GB (Gemma 4 E4B support).
+- **CPU SIMD Kernels** — Java Vector API acceleration for TurboQuant encode/decode/rotation operations.
 
 See [CHANGELOG.md](CHANGELOG.md) for the full release history.
 
@@ -162,7 +165,7 @@ See [CHANGELOG.md](CHANGELOG.md) for the full release history.
 ## Roadmap
 
 - **Q1 2026**: Comprehensive documentation ✅
-- **Q2 2026**: Reference-based validation of computation correctness
+- **Q2 2026**: TurboQuant KV-cache compression ✅ (shipped in 0.18.0)
 - **Q3 2026**: Agentic AI enhancements ✅ (tool calling shipped in 0.13.0; ongoing)
 - **Q4 2026**: Federated learning support for multi-device training
 

diff --git a/gradle.properties b/gradle.properties
@@ -1,6 +1,5 @@
 GROUP=sk.ainet.core
-VERSION_NAME=0.17.0
-
+VERSION_NAME=0.18.0
 POM_DESCRIPTION=SKaiNET
 
 POM_URL=https://github.com/SKaiNET-developers/skainet/

diff --git a/...o-safetensors/src/commonMain/kotlin/sk/ainet/io/safetensors/StreamingSafeTensorsReader.kt b/...o-safetensors/src/commonMain/kotlin/sk/ainet/io/safetensors/StreamingSafeTensorsReader.kt
@@ -73,7 +73,11 @@ public class StreamingSafeTensorsReader private constructor(
      * @return Raw bytes for the tensor
      */
     public fun loadTensorData(tensor: StreamingSafeTensorInfo): ByteArray {
-        return source.readAt(tensor.absoluteDataOffset, tensor.sizeInBytes)
+        require(tensor.sizeInBytes <= Int.MAX_VALUE) {
+            "Tensor '${tensor.name}' is ${tensor.sizeInBytes} bytes (> 2 GB). " +
+            "Use loadTensorStorageMapped() for file-backed zero-copy access instead."
+        }
+        return source.readAt(tensor.absoluteDataOffset, tensor.sizeInBytes.toInt())
     }
 
     /**
@@ -85,7 +89,11 @@ public class StreamingSafeTensorsReader private constructor(
      * @return Number of bytes read
      */
     public fun loadTensorData(tensor: StreamingSafeTensorInfo, buffer: ByteArray, offset: Int = 0): Int {
-        return source.readAt(tensor.absoluteDataOffset, buffer, offset, tensor.sizeInBytes)
+        require(tensor.sizeInBytes <= Int.MAX_VALUE) {
+            "Tensor '${tensor.name}' is ${tensor.sizeInBytes} bytes (> 2 GB). " +
+            "Use loadTensorStorageMapped() for file-backed zero-copy access instead."
+        }
+        return source.readAt(tensor.absoluteDataOffset, buffer, offset, tensor.sizeInBytes.toInt())
     }
 
     // ========== TensorStorage Loading ==========
@@ -130,7 +138,7 @@ public class StreamingSafeTensorsReader private constructor(
             buffer = BufferHandle.FileBacked(
                 path = filePath,
                 fileOffset = tensor.absoluteDataOffset,
-                sizeInBytes = tensor.sizeInBytes.toLong()
+                sizeInBytes = tensor.sizeInBytes
             ),
             placement = Placement.MMAP_WEIGHTS
         )
@@ -396,7 +404,7 @@ public class StreamingSafeTensorsReader private constructor(
 
         val elementCount = if (shape.isEmpty()) 1L else shape.fold(1L) { acc, d -> acc * d }
         val bytesPerElement = SafeTensorsDataTypes.sizeOf(dtype) ?: 1
-        val sizeInBytes = (dataOffsets.second - dataOffsets.first).toInt()
+        val sizeInBytes = dataOffsets.second - dataOffsets.first
         val mappedDataType = SafeTensorsDataTypeMapper.toDataType(dtype)
 
         _tensors.add(
@@ -529,7 +537,7 @@ public data class StreamingSafeTensorInfo(
     /** End offset relative to data section */
     val dataOffsetEnd: Long,
     /** Size in bytes */
-    val sizeInBytes: Int,
+    val sizeInBytes: Long,
     /** Absolute byte offset in file */
     val absoluteDataOffset: Long
 ) {

diff --git a/...ensors/src/commonMain/kotlin/sk/ainet/io/safetensors/StreamingShardedSafeTensorsReader.kt b/...ensors/src/commonMain/kotlin/sk/ainet/io/safetensors/StreamingShardedSafeTensorsReader.kt
@@ -342,7 +342,7 @@ public data class ShardedTensorInfo(
     val elementCount: Long get() = base.elementCount
     val dataOffsetStart: Long get() = base.dataOffsetStart
     val dataOffsetEnd: Long get() = base.dataOffsetEnd
-    val sizeInBytes: Int get() = base.sizeInBytes
+    val sizeInBytes: Long get() = base.sizeInBytes
     val absoluteDataOffset: Long get() = base.absoluteDataOffset
     val isUnknownType: Boolean get() = base.isUnknownType
 

diff --git a/...-safetensors/src/commonTest/kotlin/sk/ainet/io/safetensors/SafeTensorsWriterCommonTest.kt b/...-safetensors/src/commonTest/kotlin/sk/ainet/io/safetensors/SafeTensorsWriterCommonTest.kt
@@ -271,7 +271,7 @@ class SafeTensorsWriterCommonTest {
             assertEquals(1, reader.tensors.size)
             val tensor = reader.tensors[0]
             assertEquals(listOf(100L, 100L), tensor.shape)
-            assertEquals(size * 4, tensor.sizeInBytes)
+            assertEquals(size * 4L, tensor.sizeInBytes)
 
             val readData = bytesToFloatArray(reader.loadTensorData(tensor))
             assertEquals(size, readData.size)

diff --git a/...fetensors/src/commonTest/kotlin/sk/ainet/io/safetensors/StreamingSafeTensorsReaderTest.kt b/...fetensors/src/commonTest/kotlin/sk/ainet/io/safetensors/StreamingSafeTensorsReaderTest.kt
@@ -243,7 +243,7 @@ class StreamingSafeTensorsReaderTest {
         StreamingSafeTensorsReader.open(source).use { reader ->
             val tensor = reader.tensors[0]
             val tensorData = reader.loadTensorData(tensor)
-            assertEquals(tensor.sizeInBytes, tensorData.size)
+            assertEquals(tensor.sizeInBytes, tensorData.size.toLong())
         }
     }
 
@@ -302,7 +302,7 @@ class StreamingSafeTensorsReaderTest {
         StreamingSafeTensorsReader.open(source).use { reader ->
             for (tensor in reader.tensors) {
                 val tensorData = reader.loadTensorData(tensor)
-                assertEquals(tensor.sizeInBytes, tensorData.size,
+                assertEquals(tensor.sizeInBytes, tensorData.size.toLong(),
                     "Size mismatch for tensor ${tensor.name}")
             }
         }

diff --git a/...kainet-io-safetensors/src/jvmTest/kotlin/sk/ainet/io/safetensors/SafeTensorsWriterTest.kt b/...kainet-io-safetensors/src/jvmTest/kotlin/sk/ainet/io/safetensors/SafeTensorsWriterTest.kt
@@ -186,7 +186,7 @@ class SafeTensorsWriterTest {
             assertEquals(1, reader.tensors.size)
             val tensor = reader.tensors[0]
             assertEquals(listOf(100L, 100L), tensor.shape)
-            assertEquals(size * 4, tensor.sizeInBytes)
+            assertEquals(size * 4L, tensor.sizeInBytes)
 
             val readData = bytesToFloatArray(reader.loadTensorData(tensor))
             assertEquals(size, readData.size)

diff --git a/...fetensors/src/jvmTest/kotlin/sk/ainet/io/safetensors/StreamingSafeTensorsReaderJvmTest.kt b/...fetensors/src/jvmTest/kotlin/sk/ainet/io/safetensors/StreamingSafeTensorsReaderJvmTest.kt
@@ -193,7 +193,7 @@ class StreamingSafeTensorsReaderJvmTest {
             var tensorIndex = 0
             for (tensor in reader.tensors) {
                 val data = reader.loadTensorData(tensor)
-                assertEquals(tensor.sizeInBytes, data.size)
+                assertEquals(tensor.sizeInBytes, data.size.toLong())
 
                 // Verify the pattern we wrote
                 for (i in data.indices) {