diff --git a/CHANGELOG.md b/CHANGELOG.md index ecf65b6e..9820e62f 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -1,5 +1,48 @@ # Changelog +## [0.18.0] - 2026-04-08 + +### Added +- **TurboQuant KV-Cache Compression**: Runtime KV-cache compression for LLM inference using rotation-based quantization (Google Research TurboQuant paper). Supports PolarOnly and PolarPlusQjl variants with 2/3/4/8-bit encoding. + - `TurboQuantCodec`: End-to-end encode/decode pipeline (random rotation, scalar quantization, QJL residual, bit-packing). + - `TurboQuantKvCacheStore`: Compressed KV cache with per-head TurboQuant blocks and asymmetric K/V policies. + - `TurboQuantPresets`: Named presets — `safe-lowbit` (Q8_0-K + TQ4-V), `balanced` (TQ4/TQ4), `experimental-max` (TQ3/TQ3). + - `KvCacheStore.turboQuant("balanced", ...)`: One-line factory for skainet-transformers integration. + - `CompressedKvAttention`: SDPA bridge with FULL_TILE and RAW_STORAGE dequant strategies. + - `@KvCache` and `@KvCacheBypass` DSL annotations for declarative KV cache configuration. + - `KvCacheAnnotationResolver`: Resolve annotations to cache instances. + - `TurboQuantUsage`: Documented integration guide with compilable examples. +- **Memory Architecture Hardening**: First-class storage and placement abstractions for zero-copy, quantization-preserving tensor management. + - `TensorStorage`: Runtime descriptor replacing ad-hoc array passing (logical type, physical encoding, buffer ownership, placement). + - `TensorEncoding`: Sealed hierarchy — `Dense`, `Q4_K`, `Q8_0`, `TernaryPacked`, `TurboQuantPolar`, `TurboQuantPolarQjl`, `Opaque`. + - `BufferHandle`: Five ownership modes — `Owned`, `Borrowed`, `Aliased`, `FileBacked`, `DeviceResident`. + - `Placement`: Device/memory-domain intent with fallback policies (`CPU_HEAP`, `MMAP_WEIGHTS`, `GPU_PREFERRED`). + - `LogicalDType`: Semantic numeric types separate from physical encoding. + - `PackedBlockStorage`: Unified contract for all packed quantized formats. + - `MemoryPlanner`, `MemoryTracker`, `ActiveMemoryTracker`: Placement resolution and copy diagnostics. +- **KV-Cache Subsystem**: `KvCacheStore` interface with append-by-token writes, layer/head addressing, eviction, and `DefaultKvCacheStore` (dense FP32 baseline). +- **Quantization-Preserving Loaders**: `StreamingGGUFReader` and `StreamingSafeTensorsReader` produce `TensorStorage` with `FileBacked` or `Borrowed` handles (no forced densification). + - `StorageAwareSafeTensorsLoader`: Zero-copy file-backed SafeTensors loading. + - Completed `Quants.kt` port: `byteShapeToQuantShape`, `quantByteSize`, `isBlockQuantized`, `validateQuantizedBytes`. +- **Tekken Tokenizer**: Mistral Tekken (tiktoken-based BPE) tokenizer support. +- **CPU SIMD TurboQuant Kernels**: `JvmTurboQuantKernels` with Java Vector API acceleration for abs-max, quantize, dequantize, and Walsh-Hadamard butterfly. +- **JMH Benchmarks**: TurboQuant encode/decode throughput, bit-packing, rotation, and KV cache append/read benchmarks (`TurboQuantBenchmarks.kt`). +- **Storage Benchmarks**: Dequantization throughput (Q4_K, Q8_0, Ternary), buffer accessor, and TensorData bridge benchmarks (`StorageBenchmarks.kt`). +- **New Ops**: `sin`, `cos`, `tanh`, `convTranspose1d`. +- **New Layers**: `TransposedConv1d`, `Snake` activation, `LayerScale`. + +### Changed +- **Streaming GGUF as Default**: `StreamingGGUFReader` is now the recommended GGUF loading path (memory-efficient, supports quantized types). +- **DSL Annotations**: Extended `PlacementAnnotations.kt` with `@KvCache(preset=...)` and `@KvCacheBypass` for TurboQuant configuration. + +### Fixed +- **Int Overflow for Large Tensors**: Fixed `StreamingTensorInfo.nBytes` and `StreamingSafeTensorInfo.sizeInBytes` from `Int` to `Long`, preventing silent overflow for tensors > 2 GB. Fixes loading of Gemma 4 E4B and future large models. (#452) +- **Legacy GGUFReader Overflow Guard**: Added explicit overflow check with actionable error message for tensors > 2 GB in the legacy eager loader. + +### Dependencies +- io.github.kotest:kotest: 6.1.9 → 6.1.11. +- com.squareup:kotlinpoet: 2.2.0 → 2.3.0. + ## [0.17.0] - 2026-03-25 ### Added diff --git a/README.md b/README.md index f57a0e16..377c5f95 100644 --- a/README.md +++ b/README.md @@ -19,8 +19,8 @@ Add the core dependencies (Gradle Kotlin DSL): ```kotlin dependencies { - implementation("sk.ainet.core:SKaiNET-lang-core:0.17.0") - implementation("sk.ainet.core:SKaiNET-backend-cpu:0.17.0") + implementation("sk.ainet.core:SKaiNET-lang-core:0.18.0") + implementation("sk.ainet.core:SKaiNET-backend-cpu:0.18.0") } ``` @@ -107,6 +107,7 @@ SKaiNET is a modular ecosystem. While this repository contains the core engine, - **ComputeGraphExecutor**: Optimized engine with fusion passes and trace-to-DAG bridging. - **SDPA & Gather**: High-performance Scaled Dot-Product Attention and indexing operations. +- **TurboQuant**: Runtime KV-cache compression (~8x at 4-bit) for long-context LLM inference. Presets: `safe-lowbit`, `balanced`, `experimental-max`. See `TurboQuantUsage` for integration guide. ### Agentic AI Infrastructure @@ -148,12 +149,14 @@ SKaiNET is a modular ecosystem. While this repository contains the core engine, --- -## What's New in 0.17.0 +## What's New in 0.18.0 -- **Core Engine Focus** — Refactored the repository to focus on the core `ComputeGraph` framework, compiler, and backends. Extracted high-level LLM and transformer implementations to standalone repositories. -- **LLM-as-DSL** — New high-level DSL for defining and running LLM architectures within the core framework. -- **Optimized ComputeGraphExecutor** — New executor with support for fusion passes and trace-to-DAG bridging for faster inference. -- **SDPA & Gather** — Implemented Scaled Dot-Product Attention and `gather`/`indexSelect` ops for improved performance. +- **TurboQuant KV-Cache Compression** — Runtime KV-cache compression for LLM inference: ~8x memory reduction with 4-bit, works with any model (LLaMA, Mistral, Gemma, Qwen). One-line integration via `KvCacheStore.turboQuant("balanced", ...)`. +- **Memory Architecture Hardening** — First-class storage/placement abstractions (`TensorStorage`, `TensorEncoding`, `BufferHandle`, `Placement`), zero-copy ownership semantics, quantization-preserving loaders. +- **KV-Cache Subsystem** — Dedicated `KvCacheStore` with append-by-token writes, layer/head addressing, asymmetric K/V encoding policies, and `CompressedKvAttention` SDPA bridge. +- **Mistral Tokenizer** — Tekken (tiktoken-based BPE) tokenizer support for Mistral models. +- **Large Tensor Fix** — Fixed Int overflow in GGUF and SafeTensors loaders for tensors > 2 GB (Gemma 4 E4B support). +- **CPU SIMD Kernels** — Java Vector API acceleration for TurboQuant encode/decode/rotation operations. See [CHANGELOG.md](CHANGELOG.md) for the full release history. @@ -162,7 +165,7 @@ See [CHANGELOG.md](CHANGELOG.md) for the full release history. ## Roadmap - **Q1 2026**: Comprehensive documentation ✅ -- **Q2 2026**: Reference-based validation of computation correctness +- **Q2 2026**: TurboQuant KV-cache compression ✅ (shipped in 0.18.0) - **Q3 2026**: Agentic AI enhancements ✅ (tool calling shipped in 0.13.0; ongoing) - **Q4 2026**: Federated learning support for multi-device training diff --git a/gradle.properties b/gradle.properties index 08a952b0..66cf78d4 100644 --- a/gradle.properties +++ b/gradle.properties @@ -1,6 +1,5 @@ GROUP=sk.ainet.core -VERSION_NAME=0.17.0 - +VERSION_NAME=0.18.0 POM_DESCRIPTION=SKaiNET POM_URL=https://github.com/SKaiNET-developers/skainet/ diff --git a/skainet-io/skainet-io-safetensors/src/commonMain/kotlin/sk/ainet/io/safetensors/StreamingSafeTensorsReader.kt b/skainet-io/skainet-io-safetensors/src/commonMain/kotlin/sk/ainet/io/safetensors/StreamingSafeTensorsReader.kt index 8cfc7b24..99246a0e 100644 --- a/skainet-io/skainet-io-safetensors/src/commonMain/kotlin/sk/ainet/io/safetensors/StreamingSafeTensorsReader.kt +++ b/skainet-io/skainet-io-safetensors/src/commonMain/kotlin/sk/ainet/io/safetensors/StreamingSafeTensorsReader.kt @@ -73,7 +73,11 @@ public class StreamingSafeTensorsReader private constructor( * @return Raw bytes for the tensor */ public fun loadTensorData(tensor: StreamingSafeTensorInfo): ByteArray { - return source.readAt(tensor.absoluteDataOffset, tensor.sizeInBytes) + require(tensor.sizeInBytes <= Int.MAX_VALUE) { + "Tensor '${tensor.name}' is ${tensor.sizeInBytes} bytes (> 2 GB). " + + "Use loadTensorStorageMapped() for file-backed zero-copy access instead." + } + return source.readAt(tensor.absoluteDataOffset, tensor.sizeInBytes.toInt()) } /** @@ -85,7 +89,11 @@ public class StreamingSafeTensorsReader private constructor( * @return Number of bytes read */ public fun loadTensorData(tensor: StreamingSafeTensorInfo, buffer: ByteArray, offset: Int = 0): Int { - return source.readAt(tensor.absoluteDataOffset, buffer, offset, tensor.sizeInBytes) + require(tensor.sizeInBytes <= Int.MAX_VALUE) { + "Tensor '${tensor.name}' is ${tensor.sizeInBytes} bytes (> 2 GB). " + + "Use loadTensorStorageMapped() for file-backed zero-copy access instead." + } + return source.readAt(tensor.absoluteDataOffset, buffer, offset, tensor.sizeInBytes.toInt()) } // ========== TensorStorage Loading ========== @@ -130,7 +138,7 @@ public class StreamingSafeTensorsReader private constructor( buffer = BufferHandle.FileBacked( path = filePath, fileOffset = tensor.absoluteDataOffset, - sizeInBytes = tensor.sizeInBytes.toLong() + sizeInBytes = tensor.sizeInBytes ), placement = Placement.MMAP_WEIGHTS ) @@ -396,7 +404,7 @@ public class StreamingSafeTensorsReader private constructor( val elementCount = if (shape.isEmpty()) 1L else shape.fold(1L) { acc, d -> acc * d } val bytesPerElement = SafeTensorsDataTypes.sizeOf(dtype) ?: 1 - val sizeInBytes = (dataOffsets.second - dataOffsets.first).toInt() + val sizeInBytes = dataOffsets.second - dataOffsets.first val mappedDataType = SafeTensorsDataTypeMapper.toDataType(dtype) _tensors.add( @@ -529,7 +537,7 @@ public data class StreamingSafeTensorInfo( /** End offset relative to data section */ val dataOffsetEnd: Long, /** Size in bytes */ - val sizeInBytes: Int, + val sizeInBytes: Long, /** Absolute byte offset in file */ val absoluteDataOffset: Long ) { diff --git a/skainet-io/skainet-io-safetensors/src/commonMain/kotlin/sk/ainet/io/safetensors/StreamingShardedSafeTensorsReader.kt b/skainet-io/skainet-io-safetensors/src/commonMain/kotlin/sk/ainet/io/safetensors/StreamingShardedSafeTensorsReader.kt index 3ee222e4..abd1b53d 100644 --- a/skainet-io/skainet-io-safetensors/src/commonMain/kotlin/sk/ainet/io/safetensors/StreamingShardedSafeTensorsReader.kt +++ b/skainet-io/skainet-io-safetensors/src/commonMain/kotlin/sk/ainet/io/safetensors/StreamingShardedSafeTensorsReader.kt @@ -342,7 +342,7 @@ public data class ShardedTensorInfo( val elementCount: Long get() = base.elementCount val dataOffsetStart: Long get() = base.dataOffsetStart val dataOffsetEnd: Long get() = base.dataOffsetEnd - val sizeInBytes: Int get() = base.sizeInBytes + val sizeInBytes: Long get() = base.sizeInBytes val absoluteDataOffset: Long get() = base.absoluteDataOffset val isUnknownType: Boolean get() = base.isUnknownType diff --git a/skainet-io/skainet-io-safetensors/src/commonTest/kotlin/sk/ainet/io/safetensors/SafeTensorsWriterCommonTest.kt b/skainet-io/skainet-io-safetensors/src/commonTest/kotlin/sk/ainet/io/safetensors/SafeTensorsWriterCommonTest.kt index a6e03ae0..b898b979 100644 --- a/skainet-io/skainet-io-safetensors/src/commonTest/kotlin/sk/ainet/io/safetensors/SafeTensorsWriterCommonTest.kt +++ b/skainet-io/skainet-io-safetensors/src/commonTest/kotlin/sk/ainet/io/safetensors/SafeTensorsWriterCommonTest.kt @@ -271,7 +271,7 @@ class SafeTensorsWriterCommonTest { assertEquals(1, reader.tensors.size) val tensor = reader.tensors[0] assertEquals(listOf(100L, 100L), tensor.shape) - assertEquals(size * 4, tensor.sizeInBytes) + assertEquals(size * 4L, tensor.sizeInBytes) val readData = bytesToFloatArray(reader.loadTensorData(tensor)) assertEquals(size, readData.size) diff --git a/skainet-io/skainet-io-safetensors/src/commonTest/kotlin/sk/ainet/io/safetensors/StreamingSafeTensorsReaderTest.kt b/skainet-io/skainet-io-safetensors/src/commonTest/kotlin/sk/ainet/io/safetensors/StreamingSafeTensorsReaderTest.kt index 4eaff11b..809f9943 100644 --- a/skainet-io/skainet-io-safetensors/src/commonTest/kotlin/sk/ainet/io/safetensors/StreamingSafeTensorsReaderTest.kt +++ b/skainet-io/skainet-io-safetensors/src/commonTest/kotlin/sk/ainet/io/safetensors/StreamingSafeTensorsReaderTest.kt @@ -243,7 +243,7 @@ class StreamingSafeTensorsReaderTest { StreamingSafeTensorsReader.open(source).use { reader -> val tensor = reader.tensors[0] val tensorData = reader.loadTensorData(tensor) - assertEquals(tensor.sizeInBytes, tensorData.size) + assertEquals(tensor.sizeInBytes, tensorData.size.toLong()) } } @@ -302,7 +302,7 @@ class StreamingSafeTensorsReaderTest { StreamingSafeTensorsReader.open(source).use { reader -> for (tensor in reader.tensors) { val tensorData = reader.loadTensorData(tensor) - assertEquals(tensor.sizeInBytes, tensorData.size, + assertEquals(tensor.sizeInBytes, tensorData.size.toLong(), "Size mismatch for tensor ${tensor.name}") } } diff --git a/skainet-io/skainet-io-safetensors/src/jvmTest/kotlin/sk/ainet/io/safetensors/SafeTensorsWriterTest.kt b/skainet-io/skainet-io-safetensors/src/jvmTest/kotlin/sk/ainet/io/safetensors/SafeTensorsWriterTest.kt index b3f12cf6..4c3a32c5 100644 --- a/skainet-io/skainet-io-safetensors/src/jvmTest/kotlin/sk/ainet/io/safetensors/SafeTensorsWriterTest.kt +++ b/skainet-io/skainet-io-safetensors/src/jvmTest/kotlin/sk/ainet/io/safetensors/SafeTensorsWriterTest.kt @@ -186,7 +186,7 @@ class SafeTensorsWriterTest { assertEquals(1, reader.tensors.size) val tensor = reader.tensors[0] assertEquals(listOf(100L, 100L), tensor.shape) - assertEquals(size * 4, tensor.sizeInBytes) + assertEquals(size * 4L, tensor.sizeInBytes) val readData = bytesToFloatArray(reader.loadTensorData(tensor)) assertEquals(size, readData.size) diff --git a/skainet-io/skainet-io-safetensors/src/jvmTest/kotlin/sk/ainet/io/safetensors/StreamingSafeTensorsReaderJvmTest.kt b/skainet-io/skainet-io-safetensors/src/jvmTest/kotlin/sk/ainet/io/safetensors/StreamingSafeTensorsReaderJvmTest.kt index 0ae8050a..3d596a44 100644 --- a/skainet-io/skainet-io-safetensors/src/jvmTest/kotlin/sk/ainet/io/safetensors/StreamingSafeTensorsReaderJvmTest.kt +++ b/skainet-io/skainet-io-safetensors/src/jvmTest/kotlin/sk/ainet/io/safetensors/StreamingSafeTensorsReaderJvmTest.kt @@ -193,7 +193,7 @@ class StreamingSafeTensorsReaderJvmTest { var tensorIndex = 0 for (tensor in reader.tensors) { val data = reader.loadTensorData(tensor) - assertEquals(tensor.sizeInBytes, data.size) + assertEquals(tensor.sizeInBytes, data.size.toLong()) // Verify the pattern we wrote for (i in data.indices) {