Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
43 changes: 43 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,48 @@
# Changelog

## [0.18.0] - 2026-04-08

### Added
- **TurboQuant KV-Cache Compression**: Runtime KV-cache compression for LLM inference using rotation-based quantization (Google Research TurboQuant paper). Supports PolarOnly and PolarPlusQjl variants with 2/3/4/8-bit encoding.
- `TurboQuantCodec`: End-to-end encode/decode pipeline (random rotation, scalar quantization, QJL residual, bit-packing).
- `TurboQuantKvCacheStore`: Compressed KV cache with per-head TurboQuant blocks and asymmetric K/V policies.
- `TurboQuantPresets`: Named presets — `safe-lowbit` (Q8_0-K + TQ4-V), `balanced` (TQ4/TQ4), `experimental-max` (TQ3/TQ3).
- `KvCacheStore.turboQuant("balanced", ...)`: One-line factory for skainet-transformers integration.
- `CompressedKvAttention`: SDPA bridge with FULL_TILE and RAW_STORAGE dequant strategies.
- `@KvCache` and `@KvCacheBypass` DSL annotations for declarative KV cache configuration.
- `KvCacheAnnotationResolver`: Resolve annotations to cache instances.
- `TurboQuantUsage`: Documented integration guide with compilable examples.
- **Memory Architecture Hardening**: First-class storage and placement abstractions for zero-copy, quantization-preserving tensor management.
- `TensorStorage`: Runtime descriptor replacing ad-hoc array passing (logical type, physical encoding, buffer ownership, placement).
- `TensorEncoding`: Sealed hierarchy — `Dense`, `Q4_K`, `Q8_0`, `TernaryPacked`, `TurboQuantPolar`, `TurboQuantPolarQjl`, `Opaque`.
- `BufferHandle`: Five ownership modes — `Owned`, `Borrowed`, `Aliased`, `FileBacked`, `DeviceResident`.
- `Placement`: Device/memory-domain intent with fallback policies (`CPU_HEAP`, `MMAP_WEIGHTS`, `GPU_PREFERRED`).
- `LogicalDType`: Semantic numeric types separate from physical encoding.
- `PackedBlockStorage`: Unified contract for all packed quantized formats.
- `MemoryPlanner`, `MemoryTracker`, `ActiveMemoryTracker`: Placement resolution and copy diagnostics.
- **KV-Cache Subsystem**: `KvCacheStore` interface with append-by-token writes, layer/head addressing, eviction, and `DefaultKvCacheStore` (dense FP32 baseline).
- **Quantization-Preserving Loaders**: `StreamingGGUFReader` and `StreamingSafeTensorsReader` produce `TensorStorage` with `FileBacked` or `Borrowed` handles (no forced densification).
- `StorageAwareSafeTensorsLoader`: Zero-copy file-backed SafeTensors loading.
- Completed `Quants.kt` port: `byteShapeToQuantShape`, `quantByteSize`, `isBlockQuantized`, `validateQuantizedBytes`.
- **Tekken Tokenizer**: Mistral Tekken (tiktoken-based BPE) tokenizer support.
- **CPU SIMD TurboQuant Kernels**: `JvmTurboQuantKernels` with Java Vector API acceleration for abs-max, quantize, dequantize, and Walsh-Hadamard butterfly.
- **JMH Benchmarks**: TurboQuant encode/decode throughput, bit-packing, rotation, and KV cache append/read benchmarks (`TurboQuantBenchmarks.kt`).
- **Storage Benchmarks**: Dequantization throughput (Q4_K, Q8_0, Ternary), buffer accessor, and TensorData bridge benchmarks (`StorageBenchmarks.kt`).
- **New Ops**: `sin`, `cos`, `tanh`, `convTranspose1d`.
- **New Layers**: `TransposedConv1d`, `Snake` activation, `LayerScale`.

### Changed
- **Streaming GGUF as Default**: `StreamingGGUFReader` is now the recommended GGUF loading path (memory-efficient, supports quantized types).
- **DSL Annotations**: Extended `PlacementAnnotations.kt` with `@KvCache(preset=...)` and `@KvCacheBypass` for TurboQuant configuration.

### Fixed
- **Int Overflow for Large Tensors**: Fixed `StreamingTensorInfo.nBytes` and `StreamingSafeTensorInfo.sizeInBytes` from `Int` to `Long`, preventing silent overflow for tensors > 2 GB. Fixes loading of Gemma 4 E4B and future large models. (#452)
- **Legacy GGUFReader Overflow Guard**: Added explicit overflow check with actionable error message for tensors > 2 GB in the legacy eager loader.

### Dependencies
- io.github.kotest:kotest: 6.1.9 → 6.1.11.
- com.squareup:kotlinpoet: 2.2.0 → 2.3.0.

## [0.17.0] - 2026-03-25

### Added
Expand Down
19 changes: 11 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,8 +19,8 @@ Add the core dependencies (Gradle Kotlin DSL):

```kotlin
dependencies {
implementation("sk.ainet.core:SKaiNET-lang-core:0.17.0")
implementation("sk.ainet.core:SKaiNET-backend-cpu:0.17.0")
implementation("sk.ainet.core:SKaiNET-lang-core:0.18.0")
implementation("sk.ainet.core:SKaiNET-backend-cpu:0.18.0")
}
```

Expand Down Expand Up @@ -107,6 +107,7 @@ SKaiNET is a modular ecosystem. While this repository contains the core engine,

- **ComputeGraphExecutor**: Optimized engine with fusion passes and trace-to-DAG bridging.
- **SDPA & Gather**: High-performance Scaled Dot-Product Attention and indexing operations.
- **TurboQuant**: Runtime KV-cache compression (~8x at 4-bit) for long-context LLM inference. Presets: `safe-lowbit`, `balanced`, `experimental-max`. See `TurboQuantUsage` for integration guide.

### Agentic AI Infrastructure

Expand Down Expand Up @@ -148,12 +149,14 @@ SKaiNET is a modular ecosystem. While this repository contains the core engine,

---

## What's New in 0.17.0
## What's New in 0.18.0

- **Core Engine Focus** — Refactored the repository to focus on the core `ComputeGraph` framework, compiler, and backends. Extracted high-level LLM and transformer implementations to standalone repositories.
- **LLM-as-DSL** — New high-level DSL for defining and running LLM architectures within the core framework.
- **Optimized ComputeGraphExecutor** — New executor with support for fusion passes and trace-to-DAG bridging for faster inference.
- **SDPA & Gather** — Implemented Scaled Dot-Product Attention and `gather`/`indexSelect` ops for improved performance.
- **TurboQuant KV-Cache Compression** — Runtime KV-cache compression for LLM inference: ~8x memory reduction with 4-bit, works with any model (LLaMA, Mistral, Gemma, Qwen). One-line integration via `KvCacheStore.turboQuant("balanced", ...)`.
- **Memory Architecture Hardening** — First-class storage/placement abstractions (`TensorStorage`, `TensorEncoding`, `BufferHandle`, `Placement`), zero-copy ownership semantics, quantization-preserving loaders.
- **KV-Cache Subsystem** — Dedicated `KvCacheStore` with append-by-token writes, layer/head addressing, asymmetric K/V encoding policies, and `CompressedKvAttention` SDPA bridge.
- **Mistral Tokenizer** — Tekken (tiktoken-based BPE) tokenizer support for Mistral models.
- **Large Tensor Fix** — Fixed Int overflow in GGUF and SafeTensors loaders for tensors > 2 GB (Gemma 4 E4B support).
- **CPU SIMD Kernels** — Java Vector API acceleration for TurboQuant encode/decode/rotation operations.

See [CHANGELOG.md](CHANGELOG.md) for the full release history.

Expand All @@ -162,7 +165,7 @@ See [CHANGELOG.md](CHANGELOG.md) for the full release history.
## Roadmap

- **Q1 2026**: Comprehensive documentation ✅
- **Q2 2026**: Reference-based validation of computation correctness
- **Q2 2026**: TurboQuant KV-cache compression ✅ (shipped in 0.18.0)
- **Q3 2026**: Agentic AI enhancements ✅ (tool calling shipped in 0.13.0; ongoing)
- **Q4 2026**: Federated learning support for multi-device training

Expand Down
3 changes: 1 addition & 2 deletions gradle.properties
Original file line number Diff line number Diff line change
@@ -1,6 +1,5 @@
GROUP=sk.ainet.core
VERSION_NAME=0.17.0

VERSION_NAME=0.18.0
POM_DESCRIPTION=SKaiNET

POM_URL=https://github.com/SKaiNET-developers/skainet/
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -73,7 +73,11 @@ public class StreamingSafeTensorsReader private constructor(
* @return Raw bytes for the tensor
*/
public fun loadTensorData(tensor: StreamingSafeTensorInfo): ByteArray {
return source.readAt(tensor.absoluteDataOffset, tensor.sizeInBytes)
require(tensor.sizeInBytes <= Int.MAX_VALUE) {
"Tensor '${tensor.name}' is ${tensor.sizeInBytes} bytes (> 2 GB). " +
"Use loadTensorStorageMapped() for file-backed zero-copy access instead."
}
return source.readAt(tensor.absoluteDataOffset, tensor.sizeInBytes.toInt())
}

/**
Expand All @@ -85,7 +89,11 @@ public class StreamingSafeTensorsReader private constructor(
* @return Number of bytes read
*/
public fun loadTensorData(tensor: StreamingSafeTensorInfo, buffer: ByteArray, offset: Int = 0): Int {
return source.readAt(tensor.absoluteDataOffset, buffer, offset, tensor.sizeInBytes)
require(tensor.sizeInBytes <= Int.MAX_VALUE) {
"Tensor '${tensor.name}' is ${tensor.sizeInBytes} bytes (> 2 GB). " +
"Use loadTensorStorageMapped() for file-backed zero-copy access instead."
}
return source.readAt(tensor.absoluteDataOffset, buffer, offset, tensor.sizeInBytes.toInt())
}

// ========== TensorStorage Loading ==========
Expand Down Expand Up @@ -130,7 +138,7 @@ public class StreamingSafeTensorsReader private constructor(
buffer = BufferHandle.FileBacked(
path = filePath,
fileOffset = tensor.absoluteDataOffset,
sizeInBytes = tensor.sizeInBytes.toLong()
sizeInBytes = tensor.sizeInBytes
),
placement = Placement.MMAP_WEIGHTS
)
Expand Down Expand Up @@ -396,7 +404,7 @@ public class StreamingSafeTensorsReader private constructor(

val elementCount = if (shape.isEmpty()) 1L else shape.fold(1L) { acc, d -> acc * d }
val bytesPerElement = SafeTensorsDataTypes.sizeOf(dtype) ?: 1
val sizeInBytes = (dataOffsets.second - dataOffsets.first).toInt()
val sizeInBytes = dataOffsets.second - dataOffsets.first
val mappedDataType = SafeTensorsDataTypeMapper.toDataType(dtype)

_tensors.add(
Expand Down Expand Up @@ -529,7 +537,7 @@ public data class StreamingSafeTensorInfo(
/** End offset relative to data section */
val dataOffsetEnd: Long,
/** Size in bytes */
val sizeInBytes: Int,
val sizeInBytes: Long,
/** Absolute byte offset in file */
val absoluteDataOffset: Long
) {
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -342,7 +342,7 @@ public data class ShardedTensorInfo(
val elementCount: Long get() = base.elementCount
val dataOffsetStart: Long get() = base.dataOffsetStart
val dataOffsetEnd: Long get() = base.dataOffsetEnd
val sizeInBytes: Int get() = base.sizeInBytes
val sizeInBytes: Long get() = base.sizeInBytes
val absoluteDataOffset: Long get() = base.absoluteDataOffset
val isUnknownType: Boolean get() = base.isUnknownType

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -271,7 +271,7 @@ class SafeTensorsWriterCommonTest {
assertEquals(1, reader.tensors.size)
val tensor = reader.tensors[0]
assertEquals(listOf(100L, 100L), tensor.shape)
assertEquals(size * 4, tensor.sizeInBytes)
assertEquals(size * 4L, tensor.sizeInBytes)

val readData = bytesToFloatArray(reader.loadTensorData(tensor))
assertEquals(size, readData.size)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -243,7 +243,7 @@ class StreamingSafeTensorsReaderTest {
StreamingSafeTensorsReader.open(source).use { reader ->
val tensor = reader.tensors[0]
val tensorData = reader.loadTensorData(tensor)
assertEquals(tensor.sizeInBytes, tensorData.size)
assertEquals(tensor.sizeInBytes, tensorData.size.toLong())
}
}

Expand Down Expand Up @@ -302,7 +302,7 @@ class StreamingSafeTensorsReaderTest {
StreamingSafeTensorsReader.open(source).use { reader ->
for (tensor in reader.tensors) {
val tensorData = reader.loadTensorData(tensor)
assertEquals(tensor.sizeInBytes, tensorData.size,
assertEquals(tensor.sizeInBytes, tensorData.size.toLong(),
"Size mismatch for tensor ${tensor.name}")
}
}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -186,7 +186,7 @@ class SafeTensorsWriterTest {
assertEquals(1, reader.tensors.size)
val tensor = reader.tensors[0]
assertEquals(listOf(100L, 100L), tensor.shape)
assertEquals(size * 4, tensor.sizeInBytes)
assertEquals(size * 4L, tensor.sizeInBytes)

val readData = bytesToFloatArray(reader.loadTensorData(tensor))
assertEquals(size, readData.size)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -193,7 +193,7 @@ class StreamingSafeTensorsReaderJvmTest {
var tensorIndex = 0
for (tensor in reader.tensors) {
val data = reader.loadTensorData(tensor)
assertEquals(tensor.sizeInBytes, data.size)
assertEquals(tensor.sizeInBytes, data.size.toLong())

// Verify the pattern we wrote
for (i in data.indices) {
Expand Down
Loading