Skip to content

Feature/turboquant#458

Merged
michalharakal merged 27 commits intodevelopfrom
feature/turboquant
Apr 8, 2026
Merged

Feature/turboquant#458
michalharakal merged 27 commits intodevelopfrom
feature/turboquant

Conversation

@michalharakal
Copy link
Copy Markdown
Contributor

No description provided.

michalharakal and others added 27 commits April 6, 2026 17:59
Introduce explicit storage model separating logical dtype from physical
encoding, with first-class ownership tracking, placement metadata, and
layout-preserving loaders. This prevents the runtime from silently
collapsing quantized/borrowed/file-backed tensors back into heap arrays.

Key additions:
- TensorStorage descriptor with LogicalDType, TensorEncoding, BufferHandle,
  and Placement contracts
- BufferHandle sealed hierarchy: Owned, Borrowed, Aliased, FileBacked,
  DeviceResident
- PackedBlockStorage interface unifying Q4_K and Q8_0 block formats
- MappedMemoryChunk + JVM mmap implementation for file-backed weights
- StreamingGgufParametersLoader with Q4_K/Q8_0 quantized type support
- Zero-copy wrapFloatArray/wrapIntArray/wrapByteArray factory methods
- Explicit copyMaterialize() and realizeAlias() materialization APIs
- MemoryPlanner with device fallback policy
- MemoryTracker for allocation observability and copy tracing
- StorageSpec for storage-aware factory routing beyond dtype-only

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ionContext

SafeTensorsParametersLoader now uses wrapFloatArray/wrapIntArray instead of
fromFloatArray/fromIntArray for freshly-decoded arrays, eliminating a
redundant copy. Added wrapFloatArray/wrapIntArray/wrapByteArray convenience
methods to ExecutionContext to make borrow semantics accessible to loaders.

Refs #451

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Both StreamingGGUFReader and StreamingSafeTensorsReader now expose
loadTensorStorage() methods that return TensorStorage descriptors with
borrowed byte buffers, instead of only raw ByteArrays. This lets callers
work with the storage model (encoding, logical type, placement) directly
without manual conversion.

Refs #451

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Both StreamingGGUFReader and StreamingSafeTensorsReader now expose
loadTensorStorageMapped() which returns a TensorStorage with a
FileBacked BufferHandle pointing at the tensor's absolute file offset.
This enables zero-heap-copy weight loading — the OS pages data on demand
via mmap when the FileBacked handle is resolved by the runtime.

Refs #451

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
ExecutionContext now provides memoryPlanner (defaults to CPU-only) and
memoryTracker (defaults to null/disabled). Implementations can override
these to enable placement resolution and allocation tracking during
tensor creation and operation dispatch.

Refs #451

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Ternary2BitTensorData now implements PackedBlockStorage alongside Q4_K
and Q8_0, completing the unification of all packed quantization formats
under a single contract. Uses TernaryPacked encoding and provides
dequantizeBlock with scale-factor support.

Refs #451

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Heap-backed MappedMemoryChunk implementation for JS, Wasm, and Native
targets that lack native mmap support. Eagerly loads data from a
RandomAccessSource but satisfies the MappedMemoryChunk contract so
code can be written against one interface across all KMP targets.

Refs #451

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
TensorStorage now exposes copyMaterialize(), copyToHost(), copyToDevice(),
and repackTo() as explicit operations. copyMaterialize and copyToHost work
for Owned/Borrowed buffers. copyToDevice and repackTo are stubs that throw
until GPU/NPU backends and transcoding kernels are available.

Refs #451

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
CopyMaterializationStrategy and DenseTensorDataFactory's internal
createFloatTensorData/createIntTensorData now report copy events
to ActiveMemoryTracker.current when a tracker is active. This makes
hidden copies visible in debug reports without requiring callers to
manually instrument every copy site.

Refs #451

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@place(device, memory, requirement) declares where a tensor should
be allocated. @Weights marks immutable model weights that should be
file-backed (mmap) when possible. The MemoryPlanner reads these
annotations to make allocation decisions at runtime.

Refs #451

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
README example now uses StreamingGGUFReader instead of the legacy
GGUFReader. Docs guide adds a prominent streaming section with examples
and notes that the legacy reader is not recommended for new code.

Refs #451

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
BufferAccessor provides byte-level read access to any BufferHandle.
DefaultBufferResolver handles Owned/Borrowed/Aliased directly and
delegates FileBacked to a platform-specific resolver.

JvmFileBackedResolver maps FileBacked handles through JvmMappedMemoryChunk,
completing the path from loadTensorStorageMapped() → mmap → byte access
without heap loading.

Refs #451

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Tests construct a minimal GGUF binary with F32 and Q8_0 tensors and
verify the full pipeline: StreamingGGUFReader → loadTensorStorage →
file-backed mmap resolution → byte-level access. Also verifies
MemoryTracker reports correct aggregate metrics for mixed models.

Refs #451

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
TensorStorageFactory.toTensorData() converts a TensorStorage back into
a TensorData that existing backends can consume. Handles Dense FP32/INT32
(bytes → float/int array), Q4_K (→ Q4_KBlockTensorData), and Q8_0
(→ Q8_0BlockTensorData). Round-trip tests verify data integrity through
TensorData → Storage → TensorData conversions.

Refs #451

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
TransferOpsTest: 11 tests covering copyMaterialize (owned/borrowed/
file-backed/device-resident), copyToHost (identity and copy paths),
copyToDevice (CPU delegation, GPU throws), repackTo (same/different).

Q4KDequantizationTest: 6 tests covering dequantizeBlock with uniform
codes, zero codes, nibble extraction, multi-block toFloatArray,
out-of-bounds, and physical byte verification.

TernaryDequantizationTest: 6 tests covering dequantizeBlock for all -1s,
all 0s, all +1s with scale factors, mixed values matching toFloatArray,
output offset writing, and invalid block index.

Refs #451

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…tiguous storage, loader

ActiveMemoryTrackerTest (5 tests): verifies global tracker hook captures
copies from DenseTensorDataFactory, null tracker is safe, clear resets.

FallbackMappedMemoryChunkTest (10 tests): covers readByte, readBytes,
slice, nested slice offset composition, dataOffset, metadata, close.

NonContiguousStorageTest (6 tests): verifies strides preservation,
isContiguous flag, equals/hashCode include strides, defaults.

StreamingGgufParametersLoaderTest (3 tests): end-to-end F32 and Q8_0
loading through the parameters loader, progress callback verification.

Also fixes StackOverflow in TensorStorage.equals() where the private
contentEquals extension recursively called itself instead of
LongArray.contentEquals.

Refs #451

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Tiktoken-based BPE tokenizer that parses Mistral's tekken.json format:
base64-decoded byte vocab, implicit merge ordering from rank, separate
special token ID space, Unicode-aware pre-tokenization regex.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add Tekken tokenizer for Mistral models
…ors zero-copy

Finish all remaining Step 1 PRD items for TurboQuant readiness:

- Add KvCacheStore interface with append-by-token, range reads, eviction,
  asymmetric K/V encoding policies, and DefaultKvCacheStore implementation
- Add CompressedKvAttention bridge between KvCacheStore and SDPA with
  full-tile dequant and raw-storage extension points for fused backends
- Complete Quants.kt port: byteShapeToQuantShape, quantByteSize,
  isBlockQuantized, validateQuantizedBytes, and related utilities
- Add StorageAwareSafeTensorsLoader producing TensorStorage with
  FileBacked (zero-copy) or Borrowed handles
- Add TURBOQUANT_ISSUES.md tracker with 21 traceable issues

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add complete TurboQuant implementation as KV-cache compression:

Core kernels (common Kotlin):
- RandomRotation: Walsh-Hadamard + random sign flips, O(d log d)
- ScalarQuantizer: per-group symmetric quantization, 2/3/4/8-bit
- BitPacker: compact bit-packing/unpacking for all bit widths
- QjlResidual: Quantized Johnson-Lindenstrauss residual stage

End-to-end codec:
- TurboQuantCodec: full encode/decode pipeline (PolarOnly + PolarPlusQjl)
- TurboQuantKvCacheStore: compressed KV cache with per-head TurboQuant blocks
- Asymmetric K/V policies (different bit budgets for keys vs values)

Encoding types:
- TurboQuantPolar and TurboQuantPolarQjl added to sealed TensorEncoding

Presets:
- safe-lowbit (Q8_0-K + TurboQuant-4-V)
- balanced (TurboQuant-4 / TurboQuant-4)
- experimental-max (TurboQuant-3 / TurboQuant-3)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- @kvcache and @KvCacheBypass annotations for declarative KV cache
  compression configuration on attention layers
- JvmTurboQuantKernels: SIMD-accelerated abs-max, quantize, dequantize,
  and Walsh-Hadamard butterfly using Java Vector API
- TurboQuantBenchmarks: JMH benchmarks for encode/decode throughput,
  bit-packing, random rotation, and KV cache append/read performance

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Detailed task document covering:
- Metal compute shaders for TurboQuant encode/decode
- Fused dequant+SDPA kernel design (avoids materializing decompressed K/V)
- Unified-memory KV cache (zero CPU↔GPU copies on Apple Silicon)
- Kotlin/Native cinterop setup for Metal.framework
- 5-phase implementation plan with 20 subtasks
- Shader signatures and parameter structs
- Performance targets and acceptance criteria

Refs #452

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Make TurboQuant adoption a one-liner for skainet-transformers:

- KvCacheStore.turboQuant("balanced", ...) factory method
- KvCacheStore.dense() and .fromPreset() convenience factories
- TurboQuantPresets.forModel() lookup by preset name + model dims
- KvCacheAnnotationResolver: resolve @kvcache annotations to stores
- TurboQuantUsage: documented integration guide with compilable examples
  showing cache creation, attention layer wiring, and generation loop

Any GGUF model (LLaMA, Mistral, Gemma, Qwen) can use TurboQuant
immediately — it compresses the KV cache at runtime, not model weights.

Refs #452

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@michalharakal michalharakal merged commit 5ed67ab into develop Apr 8, 2026
6 checks passed
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 8, 2026

📖 Documentation Preview

The documentation has been built successfully for this PR.

Generated Files:

  • Operator documentation: docs/modules/operators/_generated_/
  • JSON schema output: operators.json

Artifacts:

  • Download the documentation-preview-458 artifact to view the complete documentation locally.

This comment will be updated automatically when the PR is updated.

@michalharakal michalharakal deleted the feature/turboquant branch April 8, 2026 15:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant