Merged
Conversation
Introduce explicit storage model separating logical dtype from physical encoding, with first-class ownership tracking, placement metadata, and layout-preserving loaders. This prevents the runtime from silently collapsing quantized/borrowed/file-backed tensors back into heap arrays. Key additions: - TensorStorage descriptor with LogicalDType, TensorEncoding, BufferHandle, and Placement contracts - BufferHandle sealed hierarchy: Owned, Borrowed, Aliased, FileBacked, DeviceResident - PackedBlockStorage interface unifying Q4_K and Q8_0 block formats - MappedMemoryChunk + JVM mmap implementation for file-backed weights - StreamingGgufParametersLoader with Q4_K/Q8_0 quantized type support - Zero-copy wrapFloatArray/wrapIntArray/wrapByteArray factory methods - Explicit copyMaterialize() and realizeAlias() materialization APIs - MemoryPlanner with device fallback policy - MemoryTracker for allocation observability and copy tracing - StorageSpec for storage-aware factory routing beyond dtype-only Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ionContext SafeTensorsParametersLoader now uses wrapFloatArray/wrapIntArray instead of fromFloatArray/fromIntArray for freshly-decoded arrays, eliminating a redundant copy. Added wrapFloatArray/wrapIntArray/wrapByteArray convenience methods to ExecutionContext to make borrow semantics accessible to loaders. Refs #451 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Both StreamingGGUFReader and StreamingSafeTensorsReader now expose loadTensorStorage() methods that return TensorStorage descriptors with borrowed byte buffers, instead of only raw ByteArrays. This lets callers work with the storage model (encoding, logical type, placement) directly without manual conversion. Refs #451 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Both StreamingGGUFReader and StreamingSafeTensorsReader now expose loadTensorStorageMapped() which returns a TensorStorage with a FileBacked BufferHandle pointing at the tensor's absolute file offset. This enables zero-heap-copy weight loading — the OS pages data on demand via mmap when the FileBacked handle is resolved by the runtime. Refs #451 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
ExecutionContext now provides memoryPlanner (defaults to CPU-only) and memoryTracker (defaults to null/disabled). Implementations can override these to enable placement resolution and allocation tracking during tensor creation and operation dispatch. Refs #451 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Ternary2BitTensorData now implements PackedBlockStorage alongside Q4_K and Q8_0, completing the unification of all packed quantization formats under a single contract. Uses TernaryPacked encoding and provides dequantizeBlock with scale-factor support. Refs #451 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Heap-backed MappedMemoryChunk implementation for JS, Wasm, and Native targets that lack native mmap support. Eagerly loads data from a RandomAccessSource but satisfies the MappedMemoryChunk contract so code can be written against one interface across all KMP targets. Refs #451 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
TensorStorage now exposes copyMaterialize(), copyToHost(), copyToDevice(), and repackTo() as explicit operations. copyMaterialize and copyToHost work for Owned/Borrowed buffers. copyToDevice and repackTo are stubs that throw until GPU/NPU backends and transcoding kernels are available. Refs #451 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
CopyMaterializationStrategy and DenseTensorDataFactory's internal createFloatTensorData/createIntTensorData now report copy events to ActiveMemoryTracker.current when a tracker is active. This makes hidden copies visible in debug reports without requiring callers to manually instrument every copy site. Refs #451 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@place(device, memory, requirement) declares where a tensor should be allocated. @Weights marks immutable model weights that should be file-backed (mmap) when possible. The MemoryPlanner reads these annotations to make allocation decisions at runtime. Refs #451 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
README example now uses StreamingGGUFReader instead of the legacy GGUFReader. Docs guide adds a prominent streaming section with examples and notes that the legacy reader is not recommended for new code. Refs #451 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
BufferAccessor provides byte-level read access to any BufferHandle. DefaultBufferResolver handles Owned/Borrowed/Aliased directly and delegates FileBacked to a platform-specific resolver. JvmFileBackedResolver maps FileBacked handles through JvmMappedMemoryChunk, completing the path from loadTensorStorageMapped() → mmap → byte access without heap loading. Refs #451 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Tests construct a minimal GGUF binary with F32 and Q8_0 tensors and verify the full pipeline: StreamingGGUFReader → loadTensorStorage → file-backed mmap resolution → byte-level access. Also verifies MemoryTracker reports correct aggregate metrics for mixed models. Refs #451 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
TensorStorageFactory.toTensorData() converts a TensorStorage back into a TensorData that existing backends can consume. Handles Dense FP32/INT32 (bytes → float/int array), Q4_K (→ Q4_KBlockTensorData), and Q8_0 (→ Q8_0BlockTensorData). Round-trip tests verify data integrity through TensorData → Storage → TensorData conversions. Refs #451 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
TransferOpsTest: 11 tests covering copyMaterialize (owned/borrowed/ file-backed/device-resident), copyToHost (identity and copy paths), copyToDevice (CPU delegation, GPU throws), repackTo (same/different). Q4KDequantizationTest: 6 tests covering dequantizeBlock with uniform codes, zero codes, nibble extraction, multi-block toFloatArray, out-of-bounds, and physical byte verification. TernaryDequantizationTest: 6 tests covering dequantizeBlock for all -1s, all 0s, all +1s with scale factors, mixed values matching toFloatArray, output offset writing, and invalid block index. Refs #451 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…tiguous storage, loader ActiveMemoryTrackerTest (5 tests): verifies global tracker hook captures copies from DenseTensorDataFactory, null tracker is safe, clear resets. FallbackMappedMemoryChunkTest (10 tests): covers readByte, readBytes, slice, nested slice offset composition, dataOffset, metadata, close. NonContiguousStorageTest (6 tests): verifies strides preservation, isContiguous flag, equals/hashCode include strides, defaults. StreamingGgufParametersLoaderTest (3 tests): end-to-end F32 and Q8_0 loading through the parameters loader, progress callback verification. Also fixes StackOverflow in TensorStorage.equals() where the private contentEquals extension recursively called itself instead of LongArray.contentEquals. Refs #451 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Tiktoken-based BPE tokenizer that parses Mistral's tekken.json format: base64-decoded byte vocab, implicit merge ordering from rank, separate special token ID space, Unicode-aware pre-tokenization regex. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add Tekken tokenizer for Mistral models
…ors zero-copy Finish all remaining Step 1 PRD items for TurboQuant readiness: - Add KvCacheStore interface with append-by-token, range reads, eviction, asymmetric K/V encoding policies, and DefaultKvCacheStore implementation - Add CompressedKvAttention bridge between KvCacheStore and SDPA with full-tile dequant and raw-storage extension points for fused backends - Complete Quants.kt port: byteShapeToQuantShape, quantByteSize, isBlockQuantized, validateQuantizedBytes, and related utilities - Add StorageAwareSafeTensorsLoader producing TensorStorage with FileBacked (zero-copy) or Borrowed handles - Add TURBOQUANT_ISSUES.md tracker with 21 traceable issues Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add complete TurboQuant implementation as KV-cache compression: Core kernels (common Kotlin): - RandomRotation: Walsh-Hadamard + random sign flips, O(d log d) - ScalarQuantizer: per-group symmetric quantization, 2/3/4/8-bit - BitPacker: compact bit-packing/unpacking for all bit widths - QjlResidual: Quantized Johnson-Lindenstrauss residual stage End-to-end codec: - TurboQuantCodec: full encode/decode pipeline (PolarOnly + PolarPlusQjl) - TurboQuantKvCacheStore: compressed KV cache with per-head TurboQuant blocks - Asymmetric K/V policies (different bit budgets for keys vs values) Encoding types: - TurboQuantPolar and TurboQuantPolarQjl added to sealed TensorEncoding Presets: - safe-lowbit (Q8_0-K + TurboQuant-4-V) - balanced (TurboQuant-4 / TurboQuant-4) - experimental-max (TurboQuant-3 / TurboQuant-3) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- @kvcache and @KvCacheBypass annotations for declarative KV cache compression configuration on attention layers - JvmTurboQuantKernels: SIMD-accelerated abs-max, quantize, dequantize, and Walsh-Hadamard butterfly using Java Vector API - TurboQuantBenchmarks: JMH benchmarks for encode/decode throughput, bit-packing, random rotation, and KV cache append/read performance Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Detailed task document covering: - Metal compute shaders for TurboQuant encode/decode - Fused dequant+SDPA kernel design (avoids materializing decompressed K/V) - Unified-memory KV cache (zero CPU↔GPU copies on Apple Silicon) - Kotlin/Native cinterop setup for Metal.framework - 5-phase implementation plan with 20 subtasks - Shader signatures and parameter structs - Performance targets and acceptance criteria Refs #452 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Make TurboQuant adoption a one-liner for skainet-transformers:
- KvCacheStore.turboQuant("balanced", ...) factory method
- KvCacheStore.dense() and .fromPreset() convenience factories
- TurboQuantPresets.forModel() lookup by preset name + model dims
- KvCacheAnnotationResolver: resolve @kvcache annotations to stores
- TurboQuantUsage: documented integration guide with compilable examples
showing cache creation, attention layer wiring, and generation loop
Any GGUF model (LLaMA, Mistral, Gemma, Qwen) can use TurboQuant
immediately — it compresses the KV cache at runtime, not model weights.
Refs #452
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
📖 Documentation Preview The documentation has been built successfully for this PR. Generated Files:
Artifacts:
This comment will be updated automatically when the PR is updated. |
This was referenced Apr 8, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.