Skip to content

Releases: zerfoo/ztensor

v1.8.0

29 Apr 04:05

Choose a tag to compare

1.8.0 (2026-04-29)

Features

  • compute: bulk-upload F32 weights to one device buffer (#103) (9ca83f6)

v1.7.0

28 Apr 06:25

Choose a tag to compare

1.7.0 (2026-04-20)

Features

  • compute: add StepScope + MarkStepBoundary for training-loop arena reset (6356dde)
  • compute: configurable GPU arena size via ZERFOO_ARENA_SIZE_GB (cab6067)

v1.6.0

17 Apr 18:52

Choose a tag to compare

1.6.0 (2026-04-17)

Features

  • compute: T1.2 add ensureNotCapturing guard and ErrCaptureIncompatibleAllocation (18e1f5a)
  • compute: T2.1a add WithCapture helper for capture-aware graph lifecycle (d60c902)
  • compute: T2.2 capture-aware allocWeight routing via cudaMallocAsync (2a723b7)
  • compute: T2.3 pre-allocate workspace buffers at UploadWeights to avoid capture-time alloc (9f9eb5c)
  • cuda: T1.1 add StreamCaptureStatus purego binding (879cbc9)
  • graph: add LMHead to nonCapturableOps (07ba531)
  • graph: T4.1 add capture watchdog with 30s timeout and status sampling (b3066a5)
  • graph: T99.1.2 mark Gemma4PLECombinedProducer non-capturable (6c855a9)

Bug Fixes

  • graph: T98.2.3 don't pool-release pass-through node inputs (6ecf8db)

v1.5.0

10 Apr 23:43

Choose a tag to compare

1.5.0 (2026-04-10)

Features

  • compute: add AllocDeviceFloat32 and CopyToDevice to FusedEncoderProvider (8d6c90b)
  • compute: add fused PatchTST encoder layer CUDA kernels (4dfd46e)

Bug Fixes

  • compute: GPUEngine.Reshape honors dst argument (18a53fe)
  • compute: reuse dst GPU memory instead of allocating per call (#84) (26bbd49)
  • kernels: rename kernel_add in fused_encoder_bwd to avoid symbol clash (716bbd6)

v1.4.0

06 Apr 21:15

Choose a tag to compare

1.4.0 (2026-04-06)

Features

  • graph: add NewPJRTClient for external PJRT usage (c8db036)
  • graph: add PJRTPlan execution wrapper with KV cache state management (3e5cb40)

Bug Fixes

  • ci: exclude metal and pjrt from go vet (5a7fdc3)
  • kernels: update GemvQ5_0F32 test to match qhOffset/qsOffset signature (70f8fd5)

v1.3.0

03 Apr 01:09

Choose a tag to compare

1.3.0 (2026-04-03)

Features

  • graph: add CompilePJRT for PJRT backend compilation (dfd77a4)
  • pjrt: add buffer management (host-device transfer, readback, lifecycle) (9b5dc75)
  • pjrt: add KV cache I/O rewriting and executable cache (c8decc5)
  • pjrt: add PJRT C API purego bindings for plugin loading, client, and device (c675807)
  • pjrt: add program execution, serialization, and full StableHLO emitter (382ea0a)
  • pjrt: add StableHLO program compilation wrapper (7fcdde7)
  • stablehlo: add emitter for element-wise and unary ops (499cef2)
  • stablehlo: add emitter for MatMul and structural ops (13d87df)
  • stablehlo: add emitter for reductions and Softmax decomposition (c07b287)
  • stablehlo: add MLIR type system and SSA naming (7c68d1e)
  • stablehlo: add shape inference for arithmetic ops (cac094e)
  • stablehlo: add shape inference for structural ops (8bf132c)

Bug Fixes

  • pjrt: centralize internal/cuda import in pjrt.go (aa8c170)
  • pjrt: remove duplicate ccall/goStringN declarations (3e5fba9)

v1.2.0

02 Apr 07:26

Choose a tag to compare

1.2.0 (2026-04-01)

Features

  • cuda: add Q6_K, Q5_K, Q5_0 GPU dequant kernels for M>1 prefill (d57e37e)
  • cuda: add Q8 Gather kernel for GPU embedding lookup (30eb9c4)
  • tensor: add QuantizeQ4K for float32 to Q4_K quantization (d0d3a82)

Bug Fixes

  • compute: add Q4KStorage to UploadWeights F32 skip list (cc071b6)
  • compute: CPU dequant fallback for Q4_K when K%256!=0 (f50ffa7)
  • compute: use dequant+cuBLAS for Q4_K when K%256!=0 (5f21cbb)
  • compute: use pool-backed GPUStorage for pool allocations (4367330)
  • cuda: byte-wise loads in Q5_0 GEMV for ARM64 alignment (5f19e54)
  • kernels: check null function pointer in FusedSoftmaxVMulF32 (935ad61)

Performance Improvements

  • cuda: separated GPU layout for Q5_0 GEMV (d456c39)

v1.1.3

01 Apr 04:34

Choose a tag to compare

1.1.3 (2026-04-01)

Bug Fixes

  • compute: add Q5_0Storage B-weight handling to CPU MatMul (e7927e5)
  • compute: Q5_0 GEMV byte-wise loads for ARM64 alignment (5c7ec7a)
  • compute: skip Q4Storage in UploadWeights F32 loop (revert overaggressive skip) (2e91650)
  • compute: skip transpose reshape fast-path for square matrices (eab19d0)

v1.1.2

31 Mar 06:18

Choose a tag to compare

1.1.2 (2026-03-31)

Bug Fixes

  • compute: upload CPU fallback MatMul results to GPU for device consistency (5bc914b)

v1.1.1

31 Mar 05:30

Choose a tag to compare

1.1.1 (2026-03-31)

Bug Fixes

  • cuda: remove float4 alignment requirement from gemv_q8_kernel (1313605)
  • cuda: remove float4 alignment requirement from gemv_q8_kernel (34aba3b)