Skip to content

Releases: m96-chan/PyGPUkit

v0.2.19 - FLUX.1 Image Generation

01 Jan 19:25
7adbe5f

Choose a tag to compare

Highlights

FLUX.1 Image Generation

Text-to-image generation with Black Forest Labs' FLUX.1 model:

  • Full FLUX.1-schnell transformer (19 joint + 38 single blocks)
  • Flow matching Euler scheduler
  • GPU-native operations (transpose, batched matmul, RoPE)

Lazy Model Loading with Streaming

Memory-efficient model loading strategies:

  • StreamingStrategy.EAGER - Load all at once (default)
  • StreamingStrategy.PROGRESSIVE - Load during first forward
  • StreamingStrategy.LAYER_BY_LAYER - Minimal memory usage

cuBLAS Dynamic Loader

  • Runtime DLL loading without compile-time CUDA Toolkit
  • Auto-detection of cuBLASLt versions (13/12/11)
  • Graceful fallback to native kernels

C++ Kernel Profiler

  • Built-in CUDA kernel profiling with minimal overhead
  • Per-kernel timing statistics

HuggingFace T5 Encoder

  • Sharded safetensors support
  • Full T5 encoder for FLUX/SD3

DiT Architecture

  • PixArt transformer with AdaLN-Zero
  • Self/cross attention with GQA
  • GEGLU FFN

New GPU Operations

  • transpose_4d_0213, transpose_3d_012
  • gpu_batched_matmul, gpu_softmax, gpu_apply_rope
  • cross_attention, conv2d, group_norm

Known Issues

  • FLUX.1 performance needs optimization (#187)

Full Changelog

v0.2.18...v0.2.19

v0.2.18

30 Dec 22:59

Choose a tag to compare

What's New in v0.2.18

Major Codebase Refactoring

Complete modularization of the codebase for better maintainability:

  • Split monolithic files into modular .inl components
  • Reorganized matmul kernel directory structure
  • Standardized GEMM/GEMV naming conventions
  • Modular pybind11 bindings

Kokoro-82M TTS

Text-to-speech synthesis with Japanese/English support.

Positional Encoding Operations

New neural network operations for attention mechanisms:

Function Description
pope_init_encoding Sinusoidal positional encoding (PoPE)
pope_inplace Apply additive encoding to Q/K
alibi_init_slopes ALiBi head-specific slopes
alibi_compute_bias ALiBi attention bias matrix
rope_init_ntk_aware NTK-aware RoPE for context extension
rope_init_yarn YaRN dimension-wise interpolation
rope_init_linear Linear position interpolation
relu2 ReLU squared activation (Primer)

Unified Benchmark Suite

New scripts/benchmark.py for comprehensive performance testing.

QAT/Pruning/Sparsity Config

Model config support for quantization-aware training, pruning, and sparsity patterns.

Optimized BF16 GEMV

98-101% peak memory bandwidth for typical LLM dimensions.

W8A16 GEMM Fix

Fixed MMA A-fragment register mapping. MoE models now produce correct output.


Full Changelog: v0.2.17...v0.2.18

v0.2.17

28 Dec 12:47

Choose a tag to compare

What's Changed

Full Changelog: v0.2.16...v0.2.17

v0.2.16 - MoE, Thinking Model, SM120 Kernels

28 Dec 04:42

Choose a tag to compare

What's New

Added

  • MoE (Mixture of Experts): Full Mixtral support with TopK routing, grouped GEMM
  • Thinking Model: Qwen3 <think>...</think> block parsing
  • GEMV Kernels (SM120): FP8/FP8 (W8A8), NVF4/NVF4 (W4A4), Int4
  • GEMM Kernels (SM120): W8A16, Int8 native (dp4a), Int4 via Int8, Grouped GEMM v2
  • Claude Code Skills: Build, benchmark, lint, test automation
  • Subagents: kernel-reviewer, perf-analyzer, api-designer

Changed

  • Kernel directory restructure: {gemm|gemv}/{input}/{output}/{arch}/
  • Removed redundant slow kernels (FP8 GEMV basic, Int8 via FP8)

Performance Highlights (RTX 5090)

Kernel Performance
W8A8 GEMV 6-18x faster than BF16
W8A16 GEMM SM120 CUTLASS support
Grouped GEMM v2 Per-row expert IDs for MoE

See CHANGELOG.md for full details.

v0.2.15

26 Dec 09:25

Choose a tag to compare

What's Changed

  • v0.2.15: FP8 I/O GEMM, Pure NVF4 (446 TFLOPS), New Math Ops by @m96-chan in #117

Full Changelog: v0.2.14...v0.2.15

v0.2.14

23 Dec 06:20

Choose a tag to compare

Full Changelog: v0.2.13...v0.2.14

v0.2.13

23 Dec 05:46

Choose a tag to compare

Full Changelog: v0.2.12...v0.2.13

v0.2.12

22 Dec 17:48

Choose a tag to compare

What's Changed

  • feat(audio): add advanced audio processing kernels (v0.2.12) by @m96-chan in #99

Full Changelog: v0.2.11...v0.2.12

v0.2.11

22 Dec 12:04

Choose a tag to compare

What's Changed

  • v0.2.11: Driver API, Dual CUDA Build, RTX 5090 Support by @m96-chan in #94
  • chore: bump version to 0.2.11 by @m96-chan in #95

Full Changelog: v0.2.10...v0.2.11

v0.2.10: Dynamic cuBLASLt + CUDA Graph Optimizations

18 Dec 12:42
cbbf111

Choose a tag to compare

What's New in v0.2.10

Dynamic cuBLASLt Loading

  • Driver-only deployment: cuBLASLt is now loaded dynamically at runtime
  • No CUDA Toolkit installation required for end users
  • Automatic fallback to CUTLASS kernels when cuBLASLt is unavailable

Performance Optimizations

  • GEMM descriptor caching: 2.67x faster matmul operations (395ms → 148ms for 224 matmuls)
  • Zero-allocation decode path: Eliminated GPU memory allocations during token generation
  • CUDA Graph integration: Reduced kernel launch overhead

LLM Inference Improvements

  • Flash Attention for prefill phase
  • Flash Decoding for decode phase
  • GPU-based top-k sampling with temperature scaling
  • Chat template support (Qwen, LLaMA, Mistral, ChatML)
  • Streaming generation with generate_stream()

New Features

  • GPUArray.narrow() for zero-copy tensor slicing
  • Fused QKV projection infrastructure
  • GPU position buffer for CUDA Graph replay without recapture

Build System

  • SM100/SM120 (Blackwell) CUTLASS kernel infrastructure
  • CUDA 13.x compatibility improvements