Releases: m96-chan/PyGPUkit
v0.2.19 - FLUX.1 Image Generation
Highlights
FLUX.1 Image Generation
Text-to-image generation with Black Forest Labs' FLUX.1 model:
- Full FLUX.1-schnell transformer (19 joint + 38 single blocks)
- Flow matching Euler scheduler
- GPU-native operations (transpose, batched matmul, RoPE)
Lazy Model Loading with Streaming
Memory-efficient model loading strategies:
StreamingStrategy.EAGER- Load all at once (default)StreamingStrategy.PROGRESSIVE- Load during first forwardStreamingStrategy.LAYER_BY_LAYER- Minimal memory usage
cuBLAS Dynamic Loader
- Runtime DLL loading without compile-time CUDA Toolkit
- Auto-detection of cuBLASLt versions (13/12/11)
- Graceful fallback to native kernels
C++ Kernel Profiler
- Built-in CUDA kernel profiling with minimal overhead
- Per-kernel timing statistics
HuggingFace T5 Encoder
- Sharded safetensors support
- Full T5 encoder for FLUX/SD3
DiT Architecture
- PixArt transformer with AdaLN-Zero
- Self/cross attention with GQA
- GEGLU FFN
New GPU Operations
transpose_4d_0213,transpose_3d_012gpu_batched_matmul,gpu_softmax,gpu_apply_ropecross_attention,conv2d,group_norm
Known Issues
- FLUX.1 performance needs optimization (#187)
Full Changelog
v0.2.18
What's New in v0.2.18
Major Codebase Refactoring
Complete modularization of the codebase for better maintainability:
- Split monolithic files into modular
.inlcomponents - Reorganized matmul kernel directory structure
- Standardized GEMM/GEMV naming conventions
- Modular pybind11 bindings
Kokoro-82M TTS
Text-to-speech synthesis with Japanese/English support.
Positional Encoding Operations
New neural network operations for attention mechanisms:
| Function | Description |
|---|---|
pope_init_encoding |
Sinusoidal positional encoding (PoPE) |
pope_inplace |
Apply additive encoding to Q/K |
alibi_init_slopes |
ALiBi head-specific slopes |
alibi_compute_bias |
ALiBi attention bias matrix |
rope_init_ntk_aware |
NTK-aware RoPE for context extension |
rope_init_yarn |
YaRN dimension-wise interpolation |
rope_init_linear |
Linear position interpolation |
relu2 |
ReLU squared activation (Primer) |
Unified Benchmark Suite
New scripts/benchmark.py for comprehensive performance testing.
QAT/Pruning/Sparsity Config
Model config support for quantization-aware training, pruning, and sparsity patterns.
Optimized BF16 GEMV
98-101% peak memory bandwidth for typical LLM dimensions.
W8A16 GEMM Fix
Fixed MMA A-fragment register mapping. MoE models now produce correct output.
Full Changelog: v0.2.17...v0.2.18
v0.2.17
What's Changed
Full Changelog: v0.2.16...v0.2.17
v0.2.16 - MoE, Thinking Model, SM120 Kernels
What's New
Added
- MoE (Mixture of Experts): Full Mixtral support with TopK routing, grouped GEMM
- Thinking Model: Qwen3
<think>...</think>block parsing - GEMV Kernels (SM120): FP8/FP8 (W8A8), NVF4/NVF4 (W4A4), Int4
- GEMM Kernels (SM120): W8A16, Int8 native (dp4a), Int4 via Int8, Grouped GEMM v2
- Claude Code Skills: Build, benchmark, lint, test automation
- Subagents: kernel-reviewer, perf-analyzer, api-designer
Changed
- Kernel directory restructure:
{gemm|gemv}/{input}/{output}/{arch}/ - Removed redundant slow kernels (FP8 GEMV basic, Int8 via FP8)
Performance Highlights (RTX 5090)
| Kernel | Performance |
|---|---|
| W8A8 GEMV | 6-18x faster than BF16 |
| W8A16 GEMM | SM120 CUTLASS support |
| Grouped GEMM v2 | Per-row expert IDs for MoE |
See CHANGELOG.md for full details.
v0.2.15
What's Changed
Full Changelog: v0.2.14...v0.2.15
v0.2.14
Full Changelog: v0.2.13...v0.2.14
v0.2.13
Full Changelog: v0.2.12...v0.2.13
v0.2.12
What's Changed
Full Changelog: v0.2.11...v0.2.12
v0.2.11
v0.2.10: Dynamic cuBLASLt + CUDA Graph Optimizations
What's New in v0.2.10
Dynamic cuBLASLt Loading
- Driver-only deployment: cuBLASLt is now loaded dynamically at runtime
- No CUDA Toolkit installation required for end users
- Automatic fallback to CUTLASS kernels when cuBLASLt is unavailable
Performance Optimizations
- GEMM descriptor caching: 2.67x faster matmul operations (395ms → 148ms for 224 matmuls)
- Zero-allocation decode path: Eliminated GPU memory allocations during token generation
- CUDA Graph integration: Reduced kernel launch overhead
LLM Inference Improvements
- Flash Attention for prefill phase
- Flash Decoding for decode phase
- GPU-based top-k sampling with temperature scaling
- Chat template support (Qwen, LLaMA, Mistral, ChatML)
- Streaming generation with
generate_stream()
New Features
GPUArray.narrow()for zero-copy tensor slicing- Fused QKV projection infrastructure
- GPU position buffer for CUDA Graph replay without recapture
Build System
- SM100/SM120 (Blackwell) CUTLASS kernel infrastructure
- CUDA 13.x compatibility improvements