CUDA matrix multiplication benchmarking on Jetson Orin Nano. Four implementations, three power modes, five matrix sizes. 99.5% mathematical validation. C++/CUDA and Python.
-
Updated
Apr 2, 2026 - Python
CUDA matrix multiplication benchmarking on Jetson Orin Nano. Four implementations, three power modes, five matrix sizes. 99.5% mathematical validation. C++/CUDA and Python.
FlashAttention-style CUDA implementation with shared-memory tiling, online softmax fusion, IO-aware optimization, and GPU benchmarking.
A 110M-parameter Llama-style transformer trained from scratch on the TinyStories dataset, optimized for high-throughput training on 4GB VRAM consumer GPUs. The project features a custom asynchronous CUDA-stream prefetcher and KV-cache inference, achieving 10k+ TPS on an RTX 3050.
DeepSeek-R1 7B INT4 at 69.3 tok/s on a $300 RTX 3060. Faster than llama.cpp, vLLM, and NVIDIA TensorRT-LLM. Is one developer + Ai really better than the entire industry?
🔍 Analyze CUDA matrix multiplication performance and power consumption on NVIDIA Jetson Orin Nano across multiple implementations and settings.
High-performance matrix engine for Unit-Domain Flow (UDF). Eliminates Mantissa Friction with 0.00 MSE integrity.
An update on how binary search works with cuda along with the cluster tree sort algo for minimal memory and compute cycles
Hardened RAG pipeline with Llama 3.2 (3B) & Arize Phoenix. Features 4-bit Unsloth optimization, OpenTelemetry auditing, and a KV-cache stability patch for T4 GPUs. P99 Latency: 19.2s.
Add a description, image, and links to the cuda-optimization topic page so that developers can more easily learn about it.
To associate your repository with the cuda-optimization topic, visit your repo's landing page and select "manage topics."