cuda-optimization

Here are 8 public repositories matching this topic...

Cre4T3Tiv3 / jetson-orin-matmul-analysis

CUDA matrix multiplication benchmarking on Jetson Orin Nano. Four implementations, three power modes, five matrix sizes. 99.5% mathematical validation. C++/CUDA and Python.

Updated Apr 2, 2026
Python

shrvan30 / flash-attention-cuda

Star

FlashAttention-style CUDA implementation with shared-memory tiling, online softmax fusion, IO-aware optimization, and GPU benchmarking.

machine-learning hpc gpu parallel-computing cuda transformer attention cuda-kernels shared-memory gpu-programming flashattention cuda-optimization flashattention2

Updated May 29, 2026
Cuda

KrishChordiya / nano-llama

Star

A 110M-parameter Llama-style transformer trained from scratch on the TinyStories dataset, optimized for high-throughput training on 4GB VRAM consumer GPUs. The project features a custom asynchronous CUDA-stream prefetcher and KV-cache inference, achieving 10k+ TPS on an RTX 3050.

nlp deep-learning transformers pytorch llama efficient-training tinystories cuda-optimization llama-from-scratch

Updated Apr 8, 2026
Python

Wierzbowski-Alien / qwen-coder-w4a16-demo

Star

DeepSeek-R1 7B INT4 at 69.3 tok/s on a $300 RTX 3060. Faster than llama.cpp, vLLM, and NVIDIA TensorRT-LLM. Is one developer + Ai really better than the entire industry?

inference-engine cachyos local-llm speculative-decoding deepseek-r1 cuda-optimization rtx-3060 w4a16

Updated May 19, 2026
Python

ZrobMiloudaa / jetson-orin-matmul-analysis

Star

🔍 Analyze CUDA matrix multiplication performance and power consumption on NVIDIA Jetson Orin Nano across multiple implementations and settings.

machine-learning robotics cuda cublas matrix-multiplication high-performance-computing gpu-computing performance-optimization autonomous-systems edge-computing nvidia-jetson embeded-systems tensor-cores ml-deployment jetson-orin-nano gpu-benchmarking power-efficiency-benchmark cuda-optimization

Updated Jun 9, 2026
Python

torajharsh / aether-scale

Star

High-performance matrix engine for Unit-Domain Flow (UDF). Eliminates Mantissa Friction with 0.00 MSE integrity.

Updated Feb 17, 2026
Python

sighthough / cuda-binary-search-with-tree-sort-algo-optimization

Star

An update on how binary search works with cuda along with the cluster tree sort algo for minimal memory and compute cycles

tree ai optimization cluster cuda gemini binary-search optimization-algorithms cuda-optimization

Updated Jun 9, 2026
HTML

ShettyShreyasR / rag-observability-pro

Star

Hardened RAG pipeline with Llama 3.2 (3B) & Arize Phoenix. Features 4-bit Unsloth optimization, OpenTelemetry auditing, and a KV-cache stability patch for T4 GPUs. P99 Latency: 19.2s.

ai-safety opentelemetry unsloth llama-3-2 cuda-optimization rag-observability

Updated Mar 31, 2026
Python

Improve this page

Add a description, image, and links to the cuda-optimization topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the cuda-optimization topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cuda-optimization

Here are 8 public repositories matching this topic...

Cre4T3Tiv3 / jetson-orin-matmul-analysis

shrvan30 / flash-attention-cuda

KrishChordiya / nano-llama

Wierzbowski-Alien / qwen-coder-w4a16-demo

ZrobMiloudaa / jetson-orin-matmul-analysis

torajharsh / aether-scale

sighthough / cuda-binary-search-with-tree-sort-algo-optimization

ShettyShreyasR / rag-observability-pro

Improve this page

Add this topic to your repo