Skip to content

TheJoshBrod/KernelForge

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

604 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

KernelForge
Kernel Forge

Drop-in GPU kernel optimizer for PyTorch models.

CUDA Triton More coming soon

Discord License


Kernel Forge automatically generates and optimizes GPU kernels for PyTorch models with no kernel programming expertise required. It profiles your model at the operator level, uses an LLM to write a correct kernel, then searches for performance improvements using Monte Carlo Tree Search until the kernel beats PyTorch's baseline.


Who is this for?

  • ML engineers running models in production who want lower inference latency on specific hardware without writing CUDA or Triton by hand.
  • AI infrastructure teams targeting specific GPU hardware (NVIDIA CUDA or AMD ROCm) who need kernels tuned to that exact device.
  • Teams with remote GPU access who run optimization on a separate GPU server while managing projects locally.
  • Researchers benchmarking operator-level speedups across different LLM backends or optimization strategies.
  • Teams packaging models for deployment who want a self-contained inference artifact with kernels baked in and no runtime dependency on KernelForge.

Features

  • Automated kernel generation via LLM with compile-error feedback loop
  • MCTS-driven optimization - explores tiling, loop unrolling, vectorized memory access, and more
  • CUDA and Triton backends (NVIDIA and AMD ROCm)
  • Remote execution over SSH - no local GPU required
  • Multi-LLM support: Anthropic, OpenAI, Google
  • Web dashboard with live progress, speed charts, and MCTS tree inspector
  • Portable .anvil snapshots and self-contained .cast inference packages

Full feature details


Quick start

See system requirements before installing.

python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

cd frontend
jac install

Configure your LLM key in the settings panel after starting, or set ANTHROPIC_API_KEY, OPENAI_API_KEY, or GOOGLE_API_KEY before launch.

jac start main.jac

Open http://localhost:8000. Create a project, upload your model weights, and click Start Forge.


CLI

For headless or scripted runs, see docs/cli.md.


Further reading

About

Drop-in GPU kernel optimizer for PyTorch models, no CUDA or GPU expertise required

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages