Kernel Forge automatically generates and optimizes GPU kernels for PyTorch models with no kernel programming expertise required. It profiles your model at the operator level, uses an LLM to write a correct kernel, then searches for performance improvements using Monte Carlo Tree Search until the kernel beats PyTorch's baseline.
- ML engineers running models in production who want lower inference latency on specific hardware without writing CUDA or Triton by hand.
- AI infrastructure teams targeting specific GPU hardware (NVIDIA CUDA or AMD ROCm) who need kernels tuned to that exact device.
- Teams with remote GPU access who run optimization on a separate GPU server while managing projects locally.
- Researchers benchmarking operator-level speedups across different LLM backends or optimization strategies.
- Teams packaging models for deployment who want a self-contained inference artifact with kernels baked in and no runtime dependency on KernelForge.
- Automated kernel generation via LLM with compile-error feedback loop
- MCTS-driven optimization - explores tiling, loop unrolling, vectorized memory access, and more
- CUDA and Triton backends (NVIDIA and AMD ROCm)
- Remote execution over SSH - no local GPU required
- Multi-LLM support: Anthropic, OpenAI, Google
- Web dashboard with live progress, speed charts, and MCTS tree inspector
- Portable
.anvilsnapshots and self-contained.castinference packages
See system requirements before installing.
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
cd frontend
jac installConfigure your LLM key in the settings panel after starting, or set ANTHROPIC_API_KEY, OPENAI_API_KEY, or GOOGLE_API_KEY before launch.
jac start main.jacOpen http://localhost:8000. Create a project, upload your model weights, and click Start Forge.
For headless or scripted runs, see docs/cli.md.