Version: 1.0
Formats: .anvil (project archive) · .cast (inference package)
KernelForge defines two file formats for packaging and distributing optimized models:
| Format | Purpose | Audience |
|---|---|---|
.anvil |
Full project snapshot for backup, transfer, and continued optimization inside KernelForge | KernelForge users |
.cast |
Self-contained inference package with baked-in optimized kernels | End users / deployment |
Both formats are ZIP archives with a specific internal layout and a mandatory header file as the first entry in the archive. This mirrors the convention used by PyTorch's .pt files — the extension signals semantics, the container is standard ZIP.
- The first entry in every archive MUST be
HEADER.json. ZIP readers can extract this single entry efficiently using the ZIP central directory without decompressing the full archive. - All paths inside the archive are relative and forward-slash separated regardless of host OS.
- Binary data (weights, compiled objects) is stored as-is — no additional encoding.
- All JSON files use UTF-8 encoding with no BOM.
- Timestamps are ISO 8601 strings in UTC (e.g.,
2026-02-20T14:30:00Z).
Every archive contains a checksums.sha256 file at the root. It lists the SHA-256 hash of every other file in the archive, one per line, in the format:
<hex_hash> <relative/path/to/file>
The HEADER.json field archive_checksum holds the SHA-256 of checksums.sha256 itself, allowing a fast top-level integrity check without reading every file.
Both formats carry a format_version field in HEADER.json. Version numbers follow MAJOR.MINOR:
- MAJOR bump: breaking change, old readers cannot load the file.
- MINOR bump: backwards-compatible addition, old readers should warn but can load.
An .anvil file is a complete, portable snapshot of a KernelForge project. It captures everything needed to restore the project into KernelForge on any machine — or to continue kernel optimization exactly where it left off.
.anvil files are produced and consumed exclusively by KernelForge. They are not intended for direct inference use.
project.anvil (ZIP)
│
├── HEADER.json ← MUST be first entry
├── manifest.json
├── checksums.sha256
│
├── model.py ← original PyTorch model source
├── config.json ← sanitized project config (no absolute paths)
│
├── artifacts/
│ └── weights/
│ ├── <sha256_hex>.pt ← weight files, SHA-256 named
│ └── meta.json ← name/hash/size registry
│
├── profiling/
│ └── <op_name>/
│ ├── entry_<n>.pt ← captured I/O tensor pairs
│ └── summary.json ← op profiling summary
│
├── kernels/
│ └── <op_name>/
│ ├── kernel.py ← best known Python dispatch wrapper
│ └── kernel.cu ← best known CUDA source
│
├── benchmarks/
│ └── op_benchmarks.json
│
└── trees/ ← present only in "full" export mode
└── <op_name>/
└── nodes.db ← MCTS tree (SQLite)
{
"format_version": "1.0",
"file_type": "kernelforge_project",
"export_mode": "snapshot | full",
"project_name": "<string>",
"created_at": "<ISO 8601>",
"exported_at": "<ISO 8601>",
"kernelforge_version": "<semver>",
"target_device": "cuda | mps | cpu",
"contents": {
"has_weights": true,
"weight_count": 1,
"has_profiling_data": true,
"profiled_op_count": 12,
"has_kernels": true,
"kernel_count": 6,
"has_mcts_trees": false,
"uncompressed_size_bytes": 52428800
},
"archive_checksum": "<sha256 of checksums.sha256>"
}| Value | Description |
|---|---|
snapshot |
Best kernels and results only. Cannot resume MCTS optimization. |
full |
Includes trees/ with all MCTS nodes. Full optimization state is preserved. |
{
"project_name": "<string>",
"created_at": "<ISO 8601>",
"exported_at": "<ISO 8601>",
"llm_info": {
"provider": "anthropic | openai | google | ollama",
"model": "<model name>"
},
"target_device": "cuda | mps | cpu",
"ops": [
{
"name": "<torch op name>",
"has_profiling_data": true,
"profiling_entries": 3,
"has_kernel": true,
"benchmark_speedup": 2.41
}
],
"weights": [
{
"id": "<sha256[:16]>",
"name": "<original filename>",
"sha256": "<full hex>",
"size_bytes": 524288000
}
]
}The project config.json is included verbatim except:
validation_diris set tonull(absolute host path, not portable).- Any path-valued fields are stripped to
null. - All other fields (generator settings, profile filters, LLM info) are preserved as-is.
- Open ZIP, read
HEADER.jsononly. Validateformat_versionandfile_type == "kernelforge_project". - Display project summary to user (name, op count, export mode, size). Wait for confirmation.
- Verify
checksums.sha256againstarchive_checksum. Reject if mismatch. - Extract to a temporary directory.
- Create a new project directory at
$KERNELFORGE_DATA_DIR/projects/<project_name>/.- If a project with that name already exists, prompt the user to rename or overwrite.
- Copy
model.py. - Re-register weights: copy
artifacts/weights/*.ptinto the new project, rebuildconfig.jsonartifact entries with correct localrelpathvalues usingmeta.jsonas the source of truth. - Copy
profiling/→io/individual_ops/. - Copy
kernels/→kernels/generated/individual_op_kernels/. - If
has_mcts_trees: copytrees/→trees/. - Write
config.jsonfrom the sanitized version, settingvalidation_dirback tonull. - Write a fresh
state.jsonwith all jobs initialized toidle. Never restore old job state — PIDs, log paths, and progress are machine-specific. - Delete temp directory.
| Excluded | Reason |
|---|---|
state.json |
Job state (PIDs, progress) is machine-specific and meaningless on import |
.uploads/ |
Temporary base64 staging files, never relevant post-creation |
exports/ |
Old old export zips, redundant |
logs/ |
Not functional data; large; excluded by default |
| Absolute paths | All path fields scrubbed to null before packaging |
A .cast file is a self-contained, deployable inference artifact. It bundles a trained model's weights with the KernelForge-optimized CUDA kernels that replace its hot-path operators. Anyone with the cast Python runtime can load a .cast file and run inference immediately — no KernelForge installation required.
.cast files are produced by KernelForge and consumed by the cast runtime package (pip-installable, lightweight, depends only on torch).
my_model.cast (ZIP)
│
├── HEADER.json ← MUST be first entry
├── manifest.json
├── checksums.sha256
│
├── model.py ← model class definition (for module reconstruction)
│
├── weights/
│ └── <sha256_hex>.pt ← model weights (standard state_dict format)
│
├── kernels/
│ └── <op_name>/
│ ├── kernel.cu ← CUDA source (portable, JIT-compiled on target)
│ └── wrapper.py ← Python dispatch glue for torch.library registration
│
├── compiled/ ← optional; pre-compiled binaries per SM version
│ ├── sm_80/
│ │ └── <op_name>.so
│ ├── sm_86/
│ │ └── <op_name>.so
│ └── sm_89/
│ └── <op_name>.so
│
└── loader.py ← vendored cast runtime (fallback, no pip required)
{
"format_version": "1.0",
"file_type": "kernelforge_inference",
"project_name": "<string>",
"exported_at": "<ISO 8601>",
"kernelforge_version": "<semver>",
"runtime": {
"min_cast_version": "0.1",
"min_torch_version": "2.1.0",
"min_cuda_version": "12.0",
"target_sm_versions": ["sm_80", "sm_86", "sm_89"]
},
"contents": {
"optimized_op_count": 6,
"total_op_count": 12,
"has_precompiled": true,
"precompiled_sm_versions": ["sm_86", "sm_89"],
"weight_size_bytes": 524288000
},
"archive_checksum": "<sha256 of checksums.sha256>"
}{
"project_name": "<string>",
"exported_at": "<ISO 8601>",
"model_class": "<ClassName>",
"model_init_args": {},
"weight_file": "weights/<sha256_hex>.pt",
"ops": [
{
"name": "<torch op name>",
"kernel_dir": "kernels/<op_name>/",
"wrapper": "kernels/<op_name>/wrapper.py",
"cuda_source": "kernels/<op_name>/kernel.cu",
"precompiled": {
"sm_86": "compiled/sm_86/<op_name>.so",
"sm_89": "compiled/sm_89/<op_name>.so"
},
"benchmark_speedup": 2.41,
"torch_op": "torch.nn.functional.linear"
}
]
}If the model class __init__ requires arguments beyond the default, they are stored here as a JSON-serializable dict. On load, the model is instantiated as ModelClass(**model_init_args) before load_state_dict is called. If the model can be reconstructed from the state dict alone (e.g., all architecture info is implied by weight shapes), this field is an empty object.
Every .cast file embeds a copy of the cast runtime as loader.py. This allows loading without a pip installation using Python's zipimport:
import zipimport
importer = zipimport.zipimporter("my_model.cast")
cast = importer.load_module("loader")
model = cast.load("my_model.cast")
output = model(input_tensor)The vendored loader.py is functionally identical to the pip-installed cast package at the version embedded. The pip package is preferred when available; loader.py is a zero-dependency fallback.
cast.load("my_model.cast")
│
├── 1. Read HEADER.json
│ Validate format_version and file_type == "kernelforge_inference"
│ Check min_torch_version against torch.__version__
│ Check min_cuda_version against torch.version.cuda
│ Warn if current GPU SM version not in target_sm_versions
│
├── 2. Verify checksums.sha256 against archive_checksum
│ Raise CastCorruptedError on mismatch
│
├── 3. Compute cache key: SHA-256 of the archive itself
│ Cache dir: ~/.cache/cast/<cache_key>/
│ If cache dir exists and is valid → skip extraction (step 4)
│
├── 4. Extract archive to cache dir
│
├── 5. Detect current GPU SM version (e.g., sm_86)
│
├── 6. For each op in manifest["ops"]:
│ a. If precompiled .so exists for current SM → dlopen it
│ b. Else → JIT compile kernel.cu via torch.utils.cpp_extension.load()
│ Compilation artifacts cached in cache dir
│ c. Register op with torch.library under namespace "cast"
│
├── 7. Import model class from model.py (exec into isolated namespace)
│
├── 8. Instantiate model: ModelClass(**model_init_args)
│
├── 9. torch.load(weight_file) → model.load_state_dict(state_dict)
│
├── 10. Wrap model.forward() to route registered ops through cast:: dispatch
│
└── 11. Return model — ready for inference
Optimized kernels are registered with PyTorch's custom operator system under the cast namespace:
cast::<op_name>(Tensor self, ...) -> Tensor
The dispatch wrapper in wrapper.py replaces the original torch.nn.functional.* call at the Python level using a monkey-patch applied during model wrapping (step 10). This requires no changes to model.py or the calling code.
| Priority | Condition | Action |
|---|---|---|
| 1 | Pre-compiled .so for exact SM version exists |
Load directly, no compilation |
| 2 | Pre-compiled .so for compatible SM version exists (lower SM, same major) |
Load with warning |
| 3 | kernel.cu source present |
JIT compile via torch.utils.cpp_extension.load() |
| 4 | None of the above | Skip kernel for this op, fall back to native PyTorch |
| Excluded | Reason |
|---|---|
Profiling I/O tensors (entry_*.pt) |
Not needed for inference; required only for kernel validation in KernelForge |
MCTS trees (nodes.db) |
Optimization history, irrelevant at inference time |
state.json / job logs |
KernelForge internal state, not meaningful outside KernelForge |
| Benchmark data | Informational only; not needed to run the model |
| Generation attempt history | Intermediate artifacts, not needed for final kernel |
A .cast file can be ingested by KernelForge using the standard project import flow. KernelForge inspects HEADER.json and branches on file_type:
file_type |
Import Path |
|---|---|
kernelforge_project |
Full .anvil import flow — all data restored |
kernelforge_inference |
Partial import — weights and best kernels recovered; profiling data absent |
When importing a .cast, KernelForge will:
- Create a new project from the embedded
model.pyand weights. - Install the optimized kernels as if they were generated kernels.
- Mark all profiling jobs as requires re-run — profiling I/O tensors are absent, so kernel correctness cannot be re-verified until profiling completes.
- Mark all generate/optimize jobs as completed (imported) — kernels exist and can be immediately benchmarked or further optimized.
| Property | .anvil |
.cast |
|---|---|---|
| Container | ZIP | ZIP |
| First entry | HEADER.json |
HEADER.json |
file_type |
kernelforge_project |
kernelforge_inference |
| Requires KernelForge to use | Yes (import) | No |
| Runnable for inference | No | Yes |
| Includes profiling tensors | Yes | No |
| Includes MCTS trees | Optional (full mode) |
No |
Includes pre-compiled .so |
No | Optional |
| Embeds runtime loader | No | Yes (loader.py) |
| Primary audience | KernelForge users | Deployment / end users |
| Re-importable into KernelForge | Yes (full restore) | Yes (partial restore) |