This folder is the canonical, per-model reference for every architecture that TensorSharp can run. Each card is a self-contained brief: it walks an engineer or researcher from "I have never heard of this model" all the way to "I can explain the forward graph and reproduce the inference path in TensorSharp." If you only need a top-level pointer, use the table below; otherwise jump into the individual cards.
Each card follows the same shape so you can diff architectures cleanly:
- Origin and intent — who designed the model, what the GGUF arch keys are, and which capabilities (modalities, thinking, tools) it exposes.
- Model architecture — the high-level block diagram, layer counts, and any per-layer heterogeneity.
- Forward graph — the exact ordered list of ops a single token (decode) and a multi-token sequence (prefill) flow through, including residuals and normalizations.
- Components — every sub-block (attention, FFN/SSM, routing, normalization, RoPE flavor, vision/audio encoder) explained in detail with the math that governs it.
- Parameters and settings — the GGUF metadata keys, weight tensor naming convention, and dtype expectations.
- TensorSharp implementation — pointers to the C# source files, the
instantiation order, the cache layout, and the way the model plugs into
ModelBase/Ops/ native GGML kernels. - Prefill optimization — chunking, fused per-layer kernels, parallelization, cross-layer caches.
- Decode optimization — fused single-call kernels, pre-resolved weight pointers, batched MoE, in-place kernels, cache reuse.
- Memory and KV cache strategy — circular vs. linear caches, mmap-backed weights, pre-allocated decode buffers.
- Multimodal pipeline — how images / audio / video are processed, encoded, and injected into the language model.
- Output / chat template — protocol parser, stop tokens, thinking / tool formats.
- Optimization opportunities — work that has not been done yet but that we know would unlock more performance or capability.
| Architecture | Card | Source class | GGUF keys | Modalities | Reasoning | Tools | Batched / paged forward | Notable acceleration |
|---|---|---|---|---|---|---|---|---|
| Gemma 3 | gemma3.md | Gemma3Model |
gemma3 |
Text, image | No | No | No (legacy per-seq) | Alternating SWA / global attention, GeGLU FFN, QK-norm, V-norm |
| Gemma 4 | gemma4.md | Gemma4Model |
gemma4 |
Text, image, video, audio | Yes | Yes | Default (toggle off with TS_GEMMA4_BATCHED=0) |
Single-graph fused decode (all layers in one GGML dispatch), fused per-layer prefill, chunked prefill, circular SWA cache, PLE, KV sharing, MoE variants. Batched path matches legacy logits within FP noise (Gemma4BatchedForwardTests); reaches ~1.5× legacy at batch=8 and ~1.6× at 4×800-token prompts. |
| Qwen 3 | qwen3.md | Qwen3Model |
qwen3 |
Text | Yes | Yes | Reference port (Qwen3Model.BatchedForward.cs) — exercised by Qwen3BatchedForwardTests when a base-Qwen3 GGUF is provided |
Native whole-model decode with pre-resolved weight pointers |
| Qwen 3.5 / 3.6 family | qwen35.md | Qwen35Model |
qwen35, qwen35moe, qwen3next |
Text, image | Yes | Yes | Default (toggle off with TS_QWEN35_BATCHED=0 or --no-continuous-batching). Per-slot recurrent-state pool + optional native GatedDeltaNet kernel (TS_QWEN35_BATCHED_GDN_NATIVE=1) |
Hybrid FullAttention + GatedDeltaNet recurrent, fused attention layer decode, fused prefill attention, fused output-projection + FFN, fused output-projection + norm + router, batched MoE (routed + shared + residual in a single kernel), fused vision encoder blocks |
| GPT OSS | gptoss.md | GptOssModel |
gptoss, gpt-oss |
Text | Yes (always) | Yes | Default (toggle off with TS_GPTOSS_BATCHED=0). Per-head attention sinks via TSGgml_PagedAttentionForwardWithSinks (or TS_GPTOSS_PAGED_ATTN_MANAGED=1 for the C# fallback). 100% greedy match vs legacy in GptOssBatchedCorrectnessTests. |
Stacked MoE prefill kernel (mul_mat_id + add_id + swiglu_oai), attention sinks, MXFP4 expert weights |
| Nemotron-H | nemotron.md | NemotronModel |
nemotron_h, nemotron_h_moe |
Text, image (Omni-class) | Yes | Yes | Default (toggle off with TS_NEMOTRON_BATCHED=0). Per-slot Mamba2 conv + SSM state pool; optional native batched Mamba2 step (TS_NEMOTRON_MAMBA2_BATCHED_NATIVE=1). 100% greedy match vs legacy; up to 3.95× tps at batch=3 on Apple M4 Pro. |
Mamba2 + attention + MoE FFN hybrid stack, batched GPU MoE, optional Parakeet audio frontend, RADIO/v2_vl image encoder |
| Mistral 3 | mistral3.md | Mistral3Model |
mistral3 |
Text, image | No | No | Default — reference IBatchedPagedModel implementation. End-to-end validated on Ministral-3-14B; native paged-attention kernel is ~21% faster than the legacy per-seq path on long context. | YaRN-corrected RoPE with position-dependent Q scaling, fused QKV / gate_up, Pixtral vision encoder |
Model code is intentionally backend-agnostic. ModelBase selects tensor
storage through BackendType and the registered execution plan, then delegates
the actual ops to the backend that owns those allocators:
| Backend type | Package | Notes |
|---|---|---|
Cpu |
TensorSharp.Core |
Pure managed tensors with SIMD/managed quantized fast paths (RMSNorm, RoPE, softmax, fused activations, GEMM, dequant). |
Cuda |
TensorSharp.Backends.Cuda |
Direct CUDA Driver-API allocator and storage, cuBLAS GEMM, PTX kernels for hot ops (RMSNorm, softmax, RoPE/RoPEEx, SDPA, GQA prefill/decode, causal mask, gather/concat, activation fusions), native quantized matmul / get_rows for supported quant types, CPU fallback for ops that are not yet implemented. |
Mlx |
TensorSharp.Backends.MLX |
Apple Silicon mlx-c bridge with quantized / fused / compiled kernels, async worker dispatch, MoE expert offload, and a CPU fallback layer. Requires libmlxc. |
GgmlCpu / GgmlMetal / GgmlCuda |
TensorSharp.Backends.GGML + TensorSharp.GGML.Native |
Native ggml bridge with quantized graph dispatch and platform backends. mmap-backed quantized weights are bound zero-copy through host-pointer buffers. Includes the paged-attention kernel (TSGgml_PagedAttentionForward, plus the GPT OSS sinks variant) that powers the batched / paged execution path. |
When a card mentions a fused GGML kernel (for example Qwen35AttentionLayerDecode,
Gemma4LayerPrefill, or MoEExpertsSwiGLUResidual), the kernel is compiled from
TensorSharp.GGML.Native/ggml_ops_*.cpp and exposed through
TensorSharp.Backends.GGML/GgmlBasicOps.cs. The native bridge is the place to
look when a fused path engages on GGML CPU / Metal / CUDA but not on the pure
managed CPU or direct CUDA backends.
All architectures listed above also run through the shared
InferenceEngine + ContinuousBatchScheduler + BatchExecutor stack documented
in docs/PAGED_ATTENTION_AND_CONTINUOUS_BATCHING.md.
Models that implement IBatchedPagedModel.ForwardBatch execute one batched
forward per scheduler step (with slotMapping-based K/V scatter into a
shared paged buffer and per-sequence attention via the native paged kernel);
the others run through the per-sequence KV-swap fallback inside the same engine.
The opt-in env vars are summarised in the matrix above and in the project root
README.
| Feature | Gemma 3 | Gemma 4 | Qwen 3 | Qwen 3.5 / 3.6 family | GPT OSS | Nemotron-H | Mistral 3 |
|---|---|---|---|---|---|---|---|
| Layer type | Dense | Dense / MoE | Dense | Hybrid (Attn + Recurrent) ± MoE | MoE | Hybrid (Mamba2 + Attn + MoE FFN) | Dense |
| Attention | SWA + Global | SWA + Global | Full GQA | Full GQA + Sigmoid Gate | Full + Sinks | Full GQA (no RoPE) | Full GQA |
| FFN activation | GeGLU | GeGLU | SwiGLU | SwiGLU | SiLUAlphaLimit (clamped GLU) | ReLU² | SwiGLU |
| RoPE variant | NeoX (dual base) | NeoX + proportional / partial | NeoX | NeoX / MRoPE | NeoX + YaRN | None | GPT-J + YaRN |
| QK-norm | Yes | Yes | Yes | Yes | No | No | No |
| V-norm | No | Yes (unweighted) | No | No | No | No | No |
| Bias in projections | No | No | No | No | Yes (all linear) | No | No |
| Per-layer scaling | No | Yes | No | No | No | No | No |
| Per-Layer Embedding (PLE) | No | Yes | No | No | No | No | No |
| KV sharing | No | Yes (tail layers) | No | No | No | No | No |
| Attention sinks | No | No | No | No | Yes | No | No |
| Circular KV cache | No | Yes (SWA layers) | No | No | No | No | No |
| SSM / recurrent layers | No | No | No | Yes (GatedDeltaNet) | No | Yes (Mamba2) | No |
| Shared experts | No | No | No | Yes (qwen35moe / qwen3next) | No | Yes (optional) | No |
| Latent bottleneck FFN | No | No | No | No | No | Yes (optional) | No |
| Position-dependent Q scaling | No | No | No | No | No | No | Yes (with YaRN) |
| Vision | Yes | Yes | No | Yes | No | Yes (Omni) | Yes (Pixtral) |
| Audio | No | Yes | No | No | No | Yes (Parakeet, when mmproj present) | No |
| Video | No | Yes | No | No | No | No | No |
| Thinking | No | Yes | Yes | Yes | Yes (always) | Yes | No |
| Tool calling | No | Yes | Yes | Yes | Yes | Yes | No |
| Fused QKV | No | Yes | Yes | Mixed (full attention layers split, recurrent layers fuse a 5-way pack) | Yes | Yes | Yes |
| Fused single-graph decode | No | Yes (Gemma4ModelDecode) | Yes (TransformerModelDecode, native loop) | Per-layer fused (Qwen35AttentionLayerDecode, FusedOutProjFFN, FusedOutProjNormRouter) | Per-layer | Per-layer / batched MoE | No |
| Fused single-graph prefill | No | Yes (Gemma4LayerPrefill, dense layers) | No | Yes (FusedPrefillAttention, FusedOutProjFFN, MoE prefill) | Yes (MoE prefill via mul_mat_id) | No | No |
| Batched GPU MoE | n/a | Pending | n/a | Yes (routed + shared + residual fused) | Yes (stacked weight slabs) | Yes | n/a |
| Fused vision encoder | n/a | Standard | n/a | Yes (FusedVisionAttention + FusedVisionMLP) | n/a | Standard (RADIO ViT) | Standard (Pixtral) |
| Output parser | PassthroughOutputParser |
Gemma4OutputParser |
Qwen3OutputParser |
Qwen35OutputParser |
HarmonyOutputParser (always required) |
Qwen3OutputParser |
PassthroughOutputParser |
When you add a new model:
- Create
TensorSharp.Models/Models/<Name>/<Name>Model.csinheritingModelBase. - In the constructor: read GGUF metadata via
_gguf.GetXxx(), callParseBaseConfig()andParseTokenizer(), callLoadWeights(), fuse weights, then initialize caches. - Implement
Forward(int[] tokens) → float[]: embedding → optional multimodal injection → transformer blocks → final norm → LM head → logit copy. - Implement
ResetKVCache()andDispose(). ImplementTruncateKVCache()when KV-cache reuse is supported. - Register in
ModelBase.Create()switch expression inTensorSharp.Models/ModelBase.cs. - Add an
IOutputParserimplementation inTensorSharp.Runtime/OutputParser.csif the model uses a non-standard output format and register it inOutputParserFactory.Create(). - Add chat template support in
TensorSharp.Runtime/ChatTemplate.cs/Jinja2Template.csif the model uses a novel template format. - Add a card under
docs/models/<name>.md(and<name>_zh-cn.mdif you want bilingual coverage), update this README's matrix, and link the card from the project root README. - Update
TensorSharp.Server/testdata/capability gates if the model exposes new modalities, thinking, or tool capabilities.