Model Architecture Cards

English | 中文

This folder is the canonical, per-model reference for every architecture that TensorSharp can run. Each card is a self-contained brief: it walks an engineer or researcher from "I have never heard of this model" all the way to "I can explain the forward graph and reproduce the inference path in TensorSharp." If you only need a top-level pointer, use the table below; otherwise jump into the individual cards.

What every card contains

Each card follows the same shape so you can diff architectures cleanly:

Origin and intent — who designed the model, what the GGUF arch keys are, and which capabilities (modalities, thinking, tools) it exposes.
Model architecture — the high-level block diagram, layer counts, and any per-layer heterogeneity.
Forward graph — the exact ordered list of ops a single token (decode) and a multi-token sequence (prefill) flow through, including residuals and normalizations.
Components — every sub-block (attention, FFN/SSM, routing, normalization, RoPE flavor, vision/audio encoder) explained in detail with the math that governs it.
Parameters and settings — the GGUF metadata keys, weight tensor naming convention, and dtype expectations.
TensorSharp implementation — pointers to the C# source files, the instantiation order, the cache layout, and the way the model plugs into ModelBase / Ops / native GGML kernels.
Prefill optimization — chunking, fused per-layer kernels, parallelization, cross-layer caches.
Decode optimization — fused single-call kernels, pre-resolved weight pointers, batched MoE, in-place kernels, cache reuse.
Memory and KV cache strategy — circular vs. linear caches, mmap-backed weights, pre-allocated decode buffers.
Multimodal pipeline — how images / audio / video are processed, encoded, and injected into the language model.
Output / chat template — protocol parser, stop tokens, thinking / tool formats.
Optimization opportunities — work that has not been done yet but that we know would unlock more performance or capability.

Implementation matrix

Architecture	Card	Source class	GGUF keys	Modalities	Reasoning	Tools	Batched / paged forward	Notable acceleration
Gemma 3	gemma3.md	`Gemma3Model`	`gemma3`	Text, image	No	No	No (legacy per-seq)	Alternating SWA / global attention, GeGLU FFN, QK-norm, V-norm
Gemma 4	gemma4.md	`Gemma4Model`	`gemma4`	Text, image, video, audio	Yes	Yes	Default (toggle off with `TS_GEMMA4_BATCHED=0`)	Single-graph fused decode (all layers in one GGML dispatch), fused per-layer prefill, chunked prefill, circular SWA cache, PLE, KV sharing, MoE variants. Batched path matches legacy logits within FP noise (`Gemma4BatchedForwardTests`); reaches ~1.5× legacy at batch=8 and ~1.6× at 4×800-token prompts.
Qwen 3	qwen3.md	`Qwen3Model`	`qwen3`	Text	Yes	Yes	Reference port (`Qwen3Model.BatchedForward.cs`) — exercised by `Qwen3BatchedForwardTests` when a base-Qwen3 GGUF is provided	Native whole-model decode with pre-resolved weight pointers
Qwen 3.5 / 3.6 family	qwen35.md	`Qwen35Model`	`qwen35`, `qwen35moe`, `qwen3next`	Text, image	Yes	Yes	Default (toggle off with `TS_QWEN35_BATCHED=0` or `--no-continuous-batching`). Per-slot recurrent-state pool + optional native GatedDeltaNet kernel (`TS_QWEN35_BATCHED_GDN_NATIVE=1`)	Hybrid FullAttention + GatedDeltaNet recurrent, fused attention layer decode, fused prefill attention, fused output-projection + FFN, fused output-projection + norm + router, batched MoE (routed + shared + residual in a single kernel), fused vision encoder blocks
GPT OSS	gptoss.md	`GptOssModel`	`gptoss`, `gpt-oss`	Text	Yes (always)	Yes	Default (toggle off with `TS_GPTOSS_BATCHED=0`). Per-head attention sinks via `TSGgml_PagedAttentionForwardWithSinks` (or `TS_GPTOSS_PAGED_ATTN_MANAGED=1` for the C# fallback). 100% greedy match vs legacy in `GptOssBatchedCorrectnessTests`.	Stacked MoE prefill kernel (mul_mat_id + add_id + swiglu_oai), attention sinks, MXFP4 expert weights
Nemotron-H	nemotron.md	`NemotronModel`	`nemotron_h`, `nemotron_h_moe`	Text, image (Omni-class)	Yes	Yes	Default (toggle off with `TS_NEMOTRON_BATCHED=0`). Per-slot Mamba2 conv + SSM state pool; optional native batched Mamba2 step (`TS_NEMOTRON_MAMBA2_BATCHED_NATIVE=1`). 100% greedy match vs legacy; up to 3.95× tps at batch=3 on Apple M4 Pro.	Mamba2 + attention + MoE FFN hybrid stack, batched GPU MoE, optional Parakeet audio frontend, RADIO/v2_vl image encoder
Mistral 3	mistral3.md	`Mistral3Model`	`mistral3`	Text, image	No	No	Default — reference IBatchedPagedModel implementation. End-to-end validated on Ministral-3-14B; native paged-attention kernel is ~21% faster than the legacy per-seq path on long context.	YaRN-corrected RoPE with position-dependent Q scaling, fused QKV / gate_up, Pixtral vision encoder

Backend notes

Model code is intentionally backend-agnostic. ModelBase selects tensor storage through BackendType and the registered execution plan, then delegates the actual ops to the backend that owns those allocators:

Backend type	Package	Notes
`Cpu`	`TensorSharp.Core`	Pure managed tensors with SIMD/managed quantized fast paths (RMSNorm, RoPE, softmax, fused activations, GEMM, dequant).
`Cuda`	`TensorSharp.Backends.Cuda`	Direct CUDA Driver-API allocator and storage, cuBLAS GEMM, PTX kernels for hot ops (RMSNorm, softmax, RoPE/RoPEEx, SDPA, GQA prefill/decode, causal mask, gather/concat, activation fusions), native quantized matmul / get_rows for supported quant types, CPU fallback for ops that are not yet implemented.
`Mlx`	`TensorSharp.Backends.MLX`	Apple Silicon `mlx-c` bridge with quantized / fused / compiled kernels, async worker dispatch, MoE expert offload, and a CPU fallback layer. Requires `libmlxc`.
`GgmlCpu` / `GgmlMetal` / `GgmlCuda`	`TensorSharp.Backends.GGML` + `TensorSharp.GGML.Native`	Native ggml bridge with quantized graph dispatch and platform backends. mmap-backed quantized weights are bound zero-copy through host-pointer buffers. Includes the paged-attention kernel (`TSGgml_PagedAttentionForward`, plus the GPT OSS sinks variant) that powers the batched / paged execution path.

When a card mentions a fused GGML kernel (for example Qwen35AttentionLayerDecode, Gemma4LayerPrefill, or MoEExpertsSwiGLUResidual), the kernel is compiled from TensorSharp.GGML.Native/ggml_ops_*.cpp and exposed through TensorSharp.Backends.GGML/GgmlBasicOps.cs. The native bridge is the place to look when a fused path engages on GGML CPU / Metal / CUDA but not on the pure managed CPU or direct CUDA backends.

Continuous batching & paged KV cache

All architectures listed above also run through the shared InferenceEngine + ContinuousBatchScheduler + BatchExecutor stack documented in docs/PAGED_ATTENTION_AND_CONTINUOUS_BATCHING.md. Models that implement IBatchedPagedModel.ForwardBatch execute one batched forward per scheduler step (with slotMapping-based K/V scatter into a shared paged buffer and per-sequence attention via the native paged kernel); the others run through the per-sequence KV-swap fallback inside the same engine. The opt-in env vars are summarised in the matrix above and in the project root README.

Architecture comparison

Feature	Gemma 3	Gemma 4	Qwen 3	Qwen 3.5 / 3.6 family	GPT OSS	Nemotron-H	Mistral 3
Layer type	Dense	Dense / MoE	Dense	Hybrid (Attn + Recurrent) ± MoE	MoE	Hybrid (Mamba2 + Attn + MoE FFN)	Dense
Attention	SWA + Global	SWA + Global	Full GQA	Full GQA + Sigmoid Gate	Full + Sinks	Full GQA (no RoPE)	Full GQA
FFN activation	GeGLU	GeGLU	SwiGLU	SwiGLU	SiLUAlphaLimit (clamped GLU)	ReLU²	SwiGLU
RoPE variant	NeoX (dual base)	NeoX + proportional / partial	NeoX	NeoX / MRoPE	NeoX + YaRN	None	GPT-J + YaRN
QK-norm	Yes	Yes	Yes	Yes	No	No	No
V-norm	No	Yes (unweighted)	No	No	No	No	No
Bias in projections	No	No	No	No	Yes (all linear)	No	No
Per-layer scaling	No	Yes	No	No	No	No	No
Per-Layer Embedding (PLE)	No	Yes	No	No	No	No	No
KV sharing	No	Yes (tail layers)	No	No	No	No	No
Attention sinks	No	No	No	No	Yes	No	No
Circular KV cache	No	Yes (SWA layers)	No	No	No	No	No
SSM / recurrent layers	No	No	No	Yes (GatedDeltaNet)	No	Yes (Mamba2)	No
Shared experts	No	No	No	Yes (qwen35moe / qwen3next)	No	Yes (optional)	No
Latent bottleneck FFN	No	No	No	No	No	Yes (optional)	No
Position-dependent Q scaling	No	No	No	No	No	No	Yes (with YaRN)
Vision	Yes	Yes	No	Yes	No	Yes (Omni)	Yes (Pixtral)
Audio	No	Yes	No	No	No	Yes (Parakeet, when mmproj present)	No
Video	No	Yes	No	No	No	No	No
Thinking	No	Yes	Yes	Yes	Yes (always)	Yes	No
Tool calling	No	Yes	Yes	Yes	Yes	Yes	No
Fused QKV	No	Yes	Yes	Mixed (full attention layers split, recurrent layers fuse a 5-way pack)	Yes	Yes	Yes
Fused single-graph decode	No	Yes (Gemma4ModelDecode)	Yes (TransformerModelDecode, native loop)	Per-layer fused (Qwen35AttentionLayerDecode, FusedOutProjFFN, FusedOutProjNormRouter)	Per-layer	Per-layer / batched MoE	No
Fused single-graph prefill	No	Yes (Gemma4LayerPrefill, dense layers)	No	Yes (FusedPrefillAttention, FusedOutProjFFN, MoE prefill)	Yes (MoE prefill via mul_mat_id)	No	No
Batched GPU MoE	n/a	Pending	n/a	Yes (routed + shared + residual fused)	Yes (stacked weight slabs)	Yes	n/a
Fused vision encoder	n/a	Standard	n/a	Yes (FusedVisionAttention + FusedVisionMLP)	n/a	Standard (RADIO ViT)	Standard (Pixtral)
Output parser	`PassthroughOutputParser`	`Gemma4OutputParser`	`Qwen3OutputParser`	`Qwen35OutputParser`	`HarmonyOutputParser` (always required)	`Qwen3OutputParser`	`PassthroughOutputParser`

Adding a new architecture

When you add a new model:

Create TensorSharp.Models/Models/<Name>/<Name>Model.cs inheriting ModelBase.
In the constructor: read GGUF metadata via _gguf.GetXxx(), call ParseBaseConfig() and ParseTokenizer(), call LoadWeights(), fuse weights, then initialize caches.
Implement Forward(int[] tokens) → float[]: embedding → optional multimodal injection → transformer blocks → final norm → LM head → logit copy.
Implement ResetKVCache() and Dispose(). Implement TruncateKVCache() when KV-cache reuse is supported.
Register in ModelBase.Create() switch expression in TensorSharp.Models/ModelBase.cs.
Add an IOutputParser implementation in TensorSharp.Runtime/OutputParser.cs if the model uses a non-standard output format and register it in OutputParserFactory.Create().
Add chat template support in TensorSharp.Runtime/ChatTemplate.cs / Jinja2Template.cs if the model uses a novel template format.
Add a card under docs/models/<name>.md (and <name>_zh-cn.md if you want bilingual coverage), update this README's matrix, and link the card from the project root README.
Update TensorSharp.Server/testdata/ capability gates if the model exposes new modalities, thinking, or tool capabilities.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Model Architecture Cards

What every card contains

Implementation matrix

Backend notes

Continuous batching & paged KV cache

Architecture comparison

Adding a new architecture

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Model Architecture Cards

What every card contains

Implementation matrix

Backend notes

Continuous batching & paged KV cache

Architecture comparison

Adding a new architecture