ATOM (AiTer Optimized Model) is a lightweight vLLM-like implementation, focusing on integration and optimization based on AITER.
- ROCm Optimized: Built on AMD's ROCm platform with AITER kernels (ASM, CK, Triton)
- OpenAI-Compatible API: Drop-in server with
/v1/chat/completionsand/v1/completionsendpoints - Piecewise torch.compile: 4 compilation levels with CUDA graph capture for low-latency decode
- Multi-GPU Parallelism: Tensor parallelism (TP), data parallelism (DP), and expert parallelism (EP) with MORI all-to-all
- Quantization: FP8, MXFP4, INT8, INT4 with auto-detection from HuggingFace configs
- Speculative Decoding: Multi-Token Prediction (MTP) with EAGLE proposer
- Prefix Caching: xxhash64-based KV cache block sharing across sequences
| Model Family | HF Architecture | Dense/MoE | Notes |
|---|---|---|---|
| Llama | LlamaForCausalLM |
Dense | Llama 2, Llama 3, Llama 3.1 |
| Qwen3 | Qwen3ForCausalLM |
Dense | |
| Qwen3-MoE | Qwen3MoeForCausalLM |
MoE | 128 experts, top-8 routing |
| DeepSeek V2/V3 | DeepseekV3ForCausalLM |
MoE | MLA attention, MTP speculative decoding |
| Mixtral | MixtralForCausalLM |
MoE | 8 experts, top-2 routing |
| GLM-4-MoE | Glm4MoeForCausalLM |
MoE | |
| GLM-5 | GlmMoeDsaForCausalLM |
MoE | MLA attention, similar to DeepSeek v3.2. See recipe |
| GPT-OSS | GptOssForCausalLM |
MoE | Sliding window + attention sinks |
| Kimi-K2 | via --trust-remote-code |
MoE | See recipe |
- AMD GPU with ROCm support
- Docker
Pre-built image with AITER + ATOM ready to use:
docker pull rocm/atom-dev:latest
docker run -it --network=host \
--device=/dev/kfd \
--device=/dev/dri \
--group-add video \
--cap-add=SYS_PTRACE \
--security-opt seccomp=unconfined \
-v $HOME:/home/$USER \
-v /mnt:/mnt \
-v /data:/data \
--shm-size=16G \
--ulimit memlock=-1 \
--ulimit stack=67108864 \
rocm/atom-dev:latestdocker pull rocm/pytorch:rocm7.0.2_ubuntu24.04_py3.12_pytorch_release_2.8.0
docker run -it --network=host \
--device=/dev/kfd \
--device=/dev/dri \
--group-add video \
--cap-add=SYS_PTRACE \
--security-opt seccomp=unconfined \
-v $HOME:/home/$USER \
-v /mnt:/mnt \
-v /data:/data \
--shm-size=16G \
--ulimit memlock=-1 \
--ulimit stack=67108864 \
rocm/pytorch:rocm7.0.2_ubuntu24.04_py3.12_pytorch_release_2.8.0pip install amd-aiter
git clone https://github.com/ROCm/ATOM.git && pip install ./ATOMFull documentation: rocm.github.io/ATOM
| Topic | Description | Guide |
|---|---|---|
| Architecture | System overview, request lifecycle, component design | Architecture Guide |
| Configuration | Config classes, CLI arguments, environment variables | Configuration Guide |
| Model Support | Supported models, weight loading, adding new architectures | Model Support Guide |
| Model Operations | AITER kernel integration, linear/attention/MoE/norm wrappers | Model Ops Guide |
| Scheduling & KV Cache | Batch scheduling, block allocation, prefix caching | Scheduling Guide |
| Compilation | torch.compile levels, CUDA graphs, piecewise compilation | Compilation Guide |
| Distributed | Tensor/data/expert parallelism, multi-GPU deployment | Distributed Guide |
| Serving & Benchmarks | OpenAI API server, benchmarking, profiling, speculative decoding | Serving Guide |
Deployment Recipes:
- Qwen3-235B-A22B -- TP8 + EP with FP8 KV cache
- Kimi-K2-Thinking -- MXFP4 MoE on 4 GPUs
- GLM-5 -- FP8 MoE with MLA on 8 GPUs
The default optimization level is 3 (piecewise torch.compile with CUDA graphs).
python -m atom.examples.simple_inference --model meta-llama/Meta-Llama-3-8B --kv_cache_dtype fp8Note: First-time execution may take approximately 10 minutes for model compilation.
Start an OpenAI-compatible server:
# Single GPU
python -m atom.entrypoints.openai_server --model Qwen/Qwen3-0.6B --kv_cache_dtype fp8
# Multi-GPU with tensor parallelism
python -m atom.entrypoints.openai_server --model deepseek-ai/DeepSeek-R1 --kv_cache_dtype fp8 -tp 8Profile offline inference:
python -m atom.examples.profile_offline --model Qwen/Qwen3-0.6B --kv_cache_dtype fp8With custom input/output lengths:
python -m atom.examples.profile_offline --model Qwen/Qwen3-0.6B --kv_cache_dtype fp8 \
--random-input --input-length 1024 --output-length 32Profile a running server:
curl -s -S -X POST http://127.0.0.1:8000/start_profile
# ... run your workload ...
curl -s -S -X POST http://127.0.0.1:8000/stop_profileRun an online throughput benchmark against a running server:
MODEL=deepseek-ai/DeepSeek-R1
ISL=1024
OSL=1024
CONC=128
PORT=8000
RESULT_FILENAME=Deepseek-R1-result
python -m atom.benchmarks.benchmark_serving \
--model=$MODEL --backend=vllm --base-url=http://localhost:$PORT \
--dataset-name=random \
--random-input-len=$ISL --random-output-len=$OSL \
--random-range-ratio 0.8 \
--num-prompts=$(( $CONC * 10 )) \
--max-concurrency=$CONC \
--request-rate=inf --ignore-eos \
--save-result --percentile-metrics="ttft,tpot,itl,e2el" \
--result-dir=./ --result-filename=$RESULT_FILENAME.jsonATOM supports automatic trace collection and analysis, which breaks down GPU kernel durations per module for both prefill and decode phases and exports the results to Excel (.xlsx) files.
Launch the server with --torch-profiler-dir to enable the PyTorch profiler and --mark-trace to insert per-module annotations into the trace. Set TORCHINDUCTOR_COMPILE_THREADS=1 to ensure deterministic compilation order.
TORCHINDUCTOR_COMPILE_THREADS=1 python -m atom.entrypoints.openai_server \
--model deepseek-ai/DeepSeek-R1 \
--kv_cache_dtype fp8 -tp 8 \
--torch-profiler-dir ./trace \
--mark-traceAfter the server processes requests and shuts down, two *.json.gz trace files will be generated in the --torch-profiler-dir directory.
Run parse_trace.py on the collected trace file(use it on trace file start with the model name):
python ATOM/tools/parse_trace.py ./trace/model_name_ts_*.json.gzThis produces two Excel files in the current directory:
| Output File | Description |
|---|---|
prefill_breakdown.xlsx |
Per-kernel duration breakdown for one prefill layer |
decode_breakdown.xlsx |
Per-kernel duration breakdown for one decode layer |
Each file contains columns: cpu_module, gpu_kernel, duration_us, sum per module, plus averaged values across layers.
Options:
| Flag | Default | Description |
|---|---|---|
--layer N |
3 |
Target transformer layer index to analyze (0-indexed) |
For more information, visit InferenceMAX.
Install lm-eval to test model accuracy:
pip install lm-eval[api]Start a server, then run the evaluation:
python -m atom.entrypoints.openai_server --model meta-llama/Meta-Llama-3-8B --kv_cache_dtype fp8lm_eval --model local-completions \
--model_args model=meta-llama/Meta-Llama-3-8B,base_url=http://localhost:8000/v1/completions,num_concurrent=64,max_retries=3,tokenized_requests=False \
--tasks gsm8k \
--num_fewshot 5This project was adapted from nano-vllm.
We welcome issues and contributions! Please use the GitHub Issues page to report bugs or request features: https://github.com/ROCm/ATOM/issues

