amd-npu-stable-diffusion-linux

Stable Diffusion image generation on AMD Ryzen AI XDNA 2 NPUs for Linux.

Loads AMD's pre-compiled ONNX models from the Ryzen AI SDK and runs the diffusion pipeline (text encoder, UNet, VAE decoder) with ~98% of UNet operations offloaded to the NPU.

Supported models

Preset	Model	Steps	Time	Resolution	Notes
`sd15`	Stable Diffusion 1.5	20	~5 s	512×512	Classifier-free guidance (batch = 2 UNet)
`sd-turbo`	SD-Turbo	1	~2 s	512×512	Single-step distilled, no guidance
`sdxl-base`	SDXL-Base	30	~35 s	1024×1024	Two text encoders + micro-conditioning; CFG mandatory
`sdxl-turbo`	SDXL-Turbo	1	~2 s	512×512	Two text encoders, single-step, no CFG
`vega`	Segmind Vega	20	~12 s	1024×1024	Distilled SDXL, two text encoders, CFG mandatory

Times measured on a Ryzen AI MAX+ 395 (Strix Halo) with 128 GB RAM, Fedora 43, kernel 6.19.

Hardware & OS

Tested on:

AMD Ryzen AI MAX+ 395 (Strix Halo) / 128 GB RAM
Fedora 43, kernel 6.19
amdxdna DKMS module, firmware 1.1.2.65

Should also work on other Ryzen AI 300-series chips (Strix Point etc.). You'll need a working /dev/accel/accel0 NPU device and the Ryzen AI SDK's ONNX Runtime. If you bring it up on another SKU, please open an issue / PR with the results.

One-time prerequisites

Do these once per machine. None of them are managed by setup.sh.

1. Install the amdxdna driver + XRT

Build and install from amd/xdna-driver. Follow their dkms_driver.sh flow. The in-tree module shipped with recent kernels aborts DD xclbins.

2. Bind the NPU device

# Check if /dev/accel/accel0 already exists
ls /dev/accel/accel0

# If not, discover the NPU's PCI address
lspci -d 1022: | grep -i 'signal\|npu'
# e.g. "c4:00.1 Signal processing controller: AMD NPU ..."

# Bind it (substitute the BDF you just found; domain 0000 is fine for a single-NPU box)
echo "0000:c4:00.1" | sudo tee /sys/bus/pci/drivers/amdxdna/bind

3. Raise the NPU command-watchdog timeout to 30 s

The amdxdna driver defaults to a 2-second per-command watchdog. SDXL (~1 s/step in a single fused 1742-op kernel) reliably exceeds it; SD1.5 intermittently overshoots on cold-start too. Make the fix persistent:

echo 'options amdxdna timeout_in_sec=30' | sudo tee /etc/modprobe.d/amdxdna.conf
sudo rmmod amdxdna && sudo modprobe amdxdna
cat /sys/module/amdxdna/parameters/timeout_in_sec   # should print 30

Important: the firmware latches the timeout at driver init, so writing to /sys/module/amdxdna/parameters/timeout_in_sec at runtime does not take effect. The sysfs value can read 30 while the firmware is still on 2 s. If you hit ERT_CMD_STATE_TIMEOUT on any model (even SD1.5), reload the module regardless of what sysfs says.

4. Install the Ryzen AI SDK 1.7.1

Download from ryzenai.docs.amd.com and extract anywhere on the system. setup.sh auto-detects ~/ryzen_ai-1.7.1 and /opt/ryzen_ai*; for anything else set RYZEN_AI_PATH=/path/to/sdk before running setup.

You'll also need whichever Python the SDK's bundled ONNX Runtime was built for (Python 3.12 for 1.7.1). setup.sh will tell you if it's missing.

5. Request Hugging Face access for the models you want

The models are pre-compiled and redistributed by AMD. Four of the five presets are gated (access must be requested on the model page); Segmind Vega is public.

Preset	Hugging Face repo	Access
`sd15`	amd/stable-diffusion-1.5-amdnpu	Gated
`sd-turbo`	amd/sd-turbo-amdnpu	Gated
`sdxl-base`	amd/sdxl-base-amdnpu	Gated
`sdxl-turbo`	amd/sdxl-turbo-amdnpu	Gated
`vega`	amd/segmind-vega-amdnpu	Public

Click "Request access" on each gated model and authenticate locally:

hf auth login

Quick start

# Install everything and download the default SD1.5 model (takes ~10 min; model is ~950 MB)
./setup.sh

# Generate an image
./run.sh "a corgi sitting in a field of wildflowers" --seed 42

Using other models

Each preset's model is downloaded separately. You can either pick one at setup time, or run the downloader afterwards for additional models.

# Install + download Segmind Vega in one shot
./setup.sh --model vega

# Or, after initial setup, add more models whenever
python download_model.py --model sdxl-turbo
python download_model.py --model sdxl-base

# Then generate
./run.sh --model vega "a corgi in a field of wildflowers" --seed 42
./run.sh --model sdxl-turbo "a cat on a windowsill" --seed 42
./run.sh --model sdxl-base "an astronaut riding a horse" --seed 42

Options

./run.sh [--model PRESET] PROMPT [flags...]

  --model {sd15,sd-turbo,sdxl-base,sdxl-turbo,vega}   Model preset (default: sd15)
  --negative-prompt TEXT                              Negative prompt
  --steps N                                           Denoising steps (default: per model)
  --guidance-scale F                                  CFG scale (default: per model)
  --seed N                                            Random seed
  --output PATH                                       Output image (default: output.png)
  --model-dir PATH                                    Override resolved model directory

Environment variables

Variable	Purpose
`RYZEN_AI_PATH`	Ryzen AI SDK install directory (used by `setup.sh` if not in the default locations)
`NPUTOP_PATH`	Local nputop checkout to install editable instead of pulling from git
`NPU_DEVICE`	NPU device node (default `/dev/accel/accel0`)
`XDNA_MODELS_DIR`	Parent directory for downloaded models (default `./models/`)
`XDNA_CACHE_DIR`	Root for per-run telemetry CSVs (default `~/.cache/xdna2-npu-diffusion`)

Performance

Measured on AMD Ryzen AI MAX+ 395 (Strix Halo), 128 GB RAM, Fedora 43:

SD1.5 (20 steps, guidance = 7.5)

Stage	Time
Model loading	1.0 s
Text encoding (CPU)	0.1 s
UNet denoising (20 steps, NPU)	3.6 s (5.5 it/s)
VAE decode (NPU)	0.2 s
Total	~5 s

SD-Turbo (1 step, no guidance)

Stage	Time
Model loading	1.3 s
Text encoding (CPU)	0.2 s
UNet denoising (1 step, NPU)	0.1 s
VAE decode (NPU)	0.2 s
Total	~2 s

Troubleshooting

Black / all-zero image. Something wrote compile_fusion_rt=1 to the session options at some point and overwrote the AMD metastate files with broken versions. Delete the affected models/<preset>/ directory and run python download_model.py --model <preset> again. Never add compile_fusion_rt to the session config.

ERT_CMD_STATE_TIMEOUT from the driver. The firmware watchdog fired. Confirm the driver was reloaded after setting timeout_in_sec=30 (prerequisite step 3). Sysfs can read 30 while the firmware is still on 2 s; only a rmmod + modprobe takes effect.

unordered_map::at during UNet step 1. The amdxdna module loaded but the NPU wasn't actually bound. Re-run the lspci + bind dance in prerequisite step 2.

401 Unauthorized from Hugging Face. You haven't been granted access to the gated model, or haven't run hf auth login. Visit the model page, click "Request access", wait for approval, then authenticate.

"ONNX Runtime not found" during setup. The SDK path doesn't contain lib/python3.X/site-packages/onnxruntime. Confirm RYZEN_AI_PATH points at a full SDK install (not just the deployment/ subdirectory).

No NPU telemetry in diagnostics output. nputop either isn't installed or can't open /dev/accel/accel0. Check that your user has read/write access to the device node. Telemetry is optional; the pipeline still works without it.

Where are the telemetry CSVs? ~/.cache/xdna2-npu-diffusion/telemetry/<run_id>.csv by default. Override with XDNA_CACHE_DIR.

Project structure

xdna2-npu-diffusion/
├── setup.sh                # One-time bootstrap (venv, libs, model, const extraction)
├── run.sh                  # Wrapper that sets env and runs the pipeline
├── run_npu.py              # Unified text-to-image pipeline (all 5 presets)
├── diagnostics.py          # NPU health probes, telemetry CSV, retry wrapper
├── download_model.py       # Pull a preset from Hugging Face and extract .const files
├── setup_const_files.py    # .fconst → .const byte-slice extractor
├── requirements.txt        # Runtime Python deps (excluding ORT and PyTorch-CPU)
├── tests/
│   └── test_dd_smoke.py    # DD custom ops smoke test on random UNet inputs
├── lib/                    # [generated by setup.sh] Deployment libraries from the SDK
└── models/                 # [generated by setup.sh] Downloaded pre-compiled ONNX models

Important notes

Never set compile_fusion_rt=1 as a session config option. It overwrites AMD's pre-compiled metastate files with broken versions that produce black images.
The NPU device must be bound before running. If the amdxdna module loads but doesn't bind the device, all DD operations fail with a cryptic unordered_map::at error.

Acknowledgments

AMD for publishing the pre-compiled DynamicDispatch ONNX models on Hugging Face and the Ryzen AI SDK runtime.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

amd-npu-stable-diffusion-linux

Supported models

Hardware & OS

One-time prerequisites

1. Install the amdxdna driver + XRT

2. Bind the NPU device

3. Raise the NPU command-watchdog timeout to 30 s

4. Install the Ryzen AI SDK 1.7.1

5. Request Hugging Face access for the models you want

Quick start

Using other models

Options

Environment variables

Performance

SD1.5 (20 steps, guidance = 7.5)

SD-Turbo (1 step, no guidance)

Troubleshooting

Project structure

Important notes

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 1

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
diagnostics.py		diagnostics.py
download_model.py		download_model.py
requirements.txt		requirements.txt
run.sh		run.sh
run_npu.py		run_npu.py
setup.sh		setup.sh
setup_const_files.py		setup_const_files.py

Folders and files

Latest commit

History

Repository files navigation

amd-npu-stable-diffusion-linux

Supported models

Hardware & OS

One-time prerequisites

1. Install the amdxdna driver + XRT

2. Bind the NPU device

3. Raise the NPU command-watchdog timeout to 30 s

4. Install the Ryzen AI SDK 1.7.1

5. Request Hugging Face access for the models you want

Quick start

Using other models

Options

Environment variables

Performance

SD1.5 (20 steps, guidance = 7.5)

SD-Turbo (1 step, no guidance)

Troubleshooting

Project structure

Important notes

Acknowledgments

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 1

Languages

Packages