Skip to content

Peterc3-dev/amdxdna-strix-fix

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

amdxdna-strix-fix

Patch for the AMD XDNA NPU driver on Ryzen AI 300 (Strix Point/Halo) that fixes the SMU init failure preventing the NPU from loading on Linux.

If your dmesg shows this:

amdxdna 0000:c6:00.1: [drm] *ERROR* aie2_smu_init: Access power failed, ret -22
amdxdna 0000:c6:00.1: [drm] *ERROR* aie2_hw_start: failed to init smu, ret -22

This patch fixes it. NPU goes from dead to 40-46 tok/s prefill on Llama 3.2 1B at ~2W.


The problem

The in-tree amdxdna driver (Linux 6.14+) fails to initialize the NPU on Strix Point/Halo because of an init-order bug: it tries to talk to the SMU (System Management Unit) before the PSP (Platform Security Processor) has loaded the NPU firmware. The SMU doesn't respond because it's part of the NPU firmware — which hasn't been loaded yet.

Driver init order (broken):
  1. SMU init → FAILS (SMU not running, firmware not loaded)
  2. PSP firmware load → never reached
  3. Everything else → never reached

The SMU responds with 0xFF (not initialized) or times out entirely. The driver gives up and the NPU stays dead.

The fix

Skip SMU init when it fails on first attempt. Let PSP load the firmware first. The NPU operates without power management (SMU handles DPM/clock gating) — it runs at BIOS-default clock speeds, which is fine for inference workloads.

// In aie2_pci.c, function aie2_hw_start():

// Original: SMU → PSP → firmware alive check
// Patched:  SMU → if fail → skip, PSP → firmware alive check (no power mgmt)

ret = aie2_smu_init(ndev);
if (ret) {
    XDNA_INFO(xdna, "SMU init failed (%d), bypassing — no power management", ret);
    // Don't return error — continue without SMU
}

ret = aie2_psp_start(ndev);
if (ret) {
    XDNA_ERR(xdna, "PSP start failed");
    goto cleanup;
}

Affected hardware

Field Value
CPU AMD Ryzen AI 9 HX 370 (Strix Halo)
NPU PCI ID 1022:17F0 rev 0x10
NPU Type RyzenAI-npu4 (XDNA 2, AIE2P architecture)
Capability 50 TOPS INT8 / BFP16
Firmware npu.sbin.1.1.2.64.zst at /lib/firmware/amdnpu/17f0_10/

Likely also affects:

  • Ryzen AI 9 HX 375
  • Ryzen AI 7 PRO 360
  • Other Strix Point/Halo APUs with XDNA 2 NPU (PCI ID 1022:17F0)

Quick start

Option A: Use the pre-built patched module (fastest)

# Clone this repo
git clone https://github.com/Peterc3-dev/amdxdna-strix-fix.git
cd amdxdna-strix-fix

# Build the patched module for your kernel
./build.sh

# Test it
sudo ./test-driver.sh

# If it works, install the auto-loader
sudo ./install-service.sh

Option B: Patch manually

  1. Clone the xdna-driver repo:

    git clone https://github.com/amd/xdna-driver.git
    cd xdna-driver
  2. Apply the patch:

    # In drivers/accel/amdxdna/aie2_pci.c, function aie2_hw_start():
    # Change the SMU init from a fatal error to a warning.
    # See patch/0001-amdxdna-bypass-smu-init-strix.patch in this repo.
  3. Build the module:

    cd xrt/build/arch
    makepkg -s --noconfirm -p PKGBUILD-xrt-plugin
    sudo pacman -U xrt-plugin-amdxdna-*.pkg.tar.zst
  4. Load it:

    sudo rmmod amdxdna 2>/dev/null
    sudo insmod /path/to/patched/amdxdna.ko

Auto-load on boot (systemd)

The install-service.sh script creates a systemd service that:

  1. Unloads the in-tree (broken) amdxdna driver
  2. Loads the patched module
  3. Runs at boot before multi-user.target
sudo ./install-service.sh
# Service: npu-loader.service
# Check: systemctl status npu-loader.service

Verify it works

# Check device exists
ls /dev/accel/accel0

# Check driver loaded
lsmod | grep amdxdna

# Check kernel log
dmesg | grep -i "amdxdna\|npu\|smu"
# Should show: "SMU init failed, bypassing" followed by successful probe

# Run inference (requires FastFlowLM)
pip install lemonade-server
flm validate
# Expected: ready: true, 8 columns, FW version 1.1.2.64

NPU benchmarks (after fix)

Tested with FastFlowLM (Llama 3.2 1B):

Metric Value
Prefill speed 40-46 tok/s
Decode speed 14-24 tok/s
Power draw ~2W
Columns available 8
Firmware 1.1.2.64

Root cause analysis

Why the SMU doesn't respond

The SMU (System Management Unit) on Strix Halo is embedded in the NPU firmware package, not a standalone hardware block pre-initialized by the BIOS. The driver assumes the SMU is ready at PCI probe time — true on older hardware, false on Strix Point/Halo.

Register evidence

SMU BAR 5 registers:
  SMU_CMD  [0x900]: 0x00000004  ← POWER_OFF command (stale, never processed)
  SMU_RESP [0x9F4]: 0x00000000  ← No response (SMU not running)

PSP BAR 4 registers:
  PSP_STATUS [0xAEC]: 0x80000000  ← READY bit set (PSP is alive)

The PSP is ready to load firmware. The SMU is not running. The fix: let the PSP do its job first.

What you lose

SMU handles power management — Dynamic Power Management (DPM), clock gating, thermal throttling. Without SMU:

  • NPU runs at BIOS-default clock speed (not downclocked when idle)
  • No dynamic frequency scaling
  • Slightly higher idle power draw
  • Inference performance is unaffected

Proper fix (for AMD)

The driver should detect Strix Point/Halo hardware and reorder init:

  1. PSP firmware load first
  2. Wait for firmware alive signal
  3. SMU init after firmware is loaded
  4. If SMU still fails, operate in degraded mode (no DPM)

This should be reported upstream as a bug against drivers/accel/amdxdna/ in the Linux kernel.

Caveats

  • Rebuild after kernel updates. The .ko is built for a specific kernel version. When CachyOS updates the kernel, rebuild with ./build.sh.
  • No power management. The NPU runs at fixed clocks. Fine for inference, not ideal for battery life when idle.
  • Tested on one machine. GPD Pocket 4 with Ryzen AI 9 HX 370, CachyOS kernel 6.19.10. Should work on other Strix Point systems but untested.

Related projects

  • R.A.G-Race-Router — Three-processor inference runtime (CPU + GPU + NPU) that uses this fix
  • pytorch-gfx1150 — PyTorch built from source for Radeon 890M (RDNA 3.5)
  • unified-ml — Vulkan + HIP kernel benchmarks for AMD APU unified memory
  • xdna-driver — AMD's official XDNA driver repo

Author

Peter Clemente (@Peterc3-dev)

License

MIT

About

Fix for AMD XDNA NPU driver on Ryzen AI 300 (Strix Point/Halo) — SMU init bypass patch, systemd auto-loader, full root cause analysis

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages