Summary
I am an AMD Ryzen AI MAX+ 395 AI PC customer trying to build a local Document AI pipeline on Ryzen AI. The layout detection stage uses PaddleOCR's PP-DocLayoutV3 model. I followed the intended Windows/Ryzen AI route:
FP32 ONNX -> VitisAIExecutionProvider -> BF16 compile -> NPU + CPU hybrid partition -> cached compiled model
However, PP-DocLayoutV3 currently falls back entirely to CPU on my machine. The same environment successfully compiles and runs a ResNet50 FP32 ONNX control model on the NPU via VitisAI EP BF16 compile, so this does not look like a general local environment setup problem. Download PP-DocLayoutV3.
This is a blocking issue for Document AI application development on Ryzen AI MAX+ 395. Layout detection is a core stage for OCR/VLM document pipelines, and without NPU execution the end-to-end latency and value proposition of the AI PC platform are substantially reduced.
Hardware / OS environment
- Machine: ASUS ProArt PX13 HN7306EA
- CPU/APU: AMD RYZEN AI MAX+ 395 w/ Radeon 8060S
- CPU cores / threads: 16 cores / 32 logical processors
- Integrated GPU: AMD Radeon(TM) 8060S Graphics
- GPU driver:
32.0.22032.6002
- NPU device:
NPU Compute Accelerator Device
- NPU PCI ID:
PCI\VEN_1022&DEV_17F0&SUBSYS_20CF1043&REV_...
- OS: Windows 11 Home Chinese, 64-bit
- OS version/build:
10.0.26200, build 26200
Software environment
- Python:
3.12.13
- ONNX Runtime:
1.24.5
- ONNX:
1.17.0
- WinML Python package:
windowsml from local venv
- AMD WinML GPU EP package:
MicrosoftCorporationII.WinML.AMD.GPU.EP.1.8_1.8.55.0_x64__8wekyb3d8bbwe
- EP library:
migraphx-ep.dll
- AMD WinML NPU EP package:
MicrosoftCorporationII.WinML.AMD.NPU.EP.1.8_1.8.59.0_x64__8wekyb3d8bbwe
- EP library:
onnxruntime_vitisai_ep.dll
- Vitis AI / AI Engine compiler observed during the ResNet50 control run:
AI Engine Compiler Version 2026.1 (windows64-bit)
SW Build 4ccfd0b on 2026-03-16-20:43:50
ort.get_ep_devices() discovers:
CPUExecutionProvider
DmlExecutionProvider
MIGraphXExecutionProvider
VitisAIExecutionProvider
Model under test: PP-DocLayoutV3 ONNX
The model is the official PaddleOCR/PaddleX PP-DocLayoutV3 layout detection model, exported to ONNX outside Windows with a compatible Paddle2ONNX toolchain, then copied back to the Ryzen AI Windows machine.
ONNX model properties:
file: inference.onnx
size: 131,731,131 bytes
IR version: 8
opset: ai.onnx 17
node count: 6440
initializer count: 0
Inputs:
im_shape tensor(float) [DynamicDimension.0, 2]
image tensor(float) [DynamicDimension.1, 3, 800, 800]
scale_factor tensor(float) [DynamicDimension.2, 2]
Outputs:
fetch_name_0 tensor(float) [DynamicDimension.3, 7]
fetch_name_1 tensor(int64) [DynamicDimension.4]
fetch_name_2 tensor(int64) [DynamicDimension.5, 200, 200]
VitisAI EP BF16 config used
I used an explicit VAIML/BF16 compile route. The most important variant used this config:
{
"passes": [
{
"name": "init",
"plugin": "vaip-pass_init"
},
{
"name": "vaiml_partition",
"plugin": "vaip-pass_vaiml_partition",
"vaiml_config": {
"optimize_level": 2,
"preferred_data_storage": "auto",
"enable_f32_to_bf16_conversion": true
}
}
],
"target": "VAIML",
"targets": [
{
"name": "VAIML",
"pass": ["init", "vaiml_partition"]
}
]
}
Provider options:
provider_options = {
"config_file": "<path-to-vai_ep_config.json>",
"cache_dir": "<cache-dir>",
"cache_key": "bf16_auto_o2_legacy_force",
"enable_cache_file_io_in_mem": "0",
"ai_analyzer_visualization": "true",
"ai_analyzer_profiling": "true",
}
I create the ONNX Runtime session through WinML EP devices, i.e. SessionOptions.add_provider_for_devices() with VitisAIExecutionProvider first and CPUExecutionProvider as fallback.
Reproduction steps
- Install the Windows AMD WinML EP packages so that
VitisAIExecutionProvider is visible through ort.get_ep_devices().
- Prepare/export
PP-DocLayoutV3 to FP32 ONNX. In my case the valid ONNX file is inference.onnx, 131,731,131 bytes.
- Register Windows ML EP libraries from Python:
import onnxruntime as ort
from windowsml import EpCatalog
with EpCatalog() as catalog:
for provider in catalog.find_all_providers():
provider.ensure_ready()
if provider.name in ("VitisAIExecutionProvider", "MIGraphXExecutionProvider"):
ort.register_execution_provider_library(provider.name, str(provider.library_path))
- Build VitisAI EP session options:
ep_devices = list(ort.get_ep_devices())
vitis = [d for d in ep_devices if d.ep_name == "VitisAIExecutionProvider"]
cpu = [d for d in ep_devices if d.ep_name == "CPUExecutionProvider"]
so = ort.SessionOptions()
so.enable_profiling = True
so.profile_file_prefix = "ort_profile.json"
so.add_provider_for_devices(vitis, provider_options)
so.add_provider_for_devices(cpu, {})
sess = ort.InferenceSession("inference.onnx", sess_options=so)
- Run a dummy inference with valid PP-DocLayoutV3 input shapes:
feeds = {
"image": np.random.randn(1, 3, 800, 800).astype(np.float32),
"im_shape": np.array([[800, 800]], dtype=np.float32),
"scale_factor": np.array([[1.0, 1.0]], dtype=np.float32),
}
outputs = sess.run(None, feeds)
profile_file = sess.end_profiling()
- Inspect the ORT profile and the VitisAI cache artifacts.
Actual PP-DocLayoutV3 result
The session is created with both providers:
session_providers = ["VitisAIExecutionProvider", "CPUExecutionProvider"]
But the runtime profile shows all nodes on CPU:
profile_provider_events = {
"CPUExecutionProvider": {
"count": 1346,
"dur_us": ~340000
}
}
There are no VitisAIExecutionProvider kernel events for PP-DocLayoutV3.
The VAIML partition result indicates that no usable NPU overlay/subgraph was formed:
unsupported_ops_count = 1262
ops_per_overlay = {}
aie_required_clusters = null
I tried several BF16/VAIML variants and got the same result:
| Variant |
Session providers |
Runtime provider events |
Unsupported ops |
ops_per_overlay |
bf16_auto_o1 |
VitisAI + CPU |
CPU only, 1346 events |
1262 |
{} |
bf16_vectorized_o1 |
VitisAI + CPU |
CPU only, 1346 events |
1262 |
{} |
bf16_unvectorized_o1 |
VitisAI + CPU |
CPU only, 1346 events |
1262 |
{} |
bf16_auto_o2_legacy_force |
VitisAI + CPU |
CPU only, 1346 events |
1262 |
{} |
Example unsupported op names include many Add.* nodes, e.g. Add.0, Add.10, Add.102, Add.104, Add.106, etc.
Control experiment: ResNet50 proves the local NPU/BF16 path works
To rule out a local setup issue, I generated a standard ResNet50-style FP32 ONNX control model and ran it through the same Windows/ORT/VitisAI EP path, with the same BF16 compile approach.
ResNet50 control model:
FP32 ONNX
input: tensor(float) [1, 3, 224, 224]
model size: 102,030,656 bytes
nodes: 122
initializers: 55
Result:
AI Engine Compiler: Compilation Complete
session_providers = ["VitisAIExecutionProvider", "CPUExecutionProvider"]
profile_provider_counts = {"VitisAIExecutionProvider": 3}
profile_provider_dur_us = {"VitisAIExecutionProvider": 19098}
run_times_s = [0.0082, 0.0056, 0.0059]
The compiled model cache was also created successfully:
.rai
subgraphs_cache.json
partition_io_shapes.json
vaiml_par_0/0/ctrl.elf
vaiml_par_0/0/unified-4x4.xclbin
vaiml_par_0/0/wts32.bin
Cache hit test:
first session create: 503.555 s
cached session create: 0.413 s
cached profile_provider_counts = {"VitisAIExecutionProvider": 2}
This confirms that the local VitisAI EP, BF16 compilation, NPU runtime, and cache path are operational on this Ryzen AI MAX+ 395 system.
Expected behavior
For PP-DocLayoutV3 FP32 ONNX, I expected VitisAI EP to do at least one of the following:
- Compile the supported parts of the graph into one or more NPU subgraphs, with unsupported ops falling back to CPU;
- Produce actionable diagnostics explaining why no NPU partition is possible;
- Provide a documented export/conversion workaround for this model class so Document AI layout detection can run on Ryzen AI NPUs.
Actual behavior
PP-DocLayoutV3 creates a VitisAI EP session but runs entirely on CPU. The analyzer reports 1262 unsupported ops and an empty ops_per_overlay, so no NPU subgraph is formed.
Why this matters
I bought and am actively developing on an AMD Ryzen AI MAX+ 395 AI PC specifically to run local AI workloads. Document AI is one of the most important practical use cases for an AI PC: OCR, layout detection, table/figure/text block detection, and VLM-based document understanding.
PP-DocLayoutV3 is the layout model used by PaddleOCR/PaddleOCR-VL style pipelines. If this model cannot use the NPU, Document AI applications on Ryzen AI lose a major performance and efficiency advantage. The ResNet50 control result shows the platform can execute BF16-compiled ONNX workloads on the NPU, but this real Document AI model currently cannot benefit from it.
Could AMD please investigate whether this is due to unsupported ONNX ops, graph patterns emitted by Paddle2ONNX, dynamic shapes, missing BF16 lowering coverage, or a VitisAI EP partitioning limitation? A model-specific recommendation or compiler/runtime fix would be very valuable to customers building Document AI applications on Ryzen AI MAX+ 395 systems.
I am happy to provide the full profile JSON files, cache directories, and the exact ONNX model metadata if that helps triage.
Summary
I am an AMD Ryzen AI MAX+ 395 AI PC customer trying to build a local Document AI pipeline on Ryzen AI. The layout detection stage uses PaddleOCR's PP-DocLayoutV3 model. I followed the intended Windows/Ryzen AI route:
FP32 ONNX -> VitisAIExecutionProvider -> BF16 compile -> NPU + CPU hybrid partition -> cached compiled modelHowever,
PP-DocLayoutV3currently falls back entirely to CPU on my machine. The same environment successfully compiles and runs a ResNet50 FP32 ONNX control model on the NPU via VitisAI EP BF16 compile, so this does not look like a general local environment setup problem. Download PP-DocLayoutV3.This is a blocking issue for Document AI application development on Ryzen AI MAX+ 395. Layout detection is a core stage for OCR/VLM document pipelines, and without NPU execution the end-to-end latency and value proposition of the AI PC platform are substantially reduced.
Hardware / OS environment
32.0.22032.6002NPU Compute Accelerator DevicePCI\VEN_1022&DEV_17F0&SUBSYS_20CF1043&REV_...10.0.26200, build26200Software environment
3.12.131.24.51.17.0windowsmlfrom local venvMicrosoftCorporationII.WinML.AMD.GPU.EP.1.8_1.8.55.0_x64__8wekyb3d8bbwemigraphx-ep.dllMicrosoftCorporationII.WinML.AMD.NPU.EP.1.8_1.8.59.0_x64__8wekyb3d8bbweonnxruntime_vitisai_ep.dllAI Engine Compiler Version 2026.1 (windows64-bit)SW Build 4ccfd0b on 2026-03-16-20:43:50ort.get_ep_devices()discovers:Model under test: PP-DocLayoutV3 ONNX
The model is the official PaddleOCR/PaddleX
PP-DocLayoutV3layout detection model, exported to ONNX outside Windows with a compatible Paddle2ONNX toolchain, then copied back to the Ryzen AI Windows machine.ONNX model properties:
Inputs:
Outputs:
VitisAI EP BF16 config used
I used an explicit VAIML/BF16 compile route. The most important variant used this config:
{ "passes": [ { "name": "init", "plugin": "vaip-pass_init" }, { "name": "vaiml_partition", "plugin": "vaip-pass_vaiml_partition", "vaiml_config": { "optimize_level": 2, "preferred_data_storage": "auto", "enable_f32_to_bf16_conversion": true } } ], "target": "VAIML", "targets": [ { "name": "VAIML", "pass": ["init", "vaiml_partition"] } ] }Provider options:
I create the ONNX Runtime session through WinML EP devices, i.e.
SessionOptions.add_provider_for_devices()withVitisAIExecutionProviderfirst andCPUExecutionProvideras fallback.Reproduction steps
VitisAIExecutionProvideris visible throughort.get_ep_devices().PP-DocLayoutV3to FP32 ONNX. In my case the valid ONNX file isinference.onnx,131,731,131bytes.Actual PP-DocLayoutV3 result
The session is created with both providers:
But the runtime profile shows all nodes on CPU:
There are no
VitisAIExecutionProviderkernel events for PP-DocLayoutV3.The VAIML partition result indicates that no usable NPU overlay/subgraph was formed:
I tried several BF16/VAIML variants and got the same result:
bf16_auto_o1{}bf16_vectorized_o1{}bf16_unvectorized_o1{}bf16_auto_o2_legacy_force{}Example unsupported op names include many
Add.*nodes, e.g.Add.0,Add.10,Add.102,Add.104,Add.106, etc.Control experiment: ResNet50 proves the local NPU/BF16 path works
To rule out a local setup issue, I generated a standard ResNet50-style FP32 ONNX control model and ran it through the same Windows/ORT/VitisAI EP path, with the same BF16 compile approach.
ResNet50 control model:
Result:
The compiled model cache was also created successfully:
Cache hit test:
This confirms that the local VitisAI EP, BF16 compilation, NPU runtime, and cache path are operational on this Ryzen AI MAX+ 395 system.
Expected behavior
For
PP-DocLayoutV3FP32 ONNX, I expected VitisAI EP to do at least one of the following:Actual behavior
PP-DocLayoutV3creates a VitisAI EP session but runs entirely on CPU. The analyzer reports 1262 unsupported ops and an emptyops_per_overlay, so no NPU subgraph is formed.Why this matters
I bought and am actively developing on an AMD Ryzen AI MAX+ 395 AI PC specifically to run local AI workloads. Document AI is one of the most important practical use cases for an AI PC: OCR, layout detection, table/figure/text block detection, and VLM-based document understanding.
PP-DocLayoutV3is the layout model used by PaddleOCR/PaddleOCR-VL style pipelines. If this model cannot use the NPU, Document AI applications on Ryzen AI lose a major performance and efficiency advantage. The ResNet50 control result shows the platform can execute BF16-compiled ONNX workloads on the NPU, but this real Document AI model currently cannot benefit from it.Could AMD please investigate whether this is due to unsupported ONNX ops, graph patterns emitted by Paddle2ONNX, dynamic shapes, missing BF16 lowering coverage, or a VitisAI EP partitioning limitation? A model-specific recommendation or compiler/runtime fix would be very valuable to customers building Document AI applications on Ryzen AI MAX+ 395 systems.
I am happy to provide the full profile JSON files, cache directories, and the exact ONNX model metadata if that helps triage.