PP-DocLayoutV3 FP32 ONNX falls back entirely to CPU with VitisAI EP BF16 compile on Ryzen AI MAX+ 395

### Summary

I am an AMD Ryzen AI MAX+ 395 AI PC customer trying to build a local Document AI pipeline on Ryzen AI. The layout detection stage uses PaddleOCR's [PP-DocLayoutV3](https://huggingface.co/alex-dinh/PP-DocLayoutV3-ONNX) model. I followed the intended Windows/Ryzen AI route:

`FP32 ONNX -> VitisAIExecutionProvider -> BF16 compile -> NPU + CPU hybrid partition -> cached compiled model`

However, `PP-DocLayoutV3` currently falls back entirely to CPU on my machine. The same environment successfully compiles and runs a ResNet50 FP32 ONNX control model on the NPU via VitisAI EP BF16 compile, so this does not look like a general local environment setup problem. [Download PP-DocLayoutV3](https://huggingface.co/alex-dinh/PP-DocLayoutV3-ONNX).

This is a blocking issue for Document AI application development on Ryzen AI MAX+ 395. Layout detection is a core stage for OCR/VLM document pipelines, and without NPU execution the end-to-end latency and value proposition of the AI PC platform are substantially reduced.

### Hardware / OS environment

- Machine: ASUS ProArt PX13 HN7306EA
- CPU/APU: AMD RYZEN AI MAX+ 395 w/ Radeon 8060S
- CPU cores / threads: 16 cores / 32 logical processors
- Integrated GPU: AMD Radeon(TM) 8060S Graphics
- GPU driver: `32.0.22032.6002`
- NPU device: `NPU Compute Accelerator Device`
- NPU PCI ID: `PCI\VEN_1022&DEV_17F0&SUBSYS_20CF1043&REV_...`
- OS: Windows 11 Home Chinese, 64-bit
- OS version/build: `10.0.26200`, build `26200`

### Software environment

- Python: `3.12.13`
- ONNX Runtime: `1.24.5`
- ONNX: `1.17.0`
- WinML Python package: `windowsml` from local venv
- AMD WinML GPU EP package:
  - `MicrosoftCorporationII.WinML.AMD.GPU.EP.1.8_1.8.55.0_x64__8wekyb3d8bbwe`
  - EP library: `migraphx-ep.dll`
- AMD WinML NPU EP package:
  - `MicrosoftCorporationII.WinML.AMD.NPU.EP.1.8_1.8.59.0_x64__8wekyb3d8bbwe`
  - EP library: `onnxruntime_vitisai_ep.dll`
- Vitis AI / AI Engine compiler observed during the ResNet50 control run:
  - `AI Engine Compiler Version 2026.1 (windows64-bit)`
  - `SW Build 4ccfd0b on 2026-03-16-20:43:50`

`ort.get_ep_devices()` discovers:

```text
CPUExecutionProvider
DmlExecutionProvider
MIGraphXExecutionProvider
VitisAIExecutionProvider
```

### Model under test: PP-DocLayoutV3 ONNX

The model is the official PaddleOCR/PaddleX `PP-DocLayoutV3` layout detection model, exported to ONNX outside Windows with a compatible Paddle2ONNX toolchain, then copied back to the Ryzen AI Windows machine.

ONNX model properties:

```text
file: inference.onnx
size: 131,731,131 bytes
IR version: 8
opset: ai.onnx 17
node count: 6440
initializer count: 0
```

Inputs:

```text
im_shape      tensor(float) [DynamicDimension.0, 2]
image         tensor(float) [DynamicDimension.1, 3, 800, 800]
scale_factor  tensor(float) [DynamicDimension.2, 2]
```

Outputs:

```text
fetch_name_0  tensor(float) [DynamicDimension.3, 7]
fetch_name_1  tensor(int64) [DynamicDimension.4]
fetch_name_2  tensor(int64) [DynamicDimension.5, 200, 200]
```

### VitisAI EP BF16 config used

I used an explicit VAIML/BF16 compile route. The most important variant used this config:

```json
{
  "passes": [
    {
      "name": "init",
      "plugin": "vaip-pass_init"
    },
    {
      "name": "vaiml_partition",
      "plugin": "vaip-pass_vaiml_partition",
      "vaiml_config": {
        "optimize_level": 2,
        "preferred_data_storage": "auto",
        "enable_f32_to_bf16_conversion": true
      }
    }
  ],
  "target": "VAIML",
  "targets": [
    {
      "name": "VAIML",
      "pass": ["init", "vaiml_partition"]
    }
  ]
}
```

Provider options:

```python
provider_options = {
    "config_file": "<path-to-vai_ep_config.json>",
    "cache_dir": "<cache-dir>",
    "cache_key": "bf16_auto_o2_legacy_force",
    "enable_cache_file_io_in_mem": "0",
    "ai_analyzer_visualization": "true",
    "ai_analyzer_profiling": "true",
}
```

I create the ONNX Runtime session through WinML EP devices, i.e. `SessionOptions.add_provider_for_devices()` with `VitisAIExecutionProvider` first and `CPUExecutionProvider` as fallback.

### Reproduction steps

1. Install the Windows AMD WinML EP packages so that `VitisAIExecutionProvider` is visible through `ort.get_ep_devices()`.
2. Prepare/export `PP-DocLayoutV3` to FP32 ONNX. In my case the valid ONNX file is `inference.onnx`, `131,731,131` bytes.
3. Register Windows ML EP libraries from Python:

```python
import onnxruntime as ort
from windowsml import EpCatalog

with EpCatalog() as catalog:
    for provider in catalog.find_all_providers():
        provider.ensure_ready()
        if provider.name in ("VitisAIExecutionProvider", "MIGraphXExecutionProvider"):
            ort.register_execution_provider_library(provider.name, str(provider.library_path))
```

4. Build VitisAI EP session options:

```python
ep_devices = list(ort.get_ep_devices())
vitis = [d for d in ep_devices if d.ep_name == "VitisAIExecutionProvider"]
cpu = [d for d in ep_devices if d.ep_name == "CPUExecutionProvider"]

so = ort.SessionOptions()
so.enable_profiling = True
so.profile_file_prefix = "ort_profile.json"
so.add_provider_for_devices(vitis, provider_options)
so.add_provider_for_devices(cpu, {})

sess = ort.InferenceSession("inference.onnx", sess_options=so)
```

5. Run a dummy inference with valid PP-DocLayoutV3 input shapes:

```python
feeds = {
    "image": np.random.randn(1, 3, 800, 800).astype(np.float32),
    "im_shape": np.array([[800, 800]], dtype=np.float32),
    "scale_factor": np.array([[1.0, 1.0]], dtype=np.float32),
}
outputs = sess.run(None, feeds)
profile_file = sess.end_profiling()
```

6. Inspect the ORT profile and the VitisAI cache artifacts.

### Actual PP-DocLayoutV3 result

The session is created with both providers:

```text
session_providers = ["VitisAIExecutionProvider", "CPUExecutionProvider"]
```

But the runtime profile shows all nodes on CPU:

```text
profile_provider_events = {
  "CPUExecutionProvider": {
    "count": 1346,
    "dur_us": ~340000
  }
}
```

There are no `VitisAIExecutionProvider` kernel events for PP-DocLayoutV3.

The VAIML partition result indicates that no usable NPU overlay/subgraph was formed:

```text
unsupported_ops_count = 1262
ops_per_overlay = {}
aie_required_clusters = null
```

I tried several BF16/VAIML variants and got the same result:

| Variant | Session providers | Runtime provider events | Unsupported ops | ops_per_overlay |
| --- | --- | --- | --- | --- |
| `bf16_auto_o1` | VitisAI + CPU | CPU only, 1346 events | 1262 | `{}` |
| `bf16_vectorized_o1` | VitisAI + CPU | CPU only, 1346 events | 1262 | `{}` |
| `bf16_unvectorized_o1` | VitisAI + CPU | CPU only, 1346 events | 1262 | `{}` |
| `bf16_auto_o2_legacy_force` | VitisAI + CPU | CPU only, 1346 events | 1262 | `{}` |

Example unsupported op names include many `Add.*` nodes, e.g. `Add.0`, `Add.10`, `Add.102`, `Add.104`, `Add.106`, etc.

### Control experiment: ResNet50 proves the local NPU/BF16 path works

To rule out a local setup issue, I generated a standard ResNet50-style FP32 ONNX control model and ran it through the same Windows/ORT/VitisAI EP path, with the same BF16 compile approach.

ResNet50 control model:

```text
FP32 ONNX
input: tensor(float) [1, 3, 224, 224]
model size: 102,030,656 bytes
nodes: 122
initializers: 55
```

Result:

```text
AI Engine Compiler: Compilation Complete
session_providers = ["VitisAIExecutionProvider", "CPUExecutionProvider"]
profile_provider_counts = {"VitisAIExecutionProvider": 3}
profile_provider_dur_us = {"VitisAIExecutionProvider": 19098}
run_times_s = [0.0082, 0.0056, 0.0059]
```

The compiled model cache was also created successfully:

```text
.rai
subgraphs_cache.json
partition_io_shapes.json
vaiml_par_0/0/ctrl.elf
vaiml_par_0/0/unified-4x4.xclbin
vaiml_par_0/0/wts32.bin
```

Cache hit test:

```text
first session create: 503.555 s
cached session create: 0.413 s
cached profile_provider_counts = {"VitisAIExecutionProvider": 2}
```

This confirms that the local VitisAI EP, BF16 compilation, NPU runtime, and cache path are operational on this Ryzen AI MAX+ 395 system.

### Expected behavior

For `PP-DocLayoutV3` FP32 ONNX, I expected VitisAI EP to do at least one of the following:

1. Compile the supported parts of the graph into one or more NPU subgraphs, with unsupported ops falling back to CPU;
2. Produce actionable diagnostics explaining why no NPU partition is possible;
3. Provide a documented export/conversion workaround for this model class so Document AI layout detection can run on Ryzen AI NPUs.

### Actual behavior

`PP-DocLayoutV3` creates a VitisAI EP session but runs entirely on CPU. The analyzer reports 1262 unsupported ops and an empty `ops_per_overlay`, so no NPU subgraph is formed.

### Why this matters

I bought and am actively developing on an AMD Ryzen AI MAX+ 395 AI PC specifically to run local AI workloads. Document AI is one of the most important practical use cases for an AI PC: OCR, layout detection, table/figure/text block detection, and VLM-based document understanding.

`PP-DocLayoutV3` is the layout model used by PaddleOCR/PaddleOCR-VL style pipelines. If this model cannot use the NPU, Document AI applications on Ryzen AI lose a major performance and efficiency advantage. The ResNet50 control result shows the platform can execute BF16-compiled ONNX workloads on the NPU, but this real Document AI model currently cannot benefit from it.

Could AMD please investigate whether this is due to unsupported ONNX ops, graph patterns emitted by Paddle2ONNX, dynamic shapes, missing BF16 lowering coverage, or a VitisAI EP partitioning limitation? A model-specific recommendation or compiler/runtime fix would be very valuable to customers building Document AI applications on Ryzen AI MAX+ 395 systems.

I am happy to provide the full profile JSON files, cache directories, and the exact ONNX model metadata if that helps triage.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PP-DocLayoutV3 FP32 ONNX falls back entirely to CPU with VitisAI EP BF16 compile on Ryzen AI MAX+ 395 #374

Summary

Hardware / OS environment

Software environment

Model under test: PP-DocLayoutV3 ONNX

VitisAI EP BF16 config used

Reproduction steps

Actual PP-DocLayoutV3 result

Control experiment: ResNet50 proves the local NPU/BF16 path works

Expected behavior

Actual behavior

Why this matters

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Variant	Session providers	Runtime provider events	Unsupported ops	ops_per_overlay
`bf16_auto_o1`	VitisAI + CPU	CPU only, 1346 events	1262	`{}`
`bf16_vectorized_o1`	VitisAI + CPU	CPU only, 1346 events	1262	`{}`
`bf16_unvectorized_o1`	VitisAI + CPU	CPU only, 1346 events	1262	`{}`
`bf16_auto_o2_legacy_force`	VitisAI + CPU	CPU only, 1346 events	1262	`{}`

PP-DocLayoutV3 FP32 ONNX falls back entirely to CPU with VitisAI EP BF16 compile on Ryzen AI MAX+ 395 #374

Description

Summary

Hardware / OS environment

Software environment

Model under test: PP-DocLayoutV3 ONNX

VitisAI EP BF16 config used

Reproduction steps

Actual PP-DocLayoutV3 result

Control experiment: ResNet50 proves the local NPU/BF16 path works

Expected behavior

Actual behavior

Why this matters

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions