Skip to content

Sync with Microsoft ONNX Runtime - 27042026#1066

Open
ai-fw-intg wants to merge 11 commits intoovep-developfrom
sync_msft_27042026
Open

Sync with Microsoft ONNX Runtime - 27042026#1066
ai-fw-intg wants to merge 11 commits intoovep-developfrom
sync_msft_27042026

Conversation

@ai-fw-intg
Copy link
Copy Markdown

Automated daily backmerge from ORT main to ovep-develop. No conflicts detected. Do NOT squash or rebase - use merge commit only.

titaiwangms and others added 11 commits April 28, 2026 09:30
…nd (microsoft#28083)

## Summary

Fixes a critical security vulnerability in the ONNX Runtime Python
backend where user-controlled `kwargs` were applied to `SessionOptions`
and `RunOptions` via unrestricted `setattr()`, allowing arbitrary file
overwrites.

## Vulnerability

The `prepare()` method in `onnxruntime/python/backend/backend.py`
iterated over user-controlled `kwargs` and used `setattr()` to apply
them directly to a `SessionOptions` instance. The `hasattr()` check was
not a security guard — it returned `True` for all exposed properties
including dangerous ones like `optimized_model_filepath`.

**Attack vector:**
```python
onnxruntime.backend.prepare(
    model_path,
    optimized_model_filepath="/etc/passwd",  # overwrites any file with protobuf binary
    graph_optimization_level=onnxruntime.GraphOptimizationLevel.ORT_ENABLE_ALL
)
```

The same pattern existed in `backend_rep.py` for `RunOptions`.

## Fix

Replaced the unrestricted `hasattr/setattr` loop in both files with
strict allowlists:

- **`_ALLOWED_SESSION_OPTIONS`** (13 safe attrs) in `backend.py`
- **`_ALLOWED_RUN_OPTIONS`** (4 safe attrs) in `backend_rep.py`

**Both `SessionOptions` and `RunOptions` use identical validation
logic** with three outcomes for each kwarg key:

- **Allowlisted** — Applied via `setattr()` (e.g.
`graph_optimization_level`, `log_severity_level`)
- **Known-but-blocked** (real attribute on the object, but not on
allowlist) — Raises `RuntimeError` (e.g. `optimized_model_filepath`,
`terminate`)
- **Completely unknown** (not a property on the object at all) —
Silently ignored for forward compatibility (e.g.
`nonexistent_option_xyz`)

**Blocked dangerous attributes:**
- `optimized_model_filepath` — triggers `Model::Save()`, overwrites
arbitrary files with protobuf binary
- `profile_file_prefix` — writes profiling JSON to arbitrary path
- `enable_profiling` — causes uncontrolled file writes to cwd
- `terminate` (RunOptions) — denies the current inference call
- `training_mode` (RunOptions) — silently switches inference behavior in
training builds

## Tests

Added `TestBackendKwargsAllowlist` with 13 new test methods covering all
exploit vectors (blocked attrs raise `RuntimeError`), safe allowlisted
attrs (accepted), unknown attrs (silently ignored), and end-to-end
`run_model()` paths for both session and run options. All 15 tests pass
(13 new + 2 pre-existing in `TestBackend`), no regressions.

## Files Changed

- `onnxruntime/python/backend/backend.py`
- `onnxruntime/python/backend/backend_rep.py`
- `onnxruntime/test/python/onnxruntime_test_python_backend.py`
- `.agents/skills/python-kwargs-setattr-security/SKILL.md`

---------

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ttention (microsoft#28200)

### Description

Adds a CUTLASS memory-efficient attention (MEA) fallback to the CUDA
PagedAttention op, enabling the operator on **sm<80 (Turing / Volta /
Pascal) with fp16** for the first time. On sm>=80 the default
FlashAttention path is unchanged; MEA is reachable via
`ORT_DISABLE_FLASH_ATTENTION=1` or the `sdpa_kernel` CUDA provider
option for debugging and perf comparison.


| Environment | Before | After |
|---|:---:|:---:|
| sm<80 + fp16 | ❌ error | ✅ MEA |
| sm<80 + bf16 | ❌ error | ❌ error (MEA requires sm>=80 for bf16) |
| sm>=80 + fp16/bf16 (default) | ✅ FA | ✅ FA (unchanged) |
| sm>=80 + `ORT_DISABLE_FLASH_ATTENTION=1` /
`sdpa_kernel=EFFICIENT_ATTENTION` | ❌ error | ✅ MEA |

### Motivation and Context

The original PagedAttention PR (microsoft#24595) landed with the title "CUDA SM80
support" — the op errors out immediately whenever FlashAttention isn't
available (sm<80 or `USE_FLASH_ATTENTION=0` builds). During that review,
@tianleiwu flagged that the interface was too FlashAttention-specific
(*"not good for other EP like WebGPU, CPU etc."*) and @aciddelgado
agreed the FA-specific dependencies could be lifted at the kernel level.

This PR closes that gap for sm<80 fp16 by mirroring the exact pattern
established in microsoft#20012 ("Packed QKV and Rotary Embedding Support for
sm<80 GQA"). The same CUTLASS memory-efficient attention backend that
covers GQA's sm<80 path now covers PagedAttention.

Related work:
- microsoft#20012 — direct pattern template (sm<80 GQA MEA fallback)
- microsoft#24595 — original PagedAttention PR
- microsoft#27516 — MS canonical FA → MEA → Unfused cascade ordering
- microsoft#27880 — ONNX Attention CUDA fallback coverage gaps
- microsoft#27992 — MEA decode + unfused softcap work (same flavor)

### Implementation

**Dispatch cascade** in `paged_attention.cc`: FlashAttention preferred;
fall back to MemoryEfficientAttention via
`has_memory_efficient_attention(sm, is_half, is_bf16, head_size,
head_size)`. No custom head-size or dtype bounds hardcoded — MEA's own
helper gates fp16 sm>=53 / bf16 sm>=80 / head_size <= 1024 and `% 8 ==
0`. This keeps us forward-compatible with any future expansion of MEA's
supported range.

**MEA path** (`UnfusedAttention<T>`):
1. Reuses existing preprocessing: `LaunchGetCumulativeSeqlensKV`
(hoisted to `paged_attention.cc` so both FA and MEA paths consume a
pre-populated buffer — single-producer refactor), rotary, packed-QKV
unpack, `ReshapeAndCache`.
2. New `GatherAndExpandPagedKVCache` CUDA kernel walks `block_table` to
gather paged K/V into a packed-varlen `[total_kv_tokens, num_heads,
head_size]` buffer, folding in GQA head expansion (so downstream MEA
sees `num_heads` uniformly).
3. Dispatches to `run_memory_efficient_attention` in **varlen mode** via
`seqstart_q_ptr = cumulative_seqlens_q` + `seqstart_k_ptr =
cumulative_seqlens_kv` (and `has_custom_right_padding = false`). No
padding required; layout matches the kernel's expected `[total_tokens,
num_heads, head_size]` with BSNH strides.

**Scratch allocation**: the MEA path D->H syncs
`cumulative_seqlens_kv[batch_size]` via a pinned buffer to obtain
`total_kv_tokens` on the host for tight `gathered_key` /
`gathered_value` / `fmha_buffer` allocation. This adds a
forward-per-call `cudaStreamSynchronize` — acceptable for a
compatibility fallback (FA remains the hot path on supported hardware).
Over-allocation (the no-sync alternative) would consume `B ×
max_num_blocks_per_seq × block_size × num_heads × head_size × 2 ×
sizeof(T)`, which reaches GB-scale for realistic GQA models and was
rejected.

`fmha_buffer` is sized with `sizeof(float)` (matching the GQA
EfficientAttention pattern at `group_query_attention.cc:482`) because
MEA's output accumulator is fp32 regardless of input dtype.

### Testing

New `TestPagedAttentionMEA` class in `test_paged_attention_cuda.py` runs
the existing parity matrix (rotary on/off, rotary_interleaved on/off,
packed-QKV on/off, local window on/off, softcap 0/50, varied head
sizes/shapes) against the MEA path via the `sdpa_kernel` CUDA provider
option set to `EFFICIENT_ATTENTION` (=2, from `AttentionBackend` enum).
Using a per-session provider option instead of an env var means both FA
and MEA test classes coexist in the same pytest process — each
InferenceSession creates its own CUDA EP with its own
`attention_kernel_options_`.

The existing `TestPagedAttention` class is skipped wholesale on sm<80 by
its `has_flash_attention()` gate, so without the new MEA class the
fallback path would have no CI coverage.

**Local verification** (NVIDIA A100 80GB, CUDA 12.8, GCC 13.3):

```
TestPagedAttention:       24/24 passed (~60s)   # FA baseline — no regression
TestPagedAttentionMEA:    24/24 passed (~59s)   # new MEA path
```

Tolerance: `rtol = atol = 5e-3` against the same torch reference used by
the FA parity test. All combinations match.

**sm<80 hardware coverage**: I don't have local Turing / Volta / Pascal
hardware, so real-SM coverage relies on MS CI. The code path exercised
on A100 via `sdpa_kernel=EFFICIENT_ATTENTION` is the same one taken on
sm<80; only the underlying CUTLASS kernel
(`run_memory_efficient_attention_sm50/70/75/80`) differs per SM, and
those are upstream and unmodified by this change.

**Build note**: built with `--cmake_extra_defines
CMAKE_CUDA_ARCHITECTURES=80 CMAKE_CXX_STANDARD=20`. The explicit C++20
define was needed because the initial configure resolved
`CMAKE_CXX_STANDARD=17`, under which `ort_version_check.h`'s `consteval`
usage fails to compile. Unrelated to this change.
…pkg set (microsoft#28254)

### Description

Remove `react-native` package from set of packages required for
RC/release publishing.
We will need to revisit this and decide whether to remove it entirely or
properly fix it.

### Motivation and Context

The React Native package is having build issues and we don't need it for
the next few immediate releases.
### Description

Adds support for `com.microsoft:QuickGelu` (`x * Sigmoid(alpha * x)`) to
the CoreML Execution Provider's MLProgram path. The builder decomposes
QuickGelu into three MIL ops (`mul` / `sigmoid` / `mul`), matching the
op's own schema function-body in `contrib_defs.cc:605-631` and the
approach the QNN EP already uses in
`qnn/builder/opbuilder/quick_gelu_op_builder.cc`. Only the MLProgram
path is implemented; NeuralNetwork is deprecated on Apple Silicon.

Adds `CoreMLExecutionProviderTest.QuickGeluTest` which builds a single
`com.microsoft:QuickGelu` node with non-default `alpha=1.5` and verifies
the entire graph is claimed by the CoreML EP via
`ExpectedEPNodeAssignment::All`. Verified with a negative test:
temporarily removing the `CreateQuickGeluOpBuilder` registration causes
the new test to fail with a `VerifyEPNodeAssignment` fatal failure,
proving it genuinely exercises the CoreML path.

Also updates `coreml_supported_mlprogram_ops.md`.

### Motivation and Context

Fixes microsoft#28183.

QuickGelu is produced by ORT's own `QuickGeluFusion` optimizer pass
(`onnxruntime/core/optimizer/quick_gelu_fusion.cc`), which runs at
`ORT_ENABLE_EXTENDED` — and therefore also at `ORT_ENABLE_ALL`, the
default session optimization level. So any model that contains the `x *
sigmoid(alpha * x)` pattern (CLIP, several mobile transformers, the
DWPose pose estimator) gets silently mutated by ORT into a graph with
`QuickGelu` nodes that the CoreML EP then rejects — turning 3 supported
primitives into 1 unsupported op, making the fusion strictly harmful for
CoreML.

On the DWPose `dw-ll_ucoco_384.onnx` model with batch=1 and
`ORT_ENABLE_EXTENDED`, 76 `QuickGelu` nodes get produced. Running the
result on the CoreML EP:

| ORT build | CoreML subgraphs | Inference (ms) |
| --- | --- | --- |
| main (QuickGelu rejected) | ~80 (each QuickGelu is a graph break) |
54.77 |
| this PR (QuickGelu supported) | 10 | 13.91 |

The remaining breaks are other ops — see "Related gaps" below. A ~4×
speedup at EXTENDED level from this patch alone.

Even at the default `ORT_ENABLE_ALL` with a symbolic batch dim (where
partial shape inference inhibits most fusions), 3 `QuickGelu` nodes
still get produced — so this patch helps any CoreML user who hasn't
explicitly downgraded to `ORT_ENABLE_BASIC`.

### Related CoreML EP gaps observed (out of scope for this PR)

With QuickGelu fixed, the remaining 9 CPU-fallback nodes on the
EXTENDED-optimized DWPose pose model are:

- **`com.microsoft:FusedConv`** (×4) — produced by
`ConvActivationFusion`. Fuses `Conv + activation` into one node. Same
failure mode as QuickGelu: `Conv` and the activations (`Relu`,
`Sigmoid`, `HardSigmoid`, etc.) are individually CoreML-supported, but
the fused form isn't. Decomposition is straightforward — emit the
underlying `conv` MIL op, then the corresponding activation.
- **`com.microsoft:FusedMatMul`** (×2, from `MatMulScaleFusion`) —
`MatMul * alpha` with an optional transpose. Decomposition: `matmul` +
scalar `mul`.
- **`ai.onnx:Split`** (×2) — pre-existing CoreML EP gap unrelated to
fusion. CoreML MIL has a native `split` op; this one is a straight
op-builder omission.

Happy to send follow-up PRs for any of these after this one lands,
following the same pattern. Flagging here so they're on the EP coverage
roadmap.

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…tilities (microsoft#28227)

This pull request significantly improves the safety and robustness of
sparse tensor handling in ONNX Runtime. The main focus is on adding
thorough bounds checking and using safe integer arithmetic to prevent
overflows and invalid memory accesses when working with sparse tensor
indices. Additionally, the Python bindings for sparse tensors are
refactored to ensure correct object lifetimes and memory management when
exposing data to NumPy.

**Sparse Tensor Index Validation and Safety**

* Added comprehensive bounds checks for COO and CSR sparse tensor
indices in both the C API (`onnxruntime_c_api.cc`) and core conversion
utilities, ensuring indices are within valid ranges and, for CSR, that
outer indices are non-decreasing and within bounds.
[[1]](diffhunk://#diff-cff364b6b1ab4ef507d87a661a97b873405f569797fcaf91af29491f223555a8R449-R485)
[[2]](diffhunk://#diff-cff364b6b1ab4ef507d87a661a97b873405f569797fcaf91af29491f223555a8R521-R547)
[[3]](diffhunk://#diff-cff364b6b1ab4ef507d87a661a97b873405f569797fcaf91af29491f223555a8R659-R696)
[[4]](diffhunk://#diff-cff364b6b1ab4ef507d87a661a97b873405f569797fcaf91af29491f223555a8R721-R747)
[[5]](diffhunk://#diff-620fd022510c5134fc9bd3c8d01bc5772cc78a82043b0da5e44cf2482038dc37L267-R273)
[[6]](diffhunk://#diff-620fd022510c5134fc9bd3c8d01bc5772cc78a82043b0da5e44cf2482038dc37L359-R376)
* Replaced direct arithmetic with `SafeInt` for all index and size
calculations to prevent integer overflows, especially when converting
between types or computing dense tensor offsets.
[[1]](diffhunk://#diff-620fd022510c5134fc9bd3c8d01bc5772cc78a82043b0da5e44cf2482038dc37L267-R273)
[[2]](diffhunk://#diff-d31e9fbe0f5334fcd949833e035f2b25d5ae810dcd505c545f6b372b546b1406L2077-R2077)
[[3]](diffhunk://#diff-d31e9fbe0f5334fcd949833e035f2b25d5ae810dcd505c545f6b372b546b1406L2091-R2091)
[[4]](diffhunk://#diff-d31e9fbe0f5334fcd949833e035f2b25d5ae810dcd505c545f6b372b546b1406L2110-R2110)
[[5]](diffhunk://#diff-d31e9fbe0f5334fcd949833e035f2b25d5ae810dcd505c545f6b372b546b1406L2291-R2298)
* Improved error messages for invalid indices, making debugging easier
by providing more context about the specific error.
[[1]](diffhunk://#diff-cff364b6b1ab4ef507d87a661a97b873405f569797fcaf91af29491f223555a8R449-R485)
[[2]](diffhunk://#diff-cff364b6b1ab4ef507d87a661a97b873405f569797fcaf91af29491f223555a8R521-R547)
[[3]](diffhunk://#diff-cff364b6b1ab4ef507d87a661a97b873405f569797fcaf91af29491f223555a8R659-R696)
[[4]](diffhunk://#diff-cff364b6b1ab4ef507d87a661a97b873405f569797fcaf91af29491f223555a8R721-R747)
[[5]](diffhunk://#diff-620fd022510c5134fc9bd3c8d01bc5772cc78a82043b0da5e44cf2482038dc37L267-R273)
[[6]](diffhunk://#diff-620fd022510c5134fc9bd3c8d01bc5772cc78a82043b0da5e44cf2482038dc37L359-R376)

**Python Bindings Improvements**

* Refactored the pybind11 bindings for sparse tensor views so that NumPy
arrays referencing sparse tensor memory correctly keep the parent Python
object alive, preventing potential memory issues when the sparse tensor
is on the GPU or managed by Python.
[[1]](diffhunk://#diff-3c1b21fe3d5903c277b4d3888f5a4c57ff8f8f6f593183a3f4865825c5ab8e0cL98-R120)
[[2]](diffhunk://#diff-3c1b21fe3d5903c277b4d3888f5a4c57ff8f8f6f593183a3f4865825c5ab8e0cL299-R304)
[[3]](diffhunk://#diff-3c1b21fe3d5903c277b4d3888f5a4c57ff8f8f6f593183a3f4865825c5ab8e0cL314-R319)

**General Code Quality**

* Added missing header include for `safeint.h` to ensure `SafeInt` is
available where needed.
* Minor cleanups and improved assertions to clarify intent and ensure
correctness.

These changes collectively make sparse tensor support in ONNX Runtime
safer, more reliable, and easier to use from both C++ and Python.
### Description
vector_per_class dimension was not verified, it could lead to illegal
memory access



### Motivation and Context
security issue

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: xadupre <22452781+xadupre@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Tianlei Wu <tlwu@microsoft.com>
…icrosoft#28248)

### Description
<!-- Describe your changes. -->

- Correct misleading 'SemVer 1.0.0' label; the universal version regex
actually validates SemVer 2.0.0 syntax without build metadata, which is
what Azure Universal Packages requires.

- Prefix the dev short SHA with 'commit-' in universal_version so the
pre-release identifier always contains a non-digit, avoiding spurious
validation failures for all-numeric SHAs with leading zeros.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

Fix invalid version when we have an all-numeric commit SHA starting with
0.
…ls (microsoft#28214)

This PR adds position_ids bounds checking to WebGPU and JS
RotaryEmbedding implementations, completing the security fix started in
PR microsoft#27597 (commit 056bab3) which covered CPU and CUDA.

## Problem
The `com.microsoft::RotaryEmbedding` kernel uses position_ids as row
indices into cos_cache/sin_cache without bounds validation. While PR
microsoft#27597 fixed CPU and CUDA paths, WebGPU and JS implementations were
still missing bounds checks, which could produce silently wrong results
(WebGPU hardware clamps OOB reads).

## Changes
- **contrib_ops/webgpu/bert/rotary_embedding.cc**: Host-side validation
(ORT_MAKE_STATUS) + shader-side defense-in-depth (pass-through on OOB)
- **core/providers/webgpu/llm/rotary_embedding.cc**: Host-side
validation with format-0 awareness
- **js/web/lib/wasm/jsep/webgpu/ops/rotary-embedding.ts**: TypeScript
validation using getBigInt64Array
- **7 new C++ OOB test cases** across contrib and ONNX domains targeting
WebGPU EP

## Security
Addresses the same vulnerability as microsoft#27597 (OOB read via position_ids,
CVSS 7.5-9.1) for WebGPU/JS execution providers.

## Testing
- 7 new unit tests (3 contrib + 4 ONNX domain) with GTEST_SKIP when
WebGPU EP unavailable
- JS/TS error tests not feasible with current JSONC test format
(documented)
- Build environment lacks C++20/emsdk for full compilation verification;
validated structurally

---------

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ntime/ep/adapter/op_kernel_info.h` (microsoft#28081)

### Description
<!-- Describe your changes. -->

Remove reinterpret_cast of OrtKernelInfo* to internal OpKernelInfo* that
breaks ABI across DLL boundaries (vtable mismatch between plugin EP and
ORT core).

- KernelInfoCache: use Ort::ConstKernelInfo::GetEp() instead of casting
to OpKernelInfo* and calling GetExecutionProvider()->GetOrtEp()

- GetAllocator: use C API KernelInfoGetAllocator +
IAllocatorWrappingOrtAllocator instead of casting to OpKernelInfo*

- Remove #include core/framework/op_kernel_info.h (no longer needed)

- Add IAllocatorWrappingOrtAllocator adapter

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

Address crash observed when testing WebGPU plugin EP with older ORT
1.24.4 binary where the number of `onnxruntime::IExecutionProvider`
virtual functions had changed between the two builds.

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
### Description
This patch adds the support of Split-K with batch size > 1 by
encoding both batch index and Split-K index in dispatch_z and
decompose them in the shader via:
  batch = logical_global_id.z / num_k_splits
  split_index = logical_global_id.z % num_k_splits

This patch also adds batch size to the criteria of using Split-K
as increasing batch size will also increasing the parallelism,
reducing the effectiveness of Split-K.

This patch also replaces `consteval` with `constexpr` in
`ort_version_check.h` to workaround a compilation error
about vs2022.

### Motivation and Context
With this patch we can improve the performance of
`sam-vit-b-decoder-static-fp16-demo` (7.5%) on Intel PTL.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants