Sync with Microsoft ONNX Runtime - 27042026#1066
Open
ai-fw-intg wants to merge 11 commits intoovep-developfrom
Open
Sync with Microsoft ONNX Runtime - 27042026#1066ai-fw-intg wants to merge 11 commits intoovep-developfrom
ai-fw-intg wants to merge 11 commits intoovep-developfrom
Conversation
…nd (microsoft#28083) ## Summary Fixes a critical security vulnerability in the ONNX Runtime Python backend where user-controlled `kwargs` were applied to `SessionOptions` and `RunOptions` via unrestricted `setattr()`, allowing arbitrary file overwrites. ## Vulnerability The `prepare()` method in `onnxruntime/python/backend/backend.py` iterated over user-controlled `kwargs` and used `setattr()` to apply them directly to a `SessionOptions` instance. The `hasattr()` check was not a security guard — it returned `True` for all exposed properties including dangerous ones like `optimized_model_filepath`. **Attack vector:** ```python onnxruntime.backend.prepare( model_path, optimized_model_filepath="/etc/passwd", # overwrites any file with protobuf binary graph_optimization_level=onnxruntime.GraphOptimizationLevel.ORT_ENABLE_ALL ) ``` The same pattern existed in `backend_rep.py` for `RunOptions`. ## Fix Replaced the unrestricted `hasattr/setattr` loop in both files with strict allowlists: - **`_ALLOWED_SESSION_OPTIONS`** (13 safe attrs) in `backend.py` - **`_ALLOWED_RUN_OPTIONS`** (4 safe attrs) in `backend_rep.py` **Both `SessionOptions` and `RunOptions` use identical validation logic** with three outcomes for each kwarg key: - **Allowlisted** — Applied via `setattr()` (e.g. `graph_optimization_level`, `log_severity_level`) - **Known-but-blocked** (real attribute on the object, but not on allowlist) — Raises `RuntimeError` (e.g. `optimized_model_filepath`, `terminate`) - **Completely unknown** (not a property on the object at all) — Silently ignored for forward compatibility (e.g. `nonexistent_option_xyz`) **Blocked dangerous attributes:** - `optimized_model_filepath` — triggers `Model::Save()`, overwrites arbitrary files with protobuf binary - `profile_file_prefix` — writes profiling JSON to arbitrary path - `enable_profiling` — causes uncontrolled file writes to cwd - `terminate` (RunOptions) — denies the current inference call - `training_mode` (RunOptions) — silently switches inference behavior in training builds ## Tests Added `TestBackendKwargsAllowlist` with 13 new test methods covering all exploit vectors (blocked attrs raise `RuntimeError`), safe allowlisted attrs (accepted), unknown attrs (silently ignored), and end-to-end `run_model()` paths for both session and run options. All 15 tests pass (13 new + 2 pre-existing in `TestBackend`), no regressions. ## Files Changed - `onnxruntime/python/backend/backend.py` - `onnxruntime/python/backend/backend_rep.py` - `onnxruntime/test/python/onnxruntime_test_python_backend.py` - `.agents/skills/python-kwargs-setattr-security/SKILL.md` --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ttention (microsoft#28200) ### Description Adds a CUTLASS memory-efficient attention (MEA) fallback to the CUDA PagedAttention op, enabling the operator on **sm<80 (Turing / Volta / Pascal) with fp16** for the first time. On sm>=80 the default FlashAttention path is unchanged; MEA is reachable via `ORT_DISABLE_FLASH_ATTENTION=1` or the `sdpa_kernel` CUDA provider option for debugging and perf comparison. | Environment | Before | After | |---|:---:|:---:| | sm<80 + fp16 | ❌ error | ✅ MEA | | sm<80 + bf16 | ❌ error | ❌ error (MEA requires sm>=80 for bf16) | | sm>=80 + fp16/bf16 (default) | ✅ FA | ✅ FA (unchanged) | | sm>=80 + `ORT_DISABLE_FLASH_ATTENTION=1` / `sdpa_kernel=EFFICIENT_ATTENTION` | ❌ error | ✅ MEA | ### Motivation and Context The original PagedAttention PR (microsoft#24595) landed with the title "CUDA SM80 support" — the op errors out immediately whenever FlashAttention isn't available (sm<80 or `USE_FLASH_ATTENTION=0` builds). During that review, @tianleiwu flagged that the interface was too FlashAttention-specific (*"not good for other EP like WebGPU, CPU etc."*) and @aciddelgado agreed the FA-specific dependencies could be lifted at the kernel level. This PR closes that gap for sm<80 fp16 by mirroring the exact pattern established in microsoft#20012 ("Packed QKV and Rotary Embedding Support for sm<80 GQA"). The same CUTLASS memory-efficient attention backend that covers GQA's sm<80 path now covers PagedAttention. Related work: - microsoft#20012 — direct pattern template (sm<80 GQA MEA fallback) - microsoft#24595 — original PagedAttention PR - microsoft#27516 — MS canonical FA → MEA → Unfused cascade ordering - microsoft#27880 — ONNX Attention CUDA fallback coverage gaps - microsoft#27992 — MEA decode + unfused softcap work (same flavor) ### Implementation **Dispatch cascade** in `paged_attention.cc`: FlashAttention preferred; fall back to MemoryEfficientAttention via `has_memory_efficient_attention(sm, is_half, is_bf16, head_size, head_size)`. No custom head-size or dtype bounds hardcoded — MEA's own helper gates fp16 sm>=53 / bf16 sm>=80 / head_size <= 1024 and `% 8 == 0`. This keeps us forward-compatible with any future expansion of MEA's supported range. **MEA path** (`UnfusedAttention<T>`): 1. Reuses existing preprocessing: `LaunchGetCumulativeSeqlensKV` (hoisted to `paged_attention.cc` so both FA and MEA paths consume a pre-populated buffer — single-producer refactor), rotary, packed-QKV unpack, `ReshapeAndCache`. 2. New `GatherAndExpandPagedKVCache` CUDA kernel walks `block_table` to gather paged K/V into a packed-varlen `[total_kv_tokens, num_heads, head_size]` buffer, folding in GQA head expansion (so downstream MEA sees `num_heads` uniformly). 3. Dispatches to `run_memory_efficient_attention` in **varlen mode** via `seqstart_q_ptr = cumulative_seqlens_q` + `seqstart_k_ptr = cumulative_seqlens_kv` (and `has_custom_right_padding = false`). No padding required; layout matches the kernel's expected `[total_tokens, num_heads, head_size]` with BSNH strides. **Scratch allocation**: the MEA path D->H syncs `cumulative_seqlens_kv[batch_size]` via a pinned buffer to obtain `total_kv_tokens` on the host for tight `gathered_key` / `gathered_value` / `fmha_buffer` allocation. This adds a forward-per-call `cudaStreamSynchronize` — acceptable for a compatibility fallback (FA remains the hot path on supported hardware). Over-allocation (the no-sync alternative) would consume `B × max_num_blocks_per_seq × block_size × num_heads × head_size × 2 × sizeof(T)`, which reaches GB-scale for realistic GQA models and was rejected. `fmha_buffer` is sized with `sizeof(float)` (matching the GQA EfficientAttention pattern at `group_query_attention.cc:482`) because MEA's output accumulator is fp32 regardless of input dtype. ### Testing New `TestPagedAttentionMEA` class in `test_paged_attention_cuda.py` runs the existing parity matrix (rotary on/off, rotary_interleaved on/off, packed-QKV on/off, local window on/off, softcap 0/50, varied head sizes/shapes) against the MEA path via the `sdpa_kernel` CUDA provider option set to `EFFICIENT_ATTENTION` (=2, from `AttentionBackend` enum). Using a per-session provider option instead of an env var means both FA and MEA test classes coexist in the same pytest process — each InferenceSession creates its own CUDA EP with its own `attention_kernel_options_`. The existing `TestPagedAttention` class is skipped wholesale on sm<80 by its `has_flash_attention()` gate, so without the new MEA class the fallback path would have no CI coverage. **Local verification** (NVIDIA A100 80GB, CUDA 12.8, GCC 13.3): ``` TestPagedAttention: 24/24 passed (~60s) # FA baseline — no regression TestPagedAttentionMEA: 24/24 passed (~59s) # new MEA path ``` Tolerance: `rtol = atol = 5e-3` against the same torch reference used by the FA parity test. All combinations match. **sm<80 hardware coverage**: I don't have local Turing / Volta / Pascal hardware, so real-SM coverage relies on MS CI. The code path exercised on A100 via `sdpa_kernel=EFFICIENT_ATTENTION` is the same one taken on sm<80; only the underlying CUTLASS kernel (`run_memory_efficient_attention_sm50/70/75/80`) differs per SM, and those are upstream and unmodified by this change. **Build note**: built with `--cmake_extra_defines CMAKE_CUDA_ARCHITECTURES=80 CMAKE_CXX_STANDARD=20`. The explicit C++20 define was needed because the initial configure resolved `CMAKE_CXX_STANDARD=17`, under which `ort_version_check.h`'s `consteval` usage fails to compile. Unrelated to this change.
…pkg set (microsoft#28254) ### Description Remove `react-native` package from set of packages required for RC/release publishing. We will need to revisit this and decide whether to remove it entirely or properly fix it. ### Motivation and Context The React Native package is having build issues and we don't need it for the next few immediate releases.
### Description Adds support for `com.microsoft:QuickGelu` (`x * Sigmoid(alpha * x)`) to the CoreML Execution Provider's MLProgram path. The builder decomposes QuickGelu into three MIL ops (`mul` / `sigmoid` / `mul`), matching the op's own schema function-body in `contrib_defs.cc:605-631` and the approach the QNN EP already uses in `qnn/builder/opbuilder/quick_gelu_op_builder.cc`. Only the MLProgram path is implemented; NeuralNetwork is deprecated on Apple Silicon. Adds `CoreMLExecutionProviderTest.QuickGeluTest` which builds a single `com.microsoft:QuickGelu` node with non-default `alpha=1.5` and verifies the entire graph is claimed by the CoreML EP via `ExpectedEPNodeAssignment::All`. Verified with a negative test: temporarily removing the `CreateQuickGeluOpBuilder` registration causes the new test to fail with a `VerifyEPNodeAssignment` fatal failure, proving it genuinely exercises the CoreML path. Also updates `coreml_supported_mlprogram_ops.md`. ### Motivation and Context Fixes microsoft#28183. QuickGelu is produced by ORT's own `QuickGeluFusion` optimizer pass (`onnxruntime/core/optimizer/quick_gelu_fusion.cc`), which runs at `ORT_ENABLE_EXTENDED` — and therefore also at `ORT_ENABLE_ALL`, the default session optimization level. So any model that contains the `x * sigmoid(alpha * x)` pattern (CLIP, several mobile transformers, the DWPose pose estimator) gets silently mutated by ORT into a graph with `QuickGelu` nodes that the CoreML EP then rejects — turning 3 supported primitives into 1 unsupported op, making the fusion strictly harmful for CoreML. On the DWPose `dw-ll_ucoco_384.onnx` model with batch=1 and `ORT_ENABLE_EXTENDED`, 76 `QuickGelu` nodes get produced. Running the result on the CoreML EP: | ORT build | CoreML subgraphs | Inference (ms) | | --- | --- | --- | | main (QuickGelu rejected) | ~80 (each QuickGelu is a graph break) | 54.77 | | this PR (QuickGelu supported) | 10 | 13.91 | The remaining breaks are other ops — see "Related gaps" below. A ~4× speedup at EXTENDED level from this patch alone. Even at the default `ORT_ENABLE_ALL` with a symbolic batch dim (where partial shape inference inhibits most fusions), 3 `QuickGelu` nodes still get produced — so this patch helps any CoreML user who hasn't explicitly downgraded to `ORT_ENABLE_BASIC`. ### Related CoreML EP gaps observed (out of scope for this PR) With QuickGelu fixed, the remaining 9 CPU-fallback nodes on the EXTENDED-optimized DWPose pose model are: - **`com.microsoft:FusedConv`** (×4) — produced by `ConvActivationFusion`. Fuses `Conv + activation` into one node. Same failure mode as QuickGelu: `Conv` and the activations (`Relu`, `Sigmoid`, `HardSigmoid`, etc.) are individually CoreML-supported, but the fused form isn't. Decomposition is straightforward — emit the underlying `conv` MIL op, then the corresponding activation. - **`com.microsoft:FusedMatMul`** (×2, from `MatMulScaleFusion`) — `MatMul * alpha` with an optional transpose. Decomposition: `matmul` + scalar `mul`. - **`ai.onnx:Split`** (×2) — pre-existing CoreML EP gap unrelated to fusion. CoreML MIL has a native `split` op; this one is a straight op-builder omission. Happy to send follow-up PRs for any of these after this one lands, following the same pattern. Flagging here so they're on the EP coverage roadmap. --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…tilities (microsoft#28227) This pull request significantly improves the safety and robustness of sparse tensor handling in ONNX Runtime. The main focus is on adding thorough bounds checking and using safe integer arithmetic to prevent overflows and invalid memory accesses when working with sparse tensor indices. Additionally, the Python bindings for sparse tensors are refactored to ensure correct object lifetimes and memory management when exposing data to NumPy. **Sparse Tensor Index Validation and Safety** * Added comprehensive bounds checks for COO and CSR sparse tensor indices in both the C API (`onnxruntime_c_api.cc`) and core conversion utilities, ensuring indices are within valid ranges and, for CSR, that outer indices are non-decreasing and within bounds. [[1]](diffhunk://#diff-cff364b6b1ab4ef507d87a661a97b873405f569797fcaf91af29491f223555a8R449-R485) [[2]](diffhunk://#diff-cff364b6b1ab4ef507d87a661a97b873405f569797fcaf91af29491f223555a8R521-R547) [[3]](diffhunk://#diff-cff364b6b1ab4ef507d87a661a97b873405f569797fcaf91af29491f223555a8R659-R696) [[4]](diffhunk://#diff-cff364b6b1ab4ef507d87a661a97b873405f569797fcaf91af29491f223555a8R721-R747) [[5]](diffhunk://#diff-620fd022510c5134fc9bd3c8d01bc5772cc78a82043b0da5e44cf2482038dc37L267-R273) [[6]](diffhunk://#diff-620fd022510c5134fc9bd3c8d01bc5772cc78a82043b0da5e44cf2482038dc37L359-R376) * Replaced direct arithmetic with `SafeInt` for all index and size calculations to prevent integer overflows, especially when converting between types or computing dense tensor offsets. [[1]](diffhunk://#diff-620fd022510c5134fc9bd3c8d01bc5772cc78a82043b0da5e44cf2482038dc37L267-R273) [[2]](diffhunk://#diff-d31e9fbe0f5334fcd949833e035f2b25d5ae810dcd505c545f6b372b546b1406L2077-R2077) [[3]](diffhunk://#diff-d31e9fbe0f5334fcd949833e035f2b25d5ae810dcd505c545f6b372b546b1406L2091-R2091) [[4]](diffhunk://#diff-d31e9fbe0f5334fcd949833e035f2b25d5ae810dcd505c545f6b372b546b1406L2110-R2110) [[5]](diffhunk://#diff-d31e9fbe0f5334fcd949833e035f2b25d5ae810dcd505c545f6b372b546b1406L2291-R2298) * Improved error messages for invalid indices, making debugging easier by providing more context about the specific error. [[1]](diffhunk://#diff-cff364b6b1ab4ef507d87a661a97b873405f569797fcaf91af29491f223555a8R449-R485) [[2]](diffhunk://#diff-cff364b6b1ab4ef507d87a661a97b873405f569797fcaf91af29491f223555a8R521-R547) [[3]](diffhunk://#diff-cff364b6b1ab4ef507d87a661a97b873405f569797fcaf91af29491f223555a8R659-R696) [[4]](diffhunk://#diff-cff364b6b1ab4ef507d87a661a97b873405f569797fcaf91af29491f223555a8R721-R747) [[5]](diffhunk://#diff-620fd022510c5134fc9bd3c8d01bc5772cc78a82043b0da5e44cf2482038dc37L267-R273) [[6]](diffhunk://#diff-620fd022510c5134fc9bd3c8d01bc5772cc78a82043b0da5e44cf2482038dc37L359-R376) **Python Bindings Improvements** * Refactored the pybind11 bindings for sparse tensor views so that NumPy arrays referencing sparse tensor memory correctly keep the parent Python object alive, preventing potential memory issues when the sparse tensor is on the GPU or managed by Python. [[1]](diffhunk://#diff-3c1b21fe3d5903c277b4d3888f5a4c57ff8f8f6f593183a3f4865825c5ab8e0cL98-R120) [[2]](diffhunk://#diff-3c1b21fe3d5903c277b4d3888f5a4c57ff8f8f6f593183a3f4865825c5ab8e0cL299-R304) [[3]](diffhunk://#diff-3c1b21fe3d5903c277b4d3888f5a4c57ff8f8f6f593183a3f4865825c5ab8e0cL314-R319) **General Code Quality** * Added missing header include for `safeint.h` to ensure `SafeInt` is available where needed. * Minor cleanups and improved assertions to clarify intent and ensure correctness. These changes collectively make sparse tensor support in ONNX Runtime safer, more reliable, and easier to use from both C++ and Python.
### Description vector_per_class dimension was not verified, it could lead to illegal memory access ### Motivation and Context security issue --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: xadupre <22452781+xadupre@users.noreply.github.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Tianlei Wu <tlwu@microsoft.com>
…icrosoft#28248) ### Description <!-- Describe your changes. --> - Correct misleading 'SemVer 1.0.0' label; the universal version regex actually validates SemVer 2.0.0 syntax without build metadata, which is what Azure Universal Packages requires. - Prefix the dev short SHA with 'commit-' in universal_version so the pre-release identifier always contains a non-digit, avoiding spurious validation failures for all-numeric SHAs with leading zeros. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Fix invalid version when we have an all-numeric commit SHA starting with 0.
…ls (microsoft#28214) This PR adds position_ids bounds checking to WebGPU and JS RotaryEmbedding implementations, completing the security fix started in PR microsoft#27597 (commit 056bab3) which covered CPU and CUDA. ## Problem The `com.microsoft::RotaryEmbedding` kernel uses position_ids as row indices into cos_cache/sin_cache without bounds validation. While PR microsoft#27597 fixed CPU and CUDA paths, WebGPU and JS implementations were still missing bounds checks, which could produce silently wrong results (WebGPU hardware clamps OOB reads). ## Changes - **contrib_ops/webgpu/bert/rotary_embedding.cc**: Host-side validation (ORT_MAKE_STATUS) + shader-side defense-in-depth (pass-through on OOB) - **core/providers/webgpu/llm/rotary_embedding.cc**: Host-side validation with format-0 awareness - **js/web/lib/wasm/jsep/webgpu/ops/rotary-embedding.ts**: TypeScript validation using getBigInt64Array - **7 new C++ OOB test cases** across contrib and ONNX domains targeting WebGPU EP ## Security Addresses the same vulnerability as microsoft#27597 (OOB read via position_ids, CVSS 7.5-9.1) for WebGPU/JS execution providers. ## Testing - 7 new unit tests (3 contrib + 4 ONNX domain) with GTEST_SKIP when WebGPU EP unavailable - JS/TS error tests not feasible with current JSONC test format (documented) - Build environment lacks C++20/emsdk for full compilation verification; validated structurally --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ntime/ep/adapter/op_kernel_info.h` (microsoft#28081) ### Description <!-- Describe your changes. --> Remove reinterpret_cast of OrtKernelInfo* to internal OpKernelInfo* that breaks ABI across DLL boundaries (vtable mismatch between plugin EP and ORT core). - KernelInfoCache: use Ort::ConstKernelInfo::GetEp() instead of casting to OpKernelInfo* and calling GetExecutionProvider()->GetOrtEp() - GetAllocator: use C API KernelInfoGetAllocator + IAllocatorWrappingOrtAllocator instead of casting to OpKernelInfo* - Remove #include core/framework/op_kernel_info.h (no longer needed) - Add IAllocatorWrappingOrtAllocator adapter ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Address crash observed when testing WebGPU plugin EP with older ORT 1.24.4 binary where the number of `onnxruntime::IExecutionProvider` virtual functions had changed between the two builds. --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
### Description This patch adds the support of Split-K with batch size > 1 by encoding both batch index and Split-K index in dispatch_z and decompose them in the shader via: batch = logical_global_id.z / num_k_splits split_index = logical_global_id.z % num_k_splits This patch also adds batch size to the criteria of using Split-K as increasing batch size will also increasing the parallelism, reducing the effectiveness of Split-K. This patch also replaces `consteval` with `constexpr` in `ort_version_check.h` to workaround a compilation error about vs2022. ### Motivation and Context With this patch we can improve the performance of `sam-vit-b-decoder-static-fp16-demo` (7.5%) on Intel PTL.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Automated daily backmerge from ORT main to ovep-develop. No conflicts detected. Do NOT squash or rebase - use merge commit only.