Backmerging pr by jatinwadhwa921 · Pull Request #1068 · intel/onnxruntime

jatinwadhwa921 · 2026-04-30T09:24:29Z

No description provided.

…nd (microsoft#28083) ## Summary Fixes a critical security vulnerability in the ONNX Runtime Python backend where user-controlled `kwargs` were applied to `SessionOptions` and `RunOptions` via unrestricted `setattr()`, allowing arbitrary file overwrites. ## Vulnerability The `prepare()` method in `onnxruntime/python/backend/backend.py` iterated over user-controlled `kwargs` and used `setattr()` to apply them directly to a `SessionOptions` instance. The `hasattr()` check was not a security guard — it returned `True` for all exposed properties including dangerous ones like `optimized_model_filepath`. **Attack vector:** ```python onnxruntime.backend.prepare( model_path, optimized_model_filepath="/etc/passwd", # overwrites any file with protobuf binary graph_optimization_level=onnxruntime.GraphOptimizationLevel.ORT_ENABLE_ALL ) ``` The same pattern existed in `backend_rep.py` for `RunOptions`. ## Fix Replaced the unrestricted `hasattr/setattr` loop in both files with strict allowlists: - **`_ALLOWED_SESSION_OPTIONS`** (13 safe attrs) in `backend.py` - **`_ALLOWED_RUN_OPTIONS`** (4 safe attrs) in `backend_rep.py` **Both `SessionOptions` and `RunOptions` use identical validation logic** with three outcomes for each kwarg key: - **Allowlisted** — Applied via `setattr()` (e.g. `graph_optimization_level`, `log_severity_level`) - **Known-but-blocked** (real attribute on the object, but not on allowlist) — Raises `RuntimeError` (e.g. `optimized_model_filepath`, `terminate`) - **Completely unknown** (not a property on the object at all) — Silently ignored for forward compatibility (e.g. `nonexistent_option_xyz`) **Blocked dangerous attributes:** - `optimized_model_filepath` — triggers `Model::Save()`, overwrites arbitrary files with protobuf binary - `profile_file_prefix` — writes profiling JSON to arbitrary path - `enable_profiling` — causes uncontrolled file writes to cwd - `terminate` (RunOptions) — denies the current inference call - `training_mode` (RunOptions) — silently switches inference behavior in training builds ## Tests Added `TestBackendKwargsAllowlist` with 13 new test methods covering all exploit vectors (blocked attrs raise `RuntimeError`), safe allowlisted attrs (accepted), unknown attrs (silently ignored), and end-to-end `run_model()` paths for both session and run options. All 15 tests pass (13 new + 2 pre-existing in `TestBackend`), no regressions. ## Files Changed - `onnxruntime/python/backend/backend.py` - `onnxruntime/python/backend/backend_rep.py` - `onnxruntime/test/python/onnxruntime_test_python_backend.py` - `.agents/skills/python-kwargs-setattr-security/SKILL.md` --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

@tianleiwu

…ttention (microsoft#28200) ### Description Adds a CUTLASS memory-efficient attention (MEA) fallback to the CUDA PagedAttention op, enabling the operator on **sm<80 (Turing / Volta / Pascal) with fp16** for the first time. On sm>=80 the default FlashAttention path is unchanged; MEA is reachable via `ORT_DISABLE_FLASH_ATTENTION=1` or the `sdpa_kernel` CUDA provider option for debugging and perf comparison. | Environment | Before | After | |---|:---:|:---:| | sm<80 + fp16 | ❌ error | ✅ MEA | | sm<80 + bf16 | ❌ error | ❌ error (MEA requires sm>=80 for bf16) | | sm>=80 + fp16/bf16 (default) | ✅ FA | ✅ FA (unchanged) | | sm>=80 + `ORT_DISABLE_FLASH_ATTENTION=1` / `sdpa_kernel=EFFICIENT_ATTENTION` | ❌ error | ✅ MEA | ### Motivation and Context The original PagedAttention PR (microsoft#24595) landed with the title "CUDA SM80 support" — the op errors out immediately whenever FlashAttention isn't available (sm<80 or `USE_FLASH_ATTENTION=0` builds). During that review, @tianleiwu flagged that the interface was too FlashAttention-specific (*"not good for other EP like WebGPU, CPU etc."*) and @aciddelgado agreed the FA-specific dependencies could be lifted at the kernel level. This PR closes that gap for sm<80 fp16 by mirroring the exact pattern established in microsoft#20012 ("Packed QKV and Rotary Embedding Support for sm<80 GQA"). The same CUTLASS memory-efficient attention backend that covers GQA's sm<80 path now covers PagedAttention. Related work: - microsoft#20012 — direct pattern template (sm<80 GQA MEA fallback) - microsoft#24595 — original PagedAttention PR - microsoft#27516 — MS canonical FA → MEA → Unfused cascade ordering - microsoft#27880 — ONNX Attention CUDA fallback coverage gaps - microsoft#27992 — MEA decode + unfused softcap work (same flavor) ### Implementation **Dispatch cascade** in `paged_attention.cc`: FlashAttention preferred; fall back to MemoryEfficientAttention via `has_memory_efficient_attention(sm, is_half, is_bf16, head_size, head_size)`. No custom head-size or dtype bounds hardcoded — MEA's own helper gates fp16 sm>=53 / bf16 sm>=80 / head_size <= 1024 and `% 8 == 0`. This keeps us forward-compatible with any future expansion of MEA's supported range. **MEA path** (`UnfusedAttention<T>`): 1. Reuses existing preprocessing: `LaunchGetCumulativeSeqlensKV` (hoisted to `paged_attention.cc` so both FA and MEA paths consume a pre-populated buffer — single-producer refactor), rotary, packed-QKV unpack, `ReshapeAndCache`. 2. New `GatherAndExpandPagedKVCache` CUDA kernel walks `block_table` to gather paged K/V into a packed-varlen `[total_kv_tokens, num_heads, head_size]` buffer, folding in GQA head expansion (so downstream MEA sees `num_heads` uniformly). 3. Dispatches to `run_memory_efficient_attention` in **varlen mode** via `seqstart_q_ptr = cumulative_seqlens_q` + `seqstart_k_ptr = cumulative_seqlens_kv` (and `has_custom_right_padding = false`). No padding required; layout matches the kernel's expected `[total_tokens, num_heads, head_size]` with BSNH strides. **Scratch allocation**: the MEA path D->H syncs `cumulative_seqlens_kv[batch_size]` via a pinned buffer to obtain `total_kv_tokens` on the host for tight `gathered_key` / `gathered_value` / `fmha_buffer` allocation. This adds a forward-per-call `cudaStreamSynchronize` — acceptable for a compatibility fallback (FA remains the hot path on supported hardware). Over-allocation (the no-sync alternative) would consume `B × max_num_blocks_per_seq × block_size × num_heads × head_size × 2 × sizeof(T)`, which reaches GB-scale for realistic GQA models and was rejected. `fmha_buffer` is sized with `sizeof(float)` (matching the GQA EfficientAttention pattern at `group_query_attention.cc:482`) because MEA's output accumulator is fp32 regardless of input dtype. ### Testing New `TestPagedAttentionMEA` class in `test_paged_attention_cuda.py` runs the existing parity matrix (rotary on/off, rotary_interleaved on/off, packed-QKV on/off, local window on/off, softcap 0/50, varied head sizes/shapes) against the MEA path via the `sdpa_kernel` CUDA provider option set to `EFFICIENT_ATTENTION` (=2, from `AttentionBackend` enum). Using a per-session provider option instead of an env var means both FA and MEA test classes coexist in the same pytest process — each InferenceSession creates its own CUDA EP with its own `attention_kernel_options_`. The existing `TestPagedAttention` class is skipped wholesale on sm<80 by its `has_flash_attention()` gate, so without the new MEA class the fallback path would have no CI coverage. **Local verification** (NVIDIA A100 80GB, CUDA 12.8, GCC 13.3): ``` TestPagedAttention: 24/24 passed (~60s) # FA baseline — no regression TestPagedAttentionMEA: 24/24 passed (~59s) # new MEA path ``` Tolerance: `rtol = atol = 5e-3` against the same torch reference used by the FA parity test. All combinations match. **sm<80 hardware coverage**: I don't have local Turing / Volta / Pascal hardware, so real-SM coverage relies on MS CI. The code path exercised on A100 via `sdpa_kernel=EFFICIENT_ATTENTION` is the same one taken on sm<80; only the underlying CUTLASS kernel (`run_memory_efficient_attention_sm50/70/75/80`) differs per SM, and those are upstream and unmodified by this change. **Build note**: built with `--cmake_extra_defines CMAKE_CUDA_ARCHITECTURES=80 CMAKE_CXX_STANDARD=20`. The explicit C++20 define was needed because the initial configure resolved `CMAKE_CXX_STANDARD=17`, under which `ort_version_check.h`'s `consteval` usage fails to compile. Unrelated to this change.

…pkg set (microsoft#28254) ### Description Remove `react-native` package from set of packages required for RC/release publishing. We will need to revisit this and decide whether to remove it entirely or properly fix it. ### Motivation and Context The React Native package is having build issues and we don't need it for the next few immediate releases.

### Description Adds support for `com.microsoft:QuickGelu` (`x * Sigmoid(alpha * x)`) to the CoreML Execution Provider's MLProgram path. The builder decomposes QuickGelu into three MIL ops (`mul` / `sigmoid` / `mul`), matching the op's own schema function-body in `contrib_defs.cc:605-631` and the approach the QNN EP already uses in `qnn/builder/opbuilder/quick_gelu_op_builder.cc`. Only the MLProgram path is implemented; NeuralNetwork is deprecated on Apple Silicon. Adds `CoreMLExecutionProviderTest.QuickGeluTest` which builds a single `com.microsoft:QuickGelu` node with non-default `alpha=1.5` and verifies the entire graph is claimed by the CoreML EP via `ExpectedEPNodeAssignment::All`. Verified with a negative test: temporarily removing the `CreateQuickGeluOpBuilder` registration causes the new test to fail with a `VerifyEPNodeAssignment` fatal failure, proving it genuinely exercises the CoreML path. Also updates `coreml_supported_mlprogram_ops.md`. ### Motivation and Context Fixes microsoft#28183. QuickGelu is produced by ORT's own `QuickGeluFusion` optimizer pass (`onnxruntime/core/optimizer/quick_gelu_fusion.cc`), which runs at `ORT_ENABLE_EXTENDED` — and therefore also at `ORT_ENABLE_ALL`, the default session optimization level. So any model that contains the `x * sigmoid(alpha * x)` pattern (CLIP, several mobile transformers, the DWPose pose estimator) gets silently mutated by ORT into a graph with `QuickGelu` nodes that the CoreML EP then rejects — turning 3 supported primitives into 1 unsupported op, making the fusion strictly harmful for CoreML. On the DWPose `dw-ll_ucoco_384.onnx` model with batch=1 and `ORT_ENABLE_EXTENDED`, 76 `QuickGelu` nodes get produced. Running the result on the CoreML EP: | ORT build | CoreML subgraphs | Inference (ms) | | --- | --- | --- | | main (QuickGelu rejected) | ~80 (each QuickGelu is a graph break) | 54.77 | | this PR (QuickGelu supported) | 10 | 13.91 | The remaining breaks are other ops — see "Related gaps" below. A ~4× speedup at EXTENDED level from this patch alone. Even at the default `ORT_ENABLE_ALL` with a symbolic batch dim (where partial shape inference inhibits most fusions), 3 `QuickGelu` nodes still get produced — so this patch helps any CoreML user who hasn't explicitly downgraded to `ORT_ENABLE_BASIC`. ### Related CoreML EP gaps observed (out of scope for this PR) With QuickGelu fixed, the remaining 9 CPU-fallback nodes on the EXTENDED-optimized DWPose pose model are: - **`com.microsoft:FusedConv`** (×4) — produced by `ConvActivationFusion`. Fuses `Conv + activation` into one node. Same failure mode as QuickGelu: `Conv` and the activations (`Relu`, `Sigmoid`, `HardSigmoid`, etc.) are individually CoreML-supported, but the fused form isn't. Decomposition is straightforward — emit the underlying `conv` MIL op, then the corresponding activation. - **`com.microsoft:FusedMatMul`** (×2, from `MatMulScaleFusion`) — `MatMul * alpha` with an optional transpose. Decomposition: `matmul` + scalar `mul`. - **`ai.onnx:Split`** (×2) — pre-existing CoreML EP gap unrelated to fusion. CoreML MIL has a native `split` op; this one is a straight op-builder omission. Happy to send follow-up PRs for any of these after this one lands, following the same pattern. Flagging here so they're on the EP coverage roadmap. --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…tilities (microsoft#28227) This pull request significantly improves the safety and robustness of sparse tensor handling in ONNX Runtime. The main focus is on adding thorough bounds checking and using safe integer arithmetic to prevent overflows and invalid memory accesses when working with sparse tensor indices. Additionally, the Python bindings for sparse tensors are refactored to ensure correct object lifetimes and memory management when exposing data to NumPy. **Sparse Tensor Index Validation and Safety** * Added comprehensive bounds checks for COO and CSR sparse tensor indices in both the C API (`onnxruntime_c_api.cc`) and core conversion utilities, ensuring indices are within valid ranges and, for CSR, that outer indices are non-decreasing and within bounds. [[1]](diffhunk://#diff-cff364b6b1ab4ef507d87a661a97b873405f569797fcaf91af29491f223555a8R449-R485) [[2]](diffhunk://#diff-cff364b6b1ab4ef507d87a661a97b873405f569797fcaf91af29491f223555a8R521-R547) [[3]](diffhunk://#diff-cff364b6b1ab4ef507d87a661a97b873405f569797fcaf91af29491f223555a8R659-R696) [[4]](diffhunk://#diff-cff364b6b1ab4ef507d87a661a97b873405f569797fcaf91af29491f223555a8R721-R747) [[5]](diffhunk://#diff-620fd022510c5134fc9bd3c8d01bc5772cc78a82043b0da5e44cf2482038dc37L267-R273) [[6]](diffhunk://#diff-620fd022510c5134fc9bd3c8d01bc5772cc78a82043b0da5e44cf2482038dc37L359-R376) * Replaced direct arithmetic with `SafeInt` for all index and size calculations to prevent integer overflows, especially when converting between types or computing dense tensor offsets. [[1]](diffhunk://#diff-620fd022510c5134fc9bd3c8d01bc5772cc78a82043b0da5e44cf2482038dc37L267-R273) [[2]](diffhunk://#diff-d31e9fbe0f5334fcd949833e035f2b25d5ae810dcd505c545f6b372b546b1406L2077-R2077) [[3]](diffhunk://#diff-d31e9fbe0f5334fcd949833e035f2b25d5ae810dcd505c545f6b372b546b1406L2091-R2091) [[4]](diffhunk://#diff-d31e9fbe0f5334fcd949833e035f2b25d5ae810dcd505c545f6b372b546b1406L2110-R2110) [[5]](diffhunk://#diff-d31e9fbe0f5334fcd949833e035f2b25d5ae810dcd505c545f6b372b546b1406L2291-R2298) * Improved error messages for invalid indices, making debugging easier by providing more context about the specific error. [[1]](diffhunk://#diff-cff364b6b1ab4ef507d87a661a97b873405f569797fcaf91af29491f223555a8R449-R485) [[2]](diffhunk://#diff-cff364b6b1ab4ef507d87a661a97b873405f569797fcaf91af29491f223555a8R521-R547) [[3]](diffhunk://#diff-cff364b6b1ab4ef507d87a661a97b873405f569797fcaf91af29491f223555a8R659-R696) [[4]](diffhunk://#diff-cff364b6b1ab4ef507d87a661a97b873405f569797fcaf91af29491f223555a8R721-R747) [[5]](diffhunk://#diff-620fd022510c5134fc9bd3c8d01bc5772cc78a82043b0da5e44cf2482038dc37L267-R273) [[6]](diffhunk://#diff-620fd022510c5134fc9bd3c8d01bc5772cc78a82043b0da5e44cf2482038dc37L359-R376) **Python Bindings Improvements** * Refactored the pybind11 bindings for sparse tensor views so that NumPy arrays referencing sparse tensor memory correctly keep the parent Python object alive, preventing potential memory issues when the sparse tensor is on the GPU or managed by Python. [[1]](diffhunk://#diff-3c1b21fe3d5903c277b4d3888f5a4c57ff8f8f6f593183a3f4865825c5ab8e0cL98-R120) [[2]](diffhunk://#diff-3c1b21fe3d5903c277b4d3888f5a4c57ff8f8f6f593183a3f4865825c5ab8e0cL299-R304) [[3]](diffhunk://#diff-3c1b21fe3d5903c277b4d3888f5a4c57ff8f8f6f593183a3f4865825c5ab8e0cL314-R319) **General Code Quality** * Added missing header include for `safeint.h` to ensure `SafeInt` is available where needed. * Minor cleanups and improved assertions to clarify intent and ensure correctness. These changes collectively make sparse tensor support in ONNX Runtime safer, more reliable, and easier to use from both C++ and Python.

### Description vector_per_class dimension was not verified, it could lead to illegal memory access ### Motivation and Context security issue --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: xadupre <22452781+xadupre@users.noreply.github.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Tianlei Wu <tlwu@microsoft.com>

…icrosoft#28248) ### Description  - Correct misleading 'SemVer 1.0.0' label; the universal version regex actually validates SemVer 2.0.0 syntax without build metadata, which is what Azure Universal Packages requires. - Prefix the dev short SHA with 'commit-' in universal_version so the pre-release identifier always contains a non-digit, avoiding spurious validation failures for all-numeric SHAs with leading zeros. ### Motivation and Context  Fix invalid version when we have an all-numeric commit SHA starting with 0.

…ls (microsoft#28214) This PR adds position_ids bounds checking to WebGPU and JS RotaryEmbedding implementations, completing the security fix started in PR microsoft#27597 (commit 056bab3) which covered CPU and CUDA. ## Problem The `com.microsoft::RotaryEmbedding` kernel uses position_ids as row indices into cos_cache/sin_cache without bounds validation. While PR microsoft#27597 fixed CPU and CUDA paths, WebGPU and JS implementations were still missing bounds checks, which could produce silently wrong results (WebGPU hardware clamps OOB reads). ## Changes - **contrib_ops/webgpu/bert/rotary_embedding.cc**: Host-side validation (ORT_MAKE_STATUS) + shader-side defense-in-depth (pass-through on OOB) - **core/providers/webgpu/llm/rotary_embedding.cc**: Host-side validation with format-0 awareness - **js/web/lib/wasm/jsep/webgpu/ops/rotary-embedding.ts**: TypeScript validation using getBigInt64Array - **7 new C++ OOB test cases** across contrib and ONNX domains targeting WebGPU EP ## Security Addresses the same vulnerability as microsoft#27597 (OOB read via position_ids, CVSS 7.5-9.1) for WebGPU/JS execution providers. ## Testing - 7 new unit tests (3 contrib + 4 ONNX domain) with GTEST_SKIP when WebGPU EP unavailable - JS/TS error tests not feasible with current JSONC test format (documented) - Build environment lacks C++20/emsdk for full compilation verification; validated structurally --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…ntime/ep/adapter/op_kernel_info.h` (microsoft#28081) ### Description  Remove reinterpret_cast of OrtKernelInfo* to internal OpKernelInfo* that breaks ABI across DLL boundaries (vtable mismatch between plugin EP and ORT core). - KernelInfoCache: use Ort::ConstKernelInfo::GetEp() instead of casting to OpKernelInfo* and calling GetExecutionProvider()->GetOrtEp() - GetAllocator: use C API KernelInfoGetAllocator + IAllocatorWrappingOrtAllocator instead of casting to OpKernelInfo* - Remove #include core/framework/op_kernel_info.h (no longer needed) - Add IAllocatorWrappingOrtAllocator adapter ### Motivation and Context  Address crash observed when testing WebGPU plugin EP with older ORT 1.24.4 binary where the number of `onnxruntime::IExecutionProvider` virtual functions had changed between the two builds. --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

### Description This patch adds the support of Split-K with batch size > 1 by encoding both batch index and Split-K index in dispatch_z and decompose them in the shader via: batch = logical_global_id.z / num_k_splits split_index = logical_global_id.z % num_k_splits This patch also adds batch size to the criteria of using Split-K as increasing batch size will also increasing the parallelism, reducing the effectiveness of Split-K. This patch also replaces `consteval` with `constexpr` in `ort_version_check.h` to workaround a compilation error about vs2022. ### Motivation and Context With this patch we can improve the performance of `sam-vit-b-decoder-static-fp16-demo` (7.5%) on Intel PTL.

…oft#27760) ### Description Adds aarch64 Linux wheel builds to the CUDA GPU packaging pipeline, mirroring the existing x86_64 configuration. - **`stages/py-linux-gpu-stage.yml`**: Add `hostArchitecture: Arm64` to pool config when `arch == 'aarch64'` (matches pattern in `py-linux.yml`) - **`stages/py-gpu-packaging-stage.yml`**: Add `docker_base_image_aarch64` and `AArch64LinuxPythonConfigurations` parameters (defaults to `[]` so CUDA 12 pipeline is unaffected), aarch64 build stages, and merge artifact dependencies/downloads - **`py-cuda13-packaging-pipeline.yml`**: Pass aarch64 base image and Python configs for all supported versions (3.11–3.14, including free-threaded) - **`aarch64/python/cuda/Dockerfile`** + **`scripts/install_centos.sh`**: New Docker build context for aarch64 CUDA builds. It is different from x86_64 variant: aarch64 uses tar to install tensorrt. ### Motivation and Context `onnxruntime-gpu` only ships x86_64 and Windows wheels. Installing on `manylinux_2_39_aarch64` (e.g. `ubuntu-24.04-arm` runners) fails with no compatible wheel available. - Fixes microsoft#27005 --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: tianleiwu <30328909+tianleiwu@users.noreply.github.com> Co-authored-by: Tianlei Wu <tlwu@microsoft.com>

When M is small and batchA is large, there are some invalid elements in each tile, merge batchA into M dimesion would reduce the workgroup count. ### Description  ### Motivation and Context  --------- Co-authored-by: wp <webgraphics@intel.com>

…crosoft#28172) Keep original roundingType name for a period of time to ensure backward compatibility. Spec change: webmachinelearning/webnn#770 --------- Co-authored-by: Dwayne Robinson <fdwr@hotmail.com>

### Description Update OpenVINO version for OVEP. --------- Co-authored-by: jatinwadhwa921 <jatin.wadhwa@intel.com> Co-authored-by: Rajeev Sekar <rajeev.sekar@intel.com>

### Description Fixes ONNX Runtime startup on Linux ARM64 environments where `/sys/devices/system/cpu/possible` and `/sys/devices/system/cpu/present` are unavailable, such as AWS Lambda ARM64/Graviton and restricted build sandboxes. There are two related failure modes: 1. `PosixEnv` may be constructed before ORT's default logger is registered. If `cpuinfo_initialize()` fails during that early construction path, the existing `LOGS_DEFAULT(INFO)` call can terminate with `Attempt to use DefaultLogger but none has been registered`. 2. The bundled `pytorch/cpuinfo` code treats missing Linux CPU `possible`/`present` sysfs cpulists as fatal on ARM Linux. The max-count helpers return `UINT32_MAX`, which wraps to `0` after `1 + UINT32_MAX` in ARM Linux initialization and prevents cpuinfo from reaching the later `/proc/cpuinfo` and `getauxval()` based detection paths. ### Root Cause The immediate import crash is caused by unsafe early logging in `onnxruntime/core/platform/posix/env.cc`. Python bindings can reference `Env::Default()` during module load before logging is initialized, so a cpuinfo initialization failure must not use `LOGS_DEFAULT()` unless a default logger exists. The cpuinfo initialization failure is more subtle. A count-only fallback is not enough: after cpuinfo computes max possible/present CPU counts, it calls `cpuinfo_linux_detect_possible_processors()` and `cpuinfo_linux_detect_present_processors()` to set `CPUINFO_LINUX_FLAG_POSSIBLE` and `CPUINFO_LINUX_FLAG_PRESENT` on each processor. ARM Linux initialization later marks processors valid only if those flags are set. If only the count fallback is provided, `valid_processors` can remain zero and cpuinfo can proceed into an invalid partial initialization state. ### Fix - Make `PosixEnv` logging safe when cpuinfo initialization fails before a default logger exists: - use `logging::LoggingManager::HasDefaultLogger()` before `LOGS_DEFAULT()` - fall back to `std::cerr` when no logger is registered - Add a cpuinfo patch for Linux missing sysfs CPU cpulists: - fallback max possible/present processor detection to `sysconf(_SC_NPROCESSORS_ONLN) - 1` - fallback present/possible processor flag detection by marking CPUs `0..nproc-1` - preserve existing sysfs parsing behavior when the cpulist files are available - Wire the cpuinfo patch into the existing cpuinfo FetchContent flow for Linux and existing ARM64/ARM64EC patch path. - Add a simulation test that validates: - safe early logging without a registered default logger - `sysconf(_SC_NPROCESSORS_ONLN)` count and present/possible flag fallback behavior - hiding `/sys/devices/system/cpu/{possible,present}` via `LD_PRELOAD` - optional ORT import with hidden sysfs when a built ORT package is importable ### Testing Ran from a clean branch/worktree: ```bash python onnxruntime/test/common/test_cpuinfo_sysfs_fallback.py ``` Result: - safe logging simulation: PASS - sysconf count + flag fallback simulation: PASS - LD_PRELOAD sysfs-hiding simulation: PASS - ORT import integration: SKIP (`onnxruntime.capi` not built/importable in this workspace) Also validated the cpuinfo patch directly: ```bash cd build/cu128/Release/_deps/pytorch_cpuinfo-src patch --dry-run -p1 < /path/to/cmake/patches/cpuinfo/fix_missing_sysfs_fallback.patch ``` And syntax-checked patched `src/linux/processors.c` in a temporary tree with cpuinfo headers. ### Related Issue Fixes microsoft#10038.

…opy (microsoft#28256) ### Description Adds an `OrtValue` overload to `update_inplace` so GPU-resident data can be copied directly to another `OrtValue` without roundtripping through CPU. - **C++ pybind** (`onnxruntime_pybind_ortvalue.cc`): New `update_inplace(const OrtValue*)` overload. Uses `CreateDataTransferMemCpy` for plugin EPs, with fallback to built-in copy functions for CUDA (including GPU↔GPU via `GetGPUDataTransfer()`), MIGraphX, DML, and CANN. - **Python wrapper** (`onnxruntime_inference_collection.py`): `update_inplace` now accepts either a numpy array or an `OrtValue`, dispatching to the appropriate C++ overload. - **Tests** (`onnxruntime_test_python_cudagraph.py`): Covers CPU→CPU, GPU→GPU, CPU→GPU, and GPU→CPU OrtValue copy paths. ```python # Before: requires numpy (CPU) source, even when data is already on GPU ortvalue_gpu.update_inplace(np_array) # After: accepts OrtValue directly for device-to-device copy ortvalue_gpu_src = onnxrt.OrtValue.ortvalue_from_numpy(data, "cuda", 0) ortvalue_gpu_dst.update_inplace(ortvalue_gpu_src) # GPU-to-GPU, no CPU roundtrip ``` ### Motivation and Context CUDA graph replay requires inputs at fixed memory addresses. When source data (e.g., encoder output) is already on GPU, the only option was to use external libraries like `cuda-python` for device-to-device memcpy. This change makes that workflow native to ORT, per the approach suggested in the issue discussion: accept an `OrtValue` in `update_inplace` to leverage ORT's existing data transfer infrastructure. --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: tianleiwu <30328909+tianleiwu@users.noreply.github.com> Co-authored-by: GitHub Copilot <copilot@example.com> Co-authored-by: Tianlei Wu <tlwu@microsoft.com> Co-authored-by: Copilot <copilot@github.com>

### Description Implements the GridSample operator (opset 16–19) for the WebGPU EP. ### Motivation and Context GridSample was missing from the WebGPU EP and all other major execution providers already support it. The GridSample tests were extended to cover the WebGPU EP and seem to pass successfully. I haven't tested that the `onnxruntime-web` build would pick up this new operator implementation because it's really hard to build this locally, but it seems like it should just work. Closes microsoft#27085

Prepare for cuda performance optimizations using new cutlass features.

…tion patch to fix autolinking (microsoft#28266) ## Summary This PR fixes the long-standing issue where `onnxruntime-react-native` requires manual native setup after installation, causing the `TypeError: Cannot read property 'install' of null` runtime crash. **Root cause:** The package shipped without a `react-native.config.js`, so the RN community CLI autolinking did not register `OnnxruntimePackage` on Android. For Expo users on the New Architecture, the existing `app.plugin.js` patched Gradle and the Podfile but never registered the package class in `MainApplication`. **Changes:** - **`react-native.config.js`** (new) — enables RN community autolinking for Android. With RN 0.74+ this covers both old and new architecture via the generated `PackageList.java`. No manual `settings.gradle` or `MainApplication` edits needed for bare RN. - **`app.plugin.js`** — adds a `withMainApplication` mod that idempotently inserts the `OnnxruntimePackage` import and registration into `MainApplication.kt`/`.java` during `expo prebuild`. Uses the same `mergeContents` tag pattern as the existing Gradle/Podfile mods, so it is safe to run multiple times. - **`package.json`** — includes `react-native.config.js` in the npm `files` array so it ships in the tarball. - **`README.md`** — documents that autolinking handles registration automatically and adds the Expo plugin usage snippet. ## Testing - Bare RN: `npx react-native config | jq '.dependencies["onnxruntime-react-native"]'` should show `packageImportPath` populated; `PackageList.java` should include `OnnxruntimePackage` after a Gradle sync. - Expo: `npx expo prebuild --clean` should produce a `MainApplication.kt` containing `import ai.onnxruntime.reactnative.OnnxruntimePackage` and `add(OnnxruntimePackage())`. Fixes microsoft#19510 See also microsoft#17773

### Description Fixes: https://portal.microsofticm.com/imp/v5/incidents/details/31000000586963 https://portal.microsofticm.com/imp/v5/incidents/details/31000000586944 ### Motivation and Context Fix ICM issues --------- Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

@qiurui144

…#28261) ## Motivation and Context  Close microsoft#17466 and microsoft#24596 MLAS already provides architecture-specific optimized kernels for multiple vector ISAs, such as SSE/AVX/AVX2/AVX512 on x86/x64, NEON/SVE on Arm, VSX on POWER, LSX/LASX on LoongArch, and zvector on s390x. However, riscv64 has not had comparable RVV-optimized coverage for the operators in this PR and has mainly fallen back to scalar code. This PR introduces **RISC-V Vector (RVV)** extension support to the ONNX Runtime CPU Execution Provider. This PR focuses on two operators: SGEMM and Softmax. We have already completed optimizations for several other operators. Following the acceptance of this PR, I will work with @qiurui144 to upstream the remaining optimized kernels in a series of subsequent PRs. ## Benchmark Results ### SGEMM | Case | pack_b | RVV pack ms | RVV compute ms | Scalar pack ms | Scalar compute ms | Compute speedup | End-to-end speedup | |---|---:|---:|---:|---:|---:|---:|---:| | 128x3072x768 | 1 | 63.21 | 114.52 | 66.71 | 414.44 | 3.62x | 2.71x | | 64x1024x1024 | 1 | 22.07 | 27.66 | 23.14 | 96.64 | 3.49x | 2.41x | | 32x4096x1024 | 1 | 119.04 | 56.82 | 118.86 | 188.34 | 3.31x | 1.75x | ### Softmax | Case | Scalar ms | RVV ms | Speedup | |---|---:|---:|---:| | 4096x128 | 1955.25 | 611.65 | 3.20x | | 1024x1024 | 717.26 | 236.73 | 3.03x |

titaiwangms and others added 23 commits April 28, 2026 09:30

[WebNN] Rename roundingType to outputShapeRounding for pool2d ops (mi…

8a77597

…crosoft#28172) Keep original roundingType name for a period of time to ensure backward compatibility. Spec change: webmachinelearning/webnn#770 --------- Co-authored-by: Dwayne Robinson <fdwr@hotmail.com>

[OVEP] Updating OV version to 2026.1.0 (microsoft#28170)

ddea107

### Description Update OpenVINO version for OVEP. --------- Co-authored-by: jatinwadhwa921 <jatin.wadhwa@intel.com> Co-authored-by: Rajeev Sekar <rajeev.sekar@intel.com>

[Cuda] Upgrade cutlass to 4.4.2 (microsoft#28276)

11e3072

Prepare for cuda performance optimizations using new cutlass features.

Merge branch 'master' into sync_30_4_2026

e1d2ded

Support of OV version to resize op

5d02aae

jatinwadhwa921 closed this Apr 30, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Backmerging pr #1068

Backmerging pr #1068
jatinwadhwa921 wants to merge 23 commits intoovep-developfrom
sync_30_4_2026

jatinwadhwa921 commented Apr 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

18 participants

Conversation

jatinwadhwa921 commented Apr 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

18 participants