Skip to content

Backmerging pr #1068

Closed
jatinwadhwa921 wants to merge 23 commits intoovep-developfrom
sync_30_4_2026
Closed

Backmerging pr #1068
jatinwadhwa921 wants to merge 23 commits intoovep-developfrom
sync_30_4_2026

Conversation

@jatinwadhwa921
Copy link
Copy Markdown

No description provided.

titaiwangms and others added 23 commits April 28, 2026 09:30
…nd (microsoft#28083)

## Summary

Fixes a critical security vulnerability in the ONNX Runtime Python
backend where user-controlled `kwargs` were applied to `SessionOptions`
and `RunOptions` via unrestricted `setattr()`, allowing arbitrary file
overwrites.

## Vulnerability

The `prepare()` method in `onnxruntime/python/backend/backend.py`
iterated over user-controlled `kwargs` and used `setattr()` to apply
them directly to a `SessionOptions` instance. The `hasattr()` check was
not a security guard — it returned `True` for all exposed properties
including dangerous ones like `optimized_model_filepath`.

**Attack vector:**
```python
onnxruntime.backend.prepare(
    model_path,
    optimized_model_filepath="/etc/passwd",  # overwrites any file with protobuf binary
    graph_optimization_level=onnxruntime.GraphOptimizationLevel.ORT_ENABLE_ALL
)
```

The same pattern existed in `backend_rep.py` for `RunOptions`.

## Fix

Replaced the unrestricted `hasattr/setattr` loop in both files with
strict allowlists:

- **`_ALLOWED_SESSION_OPTIONS`** (13 safe attrs) in `backend.py`
- **`_ALLOWED_RUN_OPTIONS`** (4 safe attrs) in `backend_rep.py`

**Both `SessionOptions` and `RunOptions` use identical validation
logic** with three outcomes for each kwarg key:

- **Allowlisted** — Applied via `setattr()` (e.g.
`graph_optimization_level`, `log_severity_level`)
- **Known-but-blocked** (real attribute on the object, but not on
allowlist) — Raises `RuntimeError` (e.g. `optimized_model_filepath`,
`terminate`)
- **Completely unknown** (not a property on the object at all) —
Silently ignored for forward compatibility (e.g.
`nonexistent_option_xyz`)

**Blocked dangerous attributes:**
- `optimized_model_filepath` — triggers `Model::Save()`, overwrites
arbitrary files with protobuf binary
- `profile_file_prefix` — writes profiling JSON to arbitrary path
- `enable_profiling` — causes uncontrolled file writes to cwd
- `terminate` (RunOptions) — denies the current inference call
- `training_mode` (RunOptions) — silently switches inference behavior in
training builds

## Tests

Added `TestBackendKwargsAllowlist` with 13 new test methods covering all
exploit vectors (blocked attrs raise `RuntimeError`), safe allowlisted
attrs (accepted), unknown attrs (silently ignored), and end-to-end
`run_model()` paths for both session and run options. All 15 tests pass
(13 new + 2 pre-existing in `TestBackend`), no regressions.

## Files Changed

- `onnxruntime/python/backend/backend.py`
- `onnxruntime/python/backend/backend_rep.py`
- `onnxruntime/test/python/onnxruntime_test_python_backend.py`
- `.agents/skills/python-kwargs-setattr-security/SKILL.md`

---------

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ttention (microsoft#28200)

### Description

Adds a CUTLASS memory-efficient attention (MEA) fallback to the CUDA
PagedAttention op, enabling the operator on **sm<80 (Turing / Volta /
Pascal) with fp16** for the first time. On sm>=80 the default
FlashAttention path is unchanged; MEA is reachable via
`ORT_DISABLE_FLASH_ATTENTION=1` or the `sdpa_kernel` CUDA provider
option for debugging and perf comparison.


| Environment | Before | After |
|---|:---:|:---:|
| sm<80 + fp16 | ❌ error | ✅ MEA |
| sm<80 + bf16 | ❌ error | ❌ error (MEA requires sm>=80 for bf16) |
| sm>=80 + fp16/bf16 (default) | ✅ FA | ✅ FA (unchanged) |
| sm>=80 + `ORT_DISABLE_FLASH_ATTENTION=1` /
`sdpa_kernel=EFFICIENT_ATTENTION` | ❌ error | ✅ MEA |

### Motivation and Context

The original PagedAttention PR (microsoft#24595) landed with the title "CUDA SM80
support" — the op errors out immediately whenever FlashAttention isn't
available (sm<80 or `USE_FLASH_ATTENTION=0` builds). During that review,
@tianleiwu flagged that the interface was too FlashAttention-specific
(*"not good for other EP like WebGPU, CPU etc."*) and @aciddelgado
agreed the FA-specific dependencies could be lifted at the kernel level.

This PR closes that gap for sm<80 fp16 by mirroring the exact pattern
established in microsoft#20012 ("Packed QKV and Rotary Embedding Support for
sm<80 GQA"). The same CUTLASS memory-efficient attention backend that
covers GQA's sm<80 path now covers PagedAttention.

Related work:
- microsoft#20012 — direct pattern template (sm<80 GQA MEA fallback)
- microsoft#24595 — original PagedAttention PR
- microsoft#27516 — MS canonical FA → MEA → Unfused cascade ordering
- microsoft#27880 — ONNX Attention CUDA fallback coverage gaps
- microsoft#27992 — MEA decode + unfused softcap work (same flavor)

### Implementation

**Dispatch cascade** in `paged_attention.cc`: FlashAttention preferred;
fall back to MemoryEfficientAttention via
`has_memory_efficient_attention(sm, is_half, is_bf16, head_size,
head_size)`. No custom head-size or dtype bounds hardcoded — MEA's own
helper gates fp16 sm>=53 / bf16 sm>=80 / head_size <= 1024 and `% 8 ==
0`. This keeps us forward-compatible with any future expansion of MEA's
supported range.

**MEA path** (`UnfusedAttention<T>`):
1. Reuses existing preprocessing: `LaunchGetCumulativeSeqlensKV`
(hoisted to `paged_attention.cc` so both FA and MEA paths consume a
pre-populated buffer — single-producer refactor), rotary, packed-QKV
unpack, `ReshapeAndCache`.
2. New `GatherAndExpandPagedKVCache` CUDA kernel walks `block_table` to
gather paged K/V into a packed-varlen `[total_kv_tokens, num_heads,
head_size]` buffer, folding in GQA head expansion (so downstream MEA
sees `num_heads` uniformly).
3. Dispatches to `run_memory_efficient_attention` in **varlen mode** via
`seqstart_q_ptr = cumulative_seqlens_q` + `seqstart_k_ptr =
cumulative_seqlens_kv` (and `has_custom_right_padding = false`). No
padding required; layout matches the kernel's expected `[total_tokens,
num_heads, head_size]` with BSNH strides.

**Scratch allocation**: the MEA path D->H syncs
`cumulative_seqlens_kv[batch_size]` via a pinned buffer to obtain
`total_kv_tokens` on the host for tight `gathered_key` /
`gathered_value` / `fmha_buffer` allocation. This adds a
forward-per-call `cudaStreamSynchronize` — acceptable for a
compatibility fallback (FA remains the hot path on supported hardware).
Over-allocation (the no-sync alternative) would consume `B ×
max_num_blocks_per_seq × block_size × num_heads × head_size × 2 ×
sizeof(T)`, which reaches GB-scale for realistic GQA models and was
rejected.

`fmha_buffer` is sized with `sizeof(float)` (matching the GQA
EfficientAttention pattern at `group_query_attention.cc:482`) because
MEA's output accumulator is fp32 regardless of input dtype.

### Testing

New `TestPagedAttentionMEA` class in `test_paged_attention_cuda.py` runs
the existing parity matrix (rotary on/off, rotary_interleaved on/off,
packed-QKV on/off, local window on/off, softcap 0/50, varied head
sizes/shapes) against the MEA path via the `sdpa_kernel` CUDA provider
option set to `EFFICIENT_ATTENTION` (=2, from `AttentionBackend` enum).
Using a per-session provider option instead of an env var means both FA
and MEA test classes coexist in the same pytest process — each
InferenceSession creates its own CUDA EP with its own
`attention_kernel_options_`.

The existing `TestPagedAttention` class is skipped wholesale on sm<80 by
its `has_flash_attention()` gate, so without the new MEA class the
fallback path would have no CI coverage.

**Local verification** (NVIDIA A100 80GB, CUDA 12.8, GCC 13.3):

```
TestPagedAttention:       24/24 passed (~60s)   # FA baseline — no regression
TestPagedAttentionMEA:    24/24 passed (~59s)   # new MEA path
```

Tolerance: `rtol = atol = 5e-3` against the same torch reference used by
the FA parity test. All combinations match.

**sm<80 hardware coverage**: I don't have local Turing / Volta / Pascal
hardware, so real-SM coverage relies on MS CI. The code path exercised
on A100 via `sdpa_kernel=EFFICIENT_ATTENTION` is the same one taken on
sm<80; only the underlying CUTLASS kernel
(`run_memory_efficient_attention_sm50/70/75/80`) differs per SM, and
those are upstream and unmodified by this change.

**Build note**: built with `--cmake_extra_defines
CMAKE_CUDA_ARCHITECTURES=80 CMAKE_CXX_STANDARD=20`. The explicit C++20
define was needed because the initial configure resolved
`CMAKE_CXX_STANDARD=17`, under which `ort_version_check.h`'s `consteval`
usage fails to compile. Unrelated to this change.
…pkg set (microsoft#28254)

### Description

Remove `react-native` package from set of packages required for
RC/release publishing.
We will need to revisit this and decide whether to remove it entirely or
properly fix it.

### Motivation and Context

The React Native package is having build issues and we don't need it for
the next few immediate releases.
### Description

Adds support for `com.microsoft:QuickGelu` (`x * Sigmoid(alpha * x)`) to
the CoreML Execution Provider's MLProgram path. The builder decomposes
QuickGelu into three MIL ops (`mul` / `sigmoid` / `mul`), matching the
op's own schema function-body in `contrib_defs.cc:605-631` and the
approach the QNN EP already uses in
`qnn/builder/opbuilder/quick_gelu_op_builder.cc`. Only the MLProgram
path is implemented; NeuralNetwork is deprecated on Apple Silicon.

Adds `CoreMLExecutionProviderTest.QuickGeluTest` which builds a single
`com.microsoft:QuickGelu` node with non-default `alpha=1.5` and verifies
the entire graph is claimed by the CoreML EP via
`ExpectedEPNodeAssignment::All`. Verified with a negative test:
temporarily removing the `CreateQuickGeluOpBuilder` registration causes
the new test to fail with a `VerifyEPNodeAssignment` fatal failure,
proving it genuinely exercises the CoreML path.

Also updates `coreml_supported_mlprogram_ops.md`.

### Motivation and Context

Fixes microsoft#28183.

QuickGelu is produced by ORT's own `QuickGeluFusion` optimizer pass
(`onnxruntime/core/optimizer/quick_gelu_fusion.cc`), which runs at
`ORT_ENABLE_EXTENDED` — and therefore also at `ORT_ENABLE_ALL`, the
default session optimization level. So any model that contains the `x *
sigmoid(alpha * x)` pattern (CLIP, several mobile transformers, the
DWPose pose estimator) gets silently mutated by ORT into a graph with
`QuickGelu` nodes that the CoreML EP then rejects — turning 3 supported
primitives into 1 unsupported op, making the fusion strictly harmful for
CoreML.

On the DWPose `dw-ll_ucoco_384.onnx` model with batch=1 and
`ORT_ENABLE_EXTENDED`, 76 `QuickGelu` nodes get produced. Running the
result on the CoreML EP:

| ORT build | CoreML subgraphs | Inference (ms) |
| --- | --- | --- |
| main (QuickGelu rejected) | ~80 (each QuickGelu is a graph break) |
54.77 |
| this PR (QuickGelu supported) | 10 | 13.91 |

The remaining breaks are other ops — see "Related gaps" below. A ~4×
speedup at EXTENDED level from this patch alone.

Even at the default `ORT_ENABLE_ALL` with a symbolic batch dim (where
partial shape inference inhibits most fusions), 3 `QuickGelu` nodes
still get produced — so this patch helps any CoreML user who hasn't
explicitly downgraded to `ORT_ENABLE_BASIC`.

### Related CoreML EP gaps observed (out of scope for this PR)

With QuickGelu fixed, the remaining 9 CPU-fallback nodes on the
EXTENDED-optimized DWPose pose model are:

- **`com.microsoft:FusedConv`** (×4) — produced by
`ConvActivationFusion`. Fuses `Conv + activation` into one node. Same
failure mode as QuickGelu: `Conv` and the activations (`Relu`,
`Sigmoid`, `HardSigmoid`, etc.) are individually CoreML-supported, but
the fused form isn't. Decomposition is straightforward — emit the
underlying `conv` MIL op, then the corresponding activation.
- **`com.microsoft:FusedMatMul`** (×2, from `MatMulScaleFusion`) —
`MatMul * alpha` with an optional transpose. Decomposition: `matmul` +
scalar `mul`.
- **`ai.onnx:Split`** (×2) — pre-existing CoreML EP gap unrelated to
fusion. CoreML MIL has a native `split` op; this one is a straight
op-builder omission.

Happy to send follow-up PRs for any of these after this one lands,
following the same pattern. Flagging here so they're on the EP coverage
roadmap.

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…tilities (microsoft#28227)

This pull request significantly improves the safety and robustness of
sparse tensor handling in ONNX Runtime. The main focus is on adding
thorough bounds checking and using safe integer arithmetic to prevent
overflows and invalid memory accesses when working with sparse tensor
indices. Additionally, the Python bindings for sparse tensors are
refactored to ensure correct object lifetimes and memory management when
exposing data to NumPy.

**Sparse Tensor Index Validation and Safety**

* Added comprehensive bounds checks for COO and CSR sparse tensor
indices in both the C API (`onnxruntime_c_api.cc`) and core conversion
utilities, ensuring indices are within valid ranges and, for CSR, that
outer indices are non-decreasing and within bounds.
[[1]](diffhunk://#diff-cff364b6b1ab4ef507d87a661a97b873405f569797fcaf91af29491f223555a8R449-R485)
[[2]](diffhunk://#diff-cff364b6b1ab4ef507d87a661a97b873405f569797fcaf91af29491f223555a8R521-R547)
[[3]](diffhunk://#diff-cff364b6b1ab4ef507d87a661a97b873405f569797fcaf91af29491f223555a8R659-R696)
[[4]](diffhunk://#diff-cff364b6b1ab4ef507d87a661a97b873405f569797fcaf91af29491f223555a8R721-R747)
[[5]](diffhunk://#diff-620fd022510c5134fc9bd3c8d01bc5772cc78a82043b0da5e44cf2482038dc37L267-R273)
[[6]](diffhunk://#diff-620fd022510c5134fc9bd3c8d01bc5772cc78a82043b0da5e44cf2482038dc37L359-R376)
* Replaced direct arithmetic with `SafeInt` for all index and size
calculations to prevent integer overflows, especially when converting
between types or computing dense tensor offsets.
[[1]](diffhunk://#diff-620fd022510c5134fc9bd3c8d01bc5772cc78a82043b0da5e44cf2482038dc37L267-R273)
[[2]](diffhunk://#diff-d31e9fbe0f5334fcd949833e035f2b25d5ae810dcd505c545f6b372b546b1406L2077-R2077)
[[3]](diffhunk://#diff-d31e9fbe0f5334fcd949833e035f2b25d5ae810dcd505c545f6b372b546b1406L2091-R2091)
[[4]](diffhunk://#diff-d31e9fbe0f5334fcd949833e035f2b25d5ae810dcd505c545f6b372b546b1406L2110-R2110)
[[5]](diffhunk://#diff-d31e9fbe0f5334fcd949833e035f2b25d5ae810dcd505c545f6b372b546b1406L2291-R2298)
* Improved error messages for invalid indices, making debugging easier
by providing more context about the specific error.
[[1]](diffhunk://#diff-cff364b6b1ab4ef507d87a661a97b873405f569797fcaf91af29491f223555a8R449-R485)
[[2]](diffhunk://#diff-cff364b6b1ab4ef507d87a661a97b873405f569797fcaf91af29491f223555a8R521-R547)
[[3]](diffhunk://#diff-cff364b6b1ab4ef507d87a661a97b873405f569797fcaf91af29491f223555a8R659-R696)
[[4]](diffhunk://#diff-cff364b6b1ab4ef507d87a661a97b873405f569797fcaf91af29491f223555a8R721-R747)
[[5]](diffhunk://#diff-620fd022510c5134fc9bd3c8d01bc5772cc78a82043b0da5e44cf2482038dc37L267-R273)
[[6]](diffhunk://#diff-620fd022510c5134fc9bd3c8d01bc5772cc78a82043b0da5e44cf2482038dc37L359-R376)

**Python Bindings Improvements**

* Refactored the pybind11 bindings for sparse tensor views so that NumPy
arrays referencing sparse tensor memory correctly keep the parent Python
object alive, preventing potential memory issues when the sparse tensor
is on the GPU or managed by Python.
[[1]](diffhunk://#diff-3c1b21fe3d5903c277b4d3888f5a4c57ff8f8f6f593183a3f4865825c5ab8e0cL98-R120)
[[2]](diffhunk://#diff-3c1b21fe3d5903c277b4d3888f5a4c57ff8f8f6f593183a3f4865825c5ab8e0cL299-R304)
[[3]](diffhunk://#diff-3c1b21fe3d5903c277b4d3888f5a4c57ff8f8f6f593183a3f4865825c5ab8e0cL314-R319)

**General Code Quality**

* Added missing header include for `safeint.h` to ensure `SafeInt` is
available where needed.
* Minor cleanups and improved assertions to clarify intent and ensure
correctness.

These changes collectively make sparse tensor support in ONNX Runtime
safer, more reliable, and easier to use from both C++ and Python.
### Description
vector_per_class dimension was not verified, it could lead to illegal
memory access



### Motivation and Context
security issue

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: xadupre <22452781+xadupre@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Tianlei Wu <tlwu@microsoft.com>
…icrosoft#28248)

### Description
<!-- Describe your changes. -->

- Correct misleading 'SemVer 1.0.0' label; the universal version regex
actually validates SemVer 2.0.0 syntax without build metadata, which is
what Azure Universal Packages requires.

- Prefix the dev short SHA with 'commit-' in universal_version so the
pre-release identifier always contains a non-digit, avoiding spurious
validation failures for all-numeric SHAs with leading zeros.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

Fix invalid version when we have an all-numeric commit SHA starting with
0.
…ls (microsoft#28214)

This PR adds position_ids bounds checking to WebGPU and JS
RotaryEmbedding implementations, completing the security fix started in
PR microsoft#27597 (commit 056bab3) which covered CPU and CUDA.

## Problem
The `com.microsoft::RotaryEmbedding` kernel uses position_ids as row
indices into cos_cache/sin_cache without bounds validation. While PR
microsoft#27597 fixed CPU and CUDA paths, WebGPU and JS implementations were
still missing bounds checks, which could produce silently wrong results
(WebGPU hardware clamps OOB reads).

## Changes
- **contrib_ops/webgpu/bert/rotary_embedding.cc**: Host-side validation
(ORT_MAKE_STATUS) + shader-side defense-in-depth (pass-through on OOB)
- **core/providers/webgpu/llm/rotary_embedding.cc**: Host-side
validation with format-0 awareness
- **js/web/lib/wasm/jsep/webgpu/ops/rotary-embedding.ts**: TypeScript
validation using getBigInt64Array
- **7 new C++ OOB test cases** across contrib and ONNX domains targeting
WebGPU EP

## Security
Addresses the same vulnerability as microsoft#27597 (OOB read via position_ids,
CVSS 7.5-9.1) for WebGPU/JS execution providers.

## Testing
- 7 new unit tests (3 contrib + 4 ONNX domain) with GTEST_SKIP when
WebGPU EP unavailable
- JS/TS error tests not feasible with current JSONC test format
(documented)
- Build environment lacks C++20/emsdk for full compilation verification;
validated structurally

---------

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ntime/ep/adapter/op_kernel_info.h` (microsoft#28081)

### Description
<!-- Describe your changes. -->

Remove reinterpret_cast of OrtKernelInfo* to internal OpKernelInfo* that
breaks ABI across DLL boundaries (vtable mismatch between plugin EP and
ORT core).

- KernelInfoCache: use Ort::ConstKernelInfo::GetEp() instead of casting
to OpKernelInfo* and calling GetExecutionProvider()->GetOrtEp()

- GetAllocator: use C API KernelInfoGetAllocator +
IAllocatorWrappingOrtAllocator instead of casting to OpKernelInfo*

- Remove #include core/framework/op_kernel_info.h (no longer needed)

- Add IAllocatorWrappingOrtAllocator adapter

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

Address crash observed when testing WebGPU plugin EP with older ORT
1.24.4 binary where the number of `onnxruntime::IExecutionProvider`
virtual functions had changed between the two builds.

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
### Description
This patch adds the support of Split-K with batch size > 1 by
encoding both batch index and Split-K index in dispatch_z and
decompose them in the shader via:
  batch = logical_global_id.z / num_k_splits
  split_index = logical_global_id.z % num_k_splits

This patch also adds batch size to the criteria of using Split-K
as increasing batch size will also increasing the parallelism,
reducing the effectiveness of Split-K.

This patch also replaces `consteval` with `constexpr` in
`ort_version_check.h` to workaround a compilation error
about vs2022.

### Motivation and Context
With this patch we can improve the performance of
`sam-vit-b-decoder-static-fp16-demo` (7.5%) on Intel PTL.
…oft#27760)

### Description

Adds aarch64 Linux wheel builds to the CUDA GPU packaging pipeline,
mirroring the existing x86_64 configuration.

- **`stages/py-linux-gpu-stage.yml`**: Add `hostArchitecture: Arm64` to
pool config when `arch == 'aarch64'` (matches pattern in `py-linux.yml`)
- **`stages/py-gpu-packaging-stage.yml`**: Add
`docker_base_image_aarch64` and `AArch64LinuxPythonConfigurations`
parameters (defaults to `[]` so CUDA 12 pipeline is unaffected), aarch64
build stages, and merge artifact dependencies/downloads
- **`py-cuda13-packaging-pipeline.yml`**: Pass aarch64 base image and
Python configs for all supported versions (3.11–3.14, including
free-threaded)
- **`aarch64/python/cuda/Dockerfile`** +
**`scripts/install_centos.sh`**: New Docker build context for aarch64
CUDA builds. It is different from x86_64 variant: aarch64 uses tar to
install tensorrt.

### Motivation and Context

`onnxruntime-gpu` only ships x86_64 and Windows wheels. Installing on
`manylinux_2_39_aarch64` (e.g. `ubuntu-24.04-arm` runners) fails with no
compatible wheel available.

- Fixes microsoft#27005

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: tianleiwu <30328909+tianleiwu@users.noreply.github.com>
Co-authored-by: Tianlei Wu <tlwu@microsoft.com>
When M is small and batchA is large, there are some invalid elements in
each tile, merge batchA into M dimesion would reduce the workgroup
count.

### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

---------

Co-authored-by: wp <webgraphics@intel.com>
…crosoft#28172)

Keep original roundingType name for a period of time to ensure backward
compatibility.

Spec change: webmachinelearning/webnn#770

---------

Co-authored-by: Dwayne Robinson <fdwr@hotmail.com>
### Description
Update OpenVINO version for OVEP.

---------

Co-authored-by: jatinwadhwa921 <jatin.wadhwa@intel.com>
Co-authored-by: Rajeev Sekar <rajeev.sekar@intel.com>
### Description

Fixes ONNX Runtime startup on Linux ARM64 environments where
`/sys/devices/system/cpu/possible` and `/sys/devices/system/cpu/present`
are unavailable, such as AWS Lambda ARM64/Graviton and restricted build
sandboxes.

There are two related failure modes:

1. `PosixEnv` may be constructed before ORT's default logger is
registered. If `cpuinfo_initialize()` fails during that early
construction path, the existing `LOGS_DEFAULT(INFO)` call can terminate
with `Attempt to use DefaultLogger but none has been registered`.
2. The bundled `pytorch/cpuinfo` code treats missing Linux CPU
`possible`/`present` sysfs cpulists as fatal on ARM Linux. The max-count
helpers return `UINT32_MAX`, which wraps to `0` after `1 + UINT32_MAX`
in ARM Linux initialization and prevents cpuinfo from reaching the later
`/proc/cpuinfo` and `getauxval()` based detection paths.

### Root Cause

The immediate import crash is caused by unsafe early logging in
`onnxruntime/core/platform/posix/env.cc`. Python bindings can reference
`Env::Default()` during module load before logging is initialized, so a
cpuinfo initialization failure must not use `LOGS_DEFAULT()` unless a
default logger exists.

The cpuinfo initialization failure is more subtle. A count-only fallback
is not enough: after cpuinfo computes max possible/present CPU counts,
it calls `cpuinfo_linux_detect_possible_processors()` and
`cpuinfo_linux_detect_present_processors()` to set
`CPUINFO_LINUX_FLAG_POSSIBLE` and `CPUINFO_LINUX_FLAG_PRESENT` on each
processor. ARM Linux initialization later marks processors valid only if
those flags are set. If only the count fallback is provided,
`valid_processors` can remain zero and cpuinfo can proceed into an
invalid partial initialization state.

### Fix

- Make `PosixEnv` logging safe when cpuinfo initialization fails before
a default logger exists:
- use `logging::LoggingManager::HasDefaultLogger()` before
`LOGS_DEFAULT()`
  - fall back to `std::cerr` when no logger is registered
- Add a cpuinfo patch for Linux missing sysfs CPU cpulists:
- fallback max possible/present processor detection to
`sysconf(_SC_NPROCESSORS_ONLN) - 1`
- fallback present/possible processor flag detection by marking CPUs
`0..nproc-1`
- preserve existing sysfs parsing behavior when the cpulist files are
available
- Wire the cpuinfo patch into the existing cpuinfo FetchContent flow for
Linux and existing ARM64/ARM64EC patch path.
- Add a simulation test that validates:
  - safe early logging without a registered default logger
- `sysconf(_SC_NPROCESSORS_ONLN)` count and present/possible flag
fallback behavior
  - hiding `/sys/devices/system/cpu/{possible,present}` via `LD_PRELOAD`
- optional ORT import with hidden sysfs when a built ORT package is
importable

### Testing

Ran from a clean branch/worktree:

```bash
python onnxruntime/test/common/test_cpuinfo_sysfs_fallback.py
```

Result:

- safe logging simulation: PASS
- sysconf count + flag fallback simulation: PASS
- LD_PRELOAD sysfs-hiding simulation: PASS
- ORT import integration: SKIP (`onnxruntime.capi` not built/importable
in this workspace)

Also validated the cpuinfo patch directly:

```bash
cd build/cu128/Release/_deps/pytorch_cpuinfo-src
patch --dry-run -p1 < /path/to/cmake/patches/cpuinfo/fix_missing_sysfs_fallback.patch
```

And syntax-checked patched `src/linux/processors.c` in a temporary tree
with cpuinfo headers.

### Related Issue

Fixes microsoft#10038.
…opy (microsoft#28256)

### Description

Adds an `OrtValue` overload to `update_inplace` so GPU-resident data can
be copied directly to another `OrtValue` without roundtripping through
CPU.

- **C++ pybind** (`onnxruntime_pybind_ortvalue.cc`): New
`update_inplace(const OrtValue*)` overload. Uses
`CreateDataTransferMemCpy` for plugin EPs, with fallback to built-in
copy functions for CUDA (including GPU↔GPU via `GetGPUDataTransfer()`),
MIGraphX, DML, and CANN.
- **Python wrapper** (`onnxruntime_inference_collection.py`):
`update_inplace` now accepts either a numpy array or an `OrtValue`,
dispatching to the appropriate C++ overload.
- **Tests** (`onnxruntime_test_python_cudagraph.py`): Covers CPU→CPU,
GPU→GPU, CPU→GPU, and GPU→CPU OrtValue copy paths.

```python
# Before: requires numpy (CPU) source, even when data is already on GPU
ortvalue_gpu.update_inplace(np_array)

# After: accepts OrtValue directly for device-to-device copy
ortvalue_gpu_src = onnxrt.OrtValue.ortvalue_from_numpy(data, "cuda", 0)
ortvalue_gpu_dst.update_inplace(ortvalue_gpu_src)  # GPU-to-GPU, no CPU roundtrip
```

### Motivation and Context

CUDA graph replay requires inputs at fixed memory addresses. When source
data (e.g., encoder output) is already on GPU, the only option was to
use external libraries like `cuda-python` for device-to-device memcpy.
This change makes that workflow native to ORT, per the approach
suggested in the issue discussion: accept an `OrtValue` in
`update_inplace` to leverage ORT's existing data transfer
infrastructure.

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: tianleiwu <30328909+tianleiwu@users.noreply.github.com>
Co-authored-by: GitHub Copilot <copilot@example.com>
Co-authored-by: Tianlei Wu <tlwu@microsoft.com>
Co-authored-by: Copilot <copilot@github.com>
### Description

Implements the GridSample operator (opset 16–19) for the WebGPU EP.

### Motivation and Context

GridSample was missing from the WebGPU EP and all other major execution
providers already support it. The GridSample tests were extended to
cover the WebGPU EP and seem to pass successfully.

I haven't tested that the `onnxruntime-web` build would pick up this new
operator implementation because it's really hard to build this locally,
but it seems like it should just work.

Closes microsoft#27085
Prepare for cuda performance optimizations using new cutlass features.
…tion patch to fix autolinking (microsoft#28266)

## Summary

This PR fixes the long-standing issue where `onnxruntime-react-native`
requires
manual native setup after installation, causing the `TypeError: Cannot
read property
'install' of null` runtime crash.

**Root cause:** The package shipped without a `react-native.config.js`,
so the RN
community CLI autolinking did not register `OnnxruntimePackage` on
Android. For Expo
users on the New Architecture, the existing `app.plugin.js` patched
Gradle and the
Podfile but never registered the package class in `MainApplication`.

**Changes:**

- **`react-native.config.js`** (new) — enables RN community autolinking
for Android.
With RN 0.74+ this covers both old and new architecture via the
generated
`PackageList.java`. No manual `settings.gradle` or `MainApplication`
edits needed
  for bare RN.

- **`app.plugin.js`** — adds a `withMainApplication` mod that
idempotently inserts
the `OnnxruntimePackage` import and registration into
`MainApplication.kt`/`.java`
during `expo prebuild`. Uses the same `mergeContents` tag pattern as the
existing
  Gradle/Podfile mods, so it is safe to run multiple times.

- **`package.json`** — includes `react-native.config.js` in the npm
`files` array so
  it ships in the tarball.

- **`README.md`** — documents that autolinking handles registration
automatically and
  adds the Expo plugin usage snippet.

## Testing

- Bare RN: `npx react-native config | jq
'.dependencies["onnxruntime-react-native"]'`
should show `packageImportPath` populated; `PackageList.java` should
include
  `OnnxruntimePackage` after a Gradle sync.
- Expo: `npx expo prebuild --clean` should produce a
`MainApplication.kt` containing
`import ai.onnxruntime.reactnative.OnnxruntimePackage` and
`add(OnnxruntimePackage())`.

Fixes microsoft#19510
See also microsoft#17773
### Description

Fixes:

https://portal.microsofticm.com/imp/v5/incidents/details/31000000586963
https://portal.microsofticm.com/imp/v5/incidents/details/31000000586944

### Motivation and Context
Fix ICM issues

---------

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
…#28261)

## Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Close microsoft#17466 and microsoft#24596 

MLAS already provides architecture-specific optimized kernels for
multiple vector ISAs, such as SSE/AVX/AVX2/AVX512 on x86/x64, NEON/SVE
on Arm, VSX on POWER, LSX/LASX on LoongArch, and zvector on s390x.
However, riscv64 has not had comparable RVV-optimized coverage for the
operators in this PR and has mainly fallen back to scalar code.

This PR introduces **RISC-V Vector (RVV)** extension support to the ONNX
Runtime CPU Execution Provider.


This PR focuses on two operators: SGEMM and Softmax.
We have already completed optimizations for several other operators.
Following the acceptance of this PR, I will work with @qiurui144 to
upstream the remaining optimized kernels in a series of subsequent PRs.


## Benchmark Results

### SGEMM

| Case | pack_b | RVV pack ms | RVV compute ms | Scalar pack ms | Scalar
compute ms | Compute speedup | End-to-end speedup |
|---|---:|---:|---:|---:|---:|---:|---:|
| 128x3072x768 | 1 | 63.21 | 114.52 | 66.71 | 414.44 | 3.62x | 2.71x |
| 64x1024x1024 | 1 | 22.07 | 27.66 | 23.14 | 96.64 | 3.49x | 2.41x |
| 32x4096x1024 | 1 | 119.04 | 56.82 | 118.86 | 188.34 | 3.31x | 1.75x |


### Softmax

| Case | Scalar ms | RVV ms | Speedup |
|---|---:|---:|---:|
| 4096x128 | 1955.25 | 611.65 | 3.20x |
| 1024x1024 | 717.26 | 236.73 | 3.03x |
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.