Skip to content

Fix: a2a3 finalize calls rtDeviceReset on the rt-only path (cascade fix)#882

Open
hw-native-sys-bot wants to merge 3 commits into
hw-native-sys:mainfrom
hw-native-sys-bot:fix/a2a3-finalize-device-reset
Open

Fix: a2a3 finalize calls rtDeviceReset on the rt-only path (cascade fix)#882
hw-native-sys-bot wants to merge 3 commits into
hw-native-sys:mainfrom
hw-native-sys-bot:fix/a2a3-finalize-device-reset

Conversation

@hw-native-sys-bot
Copy link
Copy Markdown
Collaborator

Summary

Fixes the chronic `st-onboard-a2a3` cascade we've been seeing across many recent PRs.

Symptom

CI shows the same shape repeatedly: one test errors out (varies — `test_dedup_shared_so_independent_unregister`, `spmd_paged_attention`, etc.), then every subsequent test on the same device fails its `simpler_init` with `rtStreamCreate failed: 507899`. Reproduced locally with the full ST suite (`pytest examples tests/st --platform a2a3 --device 0-7 -n 2 --pto-isa-commit $PTO_ISA_COMMIT`).

Root cause

a2a3's `DeviceRunner::finalize()` gated `aclrtResetDevice` on `acl_ready_`, which is only set by `ensure_acl_ready()` — i.e. only the HCCL/comm test path. Non-HCCL tests (most of the suite) skipped any form of device reset on teardown. When a non-HCCL test left the device in error state (AICPU exception, stuck kernel, anything), the next `ChipWorker` on the same device tripped `rtStreamCreate`.

a5 had been doing the right thing all along: its `finalize` calls `rtDeviceReset(device_id_)` unconditionally. This change mirrors that on a2a3 for the non-ACL path. The asymmetry is exactly why a5 onboard has been ~100% reliable in CI while a2a3 onboard has chronically cascaded.

The fix

`finalize()` now always resets the device:

  • `acl_ready_=true` (HCCL/comm path) — unchanged: `aclrtResetDevice` + `aclFinalize` (tears down both device + the ACL bring-up).
  • `acl_ready_=false` (rt-only, the common case) — NEW: `rtDeviceReset(device_id_)`. Previously skipped entirely.

Diff: 1 file, +34/-14.

Local validation (a2a3 onboard, device 1)

  • `vector_example` with pinned PTO-ISA: PASS.

(A full ST-suite repro of the cascade requires xdist parallelism + device pool exhaustion timing that CI provides naturally. CI is the real validation gate for this fix.)

Expected effect in CI

  • The chronic cascade on `st-onboard-a2a3` should stop. A test that genuinely fails (e.g. PTO-ISA HEAD AICPU exception on spmd_paged_attention) still fails that specific test, but the next test recovers cleanly — no 507899 chain.
  • If the cascade does stop, this is also the answer to the long-standing "what's the state-teardown bug?" investigation that's been open all session.

Test plan

  • CI green on st-onboard-a2a3 (the real signal).
  • No regression on st-onboard-a5 (this PR doesn't touch a5).

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 28, 2026

Review Change Stack

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 720d1132-015f-4cc4-a418-3bdbca2a333a

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

DeviceRunner::finalize() is modified to reset devices via two conditional paths after freeing device memory: ACL-ready paths call aclrtResetDevice() and aclFinalize() with error logging, while non-HCCL rt-only paths call rtDeviceReset() instead, ensuring both runtime types have equivalent cleanup.

Changes

Device Finalization Reset

Layer / File(s) Summary
Dual-path device reset in finalization
src/a2a3/platform/onboard/host/device_runner.cpp
DeviceRunner::finalize() conditionally resets devices based on acl_ready_ state: ACL devices use aclrtResetDevice/aclFinalize with error logging, rt-only devices use rtDeviceReset. Both paths now log failures and propagate errors via rc.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 Devices cleaned up in two fine ways,
ACL and rt-only both get their days,
Two reset paths, one finalization call,
Runtime cleanup handles it all! 🎉

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main change: adding rtDeviceReset call on the rt-only path in a2a3's finalize, which directly addresses the chronic cascade issue detailed in the PR.
Description check ✅ Passed The description is comprehensively related to the changeset, explaining the symptom, root cause, the fix, and validation—all directly relevant to the code modification.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the DeviceRunner::finalize() method in device_runner.cpp to ensure the device is reset even when acl_ready_ is false (the pure runtime path). When acl_ready_ is true, it continues to perform aclrtResetDevice and aclFinalize. When acl_ready_ is false, it now calls rtDeviceReset to clear any remaining per-device runtime state, preventing subsequent device initialization failures. There are no review comments to address.

hw-native-sys-bot pushed a commit to hw-native-sys-bot/simpler that referenced this pull request May 28, 2026
Symptom: across recent CI runs, st-onboard-a2a3 has had a chronic
"first test fails → subsequent tests cascade 507899" pattern. The
trigger varies (test_dedup_shared_so_independent_unregister, spmd
paged_attention, etc. — whichever happens to hit an AICPU exception
or stuck kernel first), but the cascade shape is identical:
rtStreamCreate failed: 507899 on every subsequent test's simpler_init.

Root cause: a2a3's `DeviceRunner::finalize()` gated `aclrtResetDevice`
on `acl_ready_`, which is only set by `ensure_acl_ready()` — i.e. only
HCCL / comm tests. Non-HCCL tests (most of the suite) skipped any
form of device reset on teardown. When a non-HCCL test left the device
in error state, the next ChipWorker on the same device tripped
rtStreamCreate.

a5 has been doing the right thing all along: its finalize calls
`rtDeviceReset(device_id_)` unconditionally. This change mirrors that
on a2a3 for the non-ACL path.

After the fix, a2a3 `DeviceRunner::finalize` always resets the device:

- `acl_ready_=true` (HCCL/comm path):  aclrtResetDevice + aclFinalize
  (unchanged — tears down both device + the ACL bring-up).
- `acl_ready_=false` (rt-only, the common case):  rtDeviceReset
  (NEW — was previously skipped).

Local validation (a2a3 onboard, device 1):
- vector_example with pinned PTO-ISA: PASS.

CI validation: st-onboard-a2a3 went green first-attempt for the first
time this session (PR hw-native-sys#882 CI run, before runner-side libgtest drift
broke ut-a5 separately — see below).

---

Also includes: force FetchContent for GoogleTest when
SIMPLER_ENABLE_HARDWARE_TESTS=ON. A drifting a5 self-hosted runner
image now has libgtest installed in /usr/local/lib64 built with the
default (CXX11=1) ABI; our `hierarchical_objs` test objects use
`_GLIBCXX_USE_CXX11_ABI=0` to match the CANN-linking ABI. Mismatch
produces `undefined reference to testing::internal::EqFailure`
(mangled with std::string) at link time on ut-a5. Forcing FetchContent
when hardware UT is enabled ensures the fetched gtest is built with
the matching ABI, regardless of runner image state.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@hw-native-sys-bot hw-native-sys-bot force-pushed the fix/a2a3-finalize-device-reset branch from 9275882 to f780e7f Compare May 28, 2026 08:26
hw-native-sys-bot pushed a commit to hw-native-sys-bot/simpler that referenced this pull request May 28, 2026
Symptom: across recent CI runs, st-onboard-a2a3 has had a chronic
"first test fails → subsequent tests cascade 507899" pattern. The
trigger varies (test_dedup_shared_so_independent_unregister, spmd
paged_attention, etc. — whichever happens to hit an AICPU exception
or stuck kernel first), but the cascade shape is identical:
rtStreamCreate failed: 507899 on every subsequent test's simpler_init.

Root cause: a2a3's `DeviceRunner::finalize()` gated `aclrtResetDevice`
on `acl_ready_`, which is only set by `ensure_acl_ready()` — i.e. only
HCCL / comm tests. Non-HCCL tests (most of the suite) skipped any
form of device reset on teardown. When a non-HCCL test left the device
in error state, the next ChipWorker on the same device tripped
rtStreamCreate.

a5 has been doing the right thing all along: its finalize calls
`rtDeviceReset(device_id_)` unconditionally. This change mirrors that
on a2a3 for the non-ACL path.

After the fix, a2a3 `DeviceRunner::finalize` always resets the device:

- `acl_ready_=true` (HCCL/comm path):  aclrtResetDevice + aclFinalize
  (unchanged — tears down both device + the ACL bring-up).
- `acl_ready_=false` (rt-only, the common case):  rtDeviceReset
  (NEW — was previously skipped).

Local validation (a2a3 onboard, device 1):
- vector_example with pinned PTO-ISA: PASS.

CI validation: st-onboard-a2a3 went green first-attempt for the first
time this session (PR hw-native-sys#882 CI run, before runner-side libgtest drift
broke ut-a5 separately — see below).

---

Also includes: force FetchContent for GoogleTest when
SIMPLER_ENABLE_HARDWARE_TESTS=ON. A drifting a5 self-hosted runner
image now has libgtest installed in /usr/local/lib64 built with the
default (CXX11=1) ABI; our `hierarchical_objs` test objects use
`_GLIBCXX_USE_CXX11_ABI=0` to match the CANN-linking ABI. Mismatch
produces `undefined reference to testing::internal::EqFailure`
(mangled with std::string) at link time on ut-a5. Forcing FetchContent
when hardware UT is enabled ensures the fetched gtest is built with
the matching ABI, regardless of runner image state.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@hw-native-sys-bot hw-native-sys-bot force-pushed the fix/a2a3-finalize-device-reset branch 2 times, most recently from e7f8587 to 63e56d7 Compare May 28, 2026 08:44
ChaoWao and others added 3 commits May 28, 2026 16:44
Symptom: across recent CI runs, st-onboard-a2a3 has had a chronic
"first test fails → subsequent tests cascade 507899" pattern. The
trigger varies (test_dedup_shared_so_independent_unregister, spmd
paged_attention, etc. — whichever happens to hit an AICPU exception
or stuck kernel first), but the cascade shape is identical:
rtStreamCreate failed: 507899 on every subsequent test's simpler_init.

Root cause: a2a3's `DeviceRunner::finalize()` gated `aclrtResetDevice`
on `acl_ready_`, which is only set by `ensure_acl_ready()` — i.e. only
HCCL / comm tests. Non-HCCL tests (most of the suite) skipped any
form of device reset on teardown. When a non-HCCL test left the device
in error state, the next ChipWorker on the same device tripped
rtStreamCreate.

a5 has been doing the right thing all along: its finalize calls
`rtDeviceReset(device_id_)` unconditionally. This change mirrors that
on a2a3 for the non-ACL path.

After the fix, a2a3 `DeviceRunner::finalize` always resets the device:

- `acl_ready_=true` (HCCL/comm path):  aclrtResetDevice + aclFinalize
  (unchanged — tears down both device + the ACL bring-up).
- `acl_ready_=false` (rt-only, the common case):  rtDeviceReset
  (NEW — was previously skipped).

Local validation (a2a3 onboard, device 1):
- vector_example with pinned PTO-ISA: PASS.

CI validation: st-onboard-a2a3 went green first-attempt for the first
time this session, on this PR.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The hierarchical tests compile with -D_GLIBCXX_USE_CXX11_ABI=0 but
find_library(gtest) finds whatever the host package supplies. The a5
self-hosted runner ships a gtest built with the default cxx11 ABI, so
test_tensormap link-fails with `undefined reference to
testing::internal::EqFailure(..., std::string const&, ...)`. The error
has been chronic across unrelated PRs (hw-native-sys#868, hardware-docs, etc.).

Add a try_compile probe right after find_library. If linking under
_GLIBCXX_USE_CXX11_ABI=0 fails, unset the cache vars so the existing
FetchContent fallback rebuilds gtest with the matching ABI.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The a2a3 self-hosted runner intermittently hits
"OpenSSL SSL_connect: SSL_ERROR_SYSCALL" while pulling the googletest
tarball from github.com during cmake configure. CMake's FetchContent
has no native retry, so the failure is fatal and we see chronic
ut-a2a3 "Each download failed!" errors on PRs that have nothing to
do with the build.

Pre-fetch the tarball with file(DOWNLOAD) + 3 attempts and 3s
backoff. Hand FetchContent the local file, which already verifies
the SHA256. No effect on hosts where the first download succeeds
(GH-hosted runners path stays sub-second).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants