Fix: a2a3 finalize calls rtDeviceReset on the rt-only path (cascade fix)#882
Fix: a2a3 finalize calls rtDeviceReset on the rt-only path (cascade fix)#882hw-native-sys-bot wants to merge 3 commits into
Conversation
|
Important Review skippedAuto incremental reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
📝 WalkthroughWalkthrough
ChangesDevice Finalization Reset
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request updates the DeviceRunner::finalize() method in device_runner.cpp to ensure the device is reset even when acl_ready_ is false (the pure runtime path). When acl_ready_ is true, it continues to perform aclrtResetDevice and aclFinalize. When acl_ready_ is false, it now calls rtDeviceReset to clear any remaining per-device runtime state, preventing subsequent device initialization failures. There are no review comments to address.
Symptom: across recent CI runs, st-onboard-a2a3 has had a chronic "first test fails → subsequent tests cascade 507899" pattern. The trigger varies (test_dedup_shared_so_independent_unregister, spmd paged_attention, etc. — whichever happens to hit an AICPU exception or stuck kernel first), but the cascade shape is identical: rtStreamCreate failed: 507899 on every subsequent test's simpler_init. Root cause: a2a3's `DeviceRunner::finalize()` gated `aclrtResetDevice` on `acl_ready_`, which is only set by `ensure_acl_ready()` — i.e. only HCCL / comm tests. Non-HCCL tests (most of the suite) skipped any form of device reset on teardown. When a non-HCCL test left the device in error state, the next ChipWorker on the same device tripped rtStreamCreate. a5 has been doing the right thing all along: its finalize calls `rtDeviceReset(device_id_)` unconditionally. This change mirrors that on a2a3 for the non-ACL path. After the fix, a2a3 `DeviceRunner::finalize` always resets the device: - `acl_ready_=true` (HCCL/comm path): aclrtResetDevice + aclFinalize (unchanged — tears down both device + the ACL bring-up). - `acl_ready_=false` (rt-only, the common case): rtDeviceReset (NEW — was previously skipped). Local validation (a2a3 onboard, device 1): - vector_example with pinned PTO-ISA: PASS. CI validation: st-onboard-a2a3 went green first-attempt for the first time this session (PR hw-native-sys#882 CI run, before runner-side libgtest drift broke ut-a5 separately — see below). --- Also includes: force FetchContent for GoogleTest when SIMPLER_ENABLE_HARDWARE_TESTS=ON. A drifting a5 self-hosted runner image now has libgtest installed in /usr/local/lib64 built with the default (CXX11=1) ABI; our `hierarchical_objs` test objects use `_GLIBCXX_USE_CXX11_ABI=0` to match the CANN-linking ABI. Mismatch produces `undefined reference to testing::internal::EqFailure` (mangled with std::string) at link time on ut-a5. Forcing FetchContent when hardware UT is enabled ensures the fetched gtest is built with the matching ABI, regardless of runner image state. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
9275882 to
f780e7f
Compare
Symptom: across recent CI runs, st-onboard-a2a3 has had a chronic "first test fails → subsequent tests cascade 507899" pattern. The trigger varies (test_dedup_shared_so_independent_unregister, spmd paged_attention, etc. — whichever happens to hit an AICPU exception or stuck kernel first), but the cascade shape is identical: rtStreamCreate failed: 507899 on every subsequent test's simpler_init. Root cause: a2a3's `DeviceRunner::finalize()` gated `aclrtResetDevice` on `acl_ready_`, which is only set by `ensure_acl_ready()` — i.e. only HCCL / comm tests. Non-HCCL tests (most of the suite) skipped any form of device reset on teardown. When a non-HCCL test left the device in error state, the next ChipWorker on the same device tripped rtStreamCreate. a5 has been doing the right thing all along: its finalize calls `rtDeviceReset(device_id_)` unconditionally. This change mirrors that on a2a3 for the non-ACL path. After the fix, a2a3 `DeviceRunner::finalize` always resets the device: - `acl_ready_=true` (HCCL/comm path): aclrtResetDevice + aclFinalize (unchanged — tears down both device + the ACL bring-up). - `acl_ready_=false` (rt-only, the common case): rtDeviceReset (NEW — was previously skipped). Local validation (a2a3 onboard, device 1): - vector_example with pinned PTO-ISA: PASS. CI validation: st-onboard-a2a3 went green first-attempt for the first time this session (PR hw-native-sys#882 CI run, before runner-side libgtest drift broke ut-a5 separately — see below). --- Also includes: force FetchContent for GoogleTest when SIMPLER_ENABLE_HARDWARE_TESTS=ON. A drifting a5 self-hosted runner image now has libgtest installed in /usr/local/lib64 built with the default (CXX11=1) ABI; our `hierarchical_objs` test objects use `_GLIBCXX_USE_CXX11_ABI=0` to match the CANN-linking ABI. Mismatch produces `undefined reference to testing::internal::EqFailure` (mangled with std::string) at link time on ut-a5. Forcing FetchContent when hardware UT is enabled ensures the fetched gtest is built with the matching ABI, regardless of runner image state. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
e7f8587 to
63e56d7
Compare
Symptom: across recent CI runs, st-onboard-a2a3 has had a chronic "first test fails → subsequent tests cascade 507899" pattern. The trigger varies (test_dedup_shared_so_independent_unregister, spmd paged_attention, etc. — whichever happens to hit an AICPU exception or stuck kernel first), but the cascade shape is identical: rtStreamCreate failed: 507899 on every subsequent test's simpler_init. Root cause: a2a3's `DeviceRunner::finalize()` gated `aclrtResetDevice` on `acl_ready_`, which is only set by `ensure_acl_ready()` — i.e. only HCCL / comm tests. Non-HCCL tests (most of the suite) skipped any form of device reset on teardown. When a non-HCCL test left the device in error state, the next ChipWorker on the same device tripped rtStreamCreate. a5 has been doing the right thing all along: its finalize calls `rtDeviceReset(device_id_)` unconditionally. This change mirrors that on a2a3 for the non-ACL path. After the fix, a2a3 `DeviceRunner::finalize` always resets the device: - `acl_ready_=true` (HCCL/comm path): aclrtResetDevice + aclFinalize (unchanged — tears down both device + the ACL bring-up). - `acl_ready_=false` (rt-only, the common case): rtDeviceReset (NEW — was previously skipped). Local validation (a2a3 onboard, device 1): - vector_example with pinned PTO-ISA: PASS. CI validation: st-onboard-a2a3 went green first-attempt for the first time this session, on this PR. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The hierarchical tests compile with -D_GLIBCXX_USE_CXX11_ABI=0 but find_library(gtest) finds whatever the host package supplies. The a5 self-hosted runner ships a gtest built with the default cxx11 ABI, so test_tensormap link-fails with `undefined reference to testing::internal::EqFailure(..., std::string const&, ...)`. The error has been chronic across unrelated PRs (hw-native-sys#868, hardware-docs, etc.). Add a try_compile probe right after find_library. If linking under _GLIBCXX_USE_CXX11_ABI=0 fails, unset the cache vars so the existing FetchContent fallback rebuilds gtest with the matching ABI. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The a2a3 self-hosted runner intermittently hits "OpenSSL SSL_connect: SSL_ERROR_SYSCALL" while pulling the googletest tarball from github.com during cmake configure. CMake's FetchContent has no native retry, so the failure is fatal and we see chronic ut-a2a3 "Each download failed!" errors on PRs that have nothing to do with the build. Pre-fetch the tarball with file(DOWNLOAD) + 3 attempts and 3s backoff. Hand FetchContent the local file, which already verifies the SHA256. No effect on hosts where the first download succeeds (GH-hosted runners path stays sub-second). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
Fixes the chronic `st-onboard-a2a3` cascade we've been seeing across many recent PRs.
Symptom
CI shows the same shape repeatedly: one test errors out (varies — `test_dedup_shared_so_independent_unregister`, `spmd_paged_attention`, etc.), then every subsequent test on the same device fails its `simpler_init` with `rtStreamCreate failed: 507899`. Reproduced locally with the full ST suite (`pytest examples tests/st --platform a2a3 --device 0-7 -n 2 --pto-isa-commit $PTO_ISA_COMMIT`).
Root cause
a2a3's `DeviceRunner::finalize()` gated `aclrtResetDevice` on `acl_ready_`, which is only set by `ensure_acl_ready()` — i.e. only the HCCL/comm test path. Non-HCCL tests (most of the suite) skipped any form of device reset on teardown. When a non-HCCL test left the device in error state (AICPU exception, stuck kernel, anything), the next `ChipWorker` on the same device tripped `rtStreamCreate`.
a5 had been doing the right thing all along: its `finalize` calls `rtDeviceReset(device_id_)` unconditionally. This change mirrors that on a2a3 for the non-ACL path. The asymmetry is exactly why a5 onboard has been ~100% reliable in CI while a2a3 onboard has chronically cascaded.
The fix
`finalize()` now always resets the device:
Diff: 1 file, +34/-14.
Local validation (a2a3 onboard, device 1)
(A full ST-suite repro of the cascade requires xdist parallelism + device pool exhaustion timing that CI provides naturally. CI is the real validation gate for this fix.)
Expected effect in CI
Test plan