ci: riscv: partition tests and add timeouts for QEMU stability#315
Open
allnes wants to merge 1 commit into
Open
ci: riscv: partition tests and add timeouts for QEMU stability#315allnes wants to merge 1 commit into
allnes wants to merge 1 commit into
Conversation
Collaborator
Author
|
@aobolensk please review |
aobolensk
reviewed
Jun 12, 2026
aobolensk
left a comment
There was a problem hiding this comment.
I don't think this is the right place to add such change. Let's submit it to https://github.com/uxlfoundation/oneDNN first and then cherry-pick it
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
The RISC-V CI runs the test suite under QEMU user-mode emulation, which is ~10-50x slower than native. The
vlen256job occasionally hangs: in one recent run it ran ~58 min and theRun oneDNN testsstep produced no output at all before the job died, giving zero diagnostic signal. Normal runs finish in ~17 min (181 tests), so this is a transient QEMU/runner stall rather than a real test failure — but the silent multi-hour hang and the lack of a per-test cap make it hard to diagnose and slow to fail.This is a RISC-V CI robustness change only — no library code is touched.
Changes
.github/workflows/ci-riscv.ymlfail-fast: falseon thetestmatrix so a flakyvlen128/vlen256run (or one partition) no longer cancels the siblings.timeout-minutes: 40on thetestjob — a hung QEMU run is now killed promptly instead of stalling toward the 6 h default.vlenconfig across 4 parallel partitions via apart/partsmatrix dimension (2 vlen × 4 = 8 jobs). Wall-clock per partition drops ~linearly..github/automation/riscv/test.sh(SMOKE branch)ctest -I ${start},,${stride}driven byONEDNN_TEST_PART/ONEDNN_TEST_STRIDE. This mirrors the partitioning theCItestset branch already uses; stride needs no cost table and covers every test exactly once (verified: N=181 → 46/45/45/45, no gaps, no overlaps).--timeout ${ONEDNN_TEST_TIMEOUT:-600}so a hung test (e.g. an RVV codepath that loops forever) fails fast with its name instead of stalling the whole job. SMOKE tests run in seconds, so 600 s is a generous cap. The cap is scoped to the SMOKE branch only and does not affect the weeklyCItestset (which has multi-hour tests).Notes
start=1,stride=1→ all tests, single partition).