Skip to content

ci: riscv: partition tests and add timeouts for QEMU stability#315

Open
allnes wants to merge 1 commit into
openvinotoolkit:v3.10_for_ie_masterfrom
allnes:anesterov/riscv-ci-stability
Open

ci: riscv: partition tests and add timeouts for QEMU stability#315
allnes wants to merge 1 commit into
openvinotoolkit:v3.10_for_ie_masterfrom
allnes:anesterov/riscv-ci-stability

Conversation

@allnes

@allnes allnes commented Jun 12, 2026

Copy link
Copy Markdown
Collaborator

Motivation

The RISC-V CI runs the test suite under QEMU user-mode emulation, which is ~10-50x slower than native. The vlen256 job occasionally hangs: in one recent run it ran ~58 min and the Run oneDNN tests step produced no output at all before the job died, giving zero diagnostic signal. Normal runs finish in ~17 min (181 tests), so this is a transient QEMU/runner stall rather than a real test failure — but the silent multi-hour hang and the lack of a per-test cap make it hard to diagnose and slow to fail.

This is a RISC-V CI robustness change only — no library code is touched.

Changes

.github/workflows/ci-riscv.yml

  • fail-fast: false on the test matrix so a flaky vlen128/vlen256 run (or one partition) no longer cancels the siblings.
  • timeout-minutes: 40 on the test job — a hung QEMU run is now killed promptly instead of stalling toward the 6 h default.
  • Split each vlen config across 4 parallel partitions via a part/parts matrix dimension (2 vlen × 4 = 8 jobs). Wall-clock per partition drops ~linearly.

.github/automation/riscv/test.sh (SMOKE branch)

  • Apply stride partitioning ctest -I ${start},,${stride} driven by ONEDNN_TEST_PART/ONEDNN_TEST_STRIDE. This mirrors the partitioning the CI testset branch already uses; stride needs no cost table and covers every test exactly once (verified: N=181 → 46/45/45/45, no gaps, no overlaps).
  • Add --timeout ${ONEDNN_TEST_TIMEOUT:-600} so a hung test (e.g. an RVV codepath that loops forever) fails fast with its name instead of stalling the whole job. SMOKE tests run in seconds, so 600 s is a generous cap. The cap is scoped to the SMOKE branch only and does not affect the weekly CI testset (which has multi-hour tests).

Notes

  • Defaults preserve old behavior when the env vars are unset (start=1, stride=1 → all tests, single partition).

@allnes

allnes commented Jun 12, 2026

Copy link
Copy Markdown
Collaborator Author

@aobolensk please review

@aobolensk aobolensk left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is the right place to add such change. Let's submit it to https://github.com/uxlfoundation/oneDNN first and then cherry-pick it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants