Updated test logging and timeouts by Micky774 · Pull Request #608 · ROCm/TransformerEngine

Micky774 · 2026-06-02T21:03:57Z

Description

This PR improves CI test logging by:

Adding a fixed timeout to every pytest run to ensure that CI doesn't hang
Enables the partially-implemented JUNITXML_PREFIX logging
Adds a parser script to digest the (newly enabled) XML test report
Adds an always-run (even on the failure of previous steps) step to digest the XML via parser script

Fixes # (issue)

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Please list the changes introduced in this PR:

Change A
Change B

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

ipanfilo · 2026-06-03T15:59:35Z

+#   traceback instead of the run silently timing out hours later.
+# All are overridable from the environment.
+export PYTHONFAULTHANDLER=1
+: ${PYTEST_TIMEOUT:=1200}          # per-test (per-parametrization) timeout, seconds


I would say 20 minutes for individual test is overkill

Updated to 5min

ipanfilo · 2026-06-04T19:22:05Z

+        if: always()
+        run: |
+          command -v python3 >/dev/null 2>&1 || { echo "python3 not available; skipping report"; exit 0; }
+          python3 ci/junit_report.py test-results \


Can separate reports be generated for Pytorch/JAX/Core then? It will require passing JUNITXML_PREFIX=${JUNITXML_PREFIX}/[torch|jax|core] but will probably be much user friendly than having all tests in one report

The sGPU job ran pytorch.sh/jax.sh/core.sh in parallel into a shared test-results/ dir and merged them into one report. Give each suite its own subdir via a per-subprocess JUNITXML_PREFIX override and emit one junit_report.py report per suite. This also fixes a latent collision: test_sanity_import.py runs in both the torch and jax suites at level 1/auto, so both wrote test-results/test_sanity_import.auto.xml in parallel and clobbered each other. Per-suite subdirs resolve it. mGPU is unchanged (already per-framework via the build matrix). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Micky774 · 2026-06-05T20:14:50Z

Example of test results:

ipanfilo · 2026-06-09T03:11:51Z

+          # Per-suite JUNITXML_PREFIX overrides the docker-exec default so each
+          # suite writes into its own subdir (inline VAR=val applies only to that
+          # backgrounded subprocess; the sourced _utils.sh reads it from the env).
+          HIP_VISIBLE_DEVICES=0 JUNITXML_PREFIX=/workspace/test-results/torch/ ci/pytorch.sh > /workspace/torch.log 2>&1 &


Since JUNITXML_PREFIX is passed it makes sense to have JUNITXML_PREFIX=${JUNITXML_PREFIX}/torch. Or, if it causes parsing problems because of embedding to script string, there is no need to pass JUNITXML_PREFIX as docker exec env

ipanfilo · 2026-06-09T03:15:06Z

        id: run-tests
+        # Below the job's timeout-minutes so an overrun kills only this step;
+        # the `if: always()` report + upload steps still run (artifacts survive).
+        timeout-minutes: 330


5.5hrs? Isn't it too much? Currently, the longest execution is torch sGPU ~ 2.5hrs.

ipanfilo · 2026-06-09T03:29:50Z

 #   traceback instead of the run silently timing out hours later.
+# Note: the 'thread' method bounds only the pytest process itself. Tests that
+# launch torchrun/mpirun children (tests/pytorch/distributed) are reaped
+# separately by tests/pytorch/distributed/conftest.py, which reads PYTEST_TIMEOUT


There is no such file

Updated (I missed adding this file initially)

Updated test logging and timeouts

33d8fe2

Micky774 requested review from ipanfilo, wangye805 and wenchenvincent as code owners June 2, 2026 21:03

Micky774 added the ci-level 3 CI test level 3 label Jun 3, 2026

ipanfilo reviewed Jun 4, 2026

View reviewed changes

Comment thread ci/_utils.sh

Micky774 and others added 5 commits June 5, 2026 15:41

Updated timeouts

5e21c6c

Merge branch 'dev' into zain/ci-test-logging

aa587fb

Lower timeout

f39b0f3

Updated ci/README.md

ba88e02

Micky774 requested a review from ipanfilo June 8, 2026 19:23

ipanfilo reviewed Jun 9, 2026

View reviewed changes

Introduced missing conftest, adjusted rocm-ci.yml based on feedback

9f3342a

Micky774 requested a review from ipanfilo June 11, 2026 16:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Updated test logging and timeouts#608

Updated test logging and timeouts#608
Micky774 wants to merge 7 commits into
devfrom
zain/ci-test-logging

Micky774 commented Jun 2, 2026

Uh oh!

ipanfilo Jun 3, 2026

Uh oh!

Micky774 Jun 5, 2026

Uh oh!

ipanfilo Jun 4, 2026

Uh oh!

Micky774 Jun 5, 2026

Uh oh!

Uh oh!

Micky774 commented Jun 5, 2026

Uh oh!

ipanfilo Jun 9, 2026

Uh oh!

Micky774 Jun 11, 2026

Uh oh!

ipanfilo Jun 9, 2026

Uh oh!

Micky774 Jun 11, 2026

Uh oh!

ipanfilo Jun 9, 2026

Uh oh!

Micky774 Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Micky774 commented Jun 2, 2026

Description

Type of change

Changes

Checklist:

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Micky774 commented Jun 5, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants