Updated test logging and timeouts#608
Conversation
| # traceback instead of the run silently timing out hours later. | ||
| # All are overridable from the environment. | ||
| export PYTHONFAULTHANDLER=1 | ||
| : ${PYTEST_TIMEOUT:=1200} # per-test (per-parametrization) timeout, seconds |
There was a problem hiding this comment.
I would say 20 minutes for individual test is overkill
| if: always() | ||
| run: | | ||
| command -v python3 >/dev/null 2>&1 || { echo "python3 not available; skipping report"; exit 0; } | ||
| python3 ci/junit_report.py test-results \ |
There was a problem hiding this comment.
Can separate reports be generated for Pytorch/JAX/Core then? It will require passing JUNITXML_PREFIX=${JUNITXML_PREFIX}/[torch|jax|core] but will probably be much user friendly than having all tests in one report
The sGPU job ran pytorch.sh/jax.sh/core.sh in parallel into a shared test-results/ dir and merged them into one report. Give each suite its own subdir via a per-subprocess JUNITXML_PREFIX override and emit one junit_report.py report per suite. This also fixes a latent collision: test_sanity_import.py runs in both the torch and jax suites at level 1/auto, so both wrote test-results/test_sanity_import.auto.xml in parallel and clobbered each other. Per-suite subdirs resolve it. mGPU is unchanged (already per-framework via the build matrix). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
| # Per-suite JUNITXML_PREFIX overrides the docker-exec default so each | ||
| # suite writes into its own subdir (inline VAR=val applies only to that | ||
| # backgrounded subprocess; the sourced _utils.sh reads it from the env). | ||
| HIP_VISIBLE_DEVICES=0 JUNITXML_PREFIX=/workspace/test-results/torch/ ci/pytorch.sh > /workspace/torch.log 2>&1 & |
There was a problem hiding this comment.
Since JUNITXML_PREFIX is passed it makes sense to have JUNITXML_PREFIX=${JUNITXML_PREFIX}/torch. Or, if it causes parsing problems because of embedding to script string, there is no need to pass JUNITXML_PREFIX as docker exec env
| id: run-tests | ||
| # Below the job's timeout-minutes so an overrun kills only this step; | ||
| # the `if: always()` report + upload steps still run (artifacts survive). | ||
| timeout-minutes: 330 |
There was a problem hiding this comment.
5.5hrs? Isn't it too much? Currently, the longest execution is torch sGPU ~ 2.5hrs.
| # traceback instead of the run silently timing out hours later. | ||
| # Note: the 'thread' method bounds only the pytest process itself. Tests that | ||
| # launch torchrun/mpirun children (tests/pytorch/distributed) are reaped | ||
| # separately by tests/pytorch/distributed/conftest.py, which reads PYTEST_TIMEOUT |
There was a problem hiding this comment.
Updated (I missed adding this file initially)

Description
This PR improves CI test logging by:
JUNITXML_PREFIXloggingFixes # (issue)
Type of change
Changes
Please list the changes introduced in this PR:
Checklist: