Skip to content

Fix headless premature completion detection#123

Merged
mikehorgan-dell merged 5 commits into
mainfrom
fix/headless-premature-test-completion
May 30, 2026
Merged

Fix headless premature completion detection#123
mikehorgan-dell merged 5 commits into
mainfrom
fix/headless-premature-test-completion

Conversation

@mikehorgan-dell
Copy link
Copy Markdown
Member

@mikehorgan-dell mikehorgan-dell commented May 30, 2026

Summary

  • keep engine metrics in Running state during transient zero-concurrency gaps for time- and op-count-bounded runs
  • gate headless completion on elapsed duration or op-count completion instead of trusting a single Completed sample
  • propagate aggregate StepTime and preserve aggregate limit/progress fields used by completion detection

The engine reports test_state=Completed whenever operations have started
but the instantaneous concurrency is 0. The asynchronous Netty S3 driver
hits that gap between operations at low thread counts, so the headless
multi-host coordinator (metricsPollingLoop) aborted duration-bounded
runs within seconds of starting while the interactive TUI, which does not
gate on that signal, ran to completion.

Extract shouldSignalCompletion() and, for time-bounded runs, only honor a
Completed sample once the configured duration has actually elapsed (using
the engine-reported elapsed_time_seconds carried in StepTime). Op-count
and unbounded runs keep prior behavior and are covered by the companion
engine-side fix. Adds table tests that prove the premature-completion
fault and guard the fix.
calculateTestState() reported Completed whenever operations had started
but the instantaneous concurrency was 0. The asynchronous Netty S3 driver
releases its concurrency permit between operations, so at low thread
counts concurrency momentarily hits 0 mid-run; a headless coordinator
polling /metrics then saw test_state=2 and tore the distributed run down
within seconds of starting.

For a duration-bounded run, keep reporting Running until the configured
time limit has actually elapsed (derived from the load-step metadata and
the snapshot elapsed time); op-count and unbounded runs are unchanged.

Adds tests proving the transient-gap fault and the genuine post-duration
completion path.
…etion

The headless completion check shouldSignalCompletion relies on StepTime and
the limit fields (HasLimit/LimitType/LimitTimeSec/CompletionPercent) of the
aggregate metric, but the real aggregation paths dropped them:

- aggregateNodeMetrics (via sumWorkerMetrics) never set StepTime and rebuilt
  CompletionPercent from summed work, leaving the per-step elapsed time at 0.
- AggregateByOpType only copied StepTime/HasLimit/LimitType/LimitTimeSec on the
  MIXED path and took CompletionPercent from the first metric, dropping
  LimitOpCount entirely.

As a result a genuinely-finished, duration-bounded headless run reported
StepTime=0 (and often HasLimit=false) through the aggregate, so completion was
never signalled and the run hung.

Add applyProgressAndLimitFields, called from both aggregation paths, which sets
StepTime to the MAX across nodes (most-progressed elapsed), CompletionPercent
to the MIN across nodes (slowest node = true fleet completion), and copies the
limit metadata from any node reporting a limit (Unbounded if any node is).

Also extend shouldSignalCompletion with an op-count branch: for op-count-bounded
runs, only honor Completed once CompletionPercent reaches 100, which is
normalization-safe across N nodes (comparing summed SuccessCount to a per-node
LimitOpCount would be wrong). Time-bounded runs keep the StepTime>=duration
guard; unbounded/none keep trusting the engine state.

Add completion_detection_test.go covering both findings through a real
MetricsAggregator.Aggregate: a genuine time-bounded completed run signals
completion (regression for the hang), a transient one does not, and op-count
runs gate on fleet CompletionPercent.
… runs

calculateTestState already guards the time limit: when concurrency drops to 0
with operations already issued (a transient gap with the asynchronous Netty S3
driver), a duration-bounded run stays Running until the configured time elapses.

Op-count-bounded runs hit the same transient concurrency==0 gap but were not
guarded, so they reported Completed (2) early and headless coordinators aborted
the run before the operation count was reached.

Add a matching op-count guard: when concurrency==0 with ops issued, if the
metadata op-count limit is set and the per-context totalOps has not reached it,
return Running (1). countLimit is per-node/per-context here, which matches the
per-context totalOps that calculateTestState receives.

Add clusterMetricsStayRunningWhenConcurrencyTransientlyZeroBeforeOpCountLimit
mirroring the existing time-limit regression test (80 of 1000 ops, concurrency
last()==0) asserting test_state == 1.
@mikehorgan-dell mikehorgan-dell merged commit a6fa203 into main May 30, 2026
6 checks passed
@mikehorgan-dell mikehorgan-dell deleted the fix/headless-premature-test-completion branch May 30, 2026 17:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants