Fix headless premature completion detection#123
Merged
Conversation
The engine reports test_state=Completed whenever operations have started but the instantaneous concurrency is 0. The asynchronous Netty S3 driver hits that gap between operations at low thread counts, so the headless multi-host coordinator (metricsPollingLoop) aborted duration-bounded runs within seconds of starting while the interactive TUI, which does not gate on that signal, ran to completion. Extract shouldSignalCompletion() and, for time-bounded runs, only honor a Completed sample once the configured duration has actually elapsed (using the engine-reported elapsed_time_seconds carried in StepTime). Op-count and unbounded runs keep prior behavior and are covered by the companion engine-side fix. Adds table tests that prove the premature-completion fault and guard the fix.
calculateTestState() reported Completed whenever operations had started but the instantaneous concurrency was 0. The asynchronous Netty S3 driver releases its concurrency permit between operations, so at low thread counts concurrency momentarily hits 0 mid-run; a headless coordinator polling /metrics then saw test_state=2 and tore the distributed run down within seconds of starting. For a duration-bounded run, keep reporting Running until the configured time limit has actually elapsed (derived from the load-step metadata and the snapshot elapsed time); op-count and unbounded runs are unchanged. Adds tests proving the transient-gap fault and the genuine post-duration completion path.
…etion The headless completion check shouldSignalCompletion relies on StepTime and the limit fields (HasLimit/LimitType/LimitTimeSec/CompletionPercent) of the aggregate metric, but the real aggregation paths dropped them: - aggregateNodeMetrics (via sumWorkerMetrics) never set StepTime and rebuilt CompletionPercent from summed work, leaving the per-step elapsed time at 0. - AggregateByOpType only copied StepTime/HasLimit/LimitType/LimitTimeSec on the MIXED path and took CompletionPercent from the first metric, dropping LimitOpCount entirely. As a result a genuinely-finished, duration-bounded headless run reported StepTime=0 (and often HasLimit=false) through the aggregate, so completion was never signalled and the run hung. Add applyProgressAndLimitFields, called from both aggregation paths, which sets StepTime to the MAX across nodes (most-progressed elapsed), CompletionPercent to the MIN across nodes (slowest node = true fleet completion), and copies the limit metadata from any node reporting a limit (Unbounded if any node is). Also extend shouldSignalCompletion with an op-count branch: for op-count-bounded runs, only honor Completed once CompletionPercent reaches 100, which is normalization-safe across N nodes (comparing summed SuccessCount to a per-node LimitOpCount would be wrong). Time-bounded runs keep the StepTime>=duration guard; unbounded/none keep trusting the engine state. Add completion_detection_test.go covering both findings through a real MetricsAggregator.Aggregate: a genuine time-bounded completed run signals completion (regression for the hang), a transient one does not, and op-count runs gate on fleet CompletionPercent.
… runs calculateTestState already guards the time limit: when concurrency drops to 0 with operations already issued (a transient gap with the asynchronous Netty S3 driver), a duration-bounded run stays Running until the configured time elapses. Op-count-bounded runs hit the same transient concurrency==0 gap but were not guarded, so they reported Completed (2) early and headless coordinators aborted the run before the operation count was reached. Add a matching op-count guard: when concurrency==0 with ops issued, if the metadata op-count limit is set and the per-context totalOps has not reached it, return Running (1). countLimit is per-node/per-context here, which matches the per-context totalOps that calculateTestState receives. Add clusterMetricsStayRunningWhenConcurrencyTransientlyZeroBeforeOpCountLimit mirroring the existing time-limit regression test (80 of 1000 ops, concurrency last()==0) asserting test_state == 1.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary