Fix headless premature completion detection by mikehorgan-dell · Pull Request #123 · dell/storage-performance-tool

mikehorgan-dell · 2026-05-30T17:19:27Z

Summary

keep engine metrics in Running state during transient zero-concurrency gaps for time- and op-count-bounded runs
gate headless completion on elapsed duration or op-count completion instead of trusting a single Completed sample
propagate aggregate StepTime and preserve aggregate limit/progress fields used by completion detection

The engine reports test_state=Completed whenever operations have started but the instantaneous concurrency is 0. The asynchronous Netty S3 driver hits that gap between operations at low thread counts, so the headless multi-host coordinator (metricsPollingLoop) aborted duration-bounded runs within seconds of starting while the interactive TUI, which does not gate on that signal, ran to completion. Extract shouldSignalCompletion() and, for time-bounded runs, only honor a Completed sample once the configured duration has actually elapsed (using the engine-reported elapsed_time_seconds carried in StepTime). Op-count and unbounded runs keep prior behavior and are covered by the companion engine-side fix. Adds table tests that prove the premature-completion fault and guard the fix.

calculateTestState() reported Completed whenever operations had started but the instantaneous concurrency was 0. The asynchronous Netty S3 driver releases its concurrency permit between operations, so at low thread counts concurrency momentarily hits 0 mid-run; a headless coordinator polling /metrics then saw test_state=2 and tore the distributed run down within seconds of starting. For a duration-bounded run, keep reporting Running until the configured time limit has actually elapsed (derived from the load-step metadata and the snapshot elapsed time); op-count and unbounded runs are unchanged. Adds tests proving the transient-gap fault and the genuine post-duration completion path.

…etion The headless completion check shouldSignalCompletion relies on StepTime and the limit fields (HasLimit/LimitType/LimitTimeSec/CompletionPercent) of the aggregate metric, but the real aggregation paths dropped them: - aggregateNodeMetrics (via sumWorkerMetrics) never set StepTime and rebuilt CompletionPercent from summed work, leaving the per-step elapsed time at 0. - AggregateByOpType only copied StepTime/HasLimit/LimitType/LimitTimeSec on the MIXED path and took CompletionPercent from the first metric, dropping LimitOpCount entirely. As a result a genuinely-finished, duration-bounded headless run reported StepTime=0 (and often HasLimit=false) through the aggregate, so completion was never signalled and the run hung. Add applyProgressAndLimitFields, called from both aggregation paths, which sets StepTime to the MAX across nodes (most-progressed elapsed), CompletionPercent to the MIN across nodes (slowest node = true fleet completion), and copies the limit metadata from any node reporting a limit (Unbounded if any node is). Also extend shouldSignalCompletion with an op-count branch: for op-count-bounded runs, only honor Completed once CompletionPercent reaches 100, which is normalization-safe across N nodes (comparing summed SuccessCount to a per-node LimitOpCount would be wrong). Time-bounded runs keep the StepTime>=duration guard; unbounded/none keep trusting the engine state. Add completion_detection_test.go covering both findings through a real MetricsAggregator.Aggregate: a genuine time-bounded completed run signals completion (regression for the hang), a transient one does not, and op-count runs gate on fleet CompletionPercent.

… runs calculateTestState already guards the time limit: when concurrency drops to 0 with operations already issued (a transient gap with the asynchronous Netty S3 driver), a duration-bounded run stays Running until the configured time elapses. Op-count-bounded runs hit the same transient concurrency==0 gap but were not guarded, so they reported Completed (2) early and headless coordinators aborted the run before the operation count was reached. Add a matching op-count guard: when concurrency==0 with ops issued, if the metadata op-count limit is set and the per-context totalOps has not reached it, return Running (1). countLimit is per-node/per-context here, which matches the per-context totalOps that calculateTestState receives. Add clusterMetricsStayRunningWhenConcurrencyTransientlyZeroBeforeOpCountLimit mirroring the existing time-limit regression test (80 of 1000 ops, concurrency last()==0) asserting test_state == 1.

mikehorgan added 5 commits May 29, 2026 20:48

Fix headless completion tests and aggregate limits

d253315

mikehorgan-dell merged commit a6fa203 into main May 30, 2026
6 checks passed

mikehorgan-dell deleted the fix/headless-premature-test-completion branch May 30, 2026 17:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix headless premature completion detection#123

Fix headless premature completion detection#123
mikehorgan-dell merged 5 commits into
mainfrom
fix/headless-premature-test-completion

mikehorgan-dell commented May 30, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mikehorgan-dell commented May 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mikehorgan-dell commented May 30, 2026 •

edited

Loading