Skip to content

Workload canceled due to experiment timeout without explicit telemetry #696

@nchapagain001

Description

@nchapagain001

Situation

Some workloads (for example, long‑running benchmarks like SPECjbb) have expected runtimes that are close to or exceed common experiment timeout values.
When the experiment timeout is set too low (e.g., 2 hours), Virtual Client correctly enforces the timeout and terminates the workload before it completes.
This behavior is by design and VC is functioning correctly.

What makes this confusing

Although the workload is canceled intentionally:

  • The experiment and step may still appear as Succeeded
  • No workload metrics or logs are emitted
  • There is no explicit signal explaining that the workload was terminated due to timeout

From a user and downstream analytics perspective, this appears as:

“Successful experiment with missing data”

This leads to confusion, debugging churn, and false assumptions about data loss or ingestion bugs.

Why this is not user error

While users must configure timeouts longer than the expected workload runtime, there is currently no system feedback indicating that:

The workload was cut short
The reason was an enforced experiment timeout
The resulting data is incomplete by design

Without this signal, users cannot reliably distinguish:

Misconfiguration (timeout too short)
Real workload failures
Telemetry ingestion issues

Proposed fix (observability improvement)
Virtual Client should emit explicit telemetry when a workload is terminated due to experiment timeout, for example:

“Workload execution was canceled because Virtual Client enforced the experiment timeout.”

This should be surfaced as:

A clear log/trace event
A structured telemetry field (e.g., terminationReason = ExperimentTimeout)
Optionally reflected in step status or annotations

Key point

This is not a correctness bug in workload execution.
It is an observability and diagnosability gap.
The fix is not changing VC behavior, but making the reason for cancellation explicit and visible so users and dashboards can interpret results correctly.

===
As noted in screenshot below, specjbb has 2 hour timeout from the user and scenario runs for close to that but no metrics are produced. We need some logs to indicate VC killed the process.
According to (SpecJBB requirement) [https://www.spec.org/jbb2015/docs/userguide.pdf], it takes 2 hours run. We are hitting that edge case.

Image

Traces
| where ExperimentId == "a0d3fd60-9144-4fde-aa3b-4526bd754aa0"
| where ProfileName == "PERF-SPECJBB.json"
| where * contains "error" or SeverityLevel > 2
| count // =0

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions