Supervisor keeps reporting stale heartbeats for orphaned worker directories

If a local worker directory is ever left behind (not removed by `Container.cleanUpForRestart()`), the supervisor keeps reporting its heartbeat to Nimbus forever — for a topology that no longer exists.

`ReportWorkerHeartbeats` builds the batch from `SupervisorUtils.readWorkerHeartbeats()`, which just lists every directory under the worker root (`SupervisorUtils.supervisorWorkerIds()`) and reads each `LSWorkerHeartbeat`. There's no filtering by current assignment or by staleness, so a leftover directory is reported as if it were a live worker.

Orphaned worker directories are only cleaned up once, in the `ReadClusterState` constructor (supervisor startup). The periodic sync loop (`run()`) never re-checks for detached workers. So once a directory is orphaned at runtime, it survives — and keeps being reported — until the next supervisor restart.

On the Nimbus side this shows up as repeated `getTopologyHeartbeatTimeoutSecs` -> `tryReadTopoConf` -> `NotAliveException` for the dead topology (logged as `Exception when getting heartbeat timeout` before STORM-4022). It's also a slow disk leak, since the directory is never reclaimed.

We hit this in production: a worker that shut down cleanly left its `<worker-id>/heartbeats` localstate behind, and the supervisor kept reporting it long after the topology was gone. STORM-4022 only silences the Nimbus log — it doesn't stop the bad heartbeats or reclaim the directory.

### Suggested fix
- Run the detached-worker cleanup (currently only in the `ReadClusterState` constructor) periodically from the sync loop, so orphaned directories are reclaimed at runtime; and/or
- In `ReportWorkerHeartbeats`, skip heartbeats whose topology isn't currently assigned to this supervisor, or whose `time_secs` is older than the timeout.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Supervisor keeps reporting stale heartbeats for orphaned worker directories #8786

Suggested fix

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Supervisor keeps reporting stale heartbeats for orphaned worker directories #8786

Description

Suggested fix

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions