If a local worker directory is ever left behind (not removed by Container.cleanUpForRestart()), the supervisor keeps reporting its heartbeat to Nimbus forever — for a topology that no longer exists.
ReportWorkerHeartbeats builds the batch from SupervisorUtils.readWorkerHeartbeats(), which just lists every directory under the worker root (SupervisorUtils.supervisorWorkerIds()) and reads each LSWorkerHeartbeat. There's no filtering by current assignment or by staleness, so a leftover directory is reported as if it were a live worker.
Orphaned worker directories are only cleaned up once, in the ReadClusterState constructor (supervisor startup). The periodic sync loop (run()) never re-checks for detached workers. So once a directory is orphaned at runtime, it survives — and keeps being reported — until the next supervisor restart.
On the Nimbus side this shows up as repeated getTopologyHeartbeatTimeoutSecs -> tryReadTopoConf -> NotAliveException for the dead topology (logged as Exception when getting heartbeat timeout before STORM-4022). It's also a slow disk leak, since the directory is never reclaimed.
We hit this in production: a worker that shut down cleanly left its <worker-id>/heartbeats localstate behind, and the supervisor kept reporting it long after the topology was gone. STORM-4022 only silences the Nimbus log — it doesn't stop the bad heartbeats or reclaim the directory.
Suggested fix
- Run the detached-worker cleanup (currently only in the
ReadClusterState constructor) periodically from the sync loop, so orphaned directories are reclaimed at runtime; and/or
- In
ReportWorkerHeartbeats, skip heartbeats whose topology isn't currently assigned to this supervisor, or whose time_secs is older than the timeout.
If a local worker directory is ever left behind (not removed by
Container.cleanUpForRestart()), the supervisor keeps reporting its heartbeat to Nimbus forever — for a topology that no longer exists.ReportWorkerHeartbeatsbuilds the batch fromSupervisorUtils.readWorkerHeartbeats(), which just lists every directory under the worker root (SupervisorUtils.supervisorWorkerIds()) and reads eachLSWorkerHeartbeat. There's no filtering by current assignment or by staleness, so a leftover directory is reported as if it were a live worker.Orphaned worker directories are only cleaned up once, in the
ReadClusterStateconstructor (supervisor startup). The periodic sync loop (run()) never re-checks for detached workers. So once a directory is orphaned at runtime, it survives — and keeps being reported — until the next supervisor restart.On the Nimbus side this shows up as repeated
getTopologyHeartbeatTimeoutSecs->tryReadTopoConf->NotAliveExceptionfor the dead topology (logged asException when getting heartbeat timeoutbefore STORM-4022). It's also a slow disk leak, since the directory is never reclaimed.We hit this in production: a worker that shut down cleanly left its
<worker-id>/heartbeatslocalstate behind, and the supervisor kept reporting it long after the topology was gone. STORM-4022 only silences the Nimbus log — it doesn't stop the bad heartbeats or reclaim the directory.Suggested fix
ReadClusterStateconstructor) periodically from the sync loop, so orphaned directories are reclaimed at runtime; and/orReportWorkerHeartbeats, skip heartbeats whose topology isn't currently assigned to this supervisor, or whosetime_secsis older than the timeout.