Skip to content

Supervisor keeps reporting stale heartbeats for orphaned worker directories #8786

@mwkang

Description

@mwkang

If a local worker directory is ever left behind (not removed by Container.cleanUpForRestart()), the supervisor keeps reporting its heartbeat to Nimbus forever — for a topology that no longer exists.

ReportWorkerHeartbeats builds the batch from SupervisorUtils.readWorkerHeartbeats(), which just lists every directory under the worker root (SupervisorUtils.supervisorWorkerIds()) and reads each LSWorkerHeartbeat. There's no filtering by current assignment or by staleness, so a leftover directory is reported as if it were a live worker.

Orphaned worker directories are only cleaned up once, in the ReadClusterState constructor (supervisor startup). The periodic sync loop (run()) never re-checks for detached workers. So once a directory is orphaned at runtime, it survives — and keeps being reported — until the next supervisor restart.

On the Nimbus side this shows up as repeated getTopologyHeartbeatTimeoutSecs -> tryReadTopoConf -> NotAliveException for the dead topology (logged as Exception when getting heartbeat timeout before STORM-4022). It's also a slow disk leak, since the directory is never reclaimed.

We hit this in production: a worker that shut down cleanly left its <worker-id>/heartbeats localstate behind, and the supervisor kept reporting it long after the topology was gone. STORM-4022 only silences the Nimbus log — it doesn't stop the bad heartbeats or reclaim the directory.

Suggested fix

  • Run the detached-worker cleanup (currently only in the ReadClusterState constructor) periodically from the sync loop, so orphaned directories are reclaimed at runtime; and/or
  • In ReportWorkerHeartbeats, skip heartbeats whose topology isn't currently assigned to this supervisor, or whose time_secs is older than the timeout.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions