Skip to content

Replace fixed 10-second waits during distributed-cluster startup and shutdown with real readiness checks #398

@wshlavacek

Description

@wshlavacek

Summary

When PyBNF runs across multiple machines on a computing cluster, it uses the Dask library to start a scheduler and a set of worker processes. At two points, PyBNF waits a hardcoded 10 seconds instead of checking whether the cluster is actually ready. These fixed waits should be replaced with checks that confirm the cluster has finished starting up (and, separately, finished shutting down). Making this change safely requires testing against a real multi-machine cluster, which is not currently available to us.

Where in the code

  • pybnf/cluster.py, in Cluster.setup_cluster: after launching the dask-ssh process that starts the scheduler and workers, the code runs time.sleep(10) and then assumes the cluster is up and connects to it.
  • pybnf/pybnf.py, in _teardown_cluster: after asking the cluster to shut down, the code runs time.sleep(10) before continuing.

Why this is a problem

A fixed 10-second wait is a guess, not a guarantee:

  • On a slow or heavily loaded cluster, 10 seconds may not be enough time for all the workers to connect. PyBNF then proceeds anyway and can fail, or run with fewer workers than expected — often with a confusing downstream error instead of a clear "the cluster isn't ready yet" message.
  • On a fast cluster, 10 seconds is usually longer than necessary, so every run pays an avoidable delay at startup and again at shutdown.

What should be done

Replace each fixed wait with an explicit check:

  • Startup: wait until the scheduler is reachable and the expected number of workers have connected, up to a configurable time limit. If the limit is reached, stop with a clear error explaining that the cluster did not start.
  • Shutdown: wait until the cluster has actually stopped (or a time limit elapses), rather than sleeping a fixed amount.

Why this is still open

This code runs only when starting and stopping a real distributed cluster. It cannot be meaningfully verified by the automated tests that run in continuous integration: those tests use stand-in processes and never launch a real cluster. Confirming that the new readiness checks behave correctly — including the slow-startup and failed-startup cases — requires access to an actual multi-machine cluster (for example, a SLURM job allocation). Until that testing is possible, the fixed waits remain in place as a safe but crude default.

Acceptance criteria

  • The two time.sleep(10) calls are replaced with startup/shutdown checks that have a configurable time limit.
  • A failed or slow cluster startup produces a clear, actionable error message.
  • The new behavior is verified on a real multi-machine Dask cluster, not only with mocked unit tests.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions