Summary
When PyBNF runs across multiple machines on a computing cluster, it uses the Dask library to start a scheduler and a set of worker processes. At two points, PyBNF waits a hardcoded 10 seconds instead of checking whether the cluster is actually ready. These fixed waits should be replaced with checks that confirm the cluster has finished starting up (and, separately, finished shutting down). Making this change safely requires testing against a real multi-machine cluster, which is not currently available to us.
Where in the code
pybnf/cluster.py, in Cluster.setup_cluster: after launching the dask-ssh process that starts the scheduler and workers, the code runs time.sleep(10) and then assumes the cluster is up and connects to it.
pybnf/pybnf.py, in _teardown_cluster: after asking the cluster to shut down, the code runs time.sleep(10) before continuing.
Why this is a problem
A fixed 10-second wait is a guess, not a guarantee:
- On a slow or heavily loaded cluster, 10 seconds may not be enough time for all the workers to connect. PyBNF then proceeds anyway and can fail, or run with fewer workers than expected — often with a confusing downstream error instead of a clear "the cluster isn't ready yet" message.
- On a fast cluster, 10 seconds is usually longer than necessary, so every run pays an avoidable delay at startup and again at shutdown.
What should be done
Replace each fixed wait with an explicit check:
- Startup: wait until the scheduler is reachable and the expected number of workers have connected, up to a configurable time limit. If the limit is reached, stop with a clear error explaining that the cluster did not start.
- Shutdown: wait until the cluster has actually stopped (or a time limit elapses), rather than sleeping a fixed amount.
Why this is still open
This code runs only when starting and stopping a real distributed cluster. It cannot be meaningfully verified by the automated tests that run in continuous integration: those tests use stand-in processes and never launch a real cluster. Confirming that the new readiness checks behave correctly — including the slow-startup and failed-startup cases — requires access to an actual multi-machine cluster (for example, a SLURM job allocation). Until that testing is possible, the fixed waits remain in place as a safe but crude default.
Acceptance criteria
- The two
time.sleep(10) calls are replaced with startup/shutdown checks that have a configurable time limit.
- A failed or slow cluster startup produces a clear, actionable error message.
- The new behavior is verified on a real multi-machine Dask cluster, not only with mocked unit tests.
Summary
When PyBNF runs across multiple machines on a computing cluster, it uses the Dask library to start a scheduler and a set of worker processes. At two points, PyBNF waits a hardcoded 10 seconds instead of checking whether the cluster is actually ready. These fixed waits should be replaced with checks that confirm the cluster has finished starting up (and, separately, finished shutting down). Making this change safely requires testing against a real multi-machine cluster, which is not currently available to us.
Where in the code
pybnf/cluster.py, inCluster.setup_cluster: after launching thedask-sshprocess that starts the scheduler and workers, the code runstime.sleep(10)and then assumes the cluster is up and connects to it.pybnf/pybnf.py, in_teardown_cluster: after asking the cluster to shut down, the code runstime.sleep(10)before continuing.Why this is a problem
A fixed 10-second wait is a guess, not a guarantee:
What should be done
Replace each fixed wait with an explicit check:
Why this is still open
This code runs only when starting and stopping a real distributed cluster. It cannot be meaningfully verified by the automated tests that run in continuous integration: those tests use stand-in processes and never launch a real cluster. Confirming that the new readiness checks behave correctly — including the slow-startup and failed-startup cases — requires access to an actual multi-machine cluster (for example, a SLURM job allocation). Until that testing is possible, the fixed waits remain in place as a safe but crude default.
Acceptance criteria
time.sleep(10)calls are replaced with startup/shutdown checks that have a configurable time limit.