Replace fixed 10-second waits during distributed-cluster startup and shutdown with real readiness checks

## Summary
When PyBNF runs across multiple machines on a computing cluster, it uses the Dask library to start a scheduler and a set of worker processes. At two points, PyBNF waits a hardcoded 10 seconds instead of checking whether the cluster is actually ready. These fixed waits should be replaced with checks that confirm the cluster has finished starting up (and, separately, finished shutting down). Making this change safely requires testing against a real multi-machine cluster, which is not currently available to us.

## Where in the code
- `pybnf/cluster.py`, in `Cluster.setup_cluster`: after launching the `dask-ssh` process that starts the scheduler and workers, the code runs `time.sleep(10)` and then assumes the cluster is up and connects to it.
- `pybnf/pybnf.py`, in `_teardown_cluster`: after asking the cluster to shut down, the code runs `time.sleep(10)` before continuing.

## Why this is a problem
A fixed 10-second wait is a guess, not a guarantee:
- On a slow or heavily loaded cluster, 10 seconds may not be enough time for all the workers to connect. PyBNF then proceeds anyway and can fail, or run with fewer workers than expected — often with a confusing downstream error instead of a clear "the cluster isn't ready yet" message.
- On a fast cluster, 10 seconds is usually longer than necessary, so every run pays an avoidable delay at startup and again at shutdown.

## What should be done
Replace each fixed wait with an explicit check:
- **Startup:** wait until the scheduler is reachable and the expected number of workers have connected, up to a configurable time limit. If the limit is reached, stop with a clear error explaining that the cluster did not start.
- **Shutdown:** wait until the cluster has actually stopped (or a time limit elapses), rather than sleeping a fixed amount.

## Why this is still open
This code runs only when starting and stopping a real distributed cluster. It cannot be meaningfully verified by the automated tests that run in continuous integration: those tests use stand-in processes and never launch a real cluster. Confirming that the new readiness checks behave correctly — including the slow-startup and failed-startup cases — requires access to an actual multi-machine cluster (for example, a SLURM job allocation). Until that testing is possible, the fixed waits remain in place as a safe but crude default.

## Acceptance criteria
- The two `time.sleep(10)` calls are replaced with startup/shutdown checks that have a configurable time limit.
- A failed or slow cluster startup produces a clear, actionable error message.
- The new behavior is verified on a real multi-machine Dask cluster, not only with mocked unit tests.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace fixed 10-second waits during distributed-cluster startup and shutdown with real readiness checks #398

Summary

Where in the code

Why this is a problem

What should be done

Why this is still open

Acceptance criteria

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Replace fixed 10-second waits during distributed-cluster startup and shutdown with real readiness checks #398

Description

Summary

Where in the code

Why this is a problem

What should be done

Why this is still open

Acceptance criteria

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions