Replies: 1 comment 3 replies
-
|
What exactly should happen when the health check script fails? The task should be just rescheduled somewhere else, or the whole worker should be terminated? |
Beta Was this translation helpful? Give feedback.
3 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Hello;
wouldn't be possible/interesting to implement possibility to use user-defined worker-health-check?
High level goal:
--max-failsbudget )Proposed usage:
where
health-check_script.shis executable/script expected to return 0 on healthy node.health-check_script.shwould be executed before start of every(?) job by hq worker.It is up to user to consider what to check. I believe [ availability of filesystem, IB interfaces,
/dev/nvidia*, kerberos tickets, memory availability ...] could be subject of checks..Motivation:
Recently, we faced some kerberos-related-glitch at Metacentrum, and despite having healthy nodes at other cluster, the job
--max-failsamount was exhausted on faulty nodes. ( = faulty nodes ate our all jobs 😢 )To be discussed:
does it make sense at all, isn't this already up to underlaying scheduler/manager?
when to call health-check? - at the job start, periodically, let user decide..?
instead of health-check, isn't better to create more generic job-prolog?
Inspiration
I'm coming from Slurm environment so indeed I had
HealthCheckProgramin my mind. [1][1] https://slurm.schedmd.com/SUG14/node_health_check.pdf
Beta Was this translation helpful? Give feedback.
All reactions