SCHED-864 Create sansible docker image for handling the jail state#2108
Merged
asteny merged 5 commits intosoperator-release-2.0from Feb 3, 2026
Merged
SCHED-864 Create sansible docker image for handling the jail state#2108asteny merged 5 commits intosoperator-release-2.0from
asteny merged 5 commits intosoperator-release-2.0from
Conversation
4b00ea4 to
007fad1
Compare
Contributor
There was a problem hiding this comment.
Pull request overview
This pull request introduces a new "sansible" Docker image to consolidate Ansible role management for jail state handling in the Slurm cluster. The PR updates multiple Docker base images from ml-containers repository to version 20260202/20260203, refactors the Ansible role structure to leverage shared roles from ml-containers, and updates configuration files to use full CUDA version strings instead of major versions.
Changes:
- Created new sansible Docker image that combines base Ansible roles from ml-containers with soperator-specific playbooks
- Updated Docker base image references across all dockerfiles to use ml-containers PR #53 versions
- Refactored CUDA version handling from major version (e.g., "12") to full version (e.g., "12.9.0") across Helm charts and build workflows
Reviewed changes
Copilot reviewed 32 out of 32 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| images/sansible/sansible.dockerfile | New Dockerfile for sansible image based on ml-containers ansible_roles |
| images/sansible/sansible_entrypoint.sh | Entrypoint script that bind-mounts jail filesystem resources before executing ansible playbooks |
| ansible/run.yml | Added environment variable configuration for CUDA versions and ansible connection settings; consolidated role list |
| ansible/roles/*/tasks/main.yml | Removed task files for roles now provided by ml-containers base image (cuda, nccl-tests, etc.) |
| helm/soperator-activechecks/values.yaml | Added sansible image reference, updated to full CUDA version, configured jail management checks to use sansible |
| helm/soperator-activechecks/templates/_helpers.tpl | Added logic to inject CUDA_VERSION and NCCL_TESTS_VERSION environment variables for ansible workloads |
| helm/slurm-cluster/values.yaml | Changed from cudaMajorVersion to cudaVersion with full version string |
| helm/slurm-cluster/templates/slurm-cluster-cr.yaml | Simplified populate-jail image resolution using full CUDA version |
| images/worker/slurmd.dockerfile | Updated base image to ml-containers/slurm:20260202145952 |
| images/slurm_check_job/slurm_check_job.dockerfile | Updated base image to ml-containers/slurm:20260202145952 |
| images/restd/slurmrestd.dockerfile | Updated base image to ml-containers/slurm:20260202145952 |
| images/munge/munge.dockerfile | Updated base image to ml-containers/neubuntu:20260202145316 |
| images/login/sshd.dockerfile | Updated base image to ml-containers/slurm:20260202145952 |
| images/k8s_check_job/k8s_check_job.dockerfile | Updated base image to ml-containers/neubuntu:20260202145316, removed ansible playbook copy |
| images/jail/jail.dockerfile | Updated base image to ml-containers/slurm_training_diag:20260202154154 |
| images/controller/slurmctld.dockerfile | Updated base image to ml-containers/slurm:20260202145952 |
| images/accounting/slurmdbd.dockerfile | Updated base image to ml-containers/slurm:20260202145952 |
| helm/slurm-cluster-storage/templates/jail-mount-daemonset.yaml | Updated neubuntu image to 20260202145316 |
| Makefile | Added sansible image sync, removed separate CUDA version variables |
| .github/workflows/one_job.yml | Added sansible image build step, simplified CUDA version array |
| .github/workflows/ansible_lint.yml | Changed to lint entire ansible directory instead of individual files |
| .config/ansible-lint.yml | Excluded ansible/run.yml from linting |
| README.md | Updated Slurm version documentation from 25.05.4 to 25.05.5 |
| ansible/nc-health-checker.yml | Updated hosts configuration to include 'all' in addition to 'localhost' |
Comments suppressed due to low confidence (1)
ansible/run.yml:33
- The role 'dcgmi' is listed twice in the roles list (lines 28 and 33). This duplication is unnecessary and should be removed to avoid redundant execution of the same role.
- dcgmi
- perftest
- nvtop
- docker-cli
- openmpi
- dcgmi
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
theyoprst
approved these changes
Feb 3, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
We need all Ansible role in a one place for handling the jail state
Solution
Build sansible image with roles from ml-containers (base roles) + soperator repo
Testing
Release Notes