Skip to content

SCHED-864 Create sansible docker image for handling the jail state#2108

Merged
asteny merged 5 commits intosoperator-release-2.0from
SCHED864
Feb 3, 2026
Merged

SCHED-864 Create sansible docker image for handling the jail state#2108
asteny merged 5 commits intosoperator-release-2.0from
SCHED864

Conversation

@asteny
Copy link
Collaborator

@asteny asteny commented Jan 29, 2026

Problem

We need all Ansible role in a one place for handling the jail state

Solution

Build sansible image with roles from ml-containers (base roles) + soperator repo

Testing

Release Notes

@asteny asteny added the feature label Jan 29, 2026
@asteny asteny force-pushed the SCHED864 branch 2 times, most recently from 4b00ea4 to 007fad1 Compare January 30, 2026 16:16
@asteny asteny marked this pull request as ready for review February 3, 2026 11:46
@Uburro Uburro requested a review from Copilot February 3, 2026 12:18
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request introduces a new "sansible" Docker image to consolidate Ansible role management for jail state handling in the Slurm cluster. The PR updates multiple Docker base images from ml-containers repository to version 20260202/20260203, refactors the Ansible role structure to leverage shared roles from ml-containers, and updates configuration files to use full CUDA version strings instead of major versions.

Changes:

  • Created new sansible Docker image that combines base Ansible roles from ml-containers with soperator-specific playbooks
  • Updated Docker base image references across all dockerfiles to use ml-containers PR #53 versions
  • Refactored CUDA version handling from major version (e.g., "12") to full version (e.g., "12.9.0") across Helm charts and build workflows

Reviewed changes

Copilot reviewed 32 out of 32 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
images/sansible/sansible.dockerfile New Dockerfile for sansible image based on ml-containers ansible_roles
images/sansible/sansible_entrypoint.sh Entrypoint script that bind-mounts jail filesystem resources before executing ansible playbooks
ansible/run.yml Added environment variable configuration for CUDA versions and ansible connection settings; consolidated role list
ansible/roles/*/tasks/main.yml Removed task files for roles now provided by ml-containers base image (cuda, nccl-tests, etc.)
helm/soperator-activechecks/values.yaml Added sansible image reference, updated to full CUDA version, configured jail management checks to use sansible
helm/soperator-activechecks/templates/_helpers.tpl Added logic to inject CUDA_VERSION and NCCL_TESTS_VERSION environment variables for ansible workloads
helm/slurm-cluster/values.yaml Changed from cudaMajorVersion to cudaVersion with full version string
helm/slurm-cluster/templates/slurm-cluster-cr.yaml Simplified populate-jail image resolution using full CUDA version
images/worker/slurmd.dockerfile Updated base image to ml-containers/slurm:20260202145952
images/slurm_check_job/slurm_check_job.dockerfile Updated base image to ml-containers/slurm:20260202145952
images/restd/slurmrestd.dockerfile Updated base image to ml-containers/slurm:20260202145952
images/munge/munge.dockerfile Updated base image to ml-containers/neubuntu:20260202145316
images/login/sshd.dockerfile Updated base image to ml-containers/slurm:20260202145952
images/k8s_check_job/k8s_check_job.dockerfile Updated base image to ml-containers/neubuntu:20260202145316, removed ansible playbook copy
images/jail/jail.dockerfile Updated base image to ml-containers/slurm_training_diag:20260202154154
images/controller/slurmctld.dockerfile Updated base image to ml-containers/slurm:20260202145952
images/accounting/slurmdbd.dockerfile Updated base image to ml-containers/slurm:20260202145952
helm/slurm-cluster-storage/templates/jail-mount-daemonset.yaml Updated neubuntu image to 20260202145316
Makefile Added sansible image sync, removed separate CUDA version variables
.github/workflows/one_job.yml Added sansible image build step, simplified CUDA version array
.github/workflows/ansible_lint.yml Changed to lint entire ansible directory instead of individual files
.config/ansible-lint.yml Excluded ansible/run.yml from linting
README.md Updated Slurm version documentation from 25.05.4 to 25.05.5
ansible/nc-health-checker.yml Updated hosts configuration to include 'all' in addition to 'localhost'
Comments suppressed due to low confidence (1)

ansible/run.yml:33

  • The role 'dcgmi' is listed twice in the roles list (lines 28 and 33). This duplication is unnecessary and should be removed to avoid redundant execution of the same role.
    - dcgmi
    - perftest
    - nvtop
    - docker-cli
    - openmpi
    - dcgmi

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@asteny asteny merged commit 1e9ff3f into soperator-release-2.0 Feb 3, 2026
10 checks passed
@asteny asteny deleted the SCHED864 branch February 3, 2026 14:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants