-
Notifications
You must be signed in to change notification settings - Fork 271
ci: Enable GB200 runners #2017
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
ci: Enable GB200 runners #2017
Changes from all commits
c060437
02084e3
bcf8f81
e87f2e2
2435ca5
f517e6a
3feb0ca
44a6636
99a9236
626b4d9
9576824
bdace86
280ae40
c20713b
0166036
30959e6
69e8711
8681cd2
cafa08f
0cfedc1
b59b8cf
2bbe325
570d4f5
e4f293a
21e5d84
a10a3e4
66707ac
9866c4d
5d6eb10
60d4b5c
08d62fd
31613ca
3f623a1
6b541f4
4417675
c9ca7db
bb72598
73e70e8
836c8cb
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
|
|
@@ -58,6 +58,13 @@ inputs: | |||||||||
| description: "Whether this is a pull request from a fork" | ||||||||||
| required: false | ||||||||||
| default: "false" | ||||||||||
| registry: | ||||||||||
| description: "Registry to use for test" | ||||||||||
| required: false | ||||||||||
| test_data_path: | ||||||||||
| description: "Test data path" | ||||||||||
| required: false | ||||||||||
| default: "/mnt/datadrive/TestData" | ||||||||||
| image-tag: | ||||||||||
| description: "Override container image tag. If set, infers FAST=1 and prefetches venvs + regenerates fingerprint at startup." | ||||||||||
| required: false | ||||||||||
|
|
@@ -72,73 +79,12 @@ runs: | |||||||||
| run: | | ||||||||||
| curl -sL https://aka.ms/InstallAzureCLIDeb | sudo bash | ||||||||||
|
|
||||||||||
| - name: Azure Login | ||||||||||
| if: ${{ inputs.has-azure-credentials == 'true' }} | ||||||||||
| uses: azure/login@v2 | ||||||||||
| with: | ||||||||||
| client-id: ${{ inputs.azure-client-id }} | ||||||||||
| tenant-id: ${{ inputs.azure-tenant-id }} | ||||||||||
| subscription-id: ${{ inputs.azure-subscription-id }} | ||||||||||
|
|
||||||||||
| - name: Azure ACR Login | ||||||||||
| if: ${{ inputs.has-azure-credentials == 'true' }} | ||||||||||
| shell: bash | ||||||||||
| run: | | ||||||||||
| az acr login --name nemoci | ||||||||||
|
|
||||||||||
| - name: Azure Fileshare | ||||||||||
| if: ${{ inputs.has-azure-credentials == 'true' && inputs.is_unit_test == 'false' && inputs.is_doc_test == 'false' }} | ||||||||||
| shell: bash | ||||||||||
| id: azure-fileshare | ||||||||||
| - name: Install uuidgen | ||||||||||
| shell: bash -x -e -u -o pipefail {0} | ||||||||||
| if: ${{ contains(inputs.runner, 'gcp') }} | ||||||||||
| run: | | ||||||||||
| sudo apt update | ||||||||||
| sudo apt install -y cifs-utils | ||||||||||
|
|
||||||||||
| RESOURCE_GROUP_NAME="azure-gpu-vm-runner_group" | ||||||||||
| STORAGE_ACCOUNT_NAME="nemocistorageaccount2" | ||||||||||
| FILE_SHARE_NAME="fileshare" | ||||||||||
|
|
||||||||||
| MNT_ROOT="/media" | ||||||||||
| MNT_PATH="$MNT_ROOT/$STORAGE_ACCOUNT_NAME/$FILE_SHARE_NAME" | ||||||||||
|
|
||||||||||
| echo "MNT_PATH=$MNT_PATH" | tee -a "$GITHUB_OUTPUT" | ||||||||||
|
|
||||||||||
| sudo mkdir -p $MNT_PATH | ||||||||||
|
|
||||||||||
| # Create a folder to store the credentials for this storage account and | ||||||||||
| # any other that you might set up. | ||||||||||
| CREDENTIAL_ROOT="/etc/smbcredentials" | ||||||||||
| sudo mkdir -p "/etc/smbcredentials" | ||||||||||
|
|
||||||||||
| # Get the storage account key for the indicated storage account. | ||||||||||
| # You must be logged in with az login and your user identity must have | ||||||||||
| # permissions to list the storage account keys for this command to work. | ||||||||||
| STORAGE_ACCOUNT_KEY=$(az storage account keys list \ | ||||||||||
| --resource-group $RESOURCE_GROUP_NAME \ | ||||||||||
| --account-name $STORAGE_ACCOUNT_NAME \ | ||||||||||
| --query "[0].value" --output tsv | tr -d '"') | ||||||||||
|
|
||||||||||
| # Create the credential file for this individual storage account | ||||||||||
| SMB_CREDENTIAL_FILE="$CREDENTIAL_ROOT/$STORAGE_ACCOUNT_NAME.cred" | ||||||||||
| if [ ! -f $SMB_CREDENTIAL_FILE ]; then | ||||||||||
| echo "username=$STORAGE_ACCOUNT_NAME" | sudo tee $SMB_CREDENTIAL_FILE > /dev/null | ||||||||||
| echo "password=$STORAGE_ACCOUNT_KEY" | sudo tee -a $SMB_CREDENTIAL_FILE > /dev/null | ||||||||||
| else | ||||||||||
| echo "The credential file $SMB_CREDENTIAL_FILE already exists, and was not modified." | ||||||||||
| fi | ||||||||||
|
|
||||||||||
| # Change permissions on the credential file so only root can read or modify the password file. | ||||||||||
| sudo chmod 600 $SMB_CREDENTIAL_FILE | ||||||||||
|
|
||||||||||
| # This command assumes you have logged in with az login | ||||||||||
| HTTP_ENDPOINT=$(az storage account show --resource-group $RESOURCE_GROUP_NAME --name $STORAGE_ACCOUNT_NAME --query "primaryEndpoints.file" --output tsv | tr -d '"') | ||||||||||
| SMB_PATH=$(echo $HTTP_ENDPOINT | cut -c7-${#HTTP_ENDPOINT})$FILE_SHARE_NAME | ||||||||||
|
|
||||||||||
| STORAGE_ACCOUNT_KEY=$(az storage account keys list --resource-group $RESOURCE_GROUP_NAME --account-name $STORAGE_ACCOUNT_NAME --query "[0].value" --output tsv | tr -d '"') | ||||||||||
|
|
||||||||||
| sudo mount -t cifs $SMB_PATH $MNT_PATH -o credentials=$SMB_CREDENTIAL_FILE,serverino,nosharesock,actimeo=30,mfsymlinks | ||||||||||
|
|
||||||||||
| ls -al $MNT_PATH/TestData | ||||||||||
| apt-get update | ||||||||||
| apt-get install -y uuid-runtime | ||||||||||
|
Comment on lines
+86
to
+87
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Use elevated apt commands for runner compatibility. Line 86 and Line 87 use 🔧 Suggested change- apt-get update
- apt-get install -y uuid-runtime
+ sudo apt-get update
+ sudo apt-get install -y uuid-runtime📝 Committable suggestion
Suggested change
🤖 Prompt for AI Agents |
||||||||||
|
|
||||||||||
| - name: Docker system cleanup | ||||||||||
| shell: bash | ||||||||||
|
|
@@ -148,7 +94,7 @@ runs: | |||||||||
| - name: Docker pull image | ||||||||||
| shell: bash | ||||||||||
| run: | | ||||||||||
| docker pull nemoci.azurecr.io/${{ inputs.image }}:${{ inputs.image-tag || github.run_id }} | ||||||||||
| docker pull ${{ inputs.registry }}/${{ inputs.image }}:${{ inputs.image-tag || github.run_id }} | ||||||||||
|
|
||||||||||
| - name: Create UUID | ||||||||||
| id: uuid | ||||||||||
|
|
@@ -183,11 +129,11 @@ runs: | |||||||||
| ${{ inputs.image-tag != '' && '--env FAST=1' || '' }} \ | ||||||||||
| --volume $(pwd)/${{ github.run_id }}/${{steps.uuid.outputs.id }}/nemo-rl:/opt/nemo-rl \ | ||||||||||
| --volume $GITHUB_ACTION_DIR:$GITHUB_ACTION_DIR \ | ||||||||||
| --volume /mnt/datadrive/TestData/nemo-rl/datasets:/opt/nemo-rl/datasets:ro \ | ||||||||||
| --volume /mnt/datadrive/TestData/nemo-rl/checkpoints:/home/TestData/nemo-rl/checkpoints:ro \ | ||||||||||
| --volume /mnt/datadrive/TestData/nemo-rl/hf_home/hub:/home/TestData/nemo-rl/hf_home/hub \ | ||||||||||
| --volume /mnt/datadrive/TestData/nemo-rl/hf_datasets_cache:/home/TestData/nemo-rl/hf_datasets_cache \ | ||||||||||
| nemoci.azurecr.io/${{ inputs.image }}:${{ inputs.image-tag || github.run_id }} bash -eux -o pipefail -c '\ | ||||||||||
| --volume ${{ inputs.test_data_path }}/nemo-rl/datasets:/opt/nemo-rl/datasets:ro \ | ||||||||||
| --volume ${{ inputs.test_data_path }}/nemo-rl/checkpoints:/home/TestData/nemo-rl/checkpoints:ro \ | ||||||||||
| --volume ${{ inputs.test_data_path }}/nemo-rl/hf_home/hub:/home/TestData/nemo-rl/hf_home/hub \ | ||||||||||
| --volume ${{ inputs.test_data_path }}/nemo-rl/hf_datasets_cache:/home/TestData/nemo-rl/hf_datasets_cache \ | ||||||||||
| ${{ inputs.registry }}/${{ inputs.image }}:${{ inputs.image-tag || github.run_id }} bash -eux -o pipefail -c '\ | ||||||||||
| git config --global --add safe.directory /opt/nemo-rl | ||||||||||
| # This is needed since we create virtualenvs in the workspace, so this allows it to be cleaned up if necessary | ||||||||||
| umask 000 | ||||||||||
|
|
||||||||||
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
|
|
@@ -14,13 +14,10 @@ | |||||
| name: "CICD NeMo RL" | ||||||
|
|
||||||
| on: | ||||||
| pull_request: | ||||||
| push: | ||||||
| branches: | ||||||
| - "main" | ||||||
| - "r**" | ||||||
| types: [labeled, opened, synchronize, reopened] | ||||||
| merge_group: | ||||||
| types: [checks_requested] | ||||||
| - main | ||||||
| - "pull-request/[0-9]+" | ||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🧩 Analysis chain🏁 Script executed: find . -name "cicd-main.yml" -type fRepository: NVIDIA-NeMo/RL Length of output: 91 🏁 Script executed: cat -n .github/workflows/cicd-main.yml | head -30Repository: NVIDIA-NeMo/RL Length of output: 1155 🌐 Web query:
💡 Result:
Example: on:
push:
branches:
- "releases/**"
- "!releases/**-alpha"Source: GitHub Docs — Workflow syntax for GitHub Actions (see References Fix push branch pattern: glob patterns don't interpret Line 20 uses 🔧 Suggested change- - "pull-request/[0-9]+"
+ - "pull-request/**"📝 Committable suggestion
Suggested change
🤖 Prompt for AI Agents |
||||||
| schedule: | ||||||
| - cron: "0 9 * * *" | ||||||
| workflow_dispatch: | ||||||
|
|
@@ -128,6 +125,18 @@ jobs: | |||||
| fi | ||||||
| echo "image_tag=$IMAGE_TAG" | tee -a "$GITHUB_OUTPUT" | ||||||
|
|
||||||
| org-member-pre-flight: | ||||||
| uses: NVIDIA-NeMo/FW-CI-templates/.github/workflows/_cicd_preflight.yml@fd82c6b23b5987d226f00d0719560f6e91210021 | ||||||
| with: | ||||||
| default_runner_prefix: ${{ vars.DEFAULT_RUNNER_PREFIX }} | ||||||
| non_nvidia_runner_prefix: ${{ vars.NON_NVIDIA_RUNNER_PREFIX }} | ||||||
| default_test_data_path: ${{ vars.DEFAULT_TEST_DATA_PATH }} | ||||||
| non_nvidia_test_data_path: ${{ vars.NON_NVIDIA_TEST_DATA_PATH }} | ||||||
| default_registry: ${{ vars.DEFAULT_CONTAINER_REGISTRY }} | ||||||
| non_nvidia_registry: ${{ vars.NON_NVIDIA_CONTAINER_REGISTRY }} | ||||||
| secrets: | ||||||
| NVIDIA_MANAGEMENT_ORG_PAT: ${{ secrets.NVIDIA_MANAGEMENT_ORG_PAT }} | ||||||
|
|
||||||
|
Comment on lines
+128
to
+139
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Include This new job now provides core runner/registry/data-path orchestration, but 🔧 Suggested change CI_QA_Gate:
@@
needs:
- pre-flight
+ - org-member-pre-flight
- pr-branch-up-to-date-check
- lint-check
@@
ALL_SUCCESS: >-
${{
+ needs.org-member-pre-flight.result == 'success' &&
needs.lint-check.result == 'success' &&
(needs.pr-branch-up-to-date-check.result == 'success' || needs.pr-branch-up-to-date-check.result == 'skipped') &&🤖 Prompt for AI Agents |
||||||
| pr-branch-up-to-date-check: | ||||||
| name: Check if PR branch is up to date | ||||||
| needs: [pre-flight] | ||||||
|
|
@@ -227,14 +236,16 @@ jobs: | |||||
|
|
||||||
| build-container: | ||||||
| if: ${{ needs.pre-flight.outputs.test_level != 'none' && needs.pre-flight.outputs.image_tag == '' }} | ||||||
| needs: [pre-flight] | ||||||
| uses: NVIDIA-NeMo/FW-CI-templates/.github/workflows/_build_container.yml@v0.52.0 | ||||||
| needs: [pre-flight, org-member-pre-flight] | ||||||
| uses: NVIDIA-NeMo/FW-CI-templates/.github/workflows/_build_container.yml@44284233576b11eb867ae55ac41fb291debc414d | ||||||
| with: | ||||||
| build-ref: ${{ github.sha }} | ||||||
| image-name: nemo_rl_container | ||||||
| image-name: ${{ vars.CI_CONTAINER_NAME }} | ||||||
| dockerfile: docker/Dockerfile | ||||||
| image-label: nemo-rl | ||||||
| runner: ${{ needs.org-member-pre-flight.outputs.runner_prefix }}-gpu-x2 | ||||||
| image-label: ${{ vars.CI_CONTAINER_NAME }} | ||||||
| target: release | ||||||
| registry: ${{ needs.org-member-pre-flight.outputs.registry }} | ||||||
| build-contexts: | | ||||||
| nemo-rl=${{ github.run_id }}/ | ||||||
| build-args: | | ||||||
|
|
@@ -247,8 +258,8 @@ jobs: | |||||
| matrix: | ||||||
| include: | ||||||
| - script: Docs_Tests | ||||||
| runner: self-hosted-azure | ||||||
| needs: [pre-flight, build-container] | ||||||
| runner: ${{ needs.org-member-pre-flight.outputs.runner_prefix }}-gpu-x2 | ||||||
| needs: [pre-flight, build-container, org-member-pre-flight] | ||||||
| if: ${{ contains('docs L0 L1 L2', needs.pre-flight.outputs.test_level) }} | ||||||
| runs-on: ${{ matrix.runner }} | ||||||
| name: ${{ matrix.is_optional && 'PLEASEFIXME_' || '' }}${{ matrix.script }} | ||||||
|
|
@@ -260,6 +271,9 @@ jobs: | |||||
| uses: ./.github/actions/test-template | ||||||
| with: | ||||||
| runner: ${{ runner.name }} | ||||||
| registry: ${{ needs.org-member-pre-flight.outputs.registry }} | ||||||
| image: ${{ vars.CI_CONTAINER_NAME }} | ||||||
| test_data_path: ${{ needs.org-member-pre-flight.outputs.test_data_path }} | ||||||
| script: ${{ matrix.script }} | ||||||
| is_doc_test: "true" | ||||||
| is_fork_pr: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.repo.full_name != github.event.pull_request.base.repo.full_name }} | ||||||
|
|
@@ -270,12 +284,12 @@ jobs: | |||||
| matrix: | ||||||
| include: | ||||||
| - script: L0_Unit_Tests_Generation | ||||||
| runner: self-hosted-azure | ||||||
| runner: ${{ needs.org-member-pre-flight.outputs.runner_prefix }}-gpu-x2 | ||||||
| - script: L0_Unit_Tests_Policy | ||||||
| runner: self-hosted-azure | ||||||
| runner: ${{ needs.org-member-pre-flight.outputs.runner_prefix }}-gpu-x2 | ||||||
| - script: L0_Unit_Tests_Other | ||||||
| runner: self-hosted-azure | ||||||
| needs: [pre-flight, build-container, cicd-doc-tests] | ||||||
| runner: ${{ needs.org-member-pre-flight.outputs.runner_prefix }}-gpu-x2 | ||||||
| needs: [pre-flight, build-container, cicd-doc-tests, org-member-pre-flight] | ||||||
| if: >- | ||||||
| ${{ | ||||||
| ( | ||||||
|
|
@@ -298,6 +312,9 @@ jobs: | |||||
| with: | ||||||
| runner: ${{ runner.name }} | ||||||
| script: ${{ matrix.script }} | ||||||
| registry: ${{ needs.org-member-pre-flight.outputs.registry }} | ||||||
| test_data_path: ${{ needs.org-member-pre-flight.outputs.test_data_path }} | ||||||
| image: ${{ vars.CI_CONTAINER_NAME }} | ||||||
| image-tag: ${{ needs.pre-flight.outputs.image_tag }} | ||||||
| is_unit_test: "true" | ||||||
| cpu-only: ${{ matrix.cpu-only || false }} | ||||||
|
|
@@ -309,8 +326,8 @@ jobs: | |||||
| matrix: | ||||||
| include: | ||||||
| - script: L1_Functional_Tests_GPU | ||||||
| runner: self-hosted-azure | ||||||
| needs: [pre-flight, build-container, cicd-unit-tests] | ||||||
| runner: ${{ needs.org-member-pre-flight.outputs.runner_prefix }}-gpu-x2 | ||||||
| needs: [pre-flight, build-container, cicd-unit-tests, org-member-pre-flight] | ||||||
| runs-on: ${{ matrix.runner }} | ||||||
| if: ${{ contains('L1 L2', needs.pre-flight.outputs.test_level) }} | ||||||
| name: ${{ matrix.is_optional && 'PLEASEFIXME_' || '' }}${{ matrix.script }} | ||||||
|
|
@@ -324,6 +341,9 @@ jobs: | |||||
| HF_TOKEN: ${{ secrets.HF_TOKEN }} | ||||||
| with: | ||||||
| runner: ${{ runner.name }} | ||||||
| registry: ${{ needs.org-member-pre-flight.outputs.registry }} | ||||||
| image: ${{ vars.CI_CONTAINER_NAME }} | ||||||
| test_data_path: ${{ needs.org-member-pre-flight.outputs.test_data_path }} | ||||||
| script: ${{ matrix.script }} | ||||||
| is_fork_pr: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.repo.full_name != github.event.pull_request.base.repo.full_name }} | ||||||
|
|
||||||
|
|
@@ -333,8 +353,8 @@ jobs: | |||||
| matrix: | ||||||
| include: | ||||||
| - script: L1_Functional_Tests_GPU | ||||||
| runner: self-hosted-azure | ||||||
| needs: [pre-flight] | ||||||
| runner: ${{ needs.org-member-pre-flight.outputs.runner_prefix }}-gpu-x2 | ||||||
| needs: [pre-flight, build-container, org-member-pre-flight] | ||||||
| if: ${{ needs.pre-flight.outputs.test_level == 'Lfast' }} | ||||||
| runs-on: ${{ matrix.runner }} | ||||||
| name: fast_${{ matrix.script }} | ||||||
|
|
@@ -350,6 +370,9 @@ jobs: | |||||
| runner: ${{ runner.name }} | ||||||
| script: ${{ matrix.script }} | ||||||
| image-tag: ${{ needs.pre-flight.outputs.image_tag }} | ||||||
| registry: ${{ needs.org-member-pre-flight.outputs.registry }} | ||||||
| image: ${{ vars.CI_CONTAINER_NAME }} | ||||||
| test_data_path: ${{ needs.org-member-pre-flight.outputs.test_data_path }} | ||||||
| is_fork_pr: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.repo.full_name != github.event.pull_request.base.repo.full_name }} | ||||||
|
|
||||||
| CI_QA_Gate: | ||||||
|
|
||||||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
registryis effectively required but declared optional.Line 97 and Line 136 always build image refs from
inputs.registry, so an empty value produces invalid image names. Make the input required (or provide a safe default).🔧 Suggested change
registry: description: "Registry to use for test" - required: false + required: trueAlso applies to: 97-98, 136-136
🤖 Prompt for AI Agents