Add e2e test for NVIDIA device plugin DaemonSet deployment with Renovate auto-update by ganeshkumarashok · Pull Request #7984 · Azure/AgentBaker

ganeshkumarashok · 2026-02-27T18:52:13Z

Summary

Add a new e2e test (Test_Ubuntu2204_NvidiaDevicePlugin_Daemonset) that validates the NVIDIA device plugin works when deployed as a Kubernetes DaemonSet instead of the systemd service
This tests the "upstream" deployment model used by customers who manage their own device plugin deployment
Add Renovate configuration to auto-update the container image version in e2e test files

Changes

New e2e test (e2e/scenario_gpu_daemonset_test.go):
- Tests GPU node with device plugin as DaemonSet (not systemd)
- Validates systemd service is NOT running
- Deploys NVIDIA device plugin DaemonSet to the test node
- Validates GPU resources are advertised
- Validates GPU workloads can be scheduled
- Uses mcr.microsoft.com/oss/v2/nvidia/k8s-device-plugin:v0.18.2
Renovate auto-update (.github/renovate.json):
- Add custom manager for e2e Go files to detect // renovate: datasource=docker depName=... comments
- Add MCR device plugin image to nvidia-device-plugin package group for consistent reviewers

Test plan

Run Test_Ubuntu2204_NvidiaDevicePlugin_Daemonset e2e test
Verify Renovate can detect the image version (check Renovate logs)

github-actions · 2026-02-27T18:52:27Z

PR Title Lint Failed ❌

Current Title: Add e2e test for NVIDIA device plugin DaemonSet deployment with Renovate auto-update

Your PR title doesn't follow the expected format. Please update your PR title to follow one of these patterns:

Conventional Commits Format:

feat: add new feature - for new features
fix: resolve bug in component - for bug fixes
docs: update README - for documentation changes
refactor: improve code structure - for refactoring
test: add unit tests - for test additions
chore: remove dead code - for maintenance tasks
chore(deps): update dependencies - for updating dependencies
ci: update build pipeline - for CI/CD changes

Guidelines:

Use lowercase for the type and description
Keep the description concise but descriptive
Use imperative mood (e.g., "add" not "adds" or "added")
Don't end with a period

Examples:

✅ feat(windows): add secure TLS bootstrapping for Windows nodes
✅ fix: resolve kubelet certificate rotation issue
✅ docs: update installation guide
❌ Added new feature
❌ Fix bug.
❌ Update docs

Please update your PR title and the lint check will run again automatically.

Copilot

Pull request overview

Adds coverage in the e2e suite for the “upstream” NVIDIA device plugin deployment model (as a Kubernetes DaemonSet) and configures Renovate to keep the test’s container image tag up to date.

Changes:

Introduces a new Ubuntu 22.04 GPU e2e scenario that deploys the NVIDIA device plugin as a DaemonSet and validates GPU advertisement + scheduling.
Adds a Renovate custom regex manager to update docker image tags referenced in e2e Go tests via // renovate: comments.
Extends the existing Renovate grouping for nvidia-device-plugin to also include the MCR upstream device-plugin image.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File	Description
e2e/scenario_gpu_daemonset_test.go	New e2e GPU scenario that validates DaemonSet-based NVIDIA device plugin deployment (not systemd).
.github/renovate.json	Adds a custom manager to update image tags in e2e Go tests and groups the MCR image with existing NVIDIA device plugin updates.

Copilot · 2026-02-27T18:57:02Z

e2e/scenario_gpu_daemonset_test.go

+	// Check if the service exists and is inactive
+	// Using "is-active" which returns non-zero if not active
+	result := execScriptOnVMForScenario(ctx, s, "systemctl is-active nvidia-device-plugin.service 2>/dev/null || echo 'not-running'")
+	output := strings.TrimSpace(result.stdout)
+
+	// The service should either not exist or be inactive
+	if output == "active" {
+		s.T.Fatalf("nvidia-device-plugin.service is unexpectedly running - this test requires the systemd service to be disabled")


The systemctl is-active ... || echo 'not-running' pattern discards both the real service state (e.g., "inactive", "failed", "activating") and the exit code, so the test can incorrectly pass even if the service is transitioning/running (e.g., "activating"). Capture the actual systemctl is-active output/exit status and fail on any running-like state (active/activating), or use execScriptOnVMForScenarioValidateExitCode with the expected non-active exit codes.

Suggested change

// Check if the service exists and is inactive

// Using "is-active" which returns non-zero if not active

result := execScriptOnVMForScenario(ctx, s, "systemctl is-active nvidia-device-plugin.service 2>/dev/null || echo 'not-running'")

output := strings.TrimSpace(result.stdout)

// The service should either not exist or be inactive

if output == "active" {

s.T.Fatalf("nvidia-device-plugin.service is unexpectedly running - this test requires the systemd service to be disabled")

// Check the current service state using "is-active".

// This will return "active", "inactive", "failed", "activating", "unknown", etc.

result := execScriptOnVMForScenario(ctx, s, "systemctl is-active nvidia-device-plugin.service 2>/dev/null")

output := strings.TrimSpace(result.stdout)

// The service should either not exist or be in a non-running state.

// Treat both "active" and "activating" as failures, since the service

// must not be running when validating the DaemonSet-based deployment.

if output == "active" || output == "activating" {

s.T.Fatalf("nvidia-device-plugin.service is unexpectedly %s - this test requires the systemd service to be disabled", output)

Copilot · 2026-02-27T18:57:02Z

e2e/scenario_gpu_daemonset_test.go

+	maxLen := 63
+	name := prefix + nodeName
+	if len(name) > maxLen {
+		name = name[:maxLen]
+	}


nvidiaDevicePluginDaemonsetName truncates the name to 63 chars but doesn’t ensure the truncated result still conforms to DNS label rules (must end with an alphanumeric). If truncation cuts on a '-', Kubernetes can reject the object/label value. Mirror the existing truncatePodName behavior by trimming trailing '-' after truncation (and apply any needed sanitization).

Copilot · 2026-02-27T18:57:03Z

e2e/scenario_gpu_daemonset_test.go

+	_ = s.Runtime.Cluster.Kube.Typed.AppsV1().DaemonSets(ds.Namespace).Delete(
+		deleteCtx,
+		ds.Name,
+		metav1.DeleteOptions{},
+	)


The test deletes an existing DaemonSet and then recreates it without waiting for deletion to finish. If a prior run left it terminating, the create can fail with AlreadyExists / "object is being deleted" and make the test flaky. Consider polling until a GET returns NotFound (or using an update-based approach) before creating.

Copilot

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.

Copilot · 2026-02-27T19:15:25Z

.github/renovate.json

      "matchPackageNames": [
-        "nvidia-device-plugin"
+        "nvidia-device-plugin",
+        "oss/v2/nvidia/k8s-device-plugin"
      ],


The PR description says a new Renovate custom manager was added to auto-update container image versions in e2e Go test files (via // renovate: comments), but this renovate.json change only updates package grouping. There is still no customManagers entry that targets e2e/**/*.go or parses // renovate: lines, so Renovate will not update Go test image strings as described. Either add the intended custom manager configuration or update the PR description to match the actual approach (version coming from components.json).

Copilot · 2026-02-27T19:15:26Z

e2e/components/components.go

+	jsonBytes, err := os.ReadFile(componentsPath)
+	if err != nil {
+		return ""
+	}


GetE2EContainerImage returns an empty string if reading/parsing components.json fails or the image entry is missing. In practice this causes later Kubernetes errors like “invalid image name” and hides the real root cause. Consider returning (string, error) (or failing fast) and have callers require.NoError/require.NotEmpty so test failures point to the missing/misconfigured components.json entry.

Copilot · 2026-02-27T19:15:26Z

e2e/scenario_gpu_daemonset_test.go

+							Name:  "nvidia-device-plugin-ctr",
+							Image: getNvidiaDevicePluginImage(),


The DaemonSet uses Image: getNvidiaDevicePluginImage() without asserting it’s non-empty/valid. If components.json lookup fails, the pod will be created with an empty image and the test will fail with a confusing scheduling/pull error. Add an early require.NotEmpty (or propagate an error from GetE2EContainerImage) so failures clearly indicate a missing/misconfigured E2EContainerImages entry.

Suggested change

Name: "nvidia-device-plugin-ctr",

Image: getNvidiaDevicePluginImage(),

Name: "nvidia-device-plugin-ctr",

Image: func() string {

img := getNvidiaDevicePluginImage()

require.NotEmpty(t, img, "nvidia device plugin image must be configured in E2EContainerImages (components.json entry %q)", nvidiaDevicePluginImageName)

return img

}(),

Copilot · 2026-02-27T19:15:26Z

parts/common/components.json

+      "version": {
+        "renovateTag": "registry=https://mcr.microsoft.com, name=oss/v2/nvidia/k8s-device-plugin",
+        "latestVersion": "v0.18.2-1"
+      }


The PR description says the test uses mcr.microsoft.com/oss/v2/nvidia/k8s-device-plugin:v0.18.2, but components.json pins latestVersion to v0.18.2-1. Please reconcile the version/tag format (either update the description or adjust latestVersion) to avoid confusion and to ensure the referenced tag is the one actually being tested.

lilypan26 · 2026-03-02T19:35:17Z

e2e/scenario_gpu_daemonset_test.go

+// NVIDIA device plugin deployment.
+func Test_Ubuntu2204_NvidiaDevicePlugin_Daemonset(t *testing.T) {
+	RunScenario(t, &Scenario{
+		Description: "Tests that NVIDIA device plugin works when deployed as a DaemonSet (not systemd service)",


should we just add the device plugin daemonset to all the exisiting unmanaged scenarios? instead of adding a new one?

…ate tracking - Add E2EContainerImages section to components.json for container images used only in e2e tests (not cached on VHD) - Add nvidia k8s-device-plugin container image to E2EContainerImages - Add GetGPUContainerImage and GetE2EContainerImage helpers in e2e/components - Update scenario_gpu_daemonset_test.go to read version from components.json instead of hardcoding it - Move DaemonSet validation inline into Test_Ubuntu2204_GPUNC and Test_Ubuntu2204_GPUA10 to avoid spinning up an additional GPU VM - Add schema definition for E2EContainerImages in components.cue - Add Renovate package rule for oss/v2/nvidia/k8s-device-plugin

Copilot

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.

You can also share your feedback on Copilot code review. Take the survey.

Copilot · 2026-03-10T01:34:23Z

e2e/components/components.go

+func GetGPUContainerImage(downloadURLPattern string) string {
+	// Get the project root dynamically
+	_, filename, _, _ := runtime.Caller(0)
+	projectRoot := filepath.Dir(filepath.Dir(filepath.Dir(filename))) // Go up 3 levels from e2e/components/
+	componentsPath := filepath.Join(projectRoot, "parts", "common", "components.json")
+
+	jsonBytes, err := os.ReadFile(componentsPath)
+	if err != nil {
+		return ""
+	}
+
+	gpuImages := gjson.GetBytes(jsonBytes, "GPUContainerImages")
+	for _, gpuImage := range gpuImages.Array() {
+		downloadURL := gpuImage.Get("downloadURL").String()
+		if strings.EqualFold(downloadURL, downloadURLPattern) {
+			version := gpuImage.Get("gpuVersion.latestVersion").String()
+			if version != "" {
+				return strings.Replace(downloadURL, "*", version, 1)
+			}
+		}
+	}
+	return ""
+}
+
+// GetE2EContainerImage returns the full container image URL for an e2e test container image
+// by looking up the name and version.latestVersion from components.json E2EContainerImages section.
+// The downloadURL pattern contains a wildcard (*) that gets replaced with the version.
+func GetE2EContainerImage(name string) string {
+	// Get the project root dynamically
+	_, filename, _, _ := runtime.Caller(0)
+	projectRoot := filepath.Dir(filepath.Dir(filepath.Dir(filename))) // Go up 3 levels from e2e/components/
+	componentsPath := filepath.Join(projectRoot, "parts", "common", "components.json")
+
+	jsonBytes, err := os.ReadFile(componentsPath)
+	if err != nil {
+		return ""
+	}
+
+	e2eImages := gjson.GetBytes(jsonBytes, "E2EContainerImages")
+	for _, e2eImage := range e2eImages.Array() {
+		imageName := e2eImage.Get("name").String()
+		if strings.EqualFold(imageName, name) {
+			downloadURL := e2eImage.Get("downloadURL").String()
+			version := e2eImage.Get("version.latestVersion").String()
+			if version != "" {
+				return strings.Replace(downloadURL, "*", version, 1)
+			}
+		}
+	}
+	return ""
+}


GetE2EContainerImage (and GetGPUContainerImage) silently return an empty string if parts/common/components.json can’t be read or the image isn’t found. Callers don’t validate the return value, so failures will surface later as confusing Kubernetes errors (e.g., DaemonSet created with an empty image). Consider returning (string, error) (or accepting a testing.TB to fail fast) and having call sites assert non-empty results.

Copilot · 2026-03-10T01:34:24Z

e2e/scenario_gpu_daemonset_test.go

+// getNvidiaDevicePluginImage returns the full container image URL for the NVIDIA device plugin
+// by reading the version from components.json E2EContainerImages section.
+func getNvidiaDevicePluginImage() string {
+	return components.GetE2EContainerImage(nvidiaDevicePluginImageName)


getNvidiaDevicePluginImage() directly returns the lookup result without checking for an empty string. If the components.json entry is missing/misparsed, the DaemonSet will be created with an invalid image and the failure will be indirect. Add a fast-fail check (e.g., validate non-empty before constructing the DaemonSet or when deploying it).

Suggested change

return components.GetE2EContainerImage(nvidiaDevicePluginImageName)

image := components.GetE2EContainerImage(nvidiaDevicePluginImageName)

if strings.TrimSpace(image) == "" {

panic("nvidia device plugin image is not configured in components.json (empty image returned)")

}

return image

Copilot · 2026-03-10T01:34:24Z

e2e/scenario_test.go

 func Test_Ubuntu2204_GPUNC(t *testing.T) {
-	runScenarioUbuntu2204GPU(t, "Standard_NC6s_v3")
+	RunScenario(t, &Scenario{
+		Description: "Tests that a GPU-enabled node with Standard_NC6s_v3 can bootstrap and run NVIDIA device plugin as a DaemonSet",
+		Tags: Tags{


PR description mentions a new dedicated test Test_Ubuntu2204_NvidiaDevicePlugin_Daemonset, but that test doesn’t exist in the codebase anymore (the DaemonSet validation is now embedded into Test_Ubuntu2204_GPUNC / Test_Ubuntu2204_GPUA10). Update the PR description (or reintroduce a dedicated test) so the change is discoverable and future readers know where this coverage lives.

Copilot AI review requested due to automatic review settings February 27, 2026 18:52

ganeshkumarashok requested review from AbelHu, Devinwong, awesomenix, calvin197, cameronmeissner, djsly, juan-lee, junjiezhang1997, lilypan26, mxj220, pdamianov-dev, phealy, r2k1, timmy-wright and zachary-bailey as code owners February 27, 2026 18:52

ganeshkumarashok temporarily deployed to test February 27, 2026 18:52 — with GitHub Actions Inactive

Copilot started reviewing on behalf of ganeshkumarashok February 27, 2026 18:53 View session

Copilot AI reviewed Feb 27, 2026

View reviewed changes

ganeshkumarashok temporarily deployed to test February 27, 2026 19:00 — with GitHub Actions Inactive

github-actions bot added the components This pull request updates cached components on Linux or Windows VHDs label Feb 27, 2026

Copilot AI review requested due to automatic review settings February 27, 2026 19:10

ganeshkumarashok temporarily deployed to test February 27, 2026 19:10 — with GitHub Actions Inactive

Copilot started reviewing on behalf of ganeshkumarashok February 27, 2026 19:11 View session

Copilot AI reviewed Feb 27, 2026

View reviewed changes

lilypan26 reviewed Mar 2, 2026

View reviewed changes

ganeshkumarashok requested review from sulixu and surajssd as code owners March 2, 2026 21:28

ganeshkumarashok temporarily deployed to test March 2, 2026 21:28 — with GitHub Actions Inactive

Copilot AI review requested due to automatic review settings March 10, 2026 01:29

ganeshkumarashok force-pushed the aganeshkumar/nvidia-device-plugin-daemonset-e2e-renovate branch from a365ad1 to 24402e6 Compare March 10, 2026 01:29

ganeshkumarashok temporarily deployed to test March 10, 2026 01:29 — with GitHub Actions Inactive

Copilot started reviewing on behalf of ganeshkumarashok March 10, 2026 01:30 View session

Copilot AI reviewed Mar 10, 2026

View reviewed changes

		Name: "nvidia-device-plugin-ctr",
		Image: getNvidiaDevicePluginImage(),

-							Name:  "nvidia-device-plugin-ctr",
-							Image: getNvidiaDevicePluginImage(),
+							Name: "nvidia-device-plugin-ctr",
+							Image: func() string {
+								img := getNvidiaDevicePluginImage()
+								require.NotEmpty(t, img, "nvidia device plugin image must be configured in E2EContainerImages (components.json entry %q)", nvidiaDevicePluginImageName)
+								return img
+							}(),

-	return components.GetE2EContainerImage(nvidiaDevicePluginImageName)
+	image := components.GetE2EContainerImage(nvidiaDevicePluginImageName)
+	if strings.TrimSpace(image) == "" {
+		panic("nvidia device plugin image is not configured in components.json (empty image returned)")
+	}
+	return image

Conversation

ganeshkumarashok commented Feb 27, 2026

Summary

Changes

Test plan

Uh oh!

github-actions bot commented Feb 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Title Lint Failed ❌

Conventional Commits Format:

Guidelines:

Examples:

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

lilypan26 Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

github-actions bot commented Feb 27, 2026 •

edited

Loading