Skip to content

feat(e2e): dynamically fetch VM extension version with caching, timeouts, and fallback#8064

Open
surajssd wants to merge 9 commits intomainfrom
suraj/optimize-extensio-version-lookup
Open

feat(e2e): dynamically fetch VM extension version with caching, timeouts, and fallback#8064
surajssd wants to merge 9 commits intomainfrom
suraj/optimize-extensio-version-lookup

Conversation

@surajssd
Copy link
Member

@surajssd surajssd commented Mar 10, 2026

What this PR does / why we need it:

Replaces the hardcoded VM extension version "1.406" in createVMExtensionLinuxAKSNode with a dynamic Azure API lookup that fetches the latest version at runtime, cached to avoid redundant API calls across parallel tests.

Dynamic version discovery

  • Add CachedGetLatestVMExtensionImageVersion in e2e/cache.go with a GetLatestExtensionVersionRequest struct key supporting Location, ExtType, and Publisher
  • Update createVMExtensionLinuxAKSNode in e2e/test_helpers.go to accept a context.Context, resolve location from config.Config.DefaultLocation, and call the cached version lookup
  • Update all 7 callers in e2e/scenario_gpu_managed_experience_test.go and e2e/scenario_test.go to pass t.Context()

Robustness

  • Add a 30s context.WithTimeout in createVMExtensionLinuxAKSNode to prevent ListVersions from blocking indefinitely and consuming the global 90-minute test timeout
  • Fall back to the known-good hardcoded version "1.406" with a warning log when the Azure API call fails (timeout, throttling, network error), instead of failing the test
  • Fix nil-pointer panic in toolkit.Logf: add missing return after the nil guard's log.Printf fallback, matching the sibling Log function's pattern — without this, any toolkit.Logf call from a context without testing.TB (e.g. t.Context()) dereferences a nil t

Refactoring

  • Rename VMExtenstionVersionvmExtensionVersion (unexported, fixing the typo)
  • Replace sort.Slice + last-element with slices.MaxFunc for O(n) version lookup
  • Rename lesscmp using cmp.Compare from stdlib, returning -1/0/1
  • Add nil-name guard in parseVersion as an early return, and on the slices.MaxFunc result in GetLatestVMExtensionImageVersion
  • Extract testable getLatestVMExtensionImageVersion function behind a vmExtensionImageVersionLister interface
  • Fix %w%v in fmt.Sprintf panic call in e2e/config/config.go (%w is only valid in fmt.Errorf)

Tests

  • e2e/config/azure_vmext_test.go — 26 table-driven unit tests:
    • Test_parseVersion: 10 cases (three-part, two-part, single-part, nil name, non-numeric, partially numeric, empty string, extra parts, large numbers, leading zeros)
    • Test_vmExtensionVersion_cmp: 9 cases (equal, major/minor/patch comparisons, zero values)
    • Test_getLatestVMExtensionImageVersion: 7 cases (multiple versions, single version, two-part versions, API errors, empty list, nil names, mixed valid/malformed)
  • e2e/cache_test.go — 5 unit tests for cachedFunc: consistent results, warm-vs-cold performance, per-key isolation, error caching, struct key support
  • e2e/scenario_gpu_managed_experience_test.goTest_CreateVMExtensionLinuxAKSNode_Timing integration test with testing.Short() skip guard and deterministic cache consistency assertions

Testing Done

cd e2e
go test -run Test_CreateVMExtensionLinuxAKSNode_Timing -v -count=1
go test -run "Test_cachedFunc*" -v

cd config
go test -run Test_parseVersion -v
go test -run Test_vmExtensionVersion_cmp -v
go test -run Test_getLatestVMExtensionImageVersion -v

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the e2e VMSS model generation to dynamically resolve the latest Compute.AKS.Linux.AKSNode VM extension version via the Azure API and memoize the result to avoid repeated lookups.

Changes:

  • Add an e2e cache key + cached wrapper for GetLatestVMExtensionImageVersion.
  • Update createVMExtensionLinuxAKSNode to accept a context.Context and use the cached Azure lookup instead of a hardcoded version.
  • Optimize GetLatestVMExtensionImageVersion to request only the top result using Top + Orderby.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
e2e/test_helpers.go Switch AKSNode extension creation to use cached latest-version lookup; update one call site for new signature.
e2e/scenario_test.go Update scenario to pass a context into createVMExtensionLinuxAKSNode.
e2e/scenario_gpu_managed_experience_test.go Update multiple GPU scenarios to pass a context into createVMExtensionLinuxAKSNode.
e2e/config/azure.go Change extension version lookup to a server-side “top 1, order by name desc” query.
e2e/cache.go Add cached wrapper + cache-key struct for VM extension latest-version lookups.

@surajssd surajssd force-pushed the suraj/optimize-extensio-version-lookup branch from 8982d50 to a9e9293 Compare March 10, 2026 22:29
@surajssd surajssd force-pushed the suraj/optimize-extensio-version-lookup branch from a9e9293 to 7400daf Compare March 11, 2026 05:22
Copilot AI review requested due to automatic review settings March 11, 2026 05:22
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.

@surajssd surajssd changed the title test(e2e): add timing test for createVMExtensionLinuxAKSNode feat(e2e): dynamically fetch VM extension version with caching, timeouts, and fallback Mar 11, 2026
Copilot AI review requested due to automatic review settings March 11, 2026 05:43
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 2 comments.

@surajssd surajssd force-pushed the suraj/optimize-extensio-version-lookup branch from b7537c1 to fb410cb Compare March 11, 2026 18:49
Replace hardcoded VM extension version "1.406" with a cached Azure API
call that fetches the latest version at runtime. Add a caching layer
via CachedGetLatestVMExtensionImageVersion to avoid redundant API calls
across tests. Update all callers to pass context for proper propagation.

Signed-off-by: Suraj Deshmukh <suraj.deshmukh@microsoft.com>
…ookup

Replace `sort.Slice` + last-element pattern with `slices.MaxFunc` in
`GetLatestVMExtensionImageVersion` for `O(n)` instead of `O(n log n)`.
Rename the less method to `cmp` using `cmp.Compare` from stdlib, and
add a nil name guard before accessing the result.

Move the nil name check in `parseVersion` to an early return before
splitting.

Signed-off-by: Suraj Deshmukh <suraj.deshmukh@microsoft.com>
Replace the observation-only Test_CreateVMExtensionLinuxAKSNode_Timing
from the GPU scenario file with proper unit tests in cache_test.go that
assert caching correctness, warm call performance, per-key isolation,
error caching, and struct key support.

Signed-off-by: Suraj Deshmukh <suraj.deshmukh@microsoft.com>
Prevent the ListVersions Azure API call from blocking indefinitely
when the API hangs, which would cause the global 90-minute test
timeout to panic and kill all parallel tests.

Signed-off-by: Suraj Deshmukh <suraj.deshmukh@microsoft.com>
…ilure

Instead of failing the test when the Azure API call to discover the
latest VM extension version fails, fall back to a known-good hardcoded
version (1.406) and log a warning. This prevents transient API issues
like network errors or throttling from causing test failures.

Signed-off-by: Suraj Deshmukh <suraj.deshmukh@microsoft.com>
…ison

Extract GetLatestVMExtensionImageVersion logic into a testable unexported
function accepting a vmExtensionImageVersionLister interface. Add 26
table-driven tests covering parseVersion, vmExtensionVersion.cmp, and
getLatestVMExtensionImageVersion including edge cases for nil names,
malformed versions, API errors, and empty responses.

Signed-off-by: Suraj Deshmukh <suraj.deshmukh@microsoft.com>
fmt.Sprintf does not support the error-wrapping directive %w, which
caused a build failure. The %w verb is only valid in fmt.Errorf.

Signed-off-by: Suraj Deshmukh <suraj.deshmukh@microsoft.com>
Replace fragile wall-clock timing assertions with deterministic checks
for the createVMExtensionLinuxAKSNode cache test. The previous 10s/1s
thresholds tested CI/network performance rather than caching logic.

- Add testing.Short() skip guard for integration test
- Add nil/empty checks on returned extensions and properties
- Replace assert with require for fail-fast behavior
- Keep cache consistency check and duration logging for diagnostics
- Remove unused assert import

Signed-off-by: Suraj Deshmukh <suraj.deshmukh@microsoft.com>
The nil guard in Logf fell through to t.Helper() and t.Logf() when
the context had no testing.TB, causing a nil-pointer dereference panic.
Add the missing return to match the sibling Log function's pattern.

Signed-off-by: Suraj Deshmukh <suraj.deshmukh@microsoft.com>
@surajssd surajssd force-pushed the suraj/optimize-extensio-version-lookup branch from fb410cb to 7872d35 Compare March 11, 2026 18:50
@surajssd
Copy link
Member Author

cd e2e
➜  go test -run Test_CreateVMExtensionLinuxAKSNode_Timing -v -count=1
...
=== RUN   Test_CreateVMExtensionLinuxAKSNode_Timing
    scenario_gpu_managed_experience_test.go:619: First call duration: 6.699633333s
    scenario_gpu_managed_experience_test.go:627: Second call duration: 17.875µs
--- PASS: Test_CreateVMExtensionLinuxAKSNode_Timing (6.70s)
PASS
ok  	github.com/Azure/agentbaker/e2e	7.547s

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants