Skip to content

Add Chainsaw conformance tests for the Kubernetes engine #263

@resker

Description

@resker

Problem

docs/engines/k8s.md currently ends with:

Validation and Testing

TBD

Topograph has Go unit tests and integration fixtures (tests/integration/payloads/*.json), but no end-to-end cluster-level suite that asserts "given a deployed chart, do the right labels and annotations actually land on nodes?" This gap is felt most acutely when changing the Helm chart, Node Observer, or engine output — there's no quick way to verify the full deployment-to-labels flow without manual cluster testing.

Proposed approach

Adopt Chainsaw (Kyverno team, Apache 2.0) — the declarative Kubernetes E2E test framework used widely across Kubernetes OSS. Tests are YAML that run apply → wait → assert → cleanup against a real cluster (kind in CI). Assertions use JMESPath-style expressions to verify cluster state, which fits Topograph's output model precisely: node labels, annotations, and ConfigMaps are exactly what Chainsaw is designed to assert.

Why Chainsaw specifically:

  1. Topograph's Kubernetes engine writes cluster state — Chainsaw is purpose-built for asserting cluster state
  2. The test provider + toposim models give deterministic inputs (no real hardware, no cloud credentials) — tests stay reproducible
  3. Node Observer reactivity is testable — Chainsaw can add/remove nodes and assert the re-labeling occurs within the aggregation delay
  4. Strong CNCF ecosystem alignment — Chainsaw is the common pattern for declarative Kubernetes conformance testing
  5. Precedent in NVIDIA's own OSS — NVIDIA/aicr uses Chainsaw for AI Conformance testing (tests/chainsaw/ai-conformance/common/assert-kai-scheduler.yaml etc.), which gives us a known-good structural template

Self-contained local development

Chainsaw + the test provider + toposim together enable a self-contained local development loop — contributors can clone the repo, spin up kind, install the chart, and run the full E2E suite on their own workstation without NVIDIA hardware, cloud provider credentials, or access to a shared cluster. This covers the majority of engine-facing code paths:

Code path Self-contained with Chainsaw?
API server, aggregation, validation
k8s engine output (labels)
Slinky engine output (ConfigMap)
Node Observer reactivity to node add/remove events
Node Data Broker annotation lifecycle
FNV-64a label truncation, canonical tree construction
DRA provider (with nvidia.com/gpu.clique pre-populated on kind nodes)
IB provider (requires mocking ibnetdiscover output) ❌ — separate concern
NetQ provider (requires mock NetQ API server) ❌ — separate concern
Cloud CSP providers (AWS / GCP / OCI / Nebius / Lambda AI / CW) ❌ — covered by *Sim loaders elsewhere

Why this matters:

  • Lowers the contribution bar. A new contributor can verify their change without privileged access to anything.
  • Strong OSS maturity signal. "Can a new contributor reproduce tests locally without privileged access?" is a well-established OSS health check; Chainsaw + kind + test provider is the canonical answer.
  • Pairs with the planned how-to tutorial. The tutorial scenario (kind + toposim + Helm install + demo workload) is the same setup; one investment yields tutorial, demo, and E2E test fixtures.

Proposed layout

Mirror AICR's tests/chainsaw/ structure:

tests/chainsaw/
  chainsaw-config.yaml           # Global timeouts, namespace cleanup strategy
  README.md                      # How to run locally
  fixtures/
    toposim-2leaf-1spine.yaml    # 4-node fabric: 2 leaves under 1 spine, 1 NVLink domain
    helm-values-test.yaml        # provider=test + engine=k8s + toposim model reference
  suites/
    label-application/           # Deploy → POST /v1/generate → assert labels on nodes
    node-observer/               # Add node → assert regeneration + new node labeled
    data-broker-annotations/     # DaemonSet runs → assert topograph.nvidia.com/* annotations
    fnv-truncation/              # Long switch ID → assert x-prefixed hex value
    slinky-configmap/            # engine=slinky → assert ConfigMap contents + annotations
    topology-change/             # Swap toposim model → assert labels update, no stale leak

Deliverables

  • tests/chainsaw/ directory + suites above
  • Makefile: make e2e target that runs chainsaw test against an assumed-running kind cluster; optional make e2e-local that wraps kind create cluster + e2e + kind delete cluster
  • .github/workflows/e2e.yml — triggered on PRs touching charts/**, pkg/engines/k8s/**, pkg/node_observer/**, pkg/server/**, or tests/chainsaw/**. Uses helm/kind-action + kyverno/action-install-chainsaw
  • AGENTS.md + .claude/CLAUDE.md updates (same PR):
    • Repository map adds tests/chainsaw/
    • Commands section adds make e2e and make e2e-local
    • Testing and Deployment Workflows section describes the suite and self-contained local run instructions
    • Pre-push checklist mentions make e2e when a change touches the k8s engine, chart, server, or observer paths
    • PR guidelines note that chart/engine/observer changes should extend Chainsaw coverage
  • Update docs/engines/k8s.md — replace "Validation and Testing: TBD" with a pointer to the Chainsaw suite and the local-run instructions
  • tests/chainsaw/README.md — local run instructions with kind cluster bring-up, Chainsaw install, make e2e-local quickstart

Out of scope for v1

  • InfiniBand provider tests — require mocking ibnetdiscover output across exec calls; defer to a follow-up
  • NetQ provider tests — require a mock NetQ API server; defer
  • Multi-region tests — toposim supports it but the chart's DaemonSet+Observer wiring is single-region-assumed today
  • Performance / scale tests — separate concern; Chainsaw is not the right tool for benchmarking

Dependencies

  • Shares fixtures with the upcoming "how to" tutorial — a tutorial reader follows the same toposim model the Chainsaw test uses. "Tutorial works on your laptop" and "E2E test passes in CI" become the same claim.
  • Pairs with build: add make qualify pre-push aggregator #256 (make qualify target) — future expansion to include e2e follows the AICR pattern

Reference

  • AICR's Chainsaw setup: ~/dev/aicr/tests/chainsaw/ai-conformance/ — cleanest JMESPath examples at common/assert-kai-scheduler.yaml
  • Chainsaw docs: https://kyverno.github.io/chainsaw/

Priority

Medium. Not blocking any currently open PR. Natural companion to the how-to-tutorial work and a strong OSS maturity signal. Best landed as a single PR covering the first two suites (label-application + node-observer) + the Makefile/CI workflow + AGENTS.md updates, with additional suites added incrementally.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions