feat: add long-haul test design and Phase 1a skeleton (#220)#348
feat: add long-haul test design and Phase 1a skeleton (#220)#348WentingWu666666 wants to merge 2 commits intodocumentdb:mainfrom
Conversation
ca9522c to
d1b3462
Compare
Add design document for the canary-based long haul test infrastructure per issue documentdb#220. The design covers: - 4-component harness (data plane workload, control plane operations, health monitor, event journal) - Run-until-failure canary model on persistent AKS cluster - Data integrity oracle with per-writer sequence tracking - Per-operation outage policies and failure tiers - Phased implementation plan Based on research of Strimzi, CloudNative-PG, CockroachDB, and Vitess long haul/soak test patterns. Closes documentdb#220 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Wenting Wu <wentingwu@microsoft.com>
d1b3462 to
6bd4274
Compare
There was a problem hiding this comment.
Pull request overview
Adds a long-haul (run-until-failure canary) testing framework to the DocumentDB Kubernetes Operator repo, including a design document and an initial Phase 1a skeleton that is CI-safe by default.
Changes:
- Introduces a long-haul Ginkgo suite with a
BeforeSuiteenablement gate (LONGHAUL_ENABLED) and a placeholder canary spec. - Adds a
test/longhaul/configpackage for env-based configuration loading/validation with unit tests. - Documents the long-haul test architecture and usage via a new design doc and package README.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| operator/src/test/longhaul/suite_test.go | Adds the Ginkgo suite entry point for long-haul tests. |
| operator/src/test/longhaul/longhaul_test.go | Adds gated BeforeSuite config loading/validation and a placeholder canary spec. |
| operator/src/test/longhaul/config/suite_test.go | Adds the Ginkgo suite entry point for config unit tests. |
| operator/src/test/longhaul/config/config_test.go | Adds unit tests for config defaults, env parsing, validation, and enablement gating. |
| operator/src/test/longhaul/config/config.go | Implements long-haul test configuration (env vars, defaults, validation, enablement). |
| operator/src/test/longhaul/README.md | Adds usage docs, configuration reference, and CI-safety notes for long-haul tests. |
| docs/designs/long-haul-test-design.md | Adds the long-haul canary design document and phased implementation plan. |
| // Config holds all configuration for a long haul test run. | ||
| type Config struct { | ||
| // MaxDuration is the maximum test duration. Zero means run until failure. | ||
| // Requires explicit LONGHAUL_MAX_DURATION=0 to enable infinite runs. |
There was a problem hiding this comment.
This comment says infinite runs require LONGHAUL_MAX_DURATION=0, but time.ParseDuration requires a unit (and the README/tests use 0s). Update the comment to match the actual accepted value to avoid confusing users.
| // Requires explicit LONGHAUL_MAX_DURATION=0 to enable infinite runs. | |
| // Requires explicit LONGHAUL_MAX_DURATION=0s to enable infinite runs. |
xgerman
left a comment
There was a problem hiding this comment.
Review
Verdict: request-changes — lightly. No blocking issues for the skeleton itself; the skip gate is airtight and the config tests are appropriate. But there are four structural items the Phase 1a framing bakes in that will be expensive to unwind in Phase 1b/1c if not settled now.
Verified first: test-unit.yml uses operator/src as working-dir and runs go list ./..., which enumerates the new packages. With LONGHAUL_ENABLED unset the BeforeSuite calls Skip() and the suite registers as a single "skipped" line. Zero CI overhead. ✅
🟠 Major
M1. Module placement will pollute operator/src/go.mod — operator/src/test/longhaul/ (entire tree)
The existing test/e2e/ is a separate Go module at test/e2e/go.mod — a deliberate isolation so mongo-driver/v2, CNPG test utilities, and other test-only deps never enter the operator's dependency graph. This PR puts longhaul inside the operator module.
For Phase 1a (stdlib + Ginkgo/Gomega, already in the operator go.mod) this is harmless. For the phases this PR's design describes:
- Phase 1b adds
go.mongodb.org/mongo-driver(design line 75) - Phase 1c adds a file-backed journal
- Phase 1d adds OTel / Prometheus client libs
- Phase 2f adds k8s Job manifests + likely controller-runtime client
…all would land in operator/src/go.mod and be forced onto every operator build. Resolve before the first non-stdlib dep is introduced. Preferred: move to test/longhaul/ with its own go.mod (mirrors e2e). Alternative: nest a go.mod under operator/src/test/longhaul/ and document the convention in CONTRIBUTING.md.
M2. Design does not engage with the existing e2e suite — docs/designs/long-haul-test-design.md throughout
Neither the PR description nor the design doc reference docs/designs/e2e-test-suite.md or test/e2e/. That suite already ships primitives long-haul will need:
- Ginkgo v2 harness with
SynchronizedBeforeSuitefor cluster bootstrapping (test/e2e/pkg/testenv) - Label taxonomy (
test/e2e/labels.go) - Run-ID tagging for artifact separation (
test/e2e/runid.go) - envsubst-based manifest rendering
- A mongo-driver v2 wire-protocol client (same driver the design proposes)
.github/actions/setup-test-environmentcomposite action for cluster provisioning
The "Directory Structure" section (lines 213–244) proposes reinventing monitor/, config/, suite bootstrap, and mongo client from scratch. The design should commit to either reuse — most naturally by making longhaul a sibling module that depends on test/e2e/pkg/ — or explicitly enumerate what it forks and why. Otherwise, six months from now we'll have two drifting implementations of "connect to a DocumentDB CR and open a mongo client" with different flag names, different retry policies, and different label conventions.
M3. Four design gaps that will cost rework, not captured in §Open Questions — long-haul-test-design.md:443-445
- Event journal durability across process/pod restarts. Line 239 says "In-memory + file-backed" but never says where the file lives. If the canary runs as a Kubernetes Job (line 305), the pod's ephemeral filesystem disappears on restart — and Phase 2e explicitly plans for operator restart / pod eviction chaos. The journal is the only post-mortem artifact and it will be destroyed by the very events it is supposed to record. Commit to a PVC mount or an external sink (object storage, Loki) in the design.
- Workload resumption idempotency. Writers use a unique index on
(writer_id, seq)(line 80). If a writer or the Job pod restarts, where doesseqrestart from? Reset to 0 → writes collide. Bootstrap frommax(seq)perwriter_id→ the oracle must tolerate a gap between crash and resume without flagging data loss. Neither path is described. - Teardown on abort. "On failure: collect artifacts, preserve cluster state" (line 188) handles the fatal path. But Ctrl-C, CI cancellation, or Job preemption mid-run leaves writer data in the target DB, possibly an in-progress restore cluster (Phase 2a), possibly a half-scaled topology (Phase 1e). No cleanup hooks, resumable state files, or leftover-run detector.
- Latency-regression baseline policy. Failure tiers list "latency regression" as a warning (line 197), but no baseline/threshold policy is defined. List this as an Open Question at minimum so Phase 1b doesn't invent a threshold policy in a PR review thread.
These shape the Phase 1b/1c interfaces and should be addressed (or explicitly deferred in §Open Questions) before the design doc merges.
M4. MaxDuration=30m default contradicts "run until failure" — config.go:594, long-haul-test-design.md:37,188
The design repeatedly commits to the Strimzi run-until-failure canary model. The Phase 1a default is 30m; the env doc (line 259) says 0s opts into infinite. Fine for local development — exactly wrong for the Phase 2f canary Job. Pick one: either (a) document that the canary Job manifest will set LONGHAUL_MAX_DURATION=0s explicitly and mention this in the README "CI Safety" section, or (b) flip the default to 0 and require local-dev to pass LONGHAUL_MAX_DURATION=10m.
🟡 Minor (cheap in-PR fixes)
- m1
IsEnabled()—config.go:637-640— doesn't trim whitespace.LONGHAUL_ENABLED=" true "→ false (silent skip). Addstrings.TrimSpace. Silent-failure mode; worth fixing in-PR. - m2
Validate()—config.go:625-633— accepts negativeMaxDuration.time.ParseDuration("-5m")succeeds. Addif c.MaxDuration < 0 { return fmt.Errorf(...) }. - m3 Error strings capitalized —
config.go:627,630—ST1005/ Go convention: lowercase. Matching tests useContainSubstring("Namespace")/"ClusterName"and will need adjustment. - m4 Path inconsistency —
long-haul-test-design.md:340sayskubectl apply -f test/longhaul/deploy/job.yamlbut the directory structure roots atoperator/src/test/longhaul/. Fix path, or resolve M1 first. - m5 README overstates CI safety —
README.md:543-547— phrase as "no CI workflow currently setsLONGHAUL_ENABLED; do not set it in any PR-gated workflow" rather than as a property of the code. - m6
configis a very generic package name. If Phase 2 ever imports the operator's own config, it will shadow. Considerlonghaulcfgorcfg. Not blocking. - m7 Missing
IsEnabledtest cases —config_test.go:727-757— given the silent-skip failure mode, lock it down: leading/trailing whitespace, mixed case ("True","YES"),"0"/"no"/"false"→ false.
🟢 Nit
longhaul_test.go:799— package-level mutabletestConfig. Fine for a single-file suite; revisit if Phase 1b splits files.DefaultConfig()by value vs.Validate()pointer receiver — slight asymmetry. Leave it.- README line 509's
-timeout 0guidance is correct and rarely documented — nice touch.
Positive feedback
- Skip gate is correctly placed in
BeforeSuite, notBeforeEach— single skipped line instead of N per spec. Exactly right. LoadFromEnvcleanly separates parsing from validation, soValidate()can be called on a programmatically-builtConfigin Phase 1b unit tests without fighting env vars.- The design doc's survey of Strimzi / CloudNative-PG / CockroachDB / Vitess and the "adopt / skip" table (lines 354–361) is unusually high-quality framing for an OSS design doc.
- Phased breakdown (Phase 1a → 3) is appropriately granular — each phase is demoable and reviewable.
- Test coverage on the config package is thorough for the surface area.
Questions for the author
- (M1) Any reason longhaul can't live at
test/longhaul/as a sibling module totest/e2e/? If there is one, please capture it in the design. - (M2) Have you looked at
test/e2e/pkg/for reusable primitives (typed clients, envsubst, runid, labels)? What's the plan for avoiding a second mongo-client implementation? - (M3.1) For the journal: PVC, external object store, or stdout-scraped-by-Loki?
- Issue #220 text isn't quoted in the PR — are there SLO targets, acceptance criteria, or a required timeframe (e.g., 72h)? Restating them in the design's Problem Statement would make future reviews easier.
- Should
CONTRIBUTING.mdorAGENTS.mdget a one-liner pointing atoperator/src/test/longhaul/README.mdso contributors aren't left wondering "what is this tree?"
Reviewed with the GitHub Copilot code-review agent.
1985c93 to
f3ffbef
Compare
Add the project skeleton for long-haul (canary) tests: - config/config.go: Config struct with env var loading, defaults, and validation - config/config_test.go: Comprehensive tests for all config options (15 specs) - config/suite_test.go: Ginkgo suite entry for config unit tests - suite_test.go: Ginkgo suite entry point for the canary - longhaul_test.go: BeforeSuite with LONGHAUL_ENABLED skip gate + placeholder - README.md: Usage guide for running locally and in CI Config is in a sub-package so config unit tests run independently of the long-running canary. Tests are gated behind LONGHAUL_ENABLED=true env var, so go test ./... safely skips them. Part of documentdb#220 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Wenting Wu <wentingwu@microsoft.com>
f3ffbef to
8459cae
Compare
Response to @xgerman's ReviewThank you for the thorough review! I've addressed all items in the latest force-push. Here's a summary: Major ItemsM1 (Module placement): Agreed this is a real concern. After researching CNPG's approach (they keep all tests in the same go.mod their ests/ dir with e2e specs, utils, labels, etc. is in the root module), I'm keeping longhaul under operator/src/ for Phase 1a. Added a concrete extraction trigger to the design:
Phase 1a adds zero new deps (only Ginkgo/Gomega, already in go.mod). M2 (E2E engagement): The repo currently lacks reusable e2e primitives est/e2e/ does not exist yet (the Makefile references it but the directory hasn't been created). When shared test utilities are introduced (following CNPG's ests/utils/ pattern), long-haul will reuse them via a est/utils/ package. Added this as Open Question #4 with a note to create a follow-up issue for coordination. M3 (Design gaps): Upgraded from Open Questions to Provisional Design Decisions in the design doc:
These lock the approach for Phase 1b/1c interfaces while leaving implementation details to those PRs. M4 (MaxDuration default): Added notes in both the design doc (Configuration section) and README (CI Safety section) that the persistent canary Job manifest explicitly sets LONGHAUL_MAX_DURATION=0s. The 30m default remains as a safety net for local development. Minor Items All Fixed
Questions
Test results: 23 config specs + 1 canary spec (skipped) = all passing. go vet clean. |
Summary
Adds a long-haul (canary) test framework for the DocumentDB Kubernetes Operator, addressing #220. This PR includes the design document and the Phase 1a skeleton implementation.
Design Document (docs/designs/long-haul-test-design.md)
Architecture for continuous, run-until-failure testing inspired by Strimzi, CloudNative-PG, CockroachDB, and Vitess patterns. Four-component harness: data plane workload, control plane operations, health monitor, and event journal.
Phase 1a Implementation (operator/src/test/longhaul/)
Skeleton test infrastructure with CI-safe gating:
Configstruct with env var loading, validation, andIsEnabled()gate. 15 unit tests.LONGHAUL_ENABLED=true), placeholder canary spec.CI Safety
All long-haul tests are gated behind
LONGHAUL_ENABLEDenv var. When not set, tests skip immediately with zero overhead. Safe forgo test ./...in existing CI pipelines.How to Test
Closes #220 (design phase)