Skip to content

feat: add long-haul test design and Phase 1a skeleton (#220)#348

Open
WentingWu666666 wants to merge 2 commits intodocumentdb:mainfrom
WentingWu666666:developer/wentingwu/long-haul-tests
Open

feat: add long-haul test design and Phase 1a skeleton (#220)#348
WentingWu666666 wants to merge 2 commits intodocumentdb:mainfrom
WentingWu666666:developer/wentingwu/long-haul-tests

Conversation

@WentingWu666666
Copy link
Copy Markdown
Collaborator

@WentingWu666666 WentingWu666666 commented Apr 20, 2026

Summary

Adds a long-haul (canary) test framework for the DocumentDB Kubernetes Operator, addressing #220. This PR includes the design document and the Phase 1a skeleton implementation.

Design Document (docs/designs/long-haul-test-design.md)

Architecture for continuous, run-until-failure testing inspired by Strimzi, CloudNative-PG, CockroachDB, and Vitess patterns. Four-component harness: data plane workload, control plane operations, health monitor, and event journal.

Phase 1a Implementation (operator/src/test/longhaul/)

Skeleton test infrastructure with CI-safe gating:

  • Config sub-package (config/): Config struct with env var loading, validation, and IsEnabled() gate. 15 unit tests.
  • Ginkgo test suite: BeforeSuite skip gate (requires LONGHAUL_ENABLED=true), placeholder canary spec.
  • README: Usage guide, configuration reference, CI safety explanation, standalone binary instructions.

CI Safety

All long-haul tests are gated behind LONGHAUL_ENABLED env var. When not set, tests skip immediately with zero overhead. Safe for go test ./... in existing CI pipelines.

How to Test

cd operator/src

# Config unit tests (fast, no cluster)
go test ./test/longhaul/config/ -v

# Canary (skips without LONGHAUL_ENABLED)
go test ./test/longhaul/ -v

Closes #220 (design phase)

@WentingWu666666 WentingWu666666 force-pushed the developer/wentingwu/long-haul-tests branch 8 times, most recently from ca9522c to d1b3462 Compare April 22, 2026 16:38
@WentingWu666666 WentingWu666666 changed the title docs: add long haul test design document feat: add long-haul test design and Phase 1a skeleton (#220) Apr 22, 2026
Add design document for the canary-based long haul test infrastructure
per issue documentdb#220. The design covers:

- 4-component harness (data plane workload, control plane operations,
  health monitor, event journal)
- Run-until-failure canary model on persistent AKS cluster
- Data integrity oracle with per-writer sequence tracking
- Per-operation outage policies and failure tiers
- Phased implementation plan

Based on research of Strimzi, CloudNative-PG, CockroachDB, and Vitess
long haul/soak test patterns.

Closes documentdb#220

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Wenting Wu <wentingwu@microsoft.com>
@WentingWu666666 WentingWu666666 force-pushed the developer/wentingwu/long-haul-tests branch from d1b3462 to 6bd4274 Compare April 22, 2026 16:41
@WentingWu666666 WentingWu666666 marked this pull request as ready for review April 22, 2026 16:50
Copilot AI review requested due to automatic review settings April 22, 2026 16:50
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a long-haul (run-until-failure canary) testing framework to the DocumentDB Kubernetes Operator repo, including a design document and an initial Phase 1a skeleton that is CI-safe by default.

Changes:

  • Introduces a long-haul Ginkgo suite with a BeforeSuite enablement gate (LONGHAUL_ENABLED) and a placeholder canary spec.
  • Adds a test/longhaul/config package for env-based configuration loading/validation with unit tests.
  • Documents the long-haul test architecture and usage via a new design doc and package README.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
operator/src/test/longhaul/suite_test.go Adds the Ginkgo suite entry point for long-haul tests.
operator/src/test/longhaul/longhaul_test.go Adds gated BeforeSuite config loading/validation and a placeholder canary spec.
operator/src/test/longhaul/config/suite_test.go Adds the Ginkgo suite entry point for config unit tests.
operator/src/test/longhaul/config/config_test.go Adds unit tests for config defaults, env parsing, validation, and enablement gating.
operator/src/test/longhaul/config/config.go Implements long-haul test configuration (env vars, defaults, validation, enablement).
operator/src/test/longhaul/README.md Adds usage docs, configuration reference, and CI-safety notes for long-haul tests.
docs/designs/long-haul-test-design.md Adds the long-haul canary design document and phased implementation plan.

Comment thread operator/src/test/longhaul/README.md Outdated
Comment thread operator/src/test/longhaul/README.md Outdated
// Config holds all configuration for a long haul test run.
type Config struct {
// MaxDuration is the maximum test duration. Zero means run until failure.
// Requires explicit LONGHAUL_MAX_DURATION=0 to enable infinite runs.
Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment says infinite runs require LONGHAUL_MAX_DURATION=0, but time.ParseDuration requires a unit (and the README/tests use 0s). Update the comment to match the actual accepted value to avoid confusing users.

Suggested change
// Requires explicit LONGHAUL_MAX_DURATION=0 to enable infinite runs.
// Requires explicit LONGHAUL_MAX_DURATION=0s to enable infinite runs.

Copilot uses AI. Check for mistakes.
Comment thread operator/src/test/longhaul/config/config_test.go
Comment thread docs/designs/long-haul-test-design.md Outdated
Comment thread docs/designs/long-haul-test-design.md
Comment thread docs/designs/long-haul-test-design.md
Copy link
Copy Markdown
Collaborator

@xgerman xgerman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review

Verdict: request-changes — lightly. No blocking issues for the skeleton itself; the skip gate is airtight and the config tests are appropriate. But there are four structural items the Phase 1a framing bakes in that will be expensive to unwind in Phase 1b/1c if not settled now.

Verified first: test-unit.yml uses operator/src as working-dir and runs go list ./..., which enumerates the new packages. With LONGHAUL_ENABLED unset the BeforeSuite calls Skip() and the suite registers as a single "skipped" line. Zero CI overhead. ✅


🟠 Major

M1. Module placement will pollute operator/src/go.modoperator/src/test/longhaul/ (entire tree)

The existing test/e2e/ is a separate Go module at test/e2e/go.mod — a deliberate isolation so mongo-driver/v2, CNPG test utilities, and other test-only deps never enter the operator's dependency graph. This PR puts longhaul inside the operator module.

For Phase 1a (stdlib + Ginkgo/Gomega, already in the operator go.mod) this is harmless. For the phases this PR's design describes:

  • Phase 1b adds go.mongodb.org/mongo-driver (design line 75)
  • Phase 1c adds a file-backed journal
  • Phase 1d adds OTel / Prometheus client libs
  • Phase 2f adds k8s Job manifests + likely controller-runtime client

…all would land in operator/src/go.mod and be forced onto every operator build. Resolve before the first non-stdlib dep is introduced. Preferred: move to test/longhaul/ with its own go.mod (mirrors e2e). Alternative: nest a go.mod under operator/src/test/longhaul/ and document the convention in CONTRIBUTING.md.

M2. Design does not engage with the existing e2e suitedocs/designs/long-haul-test-design.md throughout

Neither the PR description nor the design doc reference docs/designs/e2e-test-suite.md or test/e2e/. That suite already ships primitives long-haul will need:

  • Ginkgo v2 harness with SynchronizedBeforeSuite for cluster bootstrapping (test/e2e/pkg/testenv)
  • Label taxonomy (test/e2e/labels.go)
  • Run-ID tagging for artifact separation (test/e2e/runid.go)
  • envsubst-based manifest rendering
  • A mongo-driver v2 wire-protocol client (same driver the design proposes)
  • .github/actions/setup-test-environment composite action for cluster provisioning

The "Directory Structure" section (lines 213–244) proposes reinventing monitor/, config/, suite bootstrap, and mongo client from scratch. The design should commit to either reuse — most naturally by making longhaul a sibling module that depends on test/e2e/pkg/ — or explicitly enumerate what it forks and why. Otherwise, six months from now we'll have two drifting implementations of "connect to a DocumentDB CR and open a mongo client" with different flag names, different retry policies, and different label conventions.

M3. Four design gaps that will cost rework, not captured in §Open Questionslong-haul-test-design.md:443-445

  1. Event journal durability across process/pod restarts. Line 239 says "In-memory + file-backed" but never says where the file lives. If the canary runs as a Kubernetes Job (line 305), the pod's ephemeral filesystem disappears on restart — and Phase 2e explicitly plans for operator restart / pod eviction chaos. The journal is the only post-mortem artifact and it will be destroyed by the very events it is supposed to record. Commit to a PVC mount or an external sink (object storage, Loki) in the design.
  2. Workload resumption idempotency. Writers use a unique index on (writer_id, seq) (line 80). If a writer or the Job pod restarts, where does seq restart from? Reset to 0 → writes collide. Bootstrap from max(seq) per writer_id → the oracle must tolerate a gap between crash and resume without flagging data loss. Neither path is described.
  3. Teardown on abort. "On failure: collect artifacts, preserve cluster state" (line 188) handles the fatal path. But Ctrl-C, CI cancellation, or Job preemption mid-run leaves writer data in the target DB, possibly an in-progress restore cluster (Phase 2a), possibly a half-scaled topology (Phase 1e). No cleanup hooks, resumable state files, or leftover-run detector.
  4. Latency-regression baseline policy. Failure tiers list "latency regression" as a warning (line 197), but no baseline/threshold policy is defined. List this as an Open Question at minimum so Phase 1b doesn't invent a threshold policy in a PR review thread.

These shape the Phase 1b/1c interfaces and should be addressed (or explicitly deferred in §Open Questions) before the design doc merges.

M4. MaxDuration=30m default contradicts "run until failure"config.go:594, long-haul-test-design.md:37,188

The design repeatedly commits to the Strimzi run-until-failure canary model. The Phase 1a default is 30m; the env doc (line 259) says 0s opts into infinite. Fine for local development — exactly wrong for the Phase 2f canary Job. Pick one: either (a) document that the canary Job manifest will set LONGHAUL_MAX_DURATION=0s explicitly and mention this in the README "CI Safety" section, or (b) flip the default to 0 and require local-dev to pass LONGHAUL_MAX_DURATION=10m.


🟡 Minor (cheap in-PR fixes)

  • m1 IsEnabled()config.go:637-640 — doesn't trim whitespace. LONGHAUL_ENABLED=" true " → false (silent skip). Add strings.TrimSpace. Silent-failure mode; worth fixing in-PR.
  • m2 Validate()config.go:625-633 — accepts negative MaxDuration. time.ParseDuration("-5m") succeeds. Add if c.MaxDuration < 0 { return fmt.Errorf(...) }.
  • m3 Error strings capitalized — config.go:627,630ST1005 / Go convention: lowercase. Matching tests use ContainSubstring("Namespace") / "ClusterName" and will need adjustment.
  • m4 Path inconsistency — long-haul-test-design.md:340 says kubectl apply -f test/longhaul/deploy/job.yaml but the directory structure roots at operator/src/test/longhaul/. Fix path, or resolve M1 first.
  • m5 README overstates CI safety — README.md:543-547 — phrase as "no CI workflow currently sets LONGHAUL_ENABLED; do not set it in any PR-gated workflow" rather than as a property of the code.
  • m6 config is a very generic package name. If Phase 2 ever imports the operator's own config, it will shadow. Consider longhaulcfg or cfg. Not blocking.
  • m7 Missing IsEnabled test cases — config_test.go:727-757 — given the silent-skip failure mode, lock it down: leading/trailing whitespace, mixed case ("True", "YES"), "0" / "no" / "false" → false.

🟢 Nit

  • longhaul_test.go:799 — package-level mutable testConfig. Fine for a single-file suite; revisit if Phase 1b splits files.
  • DefaultConfig() by value vs. Validate() pointer receiver — slight asymmetry. Leave it.
  • README line 509's -timeout 0 guidance is correct and rarely documented — nice touch.

Positive feedback

  • Skip gate is correctly placed in BeforeSuite, not BeforeEach — single skipped line instead of N per spec. Exactly right.
  • LoadFromEnv cleanly separates parsing from validation, so Validate() can be called on a programmatically-built Config in Phase 1b unit tests without fighting env vars.
  • The design doc's survey of Strimzi / CloudNative-PG / CockroachDB / Vitess and the "adopt / skip" table (lines 354–361) is unusually high-quality framing for an OSS design doc.
  • Phased breakdown (Phase 1a → 3) is appropriately granular — each phase is demoable and reviewable.
  • Test coverage on the config package is thorough for the surface area.

Questions for the author

  1. (M1) Any reason longhaul can't live at test/longhaul/ as a sibling module to test/e2e/? If there is one, please capture it in the design.
  2. (M2) Have you looked at test/e2e/pkg/ for reusable primitives (typed clients, envsubst, runid, labels)? What's the plan for avoiding a second mongo-client implementation?
  3. (M3.1) For the journal: PVC, external object store, or stdout-scraped-by-Loki?
  4. Issue #220 text isn't quoted in the PR — are there SLO targets, acceptance criteria, or a required timeframe (e.g., 72h)? Restating them in the design's Problem Statement would make future reviews easier.
  5. Should CONTRIBUTING.md or AGENTS.md get a one-liner pointing at operator/src/test/longhaul/README.md so contributors aren't left wondering "what is this tree?"

Reviewed with the GitHub Copilot code-review agent.

@WentingWu666666 WentingWu666666 force-pushed the developer/wentingwu/long-haul-tests branch 2 times, most recently from 1985c93 to f3ffbef Compare April 22, 2026 18:21
Add the project skeleton for long-haul (canary) tests:

- config/config.go: Config struct with env var loading, defaults, and validation
- config/config_test.go: Comprehensive tests for all config options (15 specs)
- config/suite_test.go: Ginkgo suite entry for config unit tests
- suite_test.go: Ginkgo suite entry point for the canary
- longhaul_test.go: BeforeSuite with LONGHAUL_ENABLED skip gate + placeholder
- README.md: Usage guide for running locally and in CI

Config is in a sub-package so config unit tests run independently of the
long-running canary. Tests are gated behind LONGHAUL_ENABLED=true env var,
so go test ./... safely skips them.

Part of documentdb#220

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Wenting Wu <wentingwu@microsoft.com>
@WentingWu666666 WentingWu666666 force-pushed the developer/wentingwu/long-haul-tests branch from f3ffbef to 8459cae Compare April 22, 2026 18:47
@WentingWu666666
Copy link
Copy Markdown
Collaborator Author

Response to @xgerman's Review

Thank you for the thorough review! I've addressed all items in the latest force-push. Here's a summary:


Major Items

M1 (Module placement): Agreed this is a real concern. After researching CNPG's approach (they keep all tests in the same go.mod their ests/ dir with e2e specs, utils, labels, etc. is in the root module), I'm keeping longhaul under operator/src/ for Phase 1a. Added a concrete extraction trigger to the design:

"Any non-stdlib dependency beyond current operator deps triggers extraction to a nested est/longhaul/go.mod before that Phase merges."

Phase 1a adds zero new deps (only Ginkgo/Gomega, already in go.mod).

M2 (E2E engagement): The repo currently lacks reusable e2e primitives est/e2e/ does not exist yet (the Makefile references it but the directory hasn't been created). When shared test utilities are introduced (following CNPG's ests/utils/ pattern), long-haul will reuse them via a est/utils/ package. Added this as Open Question #4 with a note to create a follow-up issue for coordination.

M3 (Design gaps): Upgraded from Open Questions to Provisional Design Decisions in the design doc:

  1. Journal durability: PVC-backed file at /data/journal/, structured JSON lines, on-startup reconstruction from existing file.
  2. Writer seq resumption: Bootstrap from max(seq) per writer_id in DB. Oracle tolerates crash-recovery gaps (logged, not flagged as data loss).
  3. Teardown on abort: Signal handler for SIGTERM/SIGINT cancel contexts flush journal write final status to ConfigMap exit. Leftover-run detector on startup.
  4. Latency baseline: Kept as Open Question establish P99 baseline during first 30min warmup, alert on >2 sustained regression.

These lock the approach for Phase 1b/1c interfaces while leaving implementation details to those PRs.

M4 (MaxDuration default): Added notes in both the design doc (Configuration section) and README (CI Safety section) that the persistent canary Job manifest explicitly sets LONGHAUL_MAX_DURATION=0s. The 30m default remains as a safety net for local development.


Minor Items All Fixed

  • m1: Added strings.TrimSpace to IsEnabled()
  • m2: Added negative MaxDuration validation in Validate()
  • m3: Lowercased error strings (Go convention ST1005)
  • m4: Fixed path: est/longhaul/deploy/job.yaml operator/src/test/longhaul/deploy/job.yaml
  • m5: Rephrased CI safety: "No CI workflow currently sets this variable do not add it to any PR-gated workflow"
  • m6: Valid concern. Keeping config for Phase 1a (no shadowing risk yet). Will revisit if the operator's own config needs importing in later phases.
  • m7: Added 8 new test cases: whitespace (" true ", " yes "), whitespace-only (" "), mixed case ("True", "YES"), explicit false values ("0", "no", "false"). Total: 23 specs (was 15).

Questions

  1. (M1) Answered above CNPG precedent + concrete extraction trigger.
  2. (M2) Answered above will reuse shared primitives when they exist.
  3. (M3.1) PVC-backed file journal documented as provisional design decision.
  4. (Issue Set up long haul test cluster #220) The design's Problem Statement already quotes the three requirements from Set up long haul test cluster #220. SLO targets remain as Open Question Adding Microsoft SECURITY.MD #2 they need team input.
  5. (CONTRIBUTING.md/AGENTS.md) Good idea. Will add a one-liner in a follow-up PR to avoid scope creep here.

Test results: 23 config specs + 1 canary spec (skipped) = all passing. go vet clean.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Set up long haul test cluster

4 participants