diff --git a/docs/designs/long-haul-test-design.md b/docs/designs/long-haul-test-design.md new file mode 100644 index 00000000..7e916549 --- /dev/null +++ b/docs/designs/long-haul-test-design.md @@ -0,0 +1,633 @@ +# Long Haul Test Design — DocumentDB Kubernetes Operator + +**Issue:** [#220](https://github.com/documentdb/documentdb-kubernetes-operator/issues/220) +**Status:** In progress (Phase 1a complete) + +## Terminology + +This document refers to two kinds of cluster: + +- **DocumentDB cluster** — the database cluster managed by the operator (the `DocumentDB` Custom Resource and its pods). +- **Kubernetes cluster** (or **AKS cluster**, **Kind cluster**) — the infrastructure cluster where the operator and DocumentDB run. + +When unqualified, "cluster" in the context of operations, health, and state refers to the **DocumentDB cluster**. Infrastructure clusters are always qualified (AKS, Kind, etc.). + +## Problem Statement + +The operator lacks continuous, long-running test coverage. Issue #220 requires: +1. Constant writes/reads — ensure no data is lost +2. Constant management operations (add/remove region, HA toggle, scale, backup/restore) +3. Operator and DocumentDB cluster updates under load + +## Why Long Haul Testing? + +Problems that only surface over extended continuous operation: +- **Memory/resource leaks** — need hours of reconciliation loops to see growth trends +- **WAL accumulation / disk fill** — cleanup bugs take time to manifest +- **Connection pool exhaustion** — gradual leak over many connect/disconnect cycles +- **Reconciliation drift** — operator state slowly diverges after many operations +- **Certificate rotation** — certs don't expire during 60-min CI runs +- **Backup retention cleanup** — need to exceed retention period to verify pruning +- **Pod restart cascades** — subtle race conditions under repeated scale/failover cycles +- **Upgrade correctness under load** — data corruption from rolling restarts + +Existing 60-min E2E tests verify correctness of individual operations. Long haul tests verify **sustained reliability** — that the operator doesn't degrade over time. + +## Design Overview + +The design is based on research of Strimzi, CloudNative-PG, CockroachDB (roachtest), and Vitess soak test patterns. The common architecture across all projects: **separate workload generation from disruption injection, run them concurrently, verify correctness post-hoc**. + +We adopt the **run-until-failure (canary)** model inspired by Strimzi: the DocumentDB cluster runs indefinitely with continuous workload and operations. When something breaks — data loss, unrecoverable state, resource exhaustion — the test captures the failure, collects artifacts, and alerts the team. This answers the real question: **"what breaks first, and after how long?"** + +--- + +## Architecture: 4 Components + +``` +┌─────────────────────────────────────────────────────────┐ +│ Long Haul Test (Go/Ginkgo) │ +│ │ +│ ┌──────────────┐ ┌──────────────┐ ┌───────────────┐ │ +│ │ Data Plane │ │ Control Plane│ │ Health Monitor │ │ +│ │ Workload │ │ Operations │ │ & Metrics │ │ +│ │ │ │ │ │ │ │ +│ │ • Writers │ │ • Scale │ │ • Pod status │ │ +│ │ • Readers │ │ • Replication│ │ • CR conditions│ │ +│ │ • Verifiers │ │ • Backup │ │ • OTel metrics │ │ +│ │ │ │ • Upgrade │ │ • Leak detect │ │ +│ └──────┬───────┘ └──────┬───────┘ └───────┬───────┘ │ +│ │ │ │ │ +│ └─────────┬───────┴───────────────────┘ │ +│ ▼ │ +│ ┌────────────────┐ │ +│ │ Event Journal │ │ +│ │ │ │ +│ │ • Op start/end │ │ +│ │ • State changes│ │ +│ │ • Error windows│ │ +│ │ • Disruption │ │ +│ │ budgets │ │ +│ └────────────────┘ │ +└─────────────────────────────────────────────────────────┘ +``` + +### Component 1: Data Plane Workload + +**Purpose:** Continuous read/write traffic to detect data loss, corruption, and availability gaps. + +**Implementation:** Go with the official MongoDB driver (`go.mongodb.org/mongo-driver`), NOT shelling out to mongosh. This gives better cancellation/retry/context control over 24h+ runs. + +**Writer Model (Durability Oracle):** +- Multiple writer goroutines, each with a unique `writer_id` +- Each write: `{writer_id, seq, payload, checksum(payload), timestamp}` +- Unique index on `(writer_id, seq)` to detect duplicates +- Track three states per write: **attempted**, **acknowledged**, **verified** +- Use `writeConcern: majority` for durability claims +- Small percentage of **upserts/updates** (not just inserts) for broader coverage + +**Reader/Verifier Model:** +- Periodic full-scan verification: no gaps in acknowledged sequences per writer +- Checksum validation on read-back +- Separate counters for: missing acknowledged writes, duplicates, stale reads, checksum mismatches +- Use `readConcern: majority` to avoid false negatives from replica lag +- Lag-aware: don't flag replication delay as data loss + +**Metrics Emitted:** +- `longhaul_writes_attempted`, `longhaul_writes_acknowledged`, `longhaul_writes_failed` +- `longhaul_reads_total`, `longhaul_reads_stale`, `longhaul_verification_failures` +- `longhaul_write_latency_ms`, `longhaul_read_latency_ms` + +### Component 2: Control Plane Operations + +**Purpose:** Exercise management operations under continuous load. + +**Operation Categories:** + +| Operation | Type | Expected Disruption | Validation | +|-----------|------|-------------------|------------| +| Scale up (nodeCount++) | Topology | None | New pods ready, data accessible | +| Scale down (nodeCount--) | Topology | Brief write pause | Remaining pods healthy, no data loss | +| Enable replication | Replication | None | Replicas created, WAL streaming | +| Disable replication | Replication | Brief | Standalone healthy | +| Add region | Multi-region | None | New region catches up, data synced | +| Remove region | Multi-region | Brief | Remaining regions healthy | +| Toggle HA (localHA) | HA | Brief failover | Primary switches, writes resume | +| On-demand backup | Backup | None | Backup CR reaches Completed | +| Restore to new DocumentDB cluster | Backup | N/A (new cluster) | Restored data matches backup watermark | +| Scheduled backup verify | Backup | None | Backups created on schedule | +| Operator upgrade | Update | None (DB pods should NOT restart) | Operator pod rolls, DocumentDB cluster unaffected | +| DocumentDB binary upgrade | Update | Rolling restart | Pods restart one-by-one, workload continues | +| Schema upgrade | Update | Varies | Pre-backup, post-upgrade reads/writes OK | +| Operator restart/leader failover | Chaos | Brief reconcile gap | Reconciliation resumes | +| Pod eviction (simulating node drain) | Chaos | Brief | Pod rescheduled, workload resumes | + +**Sequencing Rules:** +- Operations are NOT fully random — use **preconditions and cooldowns** +- Cannot remove region if only 1 region exists +- Cannot scale below minimum node count +- Cooldown between disruptive ops (configurable, default 5 min) +- Must reach steady state before next operation +- Backup/restore is a **separate flow** (restore creates a NEW DocumentDB cluster, verifies, then cleans up) + +**Per-Operation Outage Policy:** +```go +type OutagePolicy struct { + AllowedDowntime time.Duration // e.g., 60s for failover + AllowedWriteFailures int // tolerated write errors during window + MustRecoverWithin time.Duration // e.g., 5min to return to steady state +} +``` + +### Component 3: Health Monitor & Metrics + +**Purpose:** Continuous DocumentDB cluster health observation + resource leak detection. + +**What to Monitor:** +- **Kubernetes layer:** Pod readiness, restart counts, OOMKills, events +- **CR layer:** DocumentDB status conditions, backup phase transitions +- **Operator layer:** Operator logs/errors, reconciliation count, reconcile duration +- **Database layer:** Connection count, WAL lag, replication status +- **Resource layer:** Memory/CPU usage trends (via OTel/cAdvisor), PVC usage + +**Leak Detection:** +- Sample memory/CPU at fixed intervals +- Linear regression over last N samples +- Alert if slope exceeds threshold (configurable) +- 48-72h runs recommended for reliable leak detection + +**Steady State Definition:** +``` +- All pods in Ready state +- DocumentDB CR conditions: all True +- Replication lag < threshold (if replicated) +- No new pod restarts in last 5 min +- Workload success rate > 99.9% +- No unresolved backup failures +``` + +### Component 4: Event Journal + +**Purpose:** Central log correlating operations, disruptions, and errors for post-mortem analysis. + +**Every entry records:** +- Timestamp +- Event type (op_start, op_end, disruption_window_open, disruption_window_close, health_change, workload_error, verification_failure) +- Operation ID +- Cluster state snapshot (DocumentDB topology, pod count, primary node) +- Associated errors (if any) + +**Key use case:** When a write failure occurs, the journal shows whether it happened during an expected disruption window (tolerable) or during steady state (bug). + +--- + +## Canary Model + +The long-haul test is a **single persistent canary** running on a dedicated AKS cluster. Existing Kind-based integration tests (45-60 min, PR-gated) already cover short-lived validation — there is no need for a separate smoke mode. + +**Canary Configuration:** +- 5 writers, 2 verifiers +- Full operation cycle (scale, HA, replication, backup/restore, upgrades, chaos) +- Runs indefinitely until a fatal failure occurs +- On failure: collect artifacts, preserve DocumentDB cluster state for investigation +- Key output: **MTTF** (mean time to failure) and failure classification +- During development: test locally with `--max-duration=30m` against Kind + +### Failure Tiers + +| Tier | Example | Action | +|------|---------|--------| +| **Fatal** (stop test) | Acknowledged write lost, checksum mismatch, DocumentDB cluster unrecoverable >10min | Artifact dump + preserve cluster + exit non-zero | +| **Degraded** (log + continue) | Operator pod restarted, brief write timeout during expected disruption | Log to journal, continue if recovery within budget | +| **Warning** (monitor) | Memory trending up, reconcile latency increasing | Log warning, no stop | + +### Auto-Recovery Before Fatal Declaration +- Operator crash → wait for K8s restart → continue if healthy within 5 min +- Pod eviction → wait for reschedule → continue +- Data loss or corruption → **immediate stop**, preserve DocumentDB cluster state for investigation + +### Future: Multi-Region Canary +- Add/remove region operations, cross-region replication verification +- AKS Fleet integration +- Separate canary AKS cluster or extension of single-cluster canary + +--- + +## Directory Structure + +The test infrastructure follows a **three-directory layout** at the repo root: + +``` +test/ +├── utils/ # Shared test utilities (used by BOTH e2e and longhaul) +│ ├── go.mod # Separate module: github.com/.../test/utils +│ ├── mongo/ # Mongo client, Seed, Count, Ping, Handle +│ ├── assertions/ # Gomega-compatible checkers (DocumentDBReady, InstanceCount, …) +│ ├── documentdb/ # DocumentDB CR CRUD (Create, WaitHealthy, Delete, PatchSpec, …) +│ ├── operatorhealth/ # Operator-churn gate (pod UID/restart tracking) +│ ├── portforward/ # Gateway port-forward (wraps CNPG forwardconnection) +│ ├── fixtures/ # Namespace/secret/label helpers, teardown-by-label +│ ├── timeouts/ # Centralised Eventually durations (reuses CNPG timeouts) +│ ├── clusterprobe/ # Runtime capability checks (VolumeSnapshot CRD, StorageClass) +│ ├── seed/ # Deterministic datasets (SmallDataset, MediumDataset, …) +│ └── testenv/ # Shared environment config (kubeconfig, client setup) +│ +├── e2e/ # E2E test suite (PR #346) +│ ├── go.mod # Imports test/utils + operator API types +│ ├── tests/ +│ │ ├── lifecycle/ # Deploy, delete, image update, log level +│ │ ├── scale/ # Instance scaling +│ │ ├── data/ # CRUD, aggregation, sort/limit +│ │ ├── backup/ # Backup & restore +│ │ ├── tls/ # TLS certificate modes +│ │ ├── upgrade/ # Operator & binary upgrades +│ │ └── ... +│ └── README.md +│ +└── longhaul/ # Long-haul canary test suite + ├── go.mod # Imports test/utils + operator API types + ├── README.md # Usage guide (running locally, CI safety, configuration) + ├── suite_test.go # Ginkgo suite entry point for the canary + ├── longhaul_test.go # BeforeSuite (skip gate + config) + long-running test specs + ├── config/ + │ ├── config.go # Config struct, env var loading, validation, IsEnabled gate + │ ├── suite_test.go # Ginkgo suite entry for config unit tests + │ └── config_test.go # Config unit tests (23 specs, fast, no Kubernetes cluster needed) + ├── workload/ # (Phase 1b) + │ ├── writer.go # Multi-writer with durability tracking + │ ├── reader.go # Reader + verifier (reuses test/utils/mongo) + │ └── oracle.go # Data integrity oracle (acknowledged write tracking) + ├── operations/ # (Phase 1d-2d) + │ ├── scheduler.go # Operation sequencer with preconditions/cooldowns + │ ├── scale.go # Scale (reuses test/utils/documentdb.PatchInstances) + │ ├── replication.go # Replication enable/disable, add/remove region + │ ├── backup.go # Backup create + restore (reuses test/utils/clusterprobe) + │ ├── upgrade.go # Operator, DocumentDB binary, schema upgrades + │ └── chaos.go # Pod eviction, operator restart + ├── monitor/ # (Phase 1d) + │ ├── health.go # Reuses test/utils/assertions + test/utils/operatorhealth + │ ├── metrics.go # OTel/Prometheus metric collection + │ └── leakdetect.go # Resource trend analysis + ├── journal/ # (Phase 1c) + │ ├── journal.go # Event journal with disruption window tracking + │ └── policy.go # Per-operation outage policies + └── report/ # (Phase 1f) + ├── report.go # Summary report generation + └── templates/ # Report templates (markdown/HTML) +``` + +### Shared Utilities: `test/utils/` + +The `test/utils/` module provides reusable test infrastructure for **both** E2E and long-haul tests. This avoids duplicating ~2000 lines of proven utilities. The packages originate from PR #346's `test/e2e/pkg/e2eutils/` and are promoted to the shared location. + +**Key packages and how long-haul uses them:** + +| Package | What it provides | Long-haul use | +|---------|-----------------|---------------| +| `mongo/` | Client, Seed, Count, Ping, Handle, port-forward connect | Writers + Verifiers connect to DocumentDB gateway | +| `assertions/` | AssertDocumentDBReady, AssertInstanceCount, AssertPrimaryUnchanged | Health monitor polls cluster health continuously | +| `documentdb/` | Create, WaitHealthy, Delete, PatchInstances, PatchSpec | Operation executor (scale, upgrade, backup/restore) | +| `operatorhealth/` | Gate (pod UID/restart tracking), Check, MarkChurned | Health monitor detects operator churn under load | +| `portforward/` | OpenWithErr for gateway service | Writers open port-forward to DocumentDB gateway | +| `timeouts/` | For(op), PollInterval(op) — standardised wait durations | All waiters use consistent, CNPG-aligned timeouts | +| `fixtures/` | ensureNamespace, ensureCredentialSecret, ownershipLabels, teardownByLabels | Canary setup creates namespace + credentials; teardown by label on abort | +| `clusterprobe/` | HasVolumeSnapshotCRD, StorageClassAllowsExpansion | Backup operations skip when CSI snapshots unavailable | +| `seed/` | SmallDataset, MediumDataset (deterministic bson.M generators) | Writer seed data for baseline verification | + +**Module structure:** + +``` +test/utils/go.mod → github.com/documentdb/documentdb-operator/test/utils +test/e2e/go.mod → github.com/documentdb/documentdb-operator/test/e2e +test/longhaul/go.mod → github.com/documentdb/documentdb-operator/test/longhaul +operator/src/go.mod → github.com/documentdb/documentdb-operator (unchanged) +``` + +Each test module uses a `replace` directive to point at the local operator source and `test/utils`: + +```go +// test/longhaul/go.mod +module github.com/documentdb/documentdb-operator/test/longhaul + +require ( + github.com/documentdb/documentdb-operator/test/utils v0.0.0 + github.com/documentdb/documentdb-operator v0.0.0 +) + +replace ( + github.com/documentdb/documentdb-operator/test/utils => ../utils + github.com/documentdb/documentdb-operator => ../../operator/src +) +``` + +> **Migration note:** PR #346 currently has utilities under `test/e2e/pkg/e2eutils/`. Extracting them to +> `test/utils/` is a follow-up task that should be coordinated with xgerman. Until extraction happens, +> long-haul tests can vendor the needed types locally and swap to imports once `test/utils/` exists. + +--- + +## Configuration + +All configuration is via environment variables. Tests are **gated** behind `LONGHAUL_ENABLED` — they are safely skipped in regular CI runs (`go test ./...`). + +**Current (Phase 1a):** + +| Variable | Required | Default | Description | +|----------|----------|---------|-------------| +| `LONGHAUL_ENABLED` | Yes | — | Must be `true`, `1`, or `yes` to run. Otherwise all tests skip. | +| `LONGHAUL_CLUSTER_NAME` | Yes | — | Name of the target DocumentDB cluster CR. | +| `LONGHAUL_NAMESPACE` | No | `default` | Kubernetes namespace of the target DocumentDB cluster. | +| `LONGHAUL_MAX_DURATION` | No | `30m` | Max test duration (`0s` = run until failure). | + +> **Note:** The default 30m timeout is a safety net for local development. The persistent canary +> Job manifest explicitly sets `LONGHAUL_MAX_DURATION=0s` to enable run-until-failure mode. + +**Planned (future phases):** + +| Variable | Default | Phase | Description | +|----------|---------|-------|-------------| +| `LONGHAUL_NUM_WRITERS` | `5` | 1b | Number of concurrent writer goroutines | +| `LONGHAUL_NUM_VERIFIERS` | `2` | 1b | Number of concurrent verifier goroutines | +| `LONGHAUL_OP_COOLDOWN` | `5m` | 1e | Min interval between disruptive operations | +| `LONGHAUL_OP_ENABLED` | all | 1e | Comma-separated list of enabled operations | +| `LONGHAUL_RECOVERY_TIMEOUT` | `5m` | 2e | Max time to wait for auto-recovery before fatal | + +--- + +## Deployment & Visibility + +### Approach + +The long haul test code is fully open source in the repository — anyone can run it. There is no requirement for a public-facing dashboard or scheduled CI workflow for the canary. This matches the pattern of most early-stage OSS projects; public dashboards (like Strimzi's Jenkins or CockroachDB's TeamCity) can be added later as the project matures. + +### Running the Canary + +**Local development (anyone):** +```bash +cd test/longhaul + +# Run config unit tests (fast, no Kubernetes cluster needed) +go test ./config/ -v + +# Run the canary against a local Kind cluster +LONGHAUL_ENABLED=true \ +LONGHAUL_CLUSTER_NAME=documentdb-sample \ +LONGHAUL_NAMESPACE=default \ +LONGHAUL_MAX_DURATION=10m \ +go test ./... -v -timeout 0 + +# Or build a standalone binary +go test -c -o longhaul.test ./ +LONGHAUL_ENABLED=true \ +LONGHAUL_CLUSTER_NAME=documentdb-sample \ +./longhaul.test -test.v -test.timeout 0 +``` +Runs against whatever Kubernetes cluster your kubeconfig points to (Kind, Minikube, etc.). + +**Persistent canary (internal):** +- Dedicated AKS cluster provisioned once (manually or via IaC) +- Long haul test deployed as a Kubernetes Job on the same AKS cluster (separate `longhaul` namespace) +- On new operator release: re-deploy operator via Helm + restart longhaul Job +- Internal Grafana/OTel dashboard for monitoring (optional) +- DocumentDB cluster preserved on failure for investigation + +> **Note:** The canary runs on a team-managed AKS cluster. Contributors do not need cluster access — +> test results are made public via GitHub Issues (on failure) and an optional status badge in README. +> This is standard practice for open-source projects (CockroachDB, Strimzi, Kubernetes itself all +> run long-running tests on private infrastructure with public results). + +### Alerting + +The alerting system uses a **two-layer architecture** to avoid managing long-lived tokens on the AKS cluster: + +**Layer 1: AKS cluster (always running)** +- Long-haul canary runs as a Kubernetes Job — continuous workload +- Writes status to a well-known ConfigMap (`longhaul-status` in `longhaul` namespace) +- Updates include: current state (running/failed/passed), last heartbeat, failure details, journal excerpt +- No GitHub token needed on the AKS cluster + +**Layer 2: GitHub Actions (periodic health check)** +- Scheduled workflow runs every hour (`cron: '0 * * * *'`) +- Connects to AKS cluster via Azure federated identity (OIDC, same as auto-upgrade workflow) +- Checks canary health: pod status, status ConfigMap, recent pod logs +- If failure detected → creates a GitHub Issue with: + - Title: `[Long Haul Failure] {failure type} — {timestamp}` + - Body: DocumentDB cluster name, uptime, error details, journal excerpt, pod logs + - Labels: `long-haul-failure` +- Uses `GITHUB_TOKEN` (auto-managed by GitHub Actions, no expiry, no rotation) +- Maintainers receive email automatically via GitHub's issue notification system +- Deduplication: skips issue creation if an open `long-haul-failure` issue already exists + +```yaml +on: + schedule: + - cron: '0 * * * *' # every hour + workflow_dispatch: # manual trigger + +jobs: + check-canary: + runs-on: ubuntu-latest + permissions: + id-token: write # Azure OIDC + issues: write # create GitHub Issues + steps: + - uses: actions/checkout@v4 + - uses: azure/login@v2 + with: + client-id: ${{ secrets.AZURE_CLIENT_ID }} + tenant-id: ${{ secrets.AZURE_TENANT_ID }} + subscription-id: ${{ secrets.AZURE_SUBSCRIPTION_ID }} + - run: az aks get-credentials --resource-group $RG --name $CLUSTER + - name: Check canary status + id: status + run: | + # Check pod health + POD_STATUS=$(kubectl get pods -l job-name=longhaul -n longhaul -o jsonpath='{.items[0].status.phase}') + # Read status ConfigMap + CANARY_STATUS=$(kubectl get configmap longhaul-status -n longhaul -o jsonpath='{.data.status}') + echo "pod_status=$POD_STATUS" >> $GITHUB_OUTPUT + echo "canary_status=$CANARY_STATUS" >> $GITHUB_OUTPUT + - name: Create issue on failure + if: steps.status.outputs.canary_status == 'failed' || steps.status.outputs.pod_status != 'Running' + uses: actions/github-script@v7 + with: + script: | + // Deduplicate: skip if open issue exists + const { data: issues } = await github.rest.issues.listForRepo({ + owner: context.repo.owner, repo: context.repo.repo, + labels: 'long-haul-failure', state: 'open' + }); + if (issues.length > 0) return; + await github.rest.issues.create({ + owner: context.repo.owner, repo: context.repo.repo, + title: `[Long Haul Failure] ${new Date().toISOString()}`, + body: `Canary status: ${{ steps.status.outputs.canary_status }}\nPod: ${{ steps.status.outputs.pod_status }}`, + labels: ['long-haul-failure'] + }); +``` + +**Benefits:** +- No long-lived GitHub tokens on the AKS cluster +- `GITHUB_TOKEN` in Actions is auto-managed — no expiry, no rotation +- Maintainers get email through GitHub's built-in notification system +- All failures are publicly visible as GitHub Issues — contributors can see and comment +- Easy to extend: add Slack webhook, Teams notification, or status badge in future + +### Auto-Upgrade + +A GitHub Actions workflow handles upgrading the canary AKS cluster automatically. It triggers on new releases and can also be triggered manually. + +```yaml +on: + workflow_dispatch: # manual trigger + release: + types: [published] # auto-trigger on new operator release + +jobs: + upgrade-canary: + runs-on: ubuntu-latest + permissions: + id-token: write # for Azure federated identity (OIDC) + steps: + - uses: actions/checkout@v4 + - uses: azure/login@v2 + with: + client-id: ${{ secrets.AZURE_CLIENT_ID }} + tenant-id: ${{ secrets.AZURE_TENANT_ID }} + subscription-id: ${{ secrets.AZURE_SUBSCRIPTION_ID }} + - run: az aks get-credentials --resource-group $RG --name $CLUSTER + - run: helm upgrade documentdb-operator ./operator/documentdb-helm-chart + - run: | + kubectl delete job longhaul -n longhaul --ignore-not-found + kubectl apply -f operator/src/test/longhaul/deploy/job.yaml + kubectl wait --for=condition=ready pod -l job-name=longhaul -n longhaul --timeout=120s +``` + +**Key points:** +- **AKS auth**: Azure federated identity (OIDC) — no stored secrets, just a trust relationship between GitHub and Azure +- **Operator release** → workflow auto-triggers → Helm upgrade → restart longhaul Job +- **Test code change** → rebuild longhaul image, trigger workflow manually via `workflow_dispatch` +- **Audit trail**: Every upgrade is visible in GitHub Actions history + +--- + +## Learnings from Other Projects + +| Project | Key Pattern We Adopt | Key Pattern We Skip | +|---------|---------------------|-------------------| +| **Strimzi** | Run-until-failure loops; metrics collection; CI profiles | JUnit (we use Ginkgo) | +| **CloudNative-PG** | Ginkgo framework; failover via pod delete + SIGSTOP; LSN verification | Single-sequence failover (we need continuous concurrent workload) | +| **CockroachDB** | Chaos runner (periodic kill/restart); separate workload from disruption; roachstress repeated runs | Custom roachtest framework (too heavy for our needs) | +| **Vitess** | Background stress goroutine; per-query tracking; Go native driver | No fault injection (we need disruptive ops) | + +**Universal pattern adopted:** Separate workload generators from disruption injectors, run concurrently, verify correctness against an acknowledged-write oracle, use per-operation disruption budgets. Run-until-failure (Strimzi model) rather than time-bounded. + +--- + +## Implementation Phases + +Each phase is a self-contained, demoable increment (~1-2 PRs each). + +### Phase 1a: Project Skeleton + Config ✅ +- `test/longhaul/` directory with Ginkgo suite, BeforeSuite skip gate, placeholder test +- `test/longhaul/config/` sub-package with Config struct, env var loading, validation, IsEnabled +- Config unit tests (23 specs) in separate suite — fast, no Kubernetes cluster needed +- README with usage guide, config reference, CI safety explanation +- CI-safe: `LONGHAUL_ENABLED` gate skips tests in `go test ./...` + +### Phase 1b: Data Plane Workload +- Multi-writer goroutines with durability oracle +- Reader/verifier with gap, duplicate, and checksum detection +- Reuses `test/utils/mongo` for gateway connections and `test/utils/seed` patterns for data generation +- Metrics counters (writes attempted/acknowledged/failed, reads, verification failures) + +### Phase 1c: Event Journal +- Central event log (op_start, op_end, health_change, workload_error, etc.) +- Disruption window tracking (expected vs unexpected errors) +- In-memory + file-backed for post-mortem + +### Phase 1d: Health Monitor +- Pod readiness, restart counts, OOMKills +- DocumentDB CR status conditions +- Reuses `test/utils/assertions` (AssertDocumentDBReady) and `test/utils/operatorhealth` (Gate) +- Steady-state detection (all healthy, no recent restarts, workload success rate OK) + +### Phase 1e: Scale Operations +- Scale up/down with precondition checks (reuses `test/utils/documentdb.PatchInstances`) +- Per-operation outage policy enforcement +- First control plane operation — validates the operation scheduler pattern + +### Phase 1f: Summary Report +- Markdown report on exit (pass/fail, duration, stats, operation timeline) +- Event journal dump +- Testable locally: `cd test/longhaul && LONGHAUL_MAX_DURATION=30m go test ./... -v -timeout 0` against Kind + +### Phase 2a: Backup & Restore Operations +- On-demand backup creation + wait for completion +- Restore to new DocumentDB cluster + data verification against backup watermark +- Cleanup of restored DocumentDB cluster + +### Phase 2b: HA & Replication Operations +- Toggle HA (localHA) +- Enable/disable replication +- Precondition checks (e.g., cannot disable if already standalone) + +### Phase 2c: Upgrade Operations +- Operator upgrade (Helm) +- DocumentDB binary upgrade (documentDBVersion) +- Schema upgrade (schemaVersion) +- Each tested separately with outage policy + +### Phase 2d: Chaos Operations +- Pod eviction (simulating node drain) +- Operator restart / leader failover + +### Phase 2e: Failure Tiers + Auto-Recovery +- Fatal / degraded / warning classification +- Auto-recovery logic (wait for K8s restart before declaring fatal) +- DocumentDB cluster state preservation on fatal failure + +### Phase 2f: AKS Deployment +- Dockerfile for longhaul test image +- Kubernetes Job manifest, RBAC (ServiceAccount, ClusterRole, Binding) +- ConfigMap for tuning parameters +- Deploy script / instructions + +### Phase 2g: Auto-Upgrade Workflow +- GitHub Actions workflow (triggered on release + manual dispatch) +- Azure OIDC auth, Helm upgrade, Job restart + +### Phase 2h: Alerting Workflow +- GitHub Actions scheduled workflow (hourly cron) +- Checks canary pod status + status ConfigMap +- Creates GitHub Issue on failure (with deduplication) +- Labels: `long-haul-failure` +- Maintainers receive email via GitHub notification system + +### Phase 3: Multi-Region Canary +- Add/remove region operations +- Cross-region replication verification +- AKS Fleet integration + +--- + +## Open Questions +1. What AKS cluster/subscription should be used for the dedicated canary cluster? +2. Desired SLO targets (e.g., 99.9% write success during steady state)? +3. **Module placement:** Long-haul tests live in `test/longhaul/` as a separate Go module (`test/longhaul/go.mod`). Shared test infrastructure lives in `test/utils/` and is imported by both `test/e2e/` and `test/longhaul/` via `replace` directives. This keeps test dependencies (Ginkgo, mongo-driver, CNPG test utils) out of the operator's runtime `go.mod`. +4. **Shared utility extraction:** PR #346 currently places reusable utilities under `test/e2e/pkg/e2eutils/`. A follow-up task will extract them to `test/utils/` so long-haul tests can import without depending on the E2E module. Until extraction, long-haul can vendor needed helpers locally. + +## Design Decisions (Provisional) + +The following decisions shape future Phase interfaces. They are provisional — details will be refined when each Phase begins, but the approach is locked. + +### Journal Durability (Phase 1c) +The event journal will use a PVC-backed file for persistence across pod restarts. The journal appends structured JSON lines (`{timestamp, event_type, op_id, cluster_state, error}`). On startup, the journal reader scans the existing file to reconstruct in-memory state. The PVC is mounted at `/data/journal/` in the canary Job manifest. + +### Writer Sequence Resumption (Phase 1b) +On restart, each writer bootstraps its sequence number from `max(seq)` for its `writer_id` in the database. The oracle tolerates gaps between a crash and resume — gaps are logged as expected (crash-recovery gap) rather than flagged as data loss. The `(writer_id, seq)` unique index guarantees no duplicate sequence numbers. + +### Teardown on Abort (Phase 1b) +The harness registers a signal handler for SIGTERM and SIGINT. On signal: (1) cancel all writer/reader contexts, (2) flush journal to disk, (3) write final status to ConfigMap, (4) exit with appropriate code. On startup, the harness checks for a leftover run (stale ConfigMap with state=running but no matching pod) and logs a warning before proceeding. + +### Latency-Regression Baseline (Phase 1d) +During the first 30 minutes of a canary run, the monitor establishes P50/P99 write and read latency baselines. After warmup, sustained P99 regression >2× baseline for >5 minutes triggers a warning-level alert. The exact thresholds are configurable via environment variables (`LONGHAUL_LATENCY_P99_MULTIPLIER`, `LONGHAUL_LATENCY_WINDOW`). diff --git a/operator/src/test/longhaul/README.md b/operator/src/test/longhaul/README.md new file mode 100644 index 00000000..4c05b164 --- /dev/null +++ b/operator/src/test/longhaul/README.md @@ -0,0 +1,104 @@ +# Long Haul Tests + +Long haul tests validate that DocumentDB Kubernetes Operator clusters remain healthy under +continuous load over extended periods. They run a canary workload that writes and reads data, +performs management operations, and checks for data integrity. + +> **Status:** Phase 1a (skeleton). The canary workload and management operations will be added +> in subsequent phases. See [design document](../../../../docs/designs/long-haul-test-design.md) +> for the full plan. + +## Project Structure + +``` +test/longhaul/ +├── README.md # This file +├── suite_test.go # Ginkgo suite entry point (the canary) +├── longhaul_test.go # BeforeSuite + long-running test specs +└── config/ + ├── config.go # Config struct, env var loading, validation + ├── suite_test.go # Config unit test suite entry + └── config_test.go # Config unit tests +``` + +- **`test/longhaul/`** — The actual long-running canary. Designed to run for hours/days. +- **`test/longhaul/config/`** — Config parsing and validation. Fast unit tests, safe for CI. + +## Quick Start + +### Prerequisites + +- A running Kubernetes cluster with DocumentDB deployed +- `kubectl` configured to access the cluster +- Go 1.25+ + +### Run the Config Unit Tests + +These are fast and require no cluster: + +```bash +cd operator/src +go test ./test/longhaul/config/ -v +``` + +### Run the Long Haul Canary Locally + +Against a local Kind cluster (see [development environment guide](../../../../docs/developer-guides/development-environment.md)): + +```bash +cd operator/src + +LONGHAUL_ENABLED=true \ +LONGHAUL_CLUSTER_NAME=documentdb-sample \ +LONGHAUL_NAMESPACE=default \ +LONGHAUL_MAX_DURATION=10m \ +go test ./test/longhaul/ -v -timeout 0 +``` + +> **Note:** Use `-timeout 0` to disable Go's default 10-minute test timeout for long runs. + +### Build a Standalone Binary + +For containerized deployment (Phase 2+): + +```bash +cd operator/src +go test -c -o longhaul.test ./test/longhaul/ + +# Run the compiled binary +LONGHAUL_ENABLED=true \ +LONGHAUL_CLUSTER_NAME=documentdb-sample \ +LONGHAUL_NAMESPACE=default \ +./longhaul.test -test.v -test.timeout 0 +``` + +## Configuration + +All configuration is via environment variables. Tests are **gated** behind `LONGHAUL_ENABLED` — +they are safely skipped in regular CI runs (`go test ./...`). + +| Variable | Required | Default | Description | +|----------|----------|---------|-------------| +| `LONGHAUL_ENABLED` | Yes | — | Must be `true`, `1`, or `yes` to run. Otherwise all tests skip. | +| `LONGHAUL_CLUSTER_NAME` | Yes | — | Name of the target DocumentDB cluster CR. | +| `LONGHAUL_NAMESPACE` | No | `default` | Kubernetes namespace of the target cluster. | +| `LONGHAUL_MAX_DURATION` | No | `30m` | Max test duration. Use `0s` for run-until-failure. | + +> Additional configuration (writer count, operation cooldown, etc.) will be added in later phases +> as the corresponding features are implemented. + +## CI Safety + +The long haul tests are gated behind `LONGHAUL_ENABLED`. No CI workflow currently sets this +variable — do not add it to any PR-gated workflow. + +1. `LONGHAUL_ENABLED` is not set in any CI workflow +2. The `BeforeSuite` calls `Skip()` when disabled +3. CI output shows `Suite skipped in BeforeSuite -- 0 Passed | 0 Failed | 1 Skipped` + +> **Note:** For persistent canary deployment, the Job manifest explicitly sets +> `LONGHAUL_MAX_DURATION=0s` to enable run-until-failure mode. The default 30m timeout +> is only a safety net for local development. + +The config unit tests (`test/longhaul/config/`) run unconditionally and are included in normal +CI test runs — they are fast (~0.002s) and require no cluster. diff --git a/operator/src/test/longhaul/config/config.go b/operator/src/test/longhaul/config/config.go new file mode 100644 index 00000000..70672548 --- /dev/null +++ b/operator/src/test/longhaul/config/config.go @@ -0,0 +1,87 @@ +// Copyright (c) Microsoft Corporation. +// Licensed under the MIT License. + +package config + +import ( + "fmt" + "os" + "strings" + "time" +) + +const ( + // Environment variable names for long haul test configuration. + EnvEnabled = "LONGHAUL_ENABLED" + EnvMaxDuration = "LONGHAUL_MAX_DURATION" + EnvNamespace = "LONGHAUL_NAMESPACE" + EnvClusterName = "LONGHAUL_CLUSTER_NAME" +) + +// Config holds all configuration for a long haul test run. +type Config struct { + // MaxDuration is the maximum test duration. Zero means run until failure. + // Requires explicit LONGHAUL_MAX_DURATION=0s to enable infinite runs. + // Default: 30m (safe for local development). + MaxDuration time.Duration + + // Namespace is the Kubernetes namespace of the target DocumentDB cluster. + Namespace string + + // ClusterName is the name of the target DocumentDB cluster CR. + ClusterName string +} + +// DefaultConfig returns a Config with safe defaults for local development. +func DefaultConfig() Config { + return Config{ + MaxDuration: 30 * time.Minute, + Namespace: "default", + ClusterName: "", + } +} + +// LoadFromEnv loads configuration from environment variables, +// falling back to defaults for any unset variable. +func LoadFromEnv() (Config, error) { + cfg := DefaultConfig() + + if v := os.Getenv(EnvMaxDuration); v != "" { + d, err := time.ParseDuration(v) + if err != nil { + return cfg, fmt.Errorf("invalid %s=%q: %w", EnvMaxDuration, v, err) + } + cfg.MaxDuration = d + } + + if v := os.Getenv(EnvNamespace); v != "" { + cfg.Namespace = v + } + + if v := os.Getenv(EnvClusterName); v != "" { + cfg.ClusterName = v + } + + return cfg, nil +} + +// Validate checks that the configuration is valid. +func (c *Config) Validate() error { + if c.MaxDuration < 0 { + return fmt.Errorf("max duration must not be negative, got %s", c.MaxDuration) + } + if c.Namespace == "" { + return fmt.Errorf("namespace must not be empty") + } + if c.ClusterName == "" { + return fmt.Errorf("cluster name must not be empty") + } + return nil +} + +// IsEnabled returns true if the long haul test is explicitly enabled +// via the LONGHAUL_ENABLED environment variable. +func IsEnabled() bool { + v := strings.TrimSpace(strings.ToLower(os.Getenv(EnvEnabled))) + return v == "true" || v == "1" || v == "yes" +} diff --git a/operator/src/test/longhaul/config/config_test.go b/operator/src/test/longhaul/config/config_test.go new file mode 100644 index 00000000..af07d63c --- /dev/null +++ b/operator/src/test/longhaul/config/config_test.go @@ -0,0 +1,157 @@ +// Copyright (c) Microsoft Corporation. +// Licensed under the MIT License. + +package config + +import ( + "time" + + . "github.com/onsi/ginkgo/v2" + . "github.com/onsi/gomega" +) + +var _ = Describe("Config", func() { + Describe("DefaultConfig", func() { + It("returns safe defaults", func() { + cfg := DefaultConfig() + Expect(cfg.MaxDuration).To(Equal(30 * time.Minute)) + Expect(cfg.Namespace).To(Equal("default")) + Expect(cfg.ClusterName).To(BeEmpty()) + }) + }) + + Describe("LoadFromEnv", func() { + It("uses defaults when no env vars set", func() { + GinkgoT().Setenv(EnvMaxDuration, "") + GinkgoT().Setenv(EnvNamespace, "") + GinkgoT().Setenv(EnvClusterName, "") + cfg, err := LoadFromEnv() + Expect(err).NotTo(HaveOccurred()) + Expect(cfg.MaxDuration).To(Equal(30 * time.Minute)) + }) + + It("parses MaxDuration from env", func() { + GinkgoT().Setenv(EnvMaxDuration, "1h") + cfg, err := LoadFromEnv() + Expect(err).NotTo(HaveOccurred()) + Expect(cfg.MaxDuration).To(Equal(1 * time.Hour)) + }) + + It("parses zero MaxDuration for infinite runs", func() { + GinkgoT().Setenv(EnvMaxDuration, "0s") + cfg, err := LoadFromEnv() + Expect(err).NotTo(HaveOccurred()) + Expect(cfg.MaxDuration).To(Equal(time.Duration(0))) + }) + + It("parses Namespace and ClusterName from env", func() { + GinkgoT().Setenv(EnvNamespace, "test-ns") + GinkgoT().Setenv(EnvClusterName, "my-cluster") + cfg, err := LoadFromEnv() + Expect(err).NotTo(HaveOccurred()) + Expect(cfg.Namespace).To(Equal("test-ns")) + Expect(cfg.ClusterName).To(Equal("my-cluster")) + }) + + It("returns error for invalid MaxDuration", func() { + GinkgoT().Setenv(EnvMaxDuration, "not-a-duration") + _, err := LoadFromEnv() + Expect(err).To(HaveOccurred()) + Expect(err.Error()).To(ContainSubstring(EnvMaxDuration)) + }) + }) + + Describe("Validate", func() { + It("passes for valid config", func() { + cfg := DefaultConfig() + cfg.ClusterName = "test-cluster" + Expect(cfg.Validate()).To(Succeed()) + }) + + It("fails when Namespace is empty", func() { + cfg := DefaultConfig() + cfg.ClusterName = "test" + cfg.Namespace = "" + Expect(cfg.Validate()).To(MatchError(ContainSubstring("namespace"))) + }) + + It("fails when ClusterName is empty", func() { + cfg := DefaultConfig() + Expect(cfg.Validate()).To(MatchError(ContainSubstring("cluster name"))) + }) + + It("fails when MaxDuration is negative", func() { + cfg := DefaultConfig() + cfg.ClusterName = "test" + cfg.MaxDuration = -1 * time.Second + Expect(cfg.Validate()).To(MatchError(ContainSubstring("max duration must not be negative"))) + }) + }) + + Describe("IsEnabled", func() { + It("returns false when env not set", func() { + GinkgoT().Setenv(EnvEnabled, "") + Expect(IsEnabled()).To(BeFalse()) + }) + + It("returns true for 'true'", func() { + GinkgoT().Setenv(EnvEnabled, "true") + Expect(IsEnabled()).To(BeTrue()) + }) + + It("returns true for '1'", func() { + GinkgoT().Setenv(EnvEnabled, "1") + Expect(IsEnabled()).To(BeTrue()) + }) + + It("returns true for 'yes'", func() { + GinkgoT().Setenv(EnvEnabled, "yes") + Expect(IsEnabled()).To(BeTrue()) + }) + + It("returns true case-insensitively", func() { + GinkgoT().Setenv(EnvEnabled, "TRUE") + Expect(IsEnabled()).To(BeTrue()) + }) + + It("returns true for mixed case 'True'", func() { + GinkgoT().Setenv(EnvEnabled, "True") + Expect(IsEnabled()).To(BeTrue()) + }) + + It("returns true for mixed case 'YES'", func() { + GinkgoT().Setenv(EnvEnabled, "YES") + Expect(IsEnabled()).To(BeTrue()) + }) + + It("returns true with surrounding whitespace", func() { + GinkgoT().Setenv(EnvEnabled, " true ") + Expect(IsEnabled()).To(BeTrue()) + }) + + It("returns true for ' yes ' with whitespace", func() { + GinkgoT().Setenv(EnvEnabled, " yes ") + Expect(IsEnabled()).To(BeTrue()) + }) + + It("returns false for whitespace-only", func() { + GinkgoT().Setenv(EnvEnabled, " ") + Expect(IsEnabled()).To(BeFalse()) + }) + + It("returns false for 'false'", func() { + GinkgoT().Setenv(EnvEnabled, "false") + Expect(IsEnabled()).To(BeFalse()) + }) + + It("returns false for '0'", func() { + GinkgoT().Setenv(EnvEnabled, "0") + Expect(IsEnabled()).To(BeFalse()) + }) + + It("returns false for 'no'", func() { + GinkgoT().Setenv(EnvEnabled, "no") + Expect(IsEnabled()).To(BeFalse()) + }) + }) +}) diff --git a/operator/src/test/longhaul/config/suite_test.go b/operator/src/test/longhaul/config/suite_test.go new file mode 100644 index 00000000..c12c6a89 --- /dev/null +++ b/operator/src/test/longhaul/config/suite_test.go @@ -0,0 +1,16 @@ +// Copyright (c) Microsoft Corporation. +// Licensed under the MIT License. + +package config + +import ( + "testing" + + . "github.com/onsi/ginkgo/v2" + . "github.com/onsi/gomega" +) + +func TestConfig(t *testing.T) { + RegisterFailHandler(Fail) + RunSpecs(t, "Long Haul Config Suite") +} diff --git a/operator/src/test/longhaul/longhaul_test.go b/operator/src/test/longhaul/longhaul_test.go new file mode 100644 index 00000000..80553609 --- /dev/null +++ b/operator/src/test/longhaul/longhaul_test.go @@ -0,0 +1,40 @@ +// Copyright (c) Microsoft Corporation. +// Licensed under the MIT License. + +package longhaul + +import ( + . "github.com/onsi/ginkgo/v2" + . "github.com/onsi/gomega" + + "github.com/documentdb/documentdb-operator/test/longhaul/config" +) + +var testConfig config.Config + +var _ = BeforeSuite(func() { + if !config.IsEnabled() { + Skip("Long haul tests are disabled. Set LONGHAUL_ENABLED=true to run.") + } + + var err error + testConfig, err = config.LoadFromEnv() + Expect(err).NotTo(HaveOccurred(), "Failed to load long haul config from environment") + + err = testConfig.Validate() + Expect(err).NotTo(HaveOccurred(), "Invalid long haul config") + + GinkgoWriter.Printf("Long haul test config:\n") + GinkgoWriter.Printf(" MaxDuration: %s\n", testConfig.MaxDuration) + GinkgoWriter.Printf(" Namespace: %s\n", testConfig.Namespace) + GinkgoWriter.Printf(" ClusterName: %s\n", testConfig.ClusterName) +}) + +var _ = Describe("Long Haul Test", func() { + It("should run the long haul canary", func() { + // Phase 1b+ will implement the actual workload, operations, and monitoring. + // For now, verify the skeleton is wired up correctly. + GinkgoWriter.Println("Long haul test skeleton is running") + Expect(testConfig.ClusterName).NotTo(BeEmpty()) + }) +}) diff --git a/operator/src/test/longhaul/suite_test.go b/operator/src/test/longhaul/suite_test.go new file mode 100644 index 00000000..ca024859 --- /dev/null +++ b/operator/src/test/longhaul/suite_test.go @@ -0,0 +1,16 @@ +// Copyright (c) Microsoft Corporation. +// Licensed under the MIT License. + +package longhaul + +import ( + "testing" + + . "github.com/onsi/ginkgo/v2" + . "github.com/onsi/gomega" +) + +func TestLongHaul(t *testing.T) { + RegisterFailHandler(Fail) + RunSpecs(t, "Long Haul Suite") +}