Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
545 changes: 545 additions & 0 deletions docs/designs/long-haul-test-design.md

Large diffs are not rendered by default.

104 changes: 104 additions & 0 deletions operator/src/test/longhaul/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,104 @@
# Long Haul Tests

Long haul tests validate that DocumentDB Kubernetes Operator clusters remain healthy under
continuous load over extended periods. They run a canary workload that writes and reads data,
performs management operations, and checks for data integrity.

> **Status:** Phase 1a (skeleton). The canary workload and management operations will be added
> in subsequent phases. See [design document](../../../../docs/designs/long-haul-test-design.md)
> for the full plan.

## Project Structure

```
test/longhaul/
├── README.md # This file
├── suite_test.go # Ginkgo suite entry point (the canary)
├── longhaul_test.go # BeforeSuite + long-running test specs
└── config/
├── config.go # Config struct, env var loading, validation
├── suite_test.go # Config unit test suite entry
└── config_test.go # Config unit tests
```

- **`test/longhaul/`** — The actual long-running canary. Designed to run for hours/days.
- **`test/longhaul/config/`** — Config parsing and validation. Fast unit tests, safe for CI.

## Quick Start

### Prerequisites

- A running Kubernetes cluster with DocumentDB deployed
- `kubectl` configured to access the cluster
- Go 1.25+

### Run the Config Unit Tests

These are fast and require no cluster:

```bash
cd operator/src
go test ./test/longhaul/config/ -v
```

### Run the Long Haul Canary Locally

Against a local Kind cluster (see [development environment guide](../../../../docs/developer-guides/development-environment.md)):

```bash
cd operator/src

LONGHAUL_ENABLED=true \
LONGHAUL_CLUSTER_NAME=documentdb-sample \
LONGHAUL_NAMESPACE=default \
LONGHAUL_MAX_DURATION=10m \
go test ./test/longhaul/ -v -timeout 0
```

> **Note:** Use `-timeout 0` to disable Go's default 10-minute test timeout for long runs.

### Build a Standalone Binary

For containerized deployment (Phase 2+):

```bash
cd operator/src
go test -c -o longhaul.test ./test/longhaul/

# Run the compiled binary
LONGHAUL_ENABLED=true \
LONGHAUL_CLUSTER_NAME=documentdb-sample \
LONGHAUL_NAMESPACE=default \
./longhaul.test -test.v -test.timeout 0
```

## Configuration

All configuration is via environment variables. Tests are **gated** behind `LONGHAUL_ENABLED` —
they are safely skipped in regular CI runs (`go test ./...`).

| Variable | Required | Default | Description |
|----------|----------|---------|-------------|
| `LONGHAUL_ENABLED` | Yes | — | Must be `true`, `1`, or `yes` to run. Otherwise all tests skip. |
| `LONGHAUL_CLUSTER_NAME` | Yes | — | Name of the target DocumentDB cluster CR. |
| `LONGHAUL_NAMESPACE` | No | `default` | Kubernetes namespace of the target cluster. |
| `LONGHAUL_MAX_DURATION` | No | `30m` | Max test duration. Use `0s` for run-until-failure. |

> Additional configuration (writer count, operation cooldown, etc.) will be added in later phases
> as the corresponding features are implemented.

## CI Safety

The long haul tests are gated behind `LONGHAUL_ENABLED`. No CI workflow currently sets this
variable — do not add it to any PR-gated workflow.

1. `LONGHAUL_ENABLED` is not set in any CI workflow
2. The `BeforeSuite` calls `Skip()` when disabled
3. CI output shows `Suite skipped in BeforeSuite -- 0 Passed | 0 Failed | 1 Skipped`

> **Note:** For persistent canary deployment, the Job manifest explicitly sets
> `LONGHAUL_MAX_DURATION=0s` to enable run-until-failure mode. The default 30m timeout
> is only a safety net for local development.

The config unit tests (`test/longhaul/config/`) run unconditionally and are included in normal
CI test runs — they are fast (~0.002s) and require no cluster.
87 changes: 87 additions & 0 deletions operator/src/test/longhaul/config/config.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,87 @@
// Copyright (c) Microsoft Corporation.
// Licensed under the MIT License.

package config

import (
"fmt"
"os"
"strings"
"time"
)

const (
// Environment variable names for long haul test configuration.
EnvEnabled = "LONGHAUL_ENABLED"
EnvMaxDuration = "LONGHAUL_MAX_DURATION"
EnvNamespace = "LONGHAUL_NAMESPACE"
EnvClusterName = "LONGHAUL_CLUSTER_NAME"
)

// Config holds all configuration for a long haul test run.
type Config struct {
// MaxDuration is the maximum test duration. Zero means run until failure.
// Requires explicit LONGHAUL_MAX_DURATION=0s to enable infinite runs.
// Default: 30m (safe for local development).
MaxDuration time.Duration

// Namespace is the Kubernetes namespace of the target DocumentDB cluster.
Namespace string

// ClusterName is the name of the target DocumentDB cluster CR.
ClusterName string
}

// DefaultConfig returns a Config with safe defaults for local development.
func DefaultConfig() Config {
return Config{
MaxDuration: 30 * time.Minute,
Namespace: "default",
ClusterName: "",
}
}

// LoadFromEnv loads configuration from environment variables,
// falling back to defaults for any unset variable.
func LoadFromEnv() (Config, error) {
cfg := DefaultConfig()

if v := os.Getenv(EnvMaxDuration); v != "" {
d, err := time.ParseDuration(v)
if err != nil {
return cfg, fmt.Errorf("invalid %s=%q: %w", EnvMaxDuration, v, err)
}
cfg.MaxDuration = d
}

if v := os.Getenv(EnvNamespace); v != "" {
cfg.Namespace = v
}

if v := os.Getenv(EnvClusterName); v != "" {
cfg.ClusterName = v
}

return cfg, nil
}

// Validate checks that the configuration is valid.
func (c *Config) Validate() error {
if c.MaxDuration < 0 {
return fmt.Errorf("max duration must not be negative, got %s", c.MaxDuration)
}
if c.Namespace == "" {
return fmt.Errorf("namespace must not be empty")
}
if c.ClusterName == "" {
return fmt.Errorf("cluster name must not be empty")
}
return nil
}

// IsEnabled returns true if the long haul test is explicitly enabled
// via the LONGHAUL_ENABLED environment variable.
func IsEnabled() bool {
v := strings.TrimSpace(strings.ToLower(os.Getenv(EnvEnabled)))
return v == "true" || v == "1" || v == "yes"
}
157 changes: 157 additions & 0 deletions operator/src/test/longhaul/config/config_test.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,157 @@
// Copyright (c) Microsoft Corporation.
// Licensed under the MIT License.

package config

import (
"time"

. "github.com/onsi/ginkgo/v2"
. "github.com/onsi/gomega"
)

var _ = Describe("Config", func() {
Describe("DefaultConfig", func() {
It("returns safe defaults", func() {
cfg := DefaultConfig()
Expect(cfg.MaxDuration).To(Equal(30 * time.Minute))
Expect(cfg.Namespace).To(Equal("default"))
Expect(cfg.ClusterName).To(BeEmpty())
})
})

Describe("LoadFromEnv", func() {
It("uses defaults when no env vars set", func() {
GinkgoT().Setenv(EnvMaxDuration, "")
GinkgoT().Setenv(EnvNamespace, "")
GinkgoT().Setenv(EnvClusterName, "")
cfg, err := LoadFromEnv()
Expect(err).NotTo(HaveOccurred())
Expect(cfg.MaxDuration).To(Equal(30 * time.Minute))
})
Comment thread
WentingWu666666 marked this conversation as resolved.

It("parses MaxDuration from env", func() {
GinkgoT().Setenv(EnvMaxDuration, "1h")
cfg, err := LoadFromEnv()
Expect(err).NotTo(HaveOccurred())
Expect(cfg.MaxDuration).To(Equal(1 * time.Hour))
})

It("parses zero MaxDuration for infinite runs", func() {
GinkgoT().Setenv(EnvMaxDuration, "0s")
cfg, err := LoadFromEnv()
Expect(err).NotTo(HaveOccurred())
Expect(cfg.MaxDuration).To(Equal(time.Duration(0)))
})

It("parses Namespace and ClusterName from env", func() {
GinkgoT().Setenv(EnvNamespace, "test-ns")
GinkgoT().Setenv(EnvClusterName, "my-cluster")
cfg, err := LoadFromEnv()
Expect(err).NotTo(HaveOccurred())
Expect(cfg.Namespace).To(Equal("test-ns"))
Expect(cfg.ClusterName).To(Equal("my-cluster"))
})

It("returns error for invalid MaxDuration", func() {
GinkgoT().Setenv(EnvMaxDuration, "not-a-duration")
_, err := LoadFromEnv()
Expect(err).To(HaveOccurred())
Expect(err.Error()).To(ContainSubstring(EnvMaxDuration))
})
})

Describe("Validate", func() {
It("passes for valid config", func() {
cfg := DefaultConfig()
cfg.ClusterName = "test-cluster"
Expect(cfg.Validate()).To(Succeed())
})

It("fails when Namespace is empty", func() {
cfg := DefaultConfig()
cfg.ClusterName = "test"
cfg.Namespace = ""
Expect(cfg.Validate()).To(MatchError(ContainSubstring("namespace")))
})

It("fails when ClusterName is empty", func() {
cfg := DefaultConfig()
Expect(cfg.Validate()).To(MatchError(ContainSubstring("cluster name")))
})

It("fails when MaxDuration is negative", func() {
cfg := DefaultConfig()
cfg.ClusterName = "test"
cfg.MaxDuration = -1 * time.Second
Expect(cfg.Validate()).To(MatchError(ContainSubstring("max duration must not be negative")))
})
})

Describe("IsEnabled", func() {
It("returns false when env not set", func() {
GinkgoT().Setenv(EnvEnabled, "")
Expect(IsEnabled()).To(BeFalse())
})

It("returns true for 'true'", func() {
GinkgoT().Setenv(EnvEnabled, "true")
Expect(IsEnabled()).To(BeTrue())
})

It("returns true for '1'", func() {
GinkgoT().Setenv(EnvEnabled, "1")
Expect(IsEnabled()).To(BeTrue())
})

It("returns true for 'yes'", func() {
GinkgoT().Setenv(EnvEnabled, "yes")
Expect(IsEnabled()).To(BeTrue())
})

It("returns true case-insensitively", func() {
GinkgoT().Setenv(EnvEnabled, "TRUE")
Expect(IsEnabled()).To(BeTrue())
})

It("returns true for mixed case 'True'", func() {
GinkgoT().Setenv(EnvEnabled, "True")
Expect(IsEnabled()).To(BeTrue())
})

It("returns true for mixed case 'YES'", func() {
GinkgoT().Setenv(EnvEnabled, "YES")
Expect(IsEnabled()).To(BeTrue())
})

It("returns true with surrounding whitespace", func() {
GinkgoT().Setenv(EnvEnabled, " true ")
Expect(IsEnabled()).To(BeTrue())
})

It("returns true for ' yes ' with whitespace", func() {
GinkgoT().Setenv(EnvEnabled, " yes ")
Expect(IsEnabled()).To(BeTrue())
})

It("returns false for whitespace-only", func() {
GinkgoT().Setenv(EnvEnabled, " ")
Expect(IsEnabled()).To(BeFalse())
})

It("returns false for 'false'", func() {
GinkgoT().Setenv(EnvEnabled, "false")
Expect(IsEnabled()).To(BeFalse())
})

It("returns false for '0'", func() {
GinkgoT().Setenv(EnvEnabled, "0")
Expect(IsEnabled()).To(BeFalse())
})

It("returns false for 'no'", func() {
GinkgoT().Setenv(EnvEnabled, "no")
Expect(IsEnabled()).To(BeFalse())
})
})
})
16 changes: 16 additions & 0 deletions operator/src/test/longhaul/config/suite_test.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
// Copyright (c) Microsoft Corporation.
// Licensed under the MIT License.

package config

import (
"testing"

. "github.com/onsi/ginkgo/v2"
. "github.com/onsi/gomega"
)

func TestConfig(t *testing.T) {
RegisterFailHandler(Fail)
RunSpecs(t, "Long Haul Config Suite")
}
Loading
Loading