From 663aef1ea639a96161ef3377fc55696a8b57fcce Mon Sep 17 00:00:00 2001 From: michaelraney Date: Wed, 22 Apr 2026 10:08:52 -0400 Subject: [PATCH 1/7] feat(contrib): scaffold telemetry-and-cost-optimized-eks variant Introduce documentdb-playground/contrib/telemetry-and-cost-optimized-eks/ as a self-contained fork of the base aws-setup/ scripts. This scaffold commit ships: - create-cluster.sh and delete-cluster.sh copied from aws-setup/ with the same simple options (NODE_TYPE/EKS_VERSION/USE_SPOT/CLUSTER_TAGS) that just landed in aws-setup/, so the contrib variant is usable standalone. - Default CLUSTER_NAME changed to documentdb-contrib-cluster to avoid collisions when running alongside the base setup. - Placeholder README that will be replaced with the full docs in a later commit on this branch. Subsequent commits will add 2-AZ deployment, CloudFormation diagnostics, S3 Gateway VPC endpoint, EKS control-plane logging, CloudWatch Container Insights, and CloudWatch-aware DocumentDB diagnostics. Signed-off-by: michaelraney Made-with: Cursor --- .../README.md | 24 + .../scripts/create-cluster.sh | 651 +++++++++++++ .../scripts/delete-cluster.sh | 877 ++++++++++++++++++ 3 files changed, 1552 insertions(+) create mode 100644 documentdb-playground/contrib/telemetry-and-cost-optimized-eks/README.md create mode 100755 documentdb-playground/contrib/telemetry-and-cost-optimized-eks/scripts/create-cluster.sh create mode 100755 documentdb-playground/contrib/telemetry-and-cost-optimized-eks/scripts/delete-cluster.sh diff --git a/documentdb-playground/contrib/telemetry-and-cost-optimized-eks/README.md b/documentdb-playground/contrib/telemetry-and-cost-optimized-eks/README.md new file mode 100644 index 00000000..433476ca --- /dev/null +++ b/documentdb-playground/contrib/telemetry-and-cost-optimized-eks/README.md @@ -0,0 +1,24 @@ +# Telemetry and Cost-Optimized EKS (contrib) + +> **Status:** community-contributed variant. Not actively maintained by the core DocumentDB team. + +This folder contains a self-contained variant of the AWS EKS playground that layers additional cost-optimization and observability features on top of the base scripts in [`documentdb-playground/aws-setup/`](../../aws-setup/). + +At this scaffold commit it is functionally equivalent to the base `aws-setup/` scripts plus the four simple options (`--node-type`, `--eks-version`, `--spot`, `--tags`). Subsequent commits add: + +- 2-AZ deployment and CloudFormation event diagnostics +- S3 Gateway VPC endpoint (free — reduces NAT Gateway data-transfer costs) +- EKS control-plane logging + CloudWatch log-group retention +- Amazon CloudWatch Observability add-on (Container Insights) +- CloudWatch-aware DocumentDB diagnostics + +The default cluster name is `documentdb-contrib-cluster` so you can run this alongside the base setup without collisions. + +## Quick start + +```bash +./scripts/create-cluster.sh --deploy-instance +./scripts/delete-cluster.sh -y +``` + +See `./scripts/create-cluster.sh --help` for the full list of options. A richer README covering the full cost model and troubleshooting story lands in a later commit on this branch. diff --git a/documentdb-playground/contrib/telemetry-and-cost-optimized-eks/scripts/create-cluster.sh b/documentdb-playground/contrib/telemetry-and-cost-optimized-eks/scripts/create-cluster.sh new file mode 100755 index 00000000..d5d5f186 --- /dev/null +++ b/documentdb-playground/contrib/telemetry-and-cost-optimized-eks/scripts/create-cluster.sh @@ -0,0 +1,651 @@ +#!/bin/bash + +# DocumentDB EKS Cluster Creation Script +# This script creates a complete EKS cluster with all dependencies for DocumentDB + +set -e # Exit on any error + +# Configuration +CLUSTER_NAME="${CLUSTER_NAME:-documentdb-contrib-cluster}" +REGION="us-west-2" +K8S_VERSION="${K8S_VERSION:-1.35}" +NODE_TYPE="${NODE_TYPE:-m7g.large}" +NODES=3 +NODES_MIN=1 +NODES_MAX=4 + +# Cost-optimization configuration +# USE_SPOT: when "true", eksctl provisions Spot-backed managed nodes (dev/test only). +# CLUSTER_TAGS: comma-separated key=value pairs passed to AWS for cost allocation in Cost Explorer. +USE_SPOT="${USE_SPOT:-false}" +CLUSTER_TAGS="${CLUSTER_TAGS:-project=documentdb-playground,environment=dev,managed-by=eksctl}" + +# DocumentDB Operator Configuration +# For production: use documentdb/documentdb-operator (official) +OPERATOR_GITHUB_ORG="documentdb" +OPERATOR_CHART_VERSION="0.1.0" + +# Feature flags - set to "true" to enable, "false" to skip +INSTALL_OPERATOR="${INSTALL_OPERATOR:-false}" +DEPLOY_INSTANCE="${DEPLOY_INSTANCE:-false}" + +# Parse command line arguments +while [[ $# -gt 0 ]]; do + case $1 in + --skip-operator) + INSTALL_OPERATOR="false" + shift + ;; + --skip-instance) + DEPLOY_INSTANCE="false" + shift + ;; + --install-operator) + INSTALL_OPERATOR="true" + shift + ;; + --deploy-instance) + DEPLOY_INSTANCE="true" + INSTALL_OPERATOR="true" # Auto-enable operator when instance is requested + shift + ;; + --cluster-name) + CLUSTER_NAME="$2" + shift 2 + ;; + --region) + REGION="$2" + shift 2 + ;; + --github-username) + GITHUB_USERNAME="$2" + shift 2 + ;; + --github-token) + GITHUB_TOKEN="$2" + shift 2 + ;; + --node-type) + NODE_TYPE="$2" + shift 2 + ;; + --eks-version|--k8s-version) + K8S_VERSION="$2" + shift 2 + ;; + --spot) + USE_SPOT="true" + shift + ;; + --tags) + CLUSTER_TAGS="$2" + shift 2 + ;; + -h|--help) + echo "Usage: $0 [OPTIONS]" + echo "" + echo "Options:" + echo " --skip-operator Skip DocumentDB operator installation (default)" + echo " --skip-instance Skip DocumentDB instance deployment (default)" + echo " --install-operator Install DocumentDB operator" + echo " --deploy-instance Deploy DocumentDB instance" + echo " --cluster-name NAME EKS cluster name (default: documentdb-contrib-cluster)" + echo " --region REGION AWS region (default: us-west-2)" + echo " --github-username GitHub username for operator installation" + echo " --github-token GitHub token for operator installation" + echo "" + echo "Cost-optimization options:" + echo " --node-type TYPE EC2 instance type (default: m7g.large, Graviton/ARM)" + echo " --eks-version VER Kubernetes/EKS version (default: 1.35)" + echo " --spot Use Spot-backed managed nodes (DEV/TEST ONLY - can be terminated)" + echo " --tags TAGS Cost allocation tags as key=value pairs (comma-separated)" + echo " (default: project=documentdb-playground,environment=dev,managed-by=eksctl)" + echo "" + echo " -h, --help Show this help message" + echo "" + echo "Examples:" + echo " $0 # Create basic cluster only (no operator, no instance)" + echo " $0 --install-operator # Create cluster with operator, no instance" + echo " $0 --deploy-instance # Create cluster with instance (auto-enables operator)" + echo " $0 --github-username user --github-token ghp_xxx --install-operator # With GitHub auth" + echo " $0 --node-type m5.large # Use x86 instance type instead of Graviton" + echo " $0 --spot --tags \"project=myproj,team=platform\" # Spot dev cluster with custom tags" + exit 0 + ;; + *) + echo "Unknown option: $1" + exit 1 + ;; + esac +done + +# Colors for output +RED='\033[0;31m' +GREEN='\033[0;32m' +YELLOW='\033[1;33m' +BLUE='\033[0;34m' +NC='\033[0m' # No Color + +# Logging function +log() { + echo -e "${BLUE}[$(date +'%Y-%m-%d %H:%M:%S')]${NC} $1" +} + +success() { + echo -e "${GREEN}[$(date +'%Y-%m-%d %H:%M:%S')] ✅ $1${NC}" +} + +warn() { + echo -e "${YELLOW}[$(date +'%Y-%m-%d %H:%M:%S')] ⚠️ $1${NC}" +} + +error() { + echo -e "${RED}[$(date +'%Y-%m-%d %H:%M:%S')] ❌ $1${NC}" + exit 1 +} + +# Check prerequisites +check_prerequisites() { + log "Checking prerequisites..." + + # Check AWS CLI + if ! command -v aws &> /dev/null; then + error "AWS CLI not found. Please install AWS CLI first." + fi + + # Check eksctl + if ! command -v eksctl &> /dev/null; then + error "eksctl not found. Please install eksctl first." + fi + + # Check kubectl + if ! command -v kubectl &> /dev/null; then + error "kubectl not found. Please install kubectl first." + fi + + # Check Helm + if ! command -v helm &> /dev/null; then + error "Helm not found. Please install Helm first." + fi + + # Check AWS credentials + if ! aws sts get-caller-identity &> /dev/null; then + error "AWS credentials not configured. Please run 'aws configure' first." + fi + + success "All prerequisites met" +} + +# Create EKS cluster +create_cluster() { + log "Creating EKS cluster: $CLUSTER_NAME in region: $REGION" + + # Check if cluster already exists + if eksctl get cluster --name $CLUSTER_NAME --region $REGION &> /dev/null; then + warn "Cluster $CLUSTER_NAME already exists. Skipping cluster creation." + return 0 + fi + + if [ "$USE_SPOT" == "true" ]; then + warn "============================================================" + warn "SPOT INSTANCES ENABLED - FOR DEV/TEST USE ONLY" + warn "AWS can terminate Spot instances at any time with 2 minutes" + warn "notice. This WILL interrupt your database and require recovery." + warn "Do NOT use Spot for production or long-running workloads." + warn "============================================================" + fi + + local EKSCTL_ARGS=( + --name "$CLUSTER_NAME" + --region "$REGION" + --version "$K8S_VERSION" + --nodes "$NODES" + --nodes-min "$NODES_MIN" + --nodes-max "$NODES_MAX" + --managed + --with-oidc + --tags "$CLUSTER_TAGS" + ) + + if [ "$USE_SPOT" == "true" ]; then + # Multiple instance types improve Spot availability; all Graviton to match the default. + EKSCTL_ARGS+=(--spot --instance-types "m7g.large,m6g.large,r7g.large,r6g.large,c7g.large,c6g.large") + else + EKSCTL_ARGS+=(--node-type "$NODE_TYPE") + fi + + eksctl create cluster "${EKSCTL_ARGS[@]}" + + if [ $? -eq 0 ]; then + success "EKS cluster created successfully" + else + error "Failed to create EKS cluster" + fi +} + +# Install EBS CSI Driver +install_ebs_csi() { + log "Installing EBS CSI Driver..." + + # Create EBS CSI service account with IAM role + eksctl create iamserviceaccount \ + --cluster $CLUSTER_NAME \ + --namespace kube-system \ + --name ebs-csi-controller-sa \ + --attach-policy-arn arn:aws:iam::aws:policy/service-role/AmazonEBSCSIDriverPolicy \ + --override-existing-serviceaccounts \ + --approve \ + --region $REGION + + # Install EBS CSI driver addon + eksctl create addon \ + --name aws-ebs-csi-driver \ + --cluster $CLUSTER_NAME \ + --region $REGION \ + --force + + # Wait for EBS CSI driver to be ready + log "Waiting for EBS CSI driver to be ready..." + sleep 30 + kubectl wait --for=condition=ready pod -l app=ebs-csi-controller -n kube-system --timeout=300s || warn "EBS CSI driver pods may still be starting" + + success "EBS CSI Driver installed" +} + +# Install AWS Load Balancer Controller +install_load_balancer_controller() { + log "Installing AWS Load Balancer Controller..." + + # Check if already installed + if helm list -n kube-system | grep -q aws-load-balancer-controller; then + warn "AWS Load Balancer Controller already installed. Skipping installation." + return 0 + fi + + # Get VPC ID for the cluster + VPC_ID=$(aws eks describe-cluster --name $CLUSTER_NAME --region $REGION --query 'cluster.resourcesVpcConfig.vpcId' --output text) + log "Using VPC ID: $VPC_ID" + + # Verify subnet tags for Load Balancer Controller + log "Verifying subnet tags for Load Balancer Controller..." + PUBLIC_SUBNETS=$(aws ec2 describe-subnets \ + --filters "Name=vpc-id,Values=$VPC_ID" "Name=map-public-ip-on-launch,Values=true" \ + --query 'Subnets[].SubnetId' --output text --region $REGION) + + PRIVATE_SUBNETS=$(aws ec2 describe-subnets \ + --filters "Name=vpc-id,Values=$VPC_ID" "Name=map-public-ip-on-launch,Values=false" \ + --query 'Subnets[].SubnetId' --output text --region $REGION) + + # Tag public subnets for internet-facing load balancers + if [ -n "$PUBLIC_SUBNETS" ]; then + log "Tagging public subnets for internet-facing load balancers..." + for subnet in $PUBLIC_SUBNETS; do + aws ec2 create-tags --resources "$subnet" --tags Key=kubernetes.io/role/elb,Value=1 --region $REGION 2>/dev/null || true + log "Tagged public subnet: $subnet" + done + fi + + # Tag private subnets for internal load balancers + if [ -n "$PRIVATE_SUBNETS" ]; then + log "Tagging private subnets for internal load balancers..." + for subnet in $PRIVATE_SUBNETS; do + aws ec2 create-tags --resources "$subnet" --tags Key=kubernetes.io/role/internal-elb,Value=1 --region $REGION 2>/dev/null || true + log "Tagged private subnet: $subnet" + done + fi + + # Download the official IAM policy (latest version) + log "Downloading AWS Load Balancer Controller IAM policy (latest version)..." + curl -o /tmp/iam_policy.json https://raw.githubusercontent.com/kubernetes-sigs/aws-load-balancer-controller/main/docs/install/iam_policy.json + + # Get account ID + ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text) + + # Check if policy exists and create/update as needed + if aws iam get-policy --policy-arn arn:aws:iam::$ACCOUNT_ID:policy/AWSLoadBalancerControllerIAMPolicy &>/dev/null; then + log "IAM policy already exists, updating to latest version..." + # Delete and recreate to ensure we have the latest version + aws iam delete-policy --policy-arn arn:aws:iam::$ACCOUNT_ID:policy/AWSLoadBalancerControllerIAMPolicy 2>/dev/null || true + sleep 5 # Wait for deletion to propagate + fi + + # Create IAM policy with latest permissions + log "Creating IAM policy with latest permissions..." + aws iam create-policy \ + --policy-name AWSLoadBalancerControllerIAMPolicy \ + --policy-document file:///tmp/iam_policy.json 2>/dev/null || \ + log "IAM policy already exists or was just created" + + # Wait a moment for policy to be available + sleep 5 + + # Create IAM service account with proper permissions using eksctl + log "Creating IAM service account with proper permissions..." + eksctl create iamserviceaccount \ + --cluster=$CLUSTER_NAME \ + --namespace=kube-system \ + --name=aws-load-balancer-controller \ + --role-name "AmazonEKSLoadBalancerControllerRole-$CLUSTER_NAME" \ + --attach-policy-arn=arn:aws:iam::$ACCOUNT_ID:policy/AWSLoadBalancerControllerIAMPolicy \ + --approve \ + --override-existing-serviceaccounts \ + --region=$REGION + + # Add EKS Helm repository + helm repo add eks https://aws.github.io/eks-charts + helm repo update eks + + # Install Load Balancer Controller using the existing service account + helm install aws-load-balancer-controller eks/aws-load-balancer-controller \ + -n kube-system \ + --set clusterName=$CLUSTER_NAME \ + --set serviceAccount.create=false \ + --set serviceAccount.name=aws-load-balancer-controller \ + --set region=$REGION \ + --set vpcId=$VPC_ID + + # Wait for Load Balancer Controller to be ready + log "Waiting for Load Balancer Controller to be ready..." + sleep 30 + kubectl wait --for=condition=ready pod -l app.kubernetes.io/name=aws-load-balancer-controller -n kube-system --timeout=300s || warn "Load Balancer Controller pods may still be starting" + + # Clean up temp file + rm -f /tmp/iam_policy.json + + success "AWS Load Balancer Controller installed" +} + +# Install cert-manager +install_cert_manager() { + log "Installing cert-manager..." + + # Check if already installed + if helm list -n cert-manager | grep -q cert-manager; then + warn "cert-manager already installed. Skipping installation." + return 0 + fi + + # Add Jetstack Helm repository + helm repo add jetstack https://charts.jetstack.io + helm repo update + + # Install cert-manager + helm install cert-manager jetstack/cert-manager \ + --namespace cert-manager \ + --create-namespace \ + --version v1.13.2 \ + --set installCRDs=true \ + --set prometheus.enabled=false \ + --set webhook.timeoutSeconds=30 + + # Wait for cert-manager to be ready + log "Waiting for cert-manager to be ready..." + sleep 30 + kubectl wait --for=condition=ready pod -l app.kubernetes.io/instance=cert-manager -n cert-manager --timeout=300s || warn "cert-manager pods may still be starting" + + success "cert-manager installed" +} + +# Create optimized storage class +create_storage_class() { + log "Creating DocumentDB storage class..." + + # Check if storage class already exists + if kubectl get storageclass documentdb-storage &> /dev/null; then + warn "DocumentDB storage class already exists. Skipping creation." + return 0 + fi + + kubectl apply -f - </dev/null || true + helm repo update documentdb + + log "Installing DocumentDB operator from public Helm repository..." + if helm install documentdb-operator documentdb/documentdb-operator \ + --namespace documentdb-operator \ + --create-namespace \ + --wait \ + --timeout 10m 2>/dev/null; then + success "DocumentDB operator installed successfully from public Helm repository" + else + # Fallback to OCI registry with GitHub authentication + warn "Public Helm repository installation failed. Falling back to OCI registry..." + + # Check for GitHub authentication + if [ -z "$GITHUB_TOKEN" ] || [ -z "$GITHUB_USERNAME" ]; then + error "DocumentDB operator installation requires GitHub authentication as fallback. + +Please set the following environment variables: + export GITHUB_USERNAME='your-github-username' + export GITHUB_TOKEN='your-github-token' + +To create a GitHub token: +1. Go to https://github.com/settings/tokens +2. Generate a new token with 'read:packages' scope +3. Export the token as shown above + +Then run the script again with --install-operator" + fi + + # Authenticate with GitHub Container Registry + log "Authenticating with GitHub Container Registry..." + if ! echo "$GITHUB_TOKEN" | helm registry login ghcr.io --username "$GITHUB_USERNAME" --password-stdin; then + error "Failed to authenticate with GitHub Container Registry. Please verify your GITHUB_TOKEN and GITHUB_USERNAME." + fi + + # Install DocumentDB operator from OCI registry + log "Pulling and installing DocumentDB operator from ghcr.io/${OPERATOR_GITHUB_ORG}/documentdb-operator..." + helm install documentdb-operator \ + oci://ghcr.io/${OPERATOR_GITHUB_ORG}/documentdb-operator \ + --version ${OPERATOR_CHART_VERSION} \ + --namespace documentdb-operator \ + --create-namespace \ + --wait \ + --timeout 10m + + if [ $? -eq 0 ]; then + success "DocumentDB operator installed successfully from OCI registry: ${OPERATOR_GITHUB_ORG}/documentdb-operator:${OPERATOR_CHART_VERSION}" + else + error "Failed to install DocumentDB operator. Please verify: +- Your GitHub token has 'read:packages' scope +- You have access to ${OPERATOR_GITHUB_ORG}/documentdb-operator repository +- The chart version ${OPERATOR_CHART_VERSION} exists" + fi + fi + + # Wait for operator to be ready + log "Waiting for DocumentDB operator to be ready..." + kubectl wait --for=condition=ready pod -l app.kubernetes.io/name=documentdb-operator -n documentdb-operator --timeout=300s || warn "DocumentDB operator pods may still be starting" + + success "DocumentDB operator installed" +} + +# Deploy DocumentDB instance (optional) +deploy_documentdb_instance() { + if [ "$DEPLOY_INSTANCE" != "true" ]; then + warn "Skipping DocumentDB instance deployment (--skip-instance specified or not enabled)" + return 0 + fi + + log "Deploying DocumentDB instance..." + + # Check if operator is installed + if ! kubectl get deployment -n documentdb-operator documentdb-operator &> /dev/null; then + error "DocumentDB operator not found. Cannot deploy instance without operator." + fi + + # Create DocumentDB namespace + kubectl apply -f - <:10260/" + fi + echo "" + echo "⚠️ IMPORTANT: Run './delete-cluster.sh' when done to avoid AWS charges!" + echo "==================================================" +} + +# Main execution +main() { + log "Starting DocumentDB EKS cluster setup..." + log "Configuration:" + log " Cluster: $CLUSTER_NAME" + log " Region: $REGION" + log " Kubernetes: $K8S_VERSION" + log " Node Type: $NODE_TYPE" + log " Spot Instances: $USE_SPOT" + log " Tags: $CLUSTER_TAGS" + log " Install Operator: $INSTALL_OPERATOR" + log " Deploy Instance: $DEPLOY_INSTANCE" + echo "" + + # Execute setup steps + check_prerequisites + create_cluster + install_ebs_csi + install_load_balancer_controller + install_cert_manager + create_storage_class + + # Optional components + install_documentdb_operator + deploy_documentdb_instance + + # Show summary + print_summary +} + +# Run main function +main "$@" diff --git a/documentdb-playground/contrib/telemetry-and-cost-optimized-eks/scripts/delete-cluster.sh b/documentdb-playground/contrib/telemetry-and-cost-optimized-eks/scripts/delete-cluster.sh new file mode 100755 index 00000000..55ea412d --- /dev/null +++ b/documentdb-playground/contrib/telemetry-and-cost-optimized-eks/scripts/delete-cluster.sh @@ -0,0 +1,877 @@ +#!/bin/bash + +# DocumentDB EKS Cluster Deletion Script +# This script completely removes the EKS cluster and all AWS resources to avoid charges + +set -e # Exit on any error + +# Configuration +CLUSTER_NAME="${CLUSTER_NAME:-documentdb-contrib-cluster}" +REGION="us-west-2" + +# Feature flags - set to "true" to enable, "false" to skip +DELETE_CLUSTER="${DELETE_CLUSTER:-true}" +DELETE_OPERATOR="${DELETE_OPERATOR:-true}" +DELETE_INSTANCE="${DELETE_INSTANCE:-true}" +SKIP_CONFIRMATION="false" + +# Parse command line arguments +while [[ $# -gt 0 ]]; do + case $1 in + --instance-only) + DELETE_INSTANCE="true" + DELETE_OPERATOR="false" + DELETE_CLUSTER="false" + shift + ;; + --instance-and-operator) + DELETE_INSTANCE="true" + DELETE_OPERATOR="true" + DELETE_CLUSTER="false" + shift + ;; + --cluster-name) + CLUSTER_NAME="$2" + shift 2 + ;; + --region) + REGION="$2" + shift 2 + ;; + -y|--yes) + SKIP_CONFIRMATION="true" + shift + ;; + -h|--help) + echo "Usage: $0 [OPTIONS]" + echo "" + echo "Options:" + echo " --instance-only Delete only DocumentDB instances (keep operator and cluster)" + echo " --instance-and-operator Delete instances and operator (keep cluster)" + echo " --cluster-name NAME EKS cluster name (default: documentdb-contrib-cluster)" + echo " --region REGION AWS region (default: us-west-2)" + echo " -y, --yes Skip confirmation prompt" + echo " -h, --help Show this help message" + echo "" + echo "Examples:" + echo " $0 # Delete everything (default)" + echo " $0 --instance-only # Delete only DocumentDB instances" + echo " $0 --instance-and-operator # Delete instances and operator, keep cluster" + echo " $0 --yes # Delete everything without confirmation" + exit 0 + ;; + *) + echo "Unknown option: $1" + exit 1 + ;; + esac +done + +# Colors for output +RED='\033[0;31m' +GREEN='\033[0;32m' +YELLOW='\033[1;33m' +BLUE='\033[0;34m' +NC='\033[0m' # No Color + +# Logging function +log() { + echo -e "${BLUE}[$(date +'%Y-%m-%d %H:%M:%S')]${NC} $1" +} + +success() { + echo -e "${GREEN}[$(date +'%Y-%m-%d %H:%M:%S')] ✅ $1${NC}" +} + +warn() { + echo -e "${YELLOW}[$(date +'%Y-%m-%d %H:%M:%S')] ⚠️ $1${NC}" +} + +error() { + echo -e "${RED}[$(date +'%Y-%m-%d %H:%M:%S')] ❌ $1${NC}" +} + +# Confirmation prompt +confirm_deletion() { + if [ "$SKIP_CONFIRMATION" == "true" ]; then + log "Skipping confirmation (--yes flag provided)" + return 0 + fi + + echo "" + echo "=======================================" + echo " DELETION WARNING" + echo "=======================================" + echo "" + warn "This will DELETE the following resources:" + + if [ "$DELETE_INSTANCE" == "true" ]; then + echo " • All DocumentDB instances" + fi + + if [ "$DELETE_OPERATOR" == "true" ]; then + echo " • DocumentDB operator deployments" + echo " • Related namespaces and CRDs" + fi + + if [ "$DELETE_CLUSTER" == "true" ]; then + echo " • EKS Cluster: $CLUSTER_NAME" + echo " • All persistent volumes" + echo " • Load balancers and networking" + echo " • IAM roles and policies" + fi + + echo "" + warn "This action is IRREVERSIBLE!" + echo "" + + read -p "Are you sure you want to proceed? (type 'yes' to confirm): " confirmation + + if [ "$confirmation" != "yes" ]; then + log "Deletion cancelled by user" + exit 0 + fi + + log "Proceeding with deletion..." +} + +# Delete DocumentDB instances +delete_documentdb_instances() { + if [ "$DELETE_INSTANCE" != "true" ]; then + warn "Skipping DocumentDB instances deletion (--keep-instance specified)" + return 0 + fi + + log "Deleting DocumentDB instances..." + + # Delete all DocumentDB instances (this will trigger LoadBalancer deletion) + kubectl delete documentdb --all --all-namespaces --timeout=300s || warn "No DocumentDB instances found or deletion failed" + + # Wait for LoadBalancer services to be deleted (created by DocumentDB instances) + log "Waiting for DocumentDB LoadBalancer services to be deleted..." + for i in {1..12}; do # Wait up to 6 minutes + LB_SERVICES=$(kubectl get services --all-namespaces -o json 2>/dev/null | jq -r '.items[] | select(.spec.type=="LoadBalancer") | "\(.metadata.namespace)/\(.metadata.name)"' 2>/dev/null || echo "") + if [ -z "$LB_SERVICES" ]; then + success "All LoadBalancer services deleted" + break + fi + log "Still waiting for LoadBalancer services to be deleted... (attempt $i/12)" + echo "$LB_SERVICES" | while read svc; do + if [ -n "$svc" ]; then + log " Remaining service: $svc" + fi + done + sleep 30 + done + + # Wait for AWS LoadBalancers to be cleaned up + log "Waiting for AWS LoadBalancers to be fully removed..." + for i in {1..12}; do # Wait up to 6 minutes for AWS cleanup + # Check for both ELBv2 (ALB/NLB) and Classic ELB + AWS_LBS_V2=$(aws elbv2 describe-load-balancers --region $REGION --query "LoadBalancers[?contains(LoadBalancerName, 'k8s-')].LoadBalancerName" --output text 2>/dev/null || echo "") + AWS_LBS_CLASSIC=$(aws elb describe-load-balancers --region $REGION --query "LoadBalancerDescriptions[?contains(LoadBalancerName, 'k8s-')].LoadBalancerName" --output text 2>/dev/null || echo "") + + if ([ -z "$AWS_LBS_V2" ] || [ "$AWS_LBS_V2" = "None" ]) && ([ -z "$AWS_LBS_CLASSIC" ] || [ "$AWS_LBS_CLASSIC" = "None" ]); then + success "All AWS LoadBalancers cleaned up" + break + fi + log "Still waiting for AWS LoadBalancers to be removed... (attempt $i/12)" + if [ -n "$AWS_LBS_V2" ] && [ "$AWS_LBS_V2" != "None" ]; then + log " Remaining ELBv2: $AWS_LBS_V2" + fi + if [ -n "$AWS_LBS_CLASSIC" ] && [ "$AWS_LBS_CLASSIC" != "None" ]; then + log " Remaining Classic ELB: $AWS_LBS_CLASSIC" + fi + sleep 30 + done + + # Wait for PostgreSQL clusters to be deleted + log "Waiting for PostgreSQL clusters to be deleted..." + sleep 30 + + success "DocumentDB instances and related LoadBalancers deleted" +} + +# Delete Helm releases +delete_helm_releases() { + if [ "$DELETE_OPERATOR" != "true" ]; then + warn "Skipping DocumentDB operator deletion" + return 0 + fi + + log "Deleting DocumentDB operator and related resources..." + + # First, delete all LoadBalancer services to avoid dependency issues + log "Deleting LoadBalancer services..." + kubectl get services --all-namespaces -o json 2>/dev/null | \ + jq -r '.items[] | select(.spec.type=="LoadBalancer") | "\(.metadata.namespace) \(.metadata.name)"' 2>/dev/null | \ + while read namespace service; do + if [ -n "$namespace" ] && [ -n "$service" ]; then + log "Deleting LoadBalancer service: $service in namespace $namespace" + kubectl delete service "$service" -n "$namespace" --timeout=300s || warn "Failed to delete service $service" + fi + done 2>/dev/null || warn "No LoadBalancer services found or jq not available" + + # Wait for LoadBalancers to be deleted from AWS + log "Waiting for AWS LoadBalancers to be cleaned up..." + sleep 30 + + log "Deleting DocumentDB operator Helm releases..." + + # Delete DocumentDB operator + helm uninstall documentdb-operator --namespace documentdb-operator 2>/dev/null || warn "DocumentDB operator not found in documentdb-operator namespace" + + # Only delete these if we're deleting the whole cluster + if [ "$DELETE_CLUSTER" == "true" ]; then + # Delete AWS Load Balancer Controller (after LoadBalancer services are gone) + helm uninstall aws-load-balancer-controller -n kube-system 2>/dev/null || warn "AWS Load Balancer Controller not found" + + # Delete cert-manager + helm uninstall cert-manager -n cert-manager 2>/dev/null || warn "cert-manager not found" + fi + + # Give more time for resources to be cleaned up + log "Waiting for Helm releases and AWS resources to be fully removed..." + sleep 30 + + success "DocumentDB operator and related resources deleted" +} + +# Delete namespaces +delete_namespaces() { + if [ "$DELETE_OPERATOR" != "true" ]; then + warn "Skipping DocumentDB namespaces deletion" + return 0 + fi + + log "Deleting DocumentDB namespaces..." + + # Delete DocumentDB namespace + kubectl delete namespace documentdb-operator --timeout=300s || warn "documentdb-operator namespace not found" + + # Delete instance namespace if it exists + kubectl delete namespace documentdb-instance-ns --timeout=300s || warn "documentdb-instance-ns namespace not found" + + # Only delete these if we're deleting the whole cluster + if [ "$DELETE_CLUSTER" == "true" ]; then + kubectl delete namespace cert-manager --timeout=300s || warn "cert-manager namespace not found" + fi + + success "DocumentDB namespaces deleted" +} + +# Delete CRDs +delete_crds() { + if [ "$DELETE_OPERATOR" != "true" ]; then + warn "Skipping DocumentDB CRDs deletion" + return 0 + fi + + log "Deleting DocumentDB Custom Resource Definitions..." + + # Delete specific CRDs + kubectl delete crd backups.postgresql.cnpg.io \ + clusterimagecatalogs.postgresql.cnpg.io \ + clusters.postgresql.cnpg.io \ + databases.postgresql.cnpg.io \ + imagecatalogs.postgresql.cnpg.io \ + poolers.postgresql.cnpg.io \ + publications.postgresql.cnpg.io \ + scheduledbackups.postgresql.cnpg.io \ + subscriptions.postgresql.cnpg.io \ + dbs.documentdb.io 2>/dev/null || warn "Some CRDs not found or already deleted" + + # Only delete these if we're deleting the whole cluster + if [ "$DELETE_CLUSTER" == "true" ]; then + # Delete cert-manager CRDs + kubectl delete crd -l app.kubernetes.io/name=cert-manager 2>/dev/null || warn "cert-manager CRDs not found" + fi + + success "DocumentDB CRDs deleted" +} + +# Delete AWS resources +delete_aws_resources() { + log "Deleting AWS resources..." + + # Get account ID + ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text 2>/dev/null) || { + warn "Could not get AWS account ID. Skipping IAM policy deletion." + return 0 + } + + # Delete IAM policies (only if they exist) + aws iam delete-policy --policy-arn arn:aws:iam::$ACCOUNT_ID:policy/AWSLoadBalancerControllerIAMPolicy 2>/dev/null || warn "IAM policy AWSLoadBalancerControllerIAMPolicy not found" + + # Delete any remaining load balancers + log "Checking for remaining load balancers..." + local remaining_lbs=$(aws elbv2 describe-load-balancers --region $REGION --query "LoadBalancers[?contains(LoadBalancerName, 'k8s')].LoadBalancerArn" --output text 2>/dev/null || echo "") + if [ -n "$remaining_lbs" ]; then + warn "Found remaining load balancers. They may take a few minutes to delete automatically." + fi + + # Delete any remaining volumes + log "Checking for remaining EBS volumes..." + local remaining_volumes=$(aws ec2 describe-volumes --region $REGION --filters "Name=tag:kubernetes.io/cluster/$CLUSTER_NAME,Values=owned" --query "Volumes[?State=='available'].VolumeId" --output text 2>/dev/null || echo "") + if [ -n "$remaining_volumes" ]; then + warn "Found remaining EBS volumes. Attempting to delete them..." + for volume in $remaining_volumes; do + aws ec2 delete-volume --volume-id $volume --region $REGION 2>/dev/null || warn "Could not delete volume $volume" + done + fi + + success "AWS resources cleanup attempted" +} + +# Clean up any remaining infrastructure LoadBalancers (not DocumentDB app LoadBalancers) +cleanup_infrastructure_loadbalancers() { + if [ "$DELETE_CLUSTER" != "true" ]; then + return 0 + fi + + log "Checking for remaining infrastructure LoadBalancers..." + + # Only look for LoadBalancers that might be created by cluster infrastructure + # DocumentDB LoadBalancers should already be deleted by delete_documentdb_instances + LB_ARNS=$(aws elbv2 describe-load-balancers --region $REGION --query "LoadBalancers[?contains(LoadBalancerName, 'k8s-elb') || contains(LoadBalancerName, 'k8s-nlb') || contains(LoadBalancerName, '$CLUSTER_NAME')].LoadBalancerArn" --output text 2>/dev/null || echo "") + + if [ -n "$LB_ARNS" ] && [ "$LB_ARNS" != "None" ]; then + log "Found infrastructure LoadBalancers to clean up:" + echo "$LB_ARNS" | tr '\t' '\n' | while read lb_arn; do + if [ -n "$lb_arn" ]; then + LB_NAME=$(aws elbv2 describe-load-balancers --load-balancer-arns "$lb_arn" --region $REGION --query 'LoadBalancers[0].LoadBalancerName' --output text 2>/dev/null || echo "unknown") + log " Deleting infrastructure LoadBalancer: $LB_NAME" + aws elbv2 delete-load-balancer --load-balancer-arn "$lb_arn" --region $REGION 2>/dev/null || warn "Failed to delete LoadBalancer $LB_NAME" + fi + done + + # Wait for infrastructure LoadBalancer deletion to complete + log "Waiting for infrastructure LoadBalancer deletion to complete..." + for i in {1..6}; do # Wait up to 3 minutes + REMAINING_LBS=$(aws elbv2 describe-load-balancers --region $REGION --query "LoadBalancers[?contains(LoadBalancerName, 'k8s-elb') || contains(LoadBalancerName, 'k8s-nlb') || contains(LoadBalancerName, '$CLUSTER_NAME')].LoadBalancerArn" --output text 2>/dev/null || echo "") + if [ -z "$REMAINING_LBS" ] || [ "$REMAINING_LBS" = "None" ]; then + success "All infrastructure LoadBalancers deleted" + break + fi + log "Still waiting for infrastructure LoadBalancers to be deleted... (attempt $i/6)" + sleep 30 + done + else + log "No infrastructure LoadBalancers found to clean up." + fi +} + +# Clean up VPC dependencies that can block CloudFormation deletion with proper waiting +cleanup_vpc_dependencies() { + if [ "$DELETE_CLUSTER" != "true" ]; then + return 0 + fi + + log "Cleaning up VPC dependencies..." + + # Get the VPC ID for our cluster + VPC_ID=$(aws ec2 describe-vpcs --region $REGION --filters "Name=tag:Name,Values=eksctl-$CLUSTER_NAME-cluster/VPC" --query 'Vpcs[0].VpcId' --output text 2>/dev/null || echo "") + + if [ -z "$VPC_ID" ] || [ "$VPC_ID" = "None" ] || [ "$VPC_ID" = "null" ]; then + log "No VPC found for cluster $CLUSTER_NAME, checking for any remaining k8s security groups..." + # Fallback: look for any k8s-related security groups + SECURITY_GROUPS=$(aws ec2 describe-security-groups --region $REGION --filters "Name=group-name,Values=k8s-*" --query 'SecurityGroups[].GroupId' --output text 2>/dev/null || echo "") + else + log "Found VPC: $VPC_ID" + + # COMPREHENSIVE LOADBALANCER CLEANUP - Check for any remaining LoadBalancers in this VPC + log "Performing comprehensive LoadBalancer cleanup in VPC $VPC_ID..." + + # Check for ELBv2 LoadBalancers (ALB/NLB) in this VPC + VPC_LBS_V2=$(aws elbv2 describe-load-balancers --region $REGION --query "LoadBalancers[?VpcId=='$VPC_ID'].{Arn:LoadBalancerArn,Name:LoadBalancerName}" --output text 2>/dev/null || echo "") + + if [ -n "$VPC_LBS_V2" ] && [ "$VPC_LBS_V2" != "None" ]; then + log "Found ELBv2 LoadBalancers in VPC, deleting them..." + echo "$VPC_LBS_V2" | while read lb_arn lb_name; do + if [ -n "$lb_arn" ] && [ "$lb_arn" != "None" ]; then + log " Deleting LoadBalancer: $lb_name ($lb_arn)" + aws elbv2 delete-load-balancer --region $REGION --load-balancer-arn "$lb_arn" || warn "Failed to delete LoadBalancer $lb_name" + fi + done + + # Wait for LoadBalancers to be deleted + log "Waiting for ELBv2 LoadBalancers to be deleted..." + for i in {1..12}; do # Wait up to 6 minutes + REMAINING_LBS=$(aws elbv2 describe-load-balancers --region $REGION --query "LoadBalancers[?VpcId=='$VPC_ID'].LoadBalancerName" --output text 2>/dev/null || echo "") + if [ -z "$REMAINING_LBS" ] || [ "$REMAINING_LBS" = "None" ]; then + success "All ELBv2 LoadBalancers deleted from VPC" + break + fi + log "Still waiting for ELBv2 LoadBalancers to be deleted... (attempt $i/12)" + sleep 30 + done + else + log "No ELBv2 LoadBalancers found in VPC" + fi + + # Check for Classic LoadBalancers in this VPC + VPC_LBS_CLASSIC=$(aws elb describe-load-balancers --region $REGION --query "LoadBalancerDescriptions[?VPCId=='$VPC_ID'].LoadBalancerName" --output text 2>/dev/null || echo "") + + if [ -n "$VPC_LBS_CLASSIC" ] && [ "$VPC_LBS_CLASSIC" != "None" ]; then + log "Found Classic LoadBalancers in VPC, deleting them..." + echo "$VPC_LBS_CLASSIC" | tr '\t' '\n' | while read lb_name; do + if [ -n "$lb_name" ] && [ "$lb_name" != "None" ]; then + log " Deleting Classic LoadBalancer: $lb_name" + aws elb delete-load-balancer --region $REGION --load-balancer-name "$lb_name" || warn "Failed to delete Classic LoadBalancer $lb_name" + fi + done + + # Wait for Classic LoadBalancers to be deleted + log "Waiting for Classic LoadBalancers to be deleted..." + for i in {1..12}; do # Wait up to 6 minutes + REMAINING_CLASSIC_LBS=$(aws elb describe-load-balancers --region $REGION --query "LoadBalancerDescriptions[?VPCId=='$VPC_ID'].LoadBalancerName" --output text 2>/dev/null || echo "") + if [ -z "$REMAINING_CLASSIC_LBS" ] || [ "$REMAINING_CLASSIC_LBS" = "None" ]; then + success "All Classic LoadBalancers deleted from VPC" + break + fi + log "Still waiting for Classic LoadBalancers to be deleted... (attempt $i/12)" + sleep 30 + done + else + log "No Classic LoadBalancers found in VPC" + fi + + # Check for network interfaces that might still be attached to LoadBalancers + log "Checking for LoadBalancer network interfaces in VPC subnets..." + VPC_SUBNETS=$(aws ec2 describe-subnets --region $REGION --filters "Name=vpc-id,Values=$VPC_ID" --query 'Subnets[].SubnetId' --output text 2>/dev/null || echo "") + + if [ -n "$VPC_SUBNETS" ] && [ "$VPC_SUBNETS" != "None" ]; then + for subnet_id in $VPC_SUBNETS; do + LB_ENIS=$(aws ec2 describe-network-interfaces --region $REGION --filters "Name=subnet-id,Values=$subnet_id" --query 'NetworkInterfaces[?contains(Description, `ELB`) && Status==`in-use`].{Id:NetworkInterfaceId,Description:Description}' --output text 2>/dev/null || echo "") + + if [ -n "$LB_ENIS" ] && [ "$LB_ENIS" != "None" ]; then + log "Found LoadBalancer network interfaces in subnet $subnet_id:" + echo "$LB_ENIS" | while read eni_id description; do + if [ -n "$eni_id" ] && [ "$eni_id" != "None" ]; then + log " ENI $eni_id: $description" + # Extract LoadBalancer name from description for targeted deletion + LB_FROM_ENI=$(echo "$description" | grep -o 'k8s-[^/]*' | head -1 || echo "") + if [ -n "$LB_FROM_ENI" ]; then + log " Attempting to delete LoadBalancer: $LB_FROM_ENI" + # Try to find and delete the LoadBalancer by name pattern + LB_ARN=$(aws elbv2 describe-load-balancers --region $REGION --query "LoadBalancers[?LoadBalancerName=='$LB_FROM_ENI'].LoadBalancerArn" --output text 2>/dev/null || echo "") + if [ -n "$LB_ARN" ] && [ "$LB_ARN" != "None" ]; then + log " Found ELBv2 LoadBalancer, deleting: $LB_ARN" + aws elbv2 delete-load-balancer --region $REGION --load-balancer-arn "$LB_ARN" || warn "Failed to delete ELBv2 LoadBalancer $LB_FROM_ENI" + else + # Try Classic ELB + aws elb delete-load-balancer --region $REGION --load-balancer-name "$LB_FROM_ENI" 2>/dev/null || warn "Could not delete LoadBalancer $LB_FROM_ENI" + fi + fi + fi + done + fi + done + + # Final wait for all network interfaces to be released + log "Waiting for LoadBalancer network interfaces to be released..." + for i in {1..8}; do # Wait up to 4 minutes + REMAINING_LB_ENIS="" + for subnet_id in $VPC_SUBNETS; do + SUBNET_LB_ENIS=$(aws ec2 describe-network-interfaces --region $REGION --filters "Name=subnet-id,Values=$subnet_id" --query 'NetworkInterfaces[?contains(Description, `ELB`) && Status==`in-use`].NetworkInterfaceId' --output text 2>/dev/null || echo "") + if [ -n "$SUBNET_LB_ENIS" ] && [ "$SUBNET_LB_ENIS" != "None" ]; then + REMAINING_LB_ENIS="$REMAINING_LB_ENIS $SUBNET_LB_ENIS" + fi + done + + if [ -z "$REMAINING_LB_ENIS" ] || [ "$REMAINING_LB_ENIS" = " " ]; then + success "All LoadBalancer network interfaces released" + break + fi + log "Still waiting for LoadBalancer network interfaces to be released... (attempt $i/8)" + sleep 30 + done + fi + + success "Comprehensive LoadBalancer cleanup completed" + + # ENHANCED SECURITY GROUP CLEANUP - Run after LoadBalancer cleanup is complete + log "Performing enhanced security group cleanup..." + + # Wait a bit more for AWS to propagate LoadBalancer deletions + sleep 30 + + # Get all non-default security groups in the VPC with retry logic + for retry in {1..3}; do + log "Attempting security group cleanup (attempt $retry/3)..." + + SECURITY_GROUPS=$(aws ec2 describe-security-groups --region $REGION --filters "Name=vpc-id,Values=$VPC_ID" --query 'SecurityGroups[?GroupName!=`default`].GroupId' --output text 2>/dev/null || echo "") + + if [ -n "$SECURITY_GROUPS" ] && [ "$SECURITY_GROUPS" != "None" ]; then + log "Found security groups to delete: $SECURITY_GROUPS" + + # Delete security groups one by one with detailed error handling + for sg_id in $SECURITY_GROUPS; do + if [ -n "$sg_id" ] && [ "$sg_id" != "None" ]; then + SG_NAME=$(aws ec2 describe-security-groups --group-ids "$sg_id" --region $REGION --query 'SecurityGroups[0].GroupName' --output text 2>/dev/null || echo "unknown") + SG_DESC=$(aws ec2 describe-security-groups --group-ids "$sg_id" --region $REGION --query 'SecurityGroups[0].Description' --output text 2>/dev/null || echo "unknown") + + log " Attempting to delete security group: $SG_NAME ($sg_id) - $SG_DESC" + + # Try to delete the security group + if aws ec2 delete-security-group --group-id "$sg_id" --region $REGION 2>/dev/null; then + success " Successfully deleted security group: $SG_NAME" + else + warn " Failed to delete security group: $SG_NAME - may have dependencies" + + # Check what's still using this security group + SG_DEPS=$(aws ec2 describe-network-interfaces --region $REGION --filters "Name=group-id,Values=$sg_id" --query 'NetworkInterfaces[].{Id:NetworkInterfaceId,Status:Status,Description:Description}' --output text 2>/dev/null || echo "") + if [ -n "$SG_DEPS" ] && [ "$SG_DEPS" != "None" ]; then + log " Security group $SG_NAME is still used by network interfaces:" + echo "$SG_DEPS" | while read eni_id status desc; do + log " ENI: $eni_id ($status) - $desc" + done + fi + fi + fi + done + + # Wait for security group deletions to propagate + if [ $retry -lt 3 ]; then + log "Waiting 60 seconds for security group deletions to propagate..." + sleep 60 + fi + else + log "No non-default security groups found" + break + fi + done + + # Final verification of security group cleanup + REMAINING_SG=$(aws ec2 describe-security-groups --region $REGION --filters "Name=vpc-id,Values=$VPC_ID" --query 'SecurityGroups[?GroupName!=`default`].{GroupId:GroupId,GroupName:GroupName}' --output text 2>/dev/null || echo "") + if [ -z "$REMAINING_SG" ] || [ "$REMAINING_SG" = "None" ]; then + success "All non-default security groups cleaned up successfully" + else + warn "Some security groups remain in VPC:" + echo "$REMAINING_SG" | while read sg_id sg_name; do + warn " Remaining: $sg_name ($sg_id)" + done + fi + + # Clean up security groups in this VPC (except default) + log "Finding security groups in VPC $VPC_ID..." + SECURITY_GROUPS=$(aws ec2 describe-security-groups --region $REGION --filters "Name=vpc-id,Values=$VPC_ID" --query 'SecurityGroups[?GroupName!=`default`].GroupId' --output text 2>/dev/null || echo "") + fi + + if [ -n "$SECURITY_GROUPS" ] && [ "$SECURITY_GROUPS" != "None" ]; then + log "Deleting security groups..." + echo "$SECURITY_GROUPS" | tr '\t' '\n' | while read sg_id; do + if [ -n "$sg_id" ]; then + SG_NAME=$(aws ec2 describe-security-groups --group-ids "$sg_id" --region $REGION --query 'SecurityGroups[0].GroupName' --output text 2>/dev/null || echo "unknown") + log " Deleting security group: $SG_NAME ($sg_id)" + aws ec2 delete-security-group --group-id "$sg_id" --region $REGION 2>/dev/null || warn "Failed to delete security group $sg_id" + fi + done + + # Wait and verify security groups are deleted + log "Waiting for security groups to be deleted..." + for i in {1..6}; do # Wait up to 3 minutes + if [ -n "$VPC_ID" ] && [ "$VPC_ID" != "None" ] && [ "$VPC_ID" != "null" ]; then + REMAINING_SG=$(aws ec2 describe-security-groups --region $REGION --filters "Name=vpc-id,Values=$VPC_ID" --query 'SecurityGroups[?GroupName!=`default`].GroupId' --output text 2>/dev/null || echo "") + else + REMAINING_SG=$(aws ec2 describe-security-groups --region $REGION --filters "Name=group-name,Values=k8s-*" --query 'SecurityGroups[].GroupId' --output text 2>/dev/null || echo "") + fi + + if [ -z "$REMAINING_SG" ] || [ "$REMAINING_SG" = "None" ]; then + success "All security groups deleted successfully" + break + fi + log "Still waiting for security groups to be deleted... (attempt $i/6)" + sleep 30 + done + else + log "No non-default security groups found to clean up." + fi + + # Clean up any remaining network interfaces in the VPC + if [ -n "$VPC_ID" ] && [ "$VPC_ID" != "None" ] && [ "$VPC_ID" != "null" ]; then + log "Checking for remaining network interfaces in VPC $VPC_ID..." + NETWORK_INTERFACES=$(aws ec2 describe-network-interfaces --region $REGION --filters "Name=vpc-id,Values=$VPC_ID" --query 'NetworkInterfaces[?Status!=`in-use`].NetworkInterfaceId' --output text 2>/dev/null || echo "") + + if [ -n "$NETWORK_INTERFACES" ] && [ "$NETWORK_INTERFACES" != "None" ]; then + log "Deleting unused network interfaces..." + echo "$NETWORK_INTERFACES" | tr '\t' '\n' | while read eni_id; do + if [ -n "$eni_id" ]; then + log " Deleting network interface: $eni_id" + aws ec2 delete-network-interface --network-interface-id "$eni_id" --region $REGION 2>/dev/null || warn "Failed to delete network interface $eni_id" + fi + done + + # Wait for network interfaces to be deleted + log "Waiting for network interfaces to be deleted..." + sleep 30 + else + log "No unused network interfaces found." + fi + fi + + success "VPC dependencies cleanup completed." +} + +# Delete EKS cluster +delete_cluster() { + if [ "$DELETE_CLUSTER" != "true" ]; then + warn "Skipping EKS cluster deletion (--keep-cluster specified)" + return 0 + fi + + log "Deleting EKS cluster..." + + # Check if cluster exists + if ! eksctl get cluster --name $CLUSTER_NAME --region $REGION &> /dev/null; then + warn "Cluster $CLUSTER_NAME not found. Skipping cluster deletion." + return 0 + fi + + # Final check: Make sure all LoadBalancers are really gone + log "Final verification: ensuring all LoadBalancers are deleted..." + local retry_count=0 + local max_retries=5 + + while [ $retry_count -lt $max_retries ]; do + # Get VPC ID for targeted cleanup + VPC_ID=$(aws ec2 describe-vpcs --region $REGION --filters "Name=tag:Name,Values=eksctl-$CLUSTER_NAME-cluster/VPC" --query 'Vpcs[0].VpcId' --output text 2>/dev/null || echo "") + + if [ -n "$VPC_ID" ] && [ "$VPC_ID" != "None" ] && [ "$VPC_ID" != "null" ]; then + # Check for LoadBalancers in this VPC + VPC_LBS_V2=$(aws elbv2 describe-load-balancers --region $REGION --query "LoadBalancers[?VpcId=='$VPC_ID'].LoadBalancerArn" --output text 2>/dev/null || echo "") + VPC_LBS_CLASSIC=$(aws elb describe-load-balancers --region $REGION --query "LoadBalancerDescriptions[?VPCId=='$VPC_ID'].LoadBalancerName" --output text 2>/dev/null || echo "") + + if ([ -z "$VPC_LBS_V2" ] || [ "$VPC_LBS_V2" = "None" ]) && ([ -z "$VPC_LBS_CLASSIC" ] || [ "$VPC_LBS_CLASSIC" = "None" ]); then + success "No LoadBalancers found in cluster VPC" + break + fi + + log "Found LoadBalancers still in VPC $VPC_ID, waiting... (attempt $((retry_count + 1))/$max_retries)" + if [ -n "$VPC_LBS_V2" ] && [ "$VPC_LBS_V2" != "None" ]; then + log " ELBv2 LoadBalancers: $VPC_LBS_V2" + fi + if [ -n "$VPC_LBS_CLASSIC" ] && [ "$VPC_LBS_CLASSIC" != "None" ]; then + log " Classic LoadBalancers: $VPC_LBS_CLASSIC" + fi + + sleep 60 + retry_count=$((retry_count + 1)) + else + log "VPC not found or already deleted" + break + fi + done + + # Delete the cluster + eksctl delete cluster --name $CLUSTER_NAME --region $REGION --wait + + if [ $? -eq 0 ]; then + success "EKS cluster deleted successfully" + else + error "Failed to delete EKS cluster" + fi +} + +# Clean up local kubectl context +cleanup_kubectl_context() { + log "Cleaning up kubectl context..." + + # Remove kubectl context (handle both possible context names) + kubectl config delete-context "$CLUSTER_NAME.$REGION.eksctl.io" 2>/dev/null || warn "kubectl context $CLUSTER_NAME.$REGION.eksctl.io not found" + kubectl config delete-cluster "$CLUSTER_NAME.$REGION.eksctl.io" 2>/dev/null || warn "kubectl cluster $CLUSTER_NAME.$REGION.eksctl.io not found" + kubectl config delete-user "documentdb-admin@$CLUSTER_NAME.$REGION.eksctl.io" 2>/dev/null || warn "kubectl user not found" + + # Also try the default user pattern + kubectl config delete-user "$CLUSTER_NAME@$CLUSTER_NAME.$REGION.eksctl.io" 2>/dev/null || warn "kubectl user (alternate pattern) not found" + + success "kubectl context cleaned up" +} + +# Verify deletion +verify_deletion() { + log "Verifying deletion..." + + echo "" + echo "=== Checking for remaining resources ===" + + # Check if cluster exists + if eksctl get cluster --name $CLUSTER_NAME --region $REGION &> /dev/null; then + warn "Cluster still exists!" + else + success "Cluster deleted" + fi + + # Check for remaining CloudFormation stacks + echo "" + log "Checking for remaining CloudFormation stacks..." + aws cloudformation list-stacks --region $REGION --stack-status-filter CREATE_COMPLETE UPDATE_COMPLETE --query "StackSummaries[?contains(StackName, 'eksctl-$CLUSTER_NAME')].{Name:StackName,Status:StackStatus}" --output table || true + + # Check for remaining EBS volumes + echo "" + log "Checking for remaining EBS volumes..." + aws ec2 describe-volumes --region $REGION --filters "Name=tag:kubernetes.io/cluster/$CLUSTER_NAME,Values=owned" --query "Volumes[].{VolumeId:VolumeId,State:State,Size:Size}" --output table 2>/dev/null || log "No volumes found with cluster tag" + + # Check for remaining load balancers + echo "" + log "Checking for remaining load balancers..." + aws elbv2 describe-load-balancers --region $REGION --query "LoadBalancers[?contains(LoadBalancerName, 'k8s')].[LoadBalancerName,State.Code]" --output table 2>/dev/null || log "No load balancers found" + + echo "" + success "Deletion verification complete!" +} + +# Manual cleanup instructions +show_manual_cleanup() { + echo "" + echo "=======================================" + echo " MANUAL CLEANUP (if needed)" + echo "=======================================" + echo "" + echo "If any resources remain, you can manually clean them up:" + echo "" + echo "1. CloudFormation Stacks:" + echo " aws cloudformation delete-stack --stack-name STACK_NAME --region $REGION" + echo "" + echo "2. EBS Volumes:" + echo " aws ec2 delete-volume --volume-id VOLUME_ID --region $REGION" + echo "" + echo "3. Load Balancers:" + echo " aws elbv2 delete-load-balancer --load-balancer-arn LOAD_BALANCER_ARN" + echo "" + echo "4. IAM Roles and Policies:" + echo " Check AWS Console -> IAM for any remaining eksctl-created resources" + echo "" +} + +# Clean up failed CloudFormation stacks with proper waiting +cleanup_failed_cloudformation_stacks() { + if [ "$DELETE_CLUSTER" != "true" ]; then + return 0 + fi + + log "Checking for failed CloudFormation stacks..." + + # Look for stacks related to our cluster that are in failed states + FAILED_STACKS=$(aws cloudformation list-stacks --region $REGION --query "StackSummaries[?contains(StackName, '$CLUSTER_NAME') && (StackStatus=='DELETE_FAILED' || StackStatus=='CREATE_FAILED' || StackStatus=='UPDATE_FAILED')].StackName" --output text 2>/dev/null || echo "") + + if [ -n "$FAILED_STACKS" ] && [ "$FAILED_STACKS" != "None" ]; then + log "Found failed CloudFormation stacks, attempting to delete:" + echo "$FAILED_STACKS" | tr '\t' '\n' | while read stack_name; do + if [ -n "$stack_name" ]; then + log " Deleting failed stack: $stack_name" + aws cloudformation delete-stack --stack-name "$stack_name" --region $REGION 2>/dev/null || warn "Failed to delete stack $stack_name" + fi + done + + # Wait for stack deletion to complete with verification + log "Waiting for CloudFormation stack deletion to complete..." + for i in {1..20}; do # Wait up to 10 minutes + REMAINING_STACKS=$(aws cloudformation list-stacks --region $REGION --query "StackSummaries[?contains(StackName, '$CLUSTER_NAME') && StackStatus!='DELETE_COMPLETE'].StackName" --output text 2>/dev/null || echo "") + if [ -z "$REMAINING_STACKS" ] || [ "$REMAINING_STACKS" = "None" ]; then + success "All CloudFormation stacks deleted successfully" + break + fi + log "Still waiting for CloudFormation stacks to be deleted... (attempt $i/20)" + sleep 30 + done + else + log "No failed CloudFormation stacks found." + fi +} + +# Main execution +main() { + echo "=======================================" + echo " DocumentDB EKS Cluster Deletion" + echo "=======================================" + echo "" + log "Target Configuration:" + log " Cluster: $CLUSTER_NAME" + log " Region: $REGION" + log " Delete Instance: $DELETE_INSTANCE" + log " Delete Operator: $DELETE_OPERATOR" + log " Delete Cluster: $DELETE_CLUSTER" + echo "" + + confirm_deletion + + log "Starting cluster deletion process..." + + # Check if cluster exists before proceeding + if ! eksctl get cluster --name $CLUSTER_NAME --region $REGION &> /dev/null; then + warn "Cluster '$CLUSTER_NAME' not found in region '$REGION'" + log "This may have been already deleted, or the name/region is incorrect." + log "Proceeding with cleanup of any remaining AWS resources..." + + # Even if cluster is gone, clean up any remaining AWS resources + if [ "$DELETE_CLUSTER" == "true" ]; then + cleanup_infrastructure_loadbalancers + cleanup_vpc_dependencies + cleanup_failed_cloudformation_stacks + cleanup_kubectl_context + fi + return 0 + fi + + # Step 1: Delete Kubernetes resources first + delete_documentdb_instances + delete_helm_releases + delete_namespaces + delete_crds + + # Step 2: Clean up AWS resources in proper order (only if deleting cluster) + if [ "$DELETE_CLUSTER" == "true" ]; then + log "Proceeding with AWS resource cleanup..." + + # Step 2a: Clean up any remaining infrastructure LoadBalancers (not DocumentDB app LBs) + cleanup_infrastructure_loadbalancers + + # Step 2b: Clean up VPC dependencies (security groups, network interfaces) + cleanup_vpc_dependencies + + # Step 2c: Clean up any failed CloudFormation stacks + cleanup_failed_cloudformation_stacks + + # Step 2d: Delete remaining AWS resources (IAM roles, policies) + delete_aws_resources + + # Step 2e: Finally delete the cluster itself + delete_cluster + + # Step 2f: Clean up local kubectl context + cleanup_kubectl_context + fi + + verify_deletion + + echo "" + echo "=======================================" + success "🗑️ Deletion completed!" + echo "=======================================" + echo "" + echo "Summary:" + if [ "$DELETE_INSTANCE" == "true" ]; then + echo " • DocumentDB instances removed" + fi + if [ "$DELETE_OPERATOR" == "true" ]; then + echo " • DocumentDB operator removed" + fi + if [ "$DELETE_CLUSTER" == "true" ]; then + echo " • EKS cluster '$CLUSTER_NAME' deleted from $REGION" + echo " • All AWS resources cleaned up" + echo " • kubectl context removed" + echo "" + success "No more AWS charges should be incurred from this cluster!" + else + echo " • EKS cluster '$CLUSTER_NAME' preserved" + echo "" + success "Cluster preserved - you can reinstall DocumentDB components as needed!" + fi + echo "" + + show_manual_cleanup +} + +# Run main function +main "$@" \ No newline at end of file From 568d989bd1157c4a80822f71fa7b6461996d6331 Mon Sep 17 00:00:00 2001 From: michaelraney Date: Wed, 22 Apr 2026 10:09:58 -0400 Subject: [PATCH 2/7] feat(contrib): add 2-AZ deployment, CloudFormation diagnostics, mongosh prereq - Deploy to 2 AZs (minimum EKS supports) instead of the eksctl default of 3 to reduce per-node cross-AZ data transfer for dev/test clusters. - Print CloudFormation stack events after eksctl create succeeds/fails, so users can see the underlying AWS resource status (helps when a create is failing partway through). - Warn (non-fatal) when mongosh is missing, matching the DocumentDB troubleshooting guidance of verifying client tooling before blaming the server on connection issues. Signed-off-by: michaelraney Made-with: Cursor --- .../scripts/create-cluster.sh | 70 ++++++++++++++++--- 1 file changed, 62 insertions(+), 8 deletions(-) diff --git a/documentdb-playground/contrib/telemetry-and-cost-optimized-eks/scripts/create-cluster.sh b/documentdb-playground/contrib/telemetry-and-cost-optimized-eks/scripts/create-cluster.sh index d5d5f186..7ec6f9fb 100755 --- a/documentdb-playground/contrib/telemetry-and-cost-optimized-eks/scripts/create-cluster.sh +++ b/documentdb-playground/contrib/telemetry-and-cost-optimized-eks/scripts/create-cluster.sh @@ -144,38 +144,87 @@ error() { exit 1 } +error_no_exit() { + echo -e "${RED}[$(date +'%Y-%m-%d %H:%M:%S')] ❌ $1${NC}" +} + # Check prerequisites check_prerequisites() { log "Checking prerequisites..." - + # Check AWS CLI if ! command -v aws &> /dev/null; then error "AWS CLI not found. Please install AWS CLI first." fi - + # Check eksctl if ! command -v eksctl &> /dev/null; then error "eksctl not found. Please install eksctl first." fi - + # Check kubectl if ! command -v kubectl &> /dev/null; then error "kubectl not found. Please install kubectl first." fi - + # Check Helm if ! command -v helm &> /dev/null; then error "Helm not found. Please install Helm first." fi - + # Check AWS credentials if ! aws sts get-caller-identity &> /dev/null; then error "AWS credentials not configured. Please run 'aws configure' first." fi - + + # Optional but recommended: mongosh for endpoint validation. Mirrors the DocumentDB + # troubleshooting best practice of verifying local client tooling before blaming the server. + if ! command -v mongosh &> /dev/null; then + warn "mongosh not found. Install with: brew install mongosh (macOS) or see https://www.mongodb.com/docs/mongodb-shell/install/" + warn "Local connection validation (kubectl port-forward + mongosh) won't work until mongosh is installed." + fi + success "All prerequisites met" } +# Show CloudFormation stack events for eksctl-managed stacks +show_cloudformation_events() { + local status_filter="$1" # "CREATE_FAILED" or "CREATE_COMPLETE" + local stacks + stacks=$(aws cloudformation list-stacks --region "$REGION" \ + --stack-status-filter CREATE_COMPLETE CREATE_FAILED ROLLBACK_COMPLETE ROLLBACK_IN_PROGRESS \ + --query "StackSummaries[?starts_with(StackName, 'eksctl-${CLUSTER_NAME}')].StackName" \ + --output text 2>/dev/null) + + if [ -z "$stacks" ]; then + warn "No CloudFormation stacks found for cluster $CLUSTER_NAME" + return + fi + + for stack in $stacks; do + log "CloudFormation stack: $stack" + if [ "$status_filter" == "CREATE_FAILED" ]; then + local failures + failures=$(aws cloudformation describe-stack-events --region "$REGION" \ + --stack-name "$stack" \ + --query "StackEvents[?ResourceStatus=='CREATE_FAILED'].[Timestamp,LogicalResourceId,ResourceStatusReason]" \ + --output table 2>/dev/null) + if [ -n "$failures" ] && ! echo "$failures" | grep -q "^$"; then + error_no_exit "Failed resources in $stack:" + echo "$failures" + else + success "No failures in $stack" + fi + else + local stack_status + stack_status=$(aws cloudformation describe-stacks --region "$REGION" \ + --stack-name "$stack" \ + --query "Stacks[0].StackStatus" --output text 2>/dev/null) + log " Status: $stack_status" + fi + done +} + # Create EKS cluster create_cluster() { log "Creating EKS cluster: $CLUSTER_NAME in region: $REGION" @@ -195,6 +244,7 @@ create_cluster() { warn "============================================================" fi + # 2 AZs is the minimum EKS supports, reduced from eksctl default of 3 for cost reasons. local EKSCTL_ARGS=( --name "$CLUSTER_NAME" --region "$REGION" @@ -214,11 +264,15 @@ create_cluster() { EKSCTL_ARGS+=(--node-type "$NODE_TYPE") fi - eksctl create cluster "${EKSCTL_ARGS[@]}" + eksctl create cluster "${EKSCTL_ARGS[@]}" --zones "${REGION}a,${REGION}b" + local exit_code=$? - if [ $? -eq 0 ]; then + log "Retrieving CloudFormation stack events..." + if [ $exit_code -eq 0 ]; then + show_cloudformation_events "CREATE_COMPLETE" success "EKS cluster created successfully" else + show_cloudformation_events "CREATE_FAILED" error "Failed to create EKS cluster" fi } From 5222cffadd9fccf731c0ee9912bdd15bf0b34e8b Mon Sep 17 00:00:00 2001 From: michaelraney Date: Wed, 22 Apr 2026 10:10:58 -0400 Subject: [PATCH 3/7] feat(contrib): add S3 Gateway VPC endpoint (create + teardown) Provision a free S3 Gateway VPC endpoint after the cluster is up and attach it to every route table in the cluster VPC. S3 traffic from the cluster then bypasses the NAT Gateway, eliminating NAT data-transfer charges for S3 (useful for pulls from S3, CloudWatch exports, backups, etc.). Teardown in delete-cluster.sh enumerates and deletes every VPC endpoint in the cluster VPC before the rest of the VPC dependency cleanup runs, so the VPC can be fully destroyed by eksctl/CloudFormation. Summary output in create-cluster.sh is updated to list the endpoint. Signed-off-by: michaelraney Made-with: Cursor --- .../scripts/create-cluster.sh | 32 +++++++++++++++++++ .../scripts/delete-cluster.sh | 20 +++++++++++- 2 files changed, 51 insertions(+), 1 deletion(-) diff --git a/documentdb-playground/contrib/telemetry-and-cost-optimized-eks/scripts/create-cluster.sh b/documentdb-playground/contrib/telemetry-and-cost-optimized-eks/scripts/create-cluster.sh index 7ec6f9fb..bafdd2f8 100755 --- a/documentdb-playground/contrib/telemetry-and-cost-optimized-eks/scripts/create-cluster.sh +++ b/documentdb-playground/contrib/telemetry-and-cost-optimized-eks/scripts/create-cluster.sh @@ -277,6 +277,36 @@ create_cluster() { fi } +# Create VPC endpoints for cost optimization (S3 Gateway endpoint is free) +create_vpc_endpoints() { + log "Creating VPC endpoints for cost optimization..." + + VPC_ID=$(aws eks describe-cluster --name "$CLUSTER_NAME" --region "$REGION" \ + --query 'cluster.resourcesVpcConfig.vpcId' --output text) + + if [ -z "$VPC_ID" ] || [ "$VPC_ID" = "None" ]; then + warn "Could not determine VPC ID. Skipping VPC endpoint creation." + return 0 + fi + + ROUTE_TABLE_IDS=$(aws ec2 describe-route-tables --region "$REGION" \ + --filters "Name=vpc-id,Values=$VPC_ID" \ + --query 'RouteTables[].RouteTableId' --output text) + + # S3 Gateway Endpoint (free - reduces NAT Gateway data transfer costs) + if aws ec2 describe-vpc-endpoints --region "$REGION" \ + --filters "Name=vpc-id,Values=$VPC_ID" "Name=service-name,Values=com.amazonaws.$REGION.s3" \ + --query 'VpcEndpoints[0].VpcEndpointId' --output text 2>/dev/null | grep -q "vpce-"; then + warn "S3 VPC endpoint already exists. Skipping creation." + else + aws ec2 create-vpc-endpoint \ + --vpc-id "$VPC_ID" \ + --service-name "com.amazonaws.$REGION.s3" \ + --route-table-ids $ROUTE_TABLE_IDS \ + --region "$REGION" 2>/dev/null && success "S3 Gateway VPC endpoint created (free)" || warn "Could not create S3 VPC endpoint" + fi +} + # Install EBS CSI Driver install_ebs_csi() { log "Installing EBS CSI Driver..." @@ -643,6 +673,7 @@ print_summary() { echo "" echo "✅ Components installed:" echo " - EKS cluster with managed nodes ($NODE_TYPE)" + echo " - S3 Gateway VPC endpoint (cost optimization)" echo " - EBS CSI driver" echo " - AWS Load Balancer Controller" echo " - cert-manager" @@ -688,6 +719,7 @@ main() { # Execute setup steps check_prerequisites create_cluster + create_vpc_endpoints install_ebs_csi install_load_balancer_controller install_cert_manager diff --git a/documentdb-playground/contrib/telemetry-and-cost-optimized-eks/scripts/delete-cluster.sh b/documentdb-playground/contrib/telemetry-and-cost-optimized-eks/scripts/delete-cluster.sh index 55ea412d..f3dbdc57 100755 --- a/documentdb-playground/contrib/telemetry-and-cost-optimized-eks/scripts/delete-cluster.sh +++ b/documentdb-playground/contrib/telemetry-and-cost-optimized-eks/scripts/delete-cluster.sh @@ -371,7 +371,25 @@ cleanup_vpc_dependencies() { # Get the VPC ID for our cluster VPC_ID=$(aws ec2 describe-vpcs --region $REGION --filters "Name=tag:Name,Values=eksctl-$CLUSTER_NAME-cluster/VPC" --query 'Vpcs[0].VpcId' --output text 2>/dev/null || echo "") - + + # Clean up VPC endpoints created for cost optimization (e.g. S3 Gateway endpoint). + # Must run before CloudFormation deletion so the VPC can be torn down cleanly. + if [ -n "$VPC_ID" ] && [ "$VPC_ID" != "None" ] && [ "$VPC_ID" != "null" ]; then + log "Cleaning up VPC endpoints..." + VPC_ENDPOINTS=$(aws ec2 describe-vpc-endpoints --region "$REGION" \ + --filters "Name=vpc-id,Values=$VPC_ID" \ + --query 'VpcEndpoints[].VpcEndpointId' --output text 2>/dev/null || echo "") + if [ -n "$VPC_ENDPOINTS" ] && [ "$VPC_ENDPOINTS" != "None" ]; then + for endpoint_id in $VPC_ENDPOINTS; do + log " Deleting VPC endpoint: $endpoint_id" + aws ec2 delete-vpc-endpoints --vpc-endpoint-ids "$endpoint_id" --region "$REGION" 2>/dev/null || warn "Failed to delete VPC endpoint $endpoint_id" + done + success "VPC endpoints cleaned up" + else + log "No VPC endpoints found to clean up." + fi + fi + if [ -z "$VPC_ID" ] || [ "$VPC_ID" = "None" ] || [ "$VPC_ID" = "null" ]; then log "No VPC found for cluster $CLUSTER_NAME, checking for any remaining k8s security groups..." # Fallback: look for any k8s-related security groups From d09a3860b5ab457aee8073e9ef31fe7b977dd771 Mon Sep 17 00:00:00 2001 From: michaelraney Date: Wed, 22 Apr 2026 10:13:13 -0400 Subject: [PATCH 4/7] feat(contrib): add EKS control-plane logging and log-group retention - LOG_RETENTION_DAYS (default: 3) and CONTROL_PLANE_LOG_TYPES (default: api,authenticator) variables, both overridable via env or the new --log-retention / --control-plane-log-types flags. - enable_control_plane_logging() calls eksctl utils update-cluster-logging to ship the selected control-plane log streams to CloudWatch. - set_log_retention() applies the retention policy to the cluster's log groups (control plane + container insights) with lazy-create and retry, since the container-insights groups only appear after the collector emits its first log. - delete-cluster.sh gets delete_cloudwatch_logs() and wires it into main() (both the "cluster still exists" and "cluster already gone" paths) so teardown leaves no lingering log groups. - Help text, configuration banner, and post-install summary updated to reflect the new options. This keeps the slim aws-setup/ scripts unchanged; the advanced logging behavior only applies when using the contrib variant. Signed-off-by: michaelraney Made-with: Cursor --- .../scripts/create-cluster.sh | 81 ++++++++++++++++++- .../scripts/delete-cluster.sh | 31 ++++++- 2 files changed, 108 insertions(+), 4 deletions(-) diff --git a/documentdb-playground/contrib/telemetry-and-cost-optimized-eks/scripts/create-cluster.sh b/documentdb-playground/contrib/telemetry-and-cost-optimized-eks/scripts/create-cluster.sh index bafdd2f8..bc572860 100755 --- a/documentdb-playground/contrib/telemetry-and-cost-optimized-eks/scripts/create-cluster.sh +++ b/documentdb-playground/contrib/telemetry-and-cost-optimized-eks/scripts/create-cluster.sh @@ -29,6 +29,13 @@ OPERATOR_CHART_VERSION="0.1.0" INSTALL_OPERATOR="${INSTALL_OPERATOR:-false}" DEPLOY_INSTANCE="${DEPLOY_INSTANCE:-false}" +# Logging and observability configuration +# LOG_RETENTION_DAYS: CloudWatch log group retention (allowed values: 1, 3, 5, 7, 14, 30, 60, 90, 120, 150, 180, 365, ...) +# CONTROL_PLANE_LOG_TYPES: comma-separated EKS control plane log types to enable +# (valid: api, audit, authenticator, controllerManager, scheduler). Keep small to control cost. +LOG_RETENTION_DAYS="${LOG_RETENTION_DAYS:-3}" +CONTROL_PLANE_LOG_TYPES="${CONTROL_PLANE_LOG_TYPES:-api,authenticator}" + # Parse command line arguments while [[ $# -gt 0 ]]; do case $1 in @@ -81,6 +88,14 @@ while [[ $# -gt 0 ]]; do CLUSTER_TAGS="$2" shift 2 ;; + --log-retention) + LOG_RETENTION_DAYS="$2" + shift 2 + ;; + --control-plane-log-types) + CONTROL_PLANE_LOG_TYPES="$2" + shift 2 + ;; -h|--help) echo "Usage: $0 [OPTIONS]" echo "" @@ -101,6 +116,12 @@ while [[ $# -gt 0 ]]; do echo " --tags TAGS Cost allocation tags as key=value pairs (comma-separated)" echo " (default: project=documentdb-playground,environment=dev,managed-by=eksctl)" echo "" + echo "Logging / observability options:" + echo " --log-retention DAYS CloudWatch retention in days (default: 3)" + echo " Valid: 1,3,5,7,14,30,60,90,120,150,180,365,400,545,731,1827,3653" + echo " --control-plane-log-types LST Comma-separated EKS control-plane log types (default: api,authenticator)" + echo " Valid: api,audit,authenticator,controllerManager,scheduler" + echo "" echo " -h, --help Show this help message" echo "" echo "Examples:" @@ -277,6 +298,52 @@ create_cluster() { fi } +# Enable EKS control plane logging to CloudWatch +# https://docs.aws.amazon.com/prescriptive-guidance/latest/amazon-eks-observability-best-practices/logging-best-practices.html +enable_control_plane_logging() { + log "Enabling EKS control plane logging: $CONTROL_PLANE_LOG_TYPES" + eksctl utils update-cluster-logging \ + --region "$REGION" \ + --cluster "$CLUSTER_NAME" \ + --enable-types "$CONTROL_PLANE_LOG_TYPES" \ + --approve \ + && success "Control plane logging enabled" \ + || warn "Failed to enable control plane logging (continuing)" +} + +# Set CloudWatch log group retention for cost control. +# Log groups may not exist yet (collector creates them lazily on first log); +# we retry for up to ~3 minutes per group. +set_log_retention() { + log "Setting CloudWatch log retention to $LOG_RETENTION_DAYS days..." + local groups=( + "/aws/eks/${CLUSTER_NAME}/cluster" + "/aws/containerinsights/${CLUSTER_NAME}/application" + "/aws/containerinsights/${CLUSTER_NAME}/dataplane" + "/aws/containerinsights/${CLUSTER_NAME}/host" + "/aws/containerinsights/${CLUSTER_NAME}/performance" + ) + for group in "${groups[@]}"; do + local set=false + for i in {1..6}; do + # Add-on collectors can create groups lazily; proactively create if missing. + aws logs create-log-group --log-group-name "$group" --region "$REGION" >/dev/null 2>&1 || true + if aws logs put-retention-policy \ + --log-group-name "$group" \ + --retention-in-days "$LOG_RETENTION_DAYS" \ + --region "$REGION" 2>/dev/null; then + success "Retention set on $group: ${LOG_RETENTION_DAYS}d" + set=true + break + fi + sleep 30 + done + if [ "$set" = false ]; then + warn "Log group $group not created yet; will need manual retention: aws logs put-retention-policy --log-group-name $group --retention-in-days $LOG_RETENTION_DAYS --region $REGION" + fi + done +} + # Create VPC endpoints for cost optimization (S3 Gateway endpoint is free) create_vpc_endpoints() { log "Creating VPC endpoints for cost optimization..." @@ -670,6 +737,8 @@ print_summary() { echo "Tags: $CLUSTER_TAGS" echo "Operator Installed: $INSTALL_OPERATOR" echo "Instance Deployed: $DEPLOY_INSTANCE" + echo "Log Retention: $LOG_RETENTION_DAYS days" + echo "Control Plane Log Types: $CONTROL_PLANE_LOG_TYPES" echo "" echo "✅ Components installed:" echo " - EKS cluster with managed nodes ($NODE_TYPE)" @@ -678,6 +747,8 @@ print_summary() { echo " - AWS Load Balancer Controller" echo " - cert-manager" echo " - DocumentDB storage class" + echo " - EKS control plane logging ($CONTROL_PLANE_LOG_TYPES) -> CloudWatch" + echo " - CloudWatch log retention: $LOG_RETENTION_DAYS days" if [ "$INSTALL_OPERATOR" == "true" ]; then echo " - DocumentDB operator" fi @@ -714,21 +785,27 @@ main() { log " Tags: $CLUSTER_TAGS" log " Install Operator: $INSTALL_OPERATOR" log " Deploy Instance: $DEPLOY_INSTANCE" + log " Log Retention (days): $LOG_RETENTION_DAYS" + log " Control Plane Log Types: $CONTROL_PLANE_LOG_TYPES" echo "" # Execute setup steps check_prerequisites create_cluster + enable_control_plane_logging create_vpc_endpoints install_ebs_csi install_load_balancer_controller install_cert_manager create_storage_class - + + # Logging/observability pipeline + set_log_retention + # Optional components install_documentdb_operator deploy_documentdb_instance - + # Show summary print_summary } diff --git a/documentdb-playground/contrib/telemetry-and-cost-optimized-eks/scripts/delete-cluster.sh b/documentdb-playground/contrib/telemetry-and-cost-optimized-eks/scripts/delete-cluster.sh index f3dbdc57..b9c6dae8 100755 --- a/documentdb-playground/contrib/telemetry-and-cost-optimized-eks/scripts/delete-cluster.sh +++ b/documentdb-playground/contrib/telemetry-and-cost-optimized-eks/scripts/delete-cluster.sh @@ -290,6 +290,29 @@ delete_crds() { success "DocumentDB CRDs deleted" } +# Delete CloudWatch log groups created for this cluster (control plane + container insights) +delete_cloudwatch_logs() { + if [ "$DELETE_CLUSTER" != "true" ]; then + return 0 + fi + + log "Deleting CloudWatch log groups for cluster $CLUSTER_NAME..." + local groups=( + "/aws/eks/${CLUSTER_NAME}/cluster" + "/aws/containerinsights/${CLUSTER_NAME}/application" + "/aws/containerinsights/${CLUSTER_NAME}/dataplane" + "/aws/containerinsights/${CLUSTER_NAME}/host" + "/aws/containerinsights/${CLUSTER_NAME}/performance" + ) + for group in "${groups[@]}"; do + if aws logs delete-log-group --log-group-name "$group" --region "$REGION" 2>/dev/null; then + success "Deleted log group: $group" + else + warn "Log group $group not found (may have already been deleted or never created)" + fi + done +} + # Delete AWS resources delete_aws_resources() { log "Deleting AWS resources..." @@ -827,6 +850,7 @@ main() { cleanup_infrastructure_loadbalancers cleanup_vpc_dependencies cleanup_failed_cloudformation_stacks + delete_cloudwatch_logs cleanup_kubectl_context fi return 0 @@ -856,8 +880,11 @@ main() { # Step 2e: Finally delete the cluster itself delete_cluster - - # Step 2f: Clean up local kubectl context + + # Step 2f: Delete CloudWatch log groups (after cluster deletion so control plane stops writing) + delete_cloudwatch_logs + + # Step 2g: Clean up local kubectl context cleanup_kubectl_context fi From 76d3c192f71731b4b1791dfb1c5596a882314e07 Mon Sep 17 00:00:00 2001 From: michaelraney Date: Wed, 22 Apr 2026 10:14:43 -0400 Subject: [PATCH 5/7] feat(contrib): add CloudWatch Observability add-on (Container Insights) - install_cloudwatch_observability_addon() in create-cluster.sh installs the eks-pod-identity-agent add-on (if needed), wires up the amazon-cloudwatch/cloudwatch-agent pod identity association with the CloudWatchAgentServerPolicy permission, then installs (or updates) the amazon-cloudwatch-observability EKS add-on and waits for it to become ACTIVE. This is the managed replacement for hand-rolled Fluent Bit manifests and enables Container Insights log shipping. - Wired into main() after install_cert_manager, before set_log_retention, so retention policies apply to the container-insights log groups the add-on creates. - delete-cluster.sh deletes the add-on (waiting for addon-deleted) before uninstalling the AWS Load Balancer Controller and cert-manager, so the collector stops writing logs before the cluster is torn down. - Summary output lists the add-on alongside control-plane logging. Signed-off-by: michaelraney Made-with: Cursor --- .../scripts/create-cluster.sh | 96 +++++++++++++++++++ .../scripts/delete-cluster.sh | 18 +++- 2 files changed, 113 insertions(+), 1 deletion(-) diff --git a/documentdb-playground/contrib/telemetry-and-cost-optimized-eks/scripts/create-cluster.sh b/documentdb-playground/contrib/telemetry-and-cost-optimized-eks/scripts/create-cluster.sh index bc572860..fcd24b52 100755 --- a/documentdb-playground/contrib/telemetry-and-cost-optimized-eks/scripts/create-cluster.sh +++ b/documentdb-playground/contrib/telemetry-and-cost-optimized-eks/scripts/create-cluster.sh @@ -311,6 +311,100 @@ enable_control_plane_logging() { || warn "Failed to enable control plane logging (continuing)" } +# Install Amazon CloudWatch Observability EKS add-on (managed collector for Container Insights) +install_cloudwatch_observability_addon() { + log "Installing Amazon CloudWatch Observability EKS add-on..." + + # Pod identity agent is required to create pod identity associations. + local POD_IDENTITY_STATUS + POD_IDENTITY_STATUS=$(aws eks describe-addon \ + --cluster-name "$CLUSTER_NAME" \ + --addon-name "eks-pod-identity-agent" \ + --region "$REGION" \ + --query 'addon.status' \ + --output text 2>/dev/null || true) + if [ -z "$POD_IDENTITY_STATUS" ] || [ "$POD_IDENTITY_STATUS" = "None" ]; then + log "Installing eks-pod-identity-agent add-on (required for CloudWatch agent IAM)..." + eksctl create addon \ + --cluster "$CLUSTER_NAME" \ + --region "$REGION" \ + --name eks-pod-identity-agent >/dev/null 2>&1 \ + || warn "Failed to install eks-pod-identity-agent add-on (continuing)" + fi + + # Grant CloudWatch agent permission through pod identity (recommended least-privilege path). + local CW_ASSOC_COUNT + CW_ASSOC_COUNT=$(aws eks list-pod-identity-associations \ + --cluster-name "$CLUSTER_NAME" \ + --region "$REGION" \ + --query "length(associations[?namespace=='amazon-cloudwatch' && serviceAccount=='cloudwatch-agent'])" \ + --output text 2>/dev/null || echo "0") + if [ "$CW_ASSOC_COUNT" = "0" ]; then + log "Creating pod identity association for amazon-cloudwatch/cloudwatch-agent..." + eksctl create podidentityassociation \ + --cluster "$CLUSTER_NAME" \ + --region "$REGION" \ + --namespace amazon-cloudwatch \ + --service-account-name cloudwatch-agent \ + --create-service-account \ + --permission-policy-arns arn:aws:iam::aws:policy/CloudWatchAgentServerPolicy >/dev/null 2>&1 \ + || warn "Failed to create pod identity association for cloudwatch-agent" + fi + + local ADDON_NAME="amazon-cloudwatch-observability" + local ADDON_STATUS + ADDON_STATUS=$(aws eks describe-addon \ + --cluster-name "$CLUSTER_NAME" \ + --addon-name "$ADDON_NAME" \ + --region "$REGION" \ + --query 'addon.status' \ + --output text 2>/dev/null || true) + + if [ -n "$ADDON_STATUS" ] && [ "$ADDON_STATUS" != "None" ]; then + log "CloudWatch Observability add-on already exists (status=$ADDON_STATUS); updating to latest compatible version" + aws eks update-addon \ + --cluster-name "$CLUSTER_NAME" \ + --addon-name "$ADDON_NAME" \ + --region "$REGION" \ + --resolve-conflicts OVERWRITE >/dev/null 2>&1 \ + || warn "Failed to update add-on (continuing with existing installation)" + else + aws eks create-addon \ + --cluster-name "$CLUSTER_NAME" \ + --addon-name "$ADDON_NAME" \ + --region "$REGION" \ + --resolve-conflicts OVERWRITE >/dev/null 2>&1 \ + || warn "Failed to create add-on (it may already exist or require IAM setup)" + fi + + log "Waiting for CloudWatch Observability add-on to become ACTIVE..." + if aws eks wait addon-active \ + --cluster-name "$CLUSTER_NAME" \ + --addon-name "$ADDON_NAME" \ + --region "$REGION" 2>/dev/null; then + success "CloudWatch Observability add-on is ACTIVE" + else + warn "Add-on did not become ACTIVE in time; checking current status" + fi + + local FINAL_STATUS + FINAL_STATUS=$(aws eks describe-addon \ + --cluster-name "$CLUSTER_NAME" \ + --addon-name "$ADDON_NAME" \ + --region "$REGION" \ + --query 'addon.status' \ + --output text 2>/dev/null || echo "UNKNOWN") + log "Add-on status: $FINAL_STATUS" + + # The add-on manages collector deployment details internally. + # Namespace/components can vary by add-on/EKS version, so this is best-effort visibility. + if kubectl get ns amazon-cloudwatch >/dev/null 2>&1; then + kubectl get pods -n amazon-cloudwatch || true + else + warn "amazon-cloudwatch namespace not found yet (collector may still be reconciling)" + fi +} + # Set CloudWatch log group retention for cost control. # Log groups may not exist yet (collector creates them lazily on first log); # we retry for up to ~3 minutes per group. @@ -748,6 +842,7 @@ print_summary() { echo " - cert-manager" echo " - DocumentDB storage class" echo " - EKS control plane logging ($CONTROL_PLANE_LOG_TYPES) -> CloudWatch" + echo " - Amazon CloudWatch Observability add-on -> CloudWatch" echo " - CloudWatch log retention: $LOG_RETENTION_DAYS days" if [ "$INSTALL_OPERATOR" == "true" ]; then echo " - DocumentDB operator" @@ -800,6 +895,7 @@ main() { create_storage_class # Logging/observability pipeline + install_cloudwatch_observability_addon set_log_retention # Optional components diff --git a/documentdb-playground/contrib/telemetry-and-cost-optimized-eks/scripts/delete-cluster.sh b/documentdb-playground/contrib/telemetry-and-cost-optimized-eks/scripts/delete-cluster.sh index b9c6dae8..039373ec 100755 --- a/documentdb-playground/contrib/telemetry-and-cost-optimized-eks/scripts/delete-cluster.sh +++ b/documentdb-playground/contrib/telemetry-and-cost-optimized-eks/scripts/delete-cluster.sh @@ -223,9 +223,25 @@ delete_helm_releases() { # Only delete these if we're deleting the whole cluster if [ "$DELETE_CLUSTER" == "true" ]; then + # Delete CloudWatch Observability add-on (managed collector for Container Insights) + log "Removing CloudWatch Observability add-on..." + if aws eks delete-addon \ + --cluster-name "$CLUSTER_NAME" \ + --addon-name amazon-cloudwatch-observability \ + --region "$REGION" >/dev/null 2>&1; then + success "CloudWatch Observability add-on deletion requested" + aws eks wait addon-deleted \ + --cluster-name "$CLUSTER_NAME" \ + --addon-name amazon-cloudwatch-observability \ + --region "$REGION" >/dev/null 2>&1 \ + || warn "Timed out waiting for add-on deletion; cluster deletion will continue" + else + warn "CloudWatch Observability add-on not found or could not be deleted" + fi + # Delete AWS Load Balancer Controller (after LoadBalancer services are gone) helm uninstall aws-load-balancer-controller -n kube-system 2>/dev/null || warn "AWS Load Balancer Controller not found" - + # Delete cert-manager helm uninstall cert-manager -n cert-manager 2>/dev/null || warn "cert-manager not found" fi From 535bf109b8841beb4c63a17bbb96b6f9b126f357 Mon Sep 17 00:00:00 2001 From: michaelraney Date: Wed, 22 Apr 2026 10:15:59 -0400 Subject: [PATCH 6/7] feat(contrib): add DocumentDB CloudWatch-aware diagnostics - diagnose_documentdb() runs after deploy_documentdb_instance() (no-op when --skip-instance). It mirrors the DocumentDB troubleshooting playbook: check pods, tail recent operator + instance logs, verify the service, and run an in-cluster mongosh ping via a throwaway pod. Uses kubectl logs (not CloudWatch) intentionally, since the add-on may not have flushed the last few lines yet. - Extend print_summary with a CloudWatch log-group listing and a full troubleshooting block that points users at: - aws logs tail with filter-pattern examples for operator and instance logs (primary path via Container Insights), - kubectl logs -f fallback, - add-on health checks, - local tooling sanity checks, - port-forward + mongosh endpoint validation, - TLS / self-signed certificate guidance. Signed-off-by: michaelraney Made-with: Cursor --- .../scripts/create-cluster.sh | 81 +++++++++++++++++++ 1 file changed, 81 insertions(+) diff --git a/documentdb-playground/contrib/telemetry-and-cost-optimized-eks/scripts/create-cluster.sh b/documentdb-playground/contrib/telemetry-and-cost-optimized-eks/scripts/create-cluster.sh index fcd24b52..c82eb4f1 100755 --- a/documentdb-playground/contrib/telemetry-and-cost-optimized-eks/scripts/create-cluster.sh +++ b/documentdb-playground/contrib/telemetry-and-cost-optimized-eks/scripts/create-cluster.sh @@ -817,6 +817,43 @@ EOF log "📝 AWS LoadBalancer annotations are automatically applied by the operator based on environment: eks" } +# Run DocumentDB post-deploy diagnostics (adapted from DocumentDB troubleshooting best practices). +# Uses kubectl logs intentionally since CloudWatch may not have flushed recent lines yet. +diagnose_documentdb() { + if [ "$DEPLOY_INSTANCE" != "true" ]; then + return 0 + fi + + log "Running DocumentDB diagnostics..." + + # 1. Verify pods are running (k8s equivalent of 'docker ps' / 'systemctl status') + log "Checking DocumentDB pods..." + kubectl get pods -n documentdb-instance-ns -o wide || true + + # 2. Recent operator logs (k8s equivalent of 'docker logs' / 'journalctl') + log "Recent operator logs (tail 20):" + kubectl logs -n documentdb-operator -l app.kubernetes.io/name=documentdb-operator --tail=20 2>/dev/null || warn "Operator logs unavailable" + + # 3. Recent instance pod logs + log "Recent DocumentDB instance logs (tail 20):" + kubectl logs -n documentdb-instance-ns sample-documentdb-1 --tail=20 2>/dev/null || warn "Instance pod not found or not ready" + + # 4. Verify service endpoint + log "Checking DocumentDB service..." + kubectl get svc -n documentdb-instance-ns documentdb-service-sample-documentdb 2>/dev/null || warn "Service not found" + + # 5. Ping test via in-cluster mongosh (validate endpoint independently of app code) + log "Testing DocumentDB connectivity via in-cluster mongosh..." + if kubectl run mongosh-diag-$RANDOM --rm -i --restart=Never --quiet \ + --image=mongo:7 -n documentdb-instance-ns -- \ + mongosh "mongodb://docdbadmin:SecurePassword123!@documentdb-service-sample-documentdb:10260/?directConnection=true&authMechanism=SCRAM-SHA-256&tls=true&tlsAllowInvalidCertificates=true" \ + --quiet --eval "db.runCommand({ping: 1})" 2>/dev/null | grep -q "ok.*1"; then + success "DocumentDB ping succeeded" + else + warn "DocumentDB ping failed -- see troubleshooting guide in summary" + fi +} + # Print summary print_summary() { echo "" @@ -851,9 +888,17 @@ print_summary() { echo " - DocumentDB instance (sample-documentdb)" fi echo "" + echo "📊 CloudWatch Log Groups (retention: ${LOG_RETENTION_DAYS}d):" + echo " - /aws/eks/$CLUSTER_NAME/cluster (control plane)" + echo " - /aws/containerinsights/$CLUSTER_NAME/application (pod stdout/stderr)" + echo " - /aws/containerinsights/$CLUSTER_NAME/dataplane (system pods, kubelet)" + echo " - /aws/containerinsights/$CLUSTER_NAME/host (node OS logs)" + echo " - /aws/containerinsights/$CLUSTER_NAME/performance (container insights performance)" + echo "" echo "💡 Next steps:" echo " - Verify cluster: kubectl get nodes" echo " - Check all pods: kubectl get pods --all-namespaces" + echo " - Verify add-on: aws eks describe-addon --cluster-name $CLUSTER_NAME --region $REGION --addon-name amazon-cloudwatch-observability" if [ "$INSTALL_OPERATOR" == "true" ]; then echo " - Check operator: kubectl get pods -n documentdb-operator" fi @@ -862,6 +907,39 @@ print_summary() { echo " - Check service status: kubectl get svc -n documentdb-instance-ns" echo " - Wait for LoadBalancer IP: kubectl get svc documentdb-service-sample-documentdb -n documentdb-instance-ns -w" echo " - Once IP is assigned, connect: mongodb://docdbadmin:SecurePassword123!@:10260/" + echo "" + echo "🔎 Troubleshooting (adapted from DocumentDB troubleshooting best practices):" + echo "" + echo " 1. Verify instance is running (k8s equivalent of 'docker ps' / 'systemctl status'):" + echo " kubectl get pods -n documentdb-instance-ns" + echo "" + echo " 2. Check logs (PRIMARY path via CloudWatch; observability add-on ships pod logs here):" + echo " aws logs tail /aws/containerinsights/$CLUSTER_NAME/application --region $REGION --since 1h --follow" + echo " aws logs tail /aws/containerinsights/$CLUSTER_NAME/application --region $REGION \\" + echo " --filter-pattern '{ \$.kubernetes.namespace_name = \"documentdb-operator\" }' --since 1h" + echo " aws logs tail /aws/containerinsights/$CLUSTER_NAME/application --region $REGION \\" + echo " --filter-pattern '{ \$.kubernetes.pod_name = \"sample-documentdb-1\" }' --since 1h" + echo " aws logs tail /aws/eks/$CLUSTER_NAME/cluster --region $REGION --since 1h # control plane" + echo "" + echo " 3. Check logs (FALLBACK via kubectl -- real-time streaming or if CloudWatch broken):" + echo " kubectl logs -n documentdb-operator -l app.kubernetes.io/name=documentdb-operator -f" + echo " kubectl logs -n documentdb-instance-ns sample-documentdb-1 -f" + echo "" + echo " 4. Verify CloudWatch Observability add-on health (if CloudWatch logs are missing):" + echo " aws eks describe-addon --cluster-name $CLUSTER_NAME --region $REGION --addon-name amazon-cloudwatch-observability" + echo " kubectl get pods -n amazon-cloudwatch" + echo "" + echo " 5. Verify local client tooling works (k8s equivalent of 'python -c import pymongo'):" + echo " which mongosh && mongosh --version" + echo " which kubectl && kubectl version --client" + echo "" + echo " 6. Validate the endpoint independently of your application code:" + echo " kubectl port-forward -n documentdb-instance-ns svc/documentdb-service-sample-documentdb 10260:10260 &" + echo " mongosh \"mongodb://docdbadmin:SecurePassword123!@localhost:10260/?directConnection=true&authMechanism=SCRAM-SHA-256&tls=true&tlsAllowInvalidCertificates=true\"" + echo "" + echo " 7. TLS / certificate errors:" + echo " - For self-signed certs (default): keep tlsAllowInvalidCertificates=true in the connection string" + echo " - For trusted certs: export CA and pass tlsCAFile=/path/to/ca.pem instead" fi echo "" echo "⚠️ IMPORTANT: Run './delete-cluster.sh' when done to avoid AWS charges!" @@ -902,6 +980,9 @@ main() { install_documentdb_operator deploy_documentdb_instance + # Post-deploy diagnostics (no-op if --skip-instance) + diagnose_documentdb + # Show summary print_summary } From 2ea53eeb2c06f9b0392fe11a1d3b9ed9df5f45a9 Mon Sep 17 00:00:00 2001 From: michaelraney Date: Wed, 22 Apr 2026 10:17:02 -0400 Subject: [PATCH 7/7] docs(contrib): expand telemetry-and-cost-optimized-eks README Replace the scaffold README with documentation covering: - What this variant adds vs base aws-setup/ (comparison table). - Simple options inherited from aws-setup/ and contrib-only logging options (--log-retention, --control-plane-log-types). - Logging model table (which log group carries what) plus ready-to-use aws logs tail filter-pattern examples for operator/instance/control plane. - Cost-optimization summary (Graviton default, Spot, 2-AZ, S3 Gateway, retention) with a rough dev/test cost ballpark. - Troubleshooting quickstart that mirrors the block printed by create-cluster.sh so users see consistent guidance. - Teardown summary covering the add-on, VPC endpoints, log groups, and the cluster. Signed-off-by: michaelraney Made-with: Cursor --- .../README.md | 122 ++++++++++++++++-- 1 file changed, 113 insertions(+), 9 deletions(-) diff --git a/documentdb-playground/contrib/telemetry-and-cost-optimized-eks/README.md b/documentdb-playground/contrib/telemetry-and-cost-optimized-eks/README.md index 433476ca..e9b1e79e 100644 --- a/documentdb-playground/contrib/telemetry-and-cost-optimized-eks/README.md +++ b/documentdb-playground/contrib/telemetry-and-cost-optimized-eks/README.md @@ -1,24 +1,128 @@ # Telemetry and Cost-Optimized EKS (contrib) > **Status:** community-contributed variant. Not actively maintained by the core DocumentDB team. +> If the base [`documentdb-playground/aws-setup/`](../../aws-setup/) scripts cover your needs, use those instead. -This folder contains a self-contained variant of the AWS EKS playground that layers additional cost-optimization and observability features on top of the base scripts in [`documentdb-playground/aws-setup/`](../../aws-setup/). +A self-contained variant of the AWS EKS playground that layers CloudWatch-based observability and additional cost-optimization features on top of the base scripts. -At this scaffold commit it is functionally equivalent to the base `aws-setup/` scripts plus the four simple options (`--node-type`, `--eks-version`, `--spot`, `--tags`). Subsequent commits add: +## What this variant adds over `aws-setup/` -- 2-AZ deployment and CloudFormation event diagnostics -- S3 Gateway VPC endpoint (free — reduces NAT Gateway data-transfer costs) -- EKS control-plane logging + CloudWatch log-group retention -- Amazon CloudWatch Observability add-on (Container Insights) -- CloudWatch-aware DocumentDB diagnostics +| Capability | Base `aws-setup/` | This contrib variant | +| --- | --- | --- | +| `--node-type`, `--eks-version`, `--spot`, `--tags` | ✅ | ✅ | +| 2-AZ deployment (cost-reduced) | ❌ (eksctl default: 3) | ✅ | +| CloudFormation stack event diagnostics | ❌ | ✅ | +| `mongosh` prerequisite warning | ❌ | ✅ | +| S3 Gateway VPC endpoint (free) | ❌ | ✅ | +| EKS control-plane logging → CloudWatch | ❌ | ✅ (`--control-plane-log-types`) | +| CloudWatch log-group retention | ❌ | ✅ (`--log-retention`) | +| Amazon CloudWatch Observability add-on (Container Insights) | ❌ | ✅ | +| CloudWatch log-group teardown | ❌ | ✅ | +| CloudWatch-aware post-deploy diagnostics | ❌ | ✅ | -The default cluster name is `documentdb-contrib-cluster` so you can run this alongside the base setup without collisions. +The default cluster name is `documentdb-contrib-cluster` so this variant can run alongside the base setup. + +## Prerequisites + +Same as [`aws-setup/`](../../aws-setup/) — AWS CLI, `eksctl`, `kubectl`, `helm`, `jq` — plus: + +- `mongosh` (warned, not required) for local endpoint validation. +- IAM permissions to manage EKS add-ons, CloudWatch log groups, VPC endpoints, and pod identity associations. ## Quick start ```bash ./scripts/create-cluster.sh --deploy-instance +# ...wait for cluster + add-on to become ACTIVE... ./scripts/delete-cluster.sh -y ``` -See `./scripts/create-cluster.sh --help` for the full list of options. A richer README covering the full cost model and troubleshooting story lands in a later commit on this branch. +See `./scripts/create-cluster.sh --help` for the full list of options. + +## Script options + +### Simple (same as `aws-setup/`) + +- `--node-type TYPE` — EC2 instance type (default: `m7g.large` Graviton/ARM) +- `--eks-version VER` — Kubernetes/EKS version (default: `1.35`) +- `--spot` — Spot-backed managed nodes (dev/test only; see warning below) +- `--tags TAGS` — comma-separated `key=value` pairs for AWS cost allocation + +### Contrib-only (logging / observability) + +- `--log-retention DAYS` — CloudWatch retention in days (default: `3`). Valid: `1,3,5,7,14,30,60,90,120,150,180,365,400,545,731,1827,3653`. +- `--control-plane-log-types LIST` — comma-separated EKS control-plane log types (default: `api,authenticator`). Valid: `api,audit,authenticator,controllerManager,scheduler`. Keep this list small to control cost. + +### Spot Instance Warning + +When using `--spot`, AWS can terminate instances at any time with 2 minutes notice. This **will interrupt your database** and require recovery. Only use Spot for dev/test. + +## Logging model + +All pod stdout/stderr, EKS control-plane events, and host/cluster telemetry flow into **Amazon CloudWatch Logs** via the managed Amazon CloudWatch Observability EKS add-on. No hand-rolled Fluent Bit / DaemonSet manifests are maintained in this variant — the add-on owns the collector. + +Log groups created for the cluster (retention set by `--log-retention`, default 3 days): + +| Log group | Contents | +| --- | --- | +| `/aws/eks//cluster` | EKS control-plane logs (types selected by `--control-plane-log-types`) | +| `/aws/containerinsights//application` | Pod stdout/stderr (operator, DocumentDB instance, everything else) | +| `/aws/containerinsights//dataplane` | System pods, kubelet, kube-proxy | +| `/aws/containerinsights//host` | Node OS logs | +| `/aws/containerinsights//performance` | Container Insights performance metrics | + +Example queries: + +```bash +# Live tail all pod stdout/stderr +aws logs tail /aws/containerinsights/$CLUSTER_NAME/application --region $REGION --since 1h --follow + +# Just the operator namespace +aws logs tail /aws/containerinsights/$CLUSTER_NAME/application --region $REGION \ + --filter-pattern '{ $.kubernetes.namespace_name = "documentdb-operator" }' --since 1h + +# Just one DocumentDB instance pod +aws logs tail /aws/containerinsights/$CLUSTER_NAME/application --region $REGION \ + --filter-pattern '{ $.kubernetes.pod_name = "sample-documentdb-1" }' --since 1h + +# EKS API server audit +aws logs tail /aws/eks/$CLUSTER_NAME/cluster --region $REGION --since 1h +``` + +## Cost optimization + +| Area | Optimization | +| --- | --- | +| Compute | `m7g.large` Graviton default (~20% cheaper than equivalent x86) | +| Compute (dev/test) | `--spot` for ~70% savings | +| Networking | 2-AZ deployment (minimum EKS supports) reduces cross-AZ data transfer | +| Networking | S3 Gateway VPC endpoint is free and eliminates NAT Gateway data-transfer cost for S3 | +| Storage | gp3 storage class (already in `aws-setup/`, inherited here) | +| Logging | `--log-retention 3` (3-day default) and narrow `--control-plane-log-types api,authenticator` keep CloudWatch bills bounded | +| Attribution | `--tags`/`CLUSTER_TAGS` for Cost Explorer breakdown | + +**Rough estimate:** dev/test cluster with `--spot`, `--deploy-instance`, and default logging lands in the low tens of dollars per month. Always run `delete-cluster.sh` when done. + +## Troubleshooting + +See the troubleshooting block printed by `create-cluster.sh --deploy-instance` (it includes `aws logs tail` examples and port-forward + mongosh validation steps). The main entry points are: + +1. `kubectl get pods -n documentdb-instance-ns` — are the pods Running? +2. `aws logs tail /aws/containerinsights//application --region --since 1h --follow` — what do the pods say? +3. `aws eks describe-addon --cluster-name --addon-name amazon-cloudwatch-observability --region ` — is the collector healthy? +4. `kubectl port-forward -n documentdb-instance-ns svc/documentdb-service-sample-documentdb 10260:10260` + `mongosh` — does the endpoint work independently of the app? + +## Teardown + +`./scripts/delete-cluster.sh -y` removes: + +- DocumentDB instances, operator, and related Helm releases +- The CloudWatch Observability add-on (waits for `addon-deleted`) +- VPC endpoints (so the VPC can be destroyed) +- All CloudWatch log groups for the cluster +- The EKS cluster itself (all CloudFormation stacks) + +## Related + +- Base scripts: [`documentdb-playground/aws-setup/`](../../aws-setup/) +- Simple options ship in documentdb#349: `NODE_TYPE`, `EKS_VERSION`, `CLUSTER_TAGS`, `USE_SPOT`.