diff --git a/documentdb-playground/contrib/telemetry-and-cost-optimized-eks/README.md b/documentdb-playground/contrib/telemetry-and-cost-optimized-eks/README.md new file mode 100644 index 00000000..e9b1e79e --- /dev/null +++ b/documentdb-playground/contrib/telemetry-and-cost-optimized-eks/README.md @@ -0,0 +1,128 @@ +# Telemetry and Cost-Optimized EKS (contrib) + +> **Status:** community-contributed variant. Not actively maintained by the core DocumentDB team. +> If the base [`documentdb-playground/aws-setup/`](../../aws-setup/) scripts cover your needs, use those instead. + +A self-contained variant of the AWS EKS playground that layers CloudWatch-based observability and additional cost-optimization features on top of the base scripts. + +## What this variant adds over `aws-setup/` + +| Capability | Base `aws-setup/` | This contrib variant | +| --- | --- | --- | +| `--node-type`, `--eks-version`, `--spot`, `--tags` | ✅ | ✅ | +| 2-AZ deployment (cost-reduced) | ❌ (eksctl default: 3) | ✅ | +| CloudFormation stack event diagnostics | ❌ | ✅ | +| `mongosh` prerequisite warning | ❌ | ✅ | +| S3 Gateway VPC endpoint (free) | ❌ | ✅ | +| EKS control-plane logging → CloudWatch | ❌ | ✅ (`--control-plane-log-types`) | +| CloudWatch log-group retention | ❌ | ✅ (`--log-retention`) | +| Amazon CloudWatch Observability add-on (Container Insights) | ❌ | ✅ | +| CloudWatch log-group teardown | ❌ | ✅ | +| CloudWatch-aware post-deploy diagnostics | ❌ | ✅ | + +The default cluster name is `documentdb-contrib-cluster` so this variant can run alongside the base setup. + +## Prerequisites + +Same as [`aws-setup/`](../../aws-setup/) — AWS CLI, `eksctl`, `kubectl`, `helm`, `jq` — plus: + +- `mongosh` (warned, not required) for local endpoint validation. +- IAM permissions to manage EKS add-ons, CloudWatch log groups, VPC endpoints, and pod identity associations. + +## Quick start + +```bash +./scripts/create-cluster.sh --deploy-instance +# ...wait for cluster + add-on to become ACTIVE... +./scripts/delete-cluster.sh -y +``` + +See `./scripts/create-cluster.sh --help` for the full list of options. + +## Script options + +### Simple (same as `aws-setup/`) + +- `--node-type TYPE` — EC2 instance type (default: `m7g.large` Graviton/ARM) +- `--eks-version VER` — Kubernetes/EKS version (default: `1.35`) +- `--spot` — Spot-backed managed nodes (dev/test only; see warning below) +- `--tags TAGS` — comma-separated `key=value` pairs for AWS cost allocation + +### Contrib-only (logging / observability) + +- `--log-retention DAYS` — CloudWatch retention in days (default: `3`). Valid: `1,3,5,7,14,30,60,90,120,150,180,365,400,545,731,1827,3653`. +- `--control-plane-log-types LIST` — comma-separated EKS control-plane log types (default: `api,authenticator`). Valid: `api,audit,authenticator,controllerManager,scheduler`. Keep this list small to control cost. + +### Spot Instance Warning + +When using `--spot`, AWS can terminate instances at any time with 2 minutes notice. This **will interrupt your database** and require recovery. Only use Spot for dev/test. + +## Logging model + +All pod stdout/stderr, EKS control-plane events, and host/cluster telemetry flow into **Amazon CloudWatch Logs** via the managed Amazon CloudWatch Observability EKS add-on. No hand-rolled Fluent Bit / DaemonSet manifests are maintained in this variant — the add-on owns the collector. + +Log groups created for the cluster (retention set by `--log-retention`, default 3 days): + +| Log group | Contents | +| --- | --- | +| `/aws/eks//cluster` | EKS control-plane logs (types selected by `--control-plane-log-types`) | +| `/aws/containerinsights//application` | Pod stdout/stderr (operator, DocumentDB instance, everything else) | +| `/aws/containerinsights//dataplane` | System pods, kubelet, kube-proxy | +| `/aws/containerinsights//host` | Node OS logs | +| `/aws/containerinsights//performance` | Container Insights performance metrics | + +Example queries: + +```bash +# Live tail all pod stdout/stderr +aws logs tail /aws/containerinsights/$CLUSTER_NAME/application --region $REGION --since 1h --follow + +# Just the operator namespace +aws logs tail /aws/containerinsights/$CLUSTER_NAME/application --region $REGION \ + --filter-pattern '{ $.kubernetes.namespace_name = "documentdb-operator" }' --since 1h + +# Just one DocumentDB instance pod +aws logs tail /aws/containerinsights/$CLUSTER_NAME/application --region $REGION \ + --filter-pattern '{ $.kubernetes.pod_name = "sample-documentdb-1" }' --since 1h + +# EKS API server audit +aws logs tail /aws/eks/$CLUSTER_NAME/cluster --region $REGION --since 1h +``` + +## Cost optimization + +| Area | Optimization | +| --- | --- | +| Compute | `m7g.large` Graviton default (~20% cheaper than equivalent x86) | +| Compute (dev/test) | `--spot` for ~70% savings | +| Networking | 2-AZ deployment (minimum EKS supports) reduces cross-AZ data transfer | +| Networking | S3 Gateway VPC endpoint is free and eliminates NAT Gateway data-transfer cost for S3 | +| Storage | gp3 storage class (already in `aws-setup/`, inherited here) | +| Logging | `--log-retention 3` (3-day default) and narrow `--control-plane-log-types api,authenticator` keep CloudWatch bills bounded | +| Attribution | `--tags`/`CLUSTER_TAGS` for Cost Explorer breakdown | + +**Rough estimate:** dev/test cluster with `--spot`, `--deploy-instance`, and default logging lands in the low tens of dollars per month. Always run `delete-cluster.sh` when done. + +## Troubleshooting + +See the troubleshooting block printed by `create-cluster.sh --deploy-instance` (it includes `aws logs tail` examples and port-forward + mongosh validation steps). The main entry points are: + +1. `kubectl get pods -n documentdb-instance-ns` — are the pods Running? +2. `aws logs tail /aws/containerinsights//application --region --since 1h --follow` — what do the pods say? +3. `aws eks describe-addon --cluster-name --addon-name amazon-cloudwatch-observability --region ` — is the collector healthy? +4. `kubectl port-forward -n documentdb-instance-ns svc/documentdb-service-sample-documentdb 10260:10260` + `mongosh` — does the endpoint work independently of the app? + +## Teardown + +`./scripts/delete-cluster.sh -y` removes: + +- DocumentDB instances, operator, and related Helm releases +- The CloudWatch Observability add-on (waits for `addon-deleted`) +- VPC endpoints (so the VPC can be destroyed) +- All CloudWatch log groups for the cluster +- The EKS cluster itself (all CloudFormation stacks) + +## Related + +- Base scripts: [`documentdb-playground/aws-setup/`](../../aws-setup/) +- Simple options ship in documentdb#349: `NODE_TYPE`, `EKS_VERSION`, `CLUSTER_TAGS`, `USE_SPOT`. diff --git a/documentdb-playground/contrib/telemetry-and-cost-optimized-eks/scripts/create-cluster.sh b/documentdb-playground/contrib/telemetry-and-cost-optimized-eks/scripts/create-cluster.sh new file mode 100755 index 00000000..c82eb4f1 --- /dev/null +++ b/documentdb-playground/contrib/telemetry-and-cost-optimized-eks/scripts/create-cluster.sh @@ -0,0 +1,991 @@ +#!/bin/bash + +# DocumentDB EKS Cluster Creation Script +# This script creates a complete EKS cluster with all dependencies for DocumentDB + +set -e # Exit on any error + +# Configuration +CLUSTER_NAME="${CLUSTER_NAME:-documentdb-contrib-cluster}" +REGION="us-west-2" +K8S_VERSION="${K8S_VERSION:-1.35}" +NODE_TYPE="${NODE_TYPE:-m7g.large}" +NODES=3 +NODES_MIN=1 +NODES_MAX=4 + +# Cost-optimization configuration +# USE_SPOT: when "true", eksctl provisions Spot-backed managed nodes (dev/test only). +# CLUSTER_TAGS: comma-separated key=value pairs passed to AWS for cost allocation in Cost Explorer. +USE_SPOT="${USE_SPOT:-false}" +CLUSTER_TAGS="${CLUSTER_TAGS:-project=documentdb-playground,environment=dev,managed-by=eksctl}" + +# DocumentDB Operator Configuration +# For production: use documentdb/documentdb-operator (official) +OPERATOR_GITHUB_ORG="documentdb" +OPERATOR_CHART_VERSION="0.1.0" + +# Feature flags - set to "true" to enable, "false" to skip +INSTALL_OPERATOR="${INSTALL_OPERATOR:-false}" +DEPLOY_INSTANCE="${DEPLOY_INSTANCE:-false}" + +# Logging and observability configuration +# LOG_RETENTION_DAYS: CloudWatch log group retention (allowed values: 1, 3, 5, 7, 14, 30, 60, 90, 120, 150, 180, 365, ...) +# CONTROL_PLANE_LOG_TYPES: comma-separated EKS control plane log types to enable +# (valid: api, audit, authenticator, controllerManager, scheduler). Keep small to control cost. +LOG_RETENTION_DAYS="${LOG_RETENTION_DAYS:-3}" +CONTROL_PLANE_LOG_TYPES="${CONTROL_PLANE_LOG_TYPES:-api,authenticator}" + +# Parse command line arguments +while [[ $# -gt 0 ]]; do + case $1 in + --skip-operator) + INSTALL_OPERATOR="false" + shift + ;; + --skip-instance) + DEPLOY_INSTANCE="false" + shift + ;; + --install-operator) + INSTALL_OPERATOR="true" + shift + ;; + --deploy-instance) + DEPLOY_INSTANCE="true" + INSTALL_OPERATOR="true" # Auto-enable operator when instance is requested + shift + ;; + --cluster-name) + CLUSTER_NAME="$2" + shift 2 + ;; + --region) + REGION="$2" + shift 2 + ;; + --github-username) + GITHUB_USERNAME="$2" + shift 2 + ;; + --github-token) + GITHUB_TOKEN="$2" + shift 2 + ;; + --node-type) + NODE_TYPE="$2" + shift 2 + ;; + --eks-version|--k8s-version) + K8S_VERSION="$2" + shift 2 + ;; + --spot) + USE_SPOT="true" + shift + ;; + --tags) + CLUSTER_TAGS="$2" + shift 2 + ;; + --log-retention) + LOG_RETENTION_DAYS="$2" + shift 2 + ;; + --control-plane-log-types) + CONTROL_PLANE_LOG_TYPES="$2" + shift 2 + ;; + -h|--help) + echo "Usage: $0 [OPTIONS]" + echo "" + echo "Options:" + echo " --skip-operator Skip DocumentDB operator installation (default)" + echo " --skip-instance Skip DocumentDB instance deployment (default)" + echo " --install-operator Install DocumentDB operator" + echo " --deploy-instance Deploy DocumentDB instance" + echo " --cluster-name NAME EKS cluster name (default: documentdb-contrib-cluster)" + echo " --region REGION AWS region (default: us-west-2)" + echo " --github-username GitHub username for operator installation" + echo " --github-token GitHub token for operator installation" + echo "" + echo "Cost-optimization options:" + echo " --node-type TYPE EC2 instance type (default: m7g.large, Graviton/ARM)" + echo " --eks-version VER Kubernetes/EKS version (default: 1.35)" + echo " --spot Use Spot-backed managed nodes (DEV/TEST ONLY - can be terminated)" + echo " --tags TAGS Cost allocation tags as key=value pairs (comma-separated)" + echo " (default: project=documentdb-playground,environment=dev,managed-by=eksctl)" + echo "" + echo "Logging / observability options:" + echo " --log-retention DAYS CloudWatch retention in days (default: 3)" + echo " Valid: 1,3,5,7,14,30,60,90,120,150,180,365,400,545,731,1827,3653" + echo " --control-plane-log-types LST Comma-separated EKS control-plane log types (default: api,authenticator)" + echo " Valid: api,audit,authenticator,controllerManager,scheduler" + echo "" + echo " -h, --help Show this help message" + echo "" + echo "Examples:" + echo " $0 # Create basic cluster only (no operator, no instance)" + echo " $0 --install-operator # Create cluster with operator, no instance" + echo " $0 --deploy-instance # Create cluster with instance (auto-enables operator)" + echo " $0 --github-username user --github-token ghp_xxx --install-operator # With GitHub auth" + echo " $0 --node-type m5.large # Use x86 instance type instead of Graviton" + echo " $0 --spot --tags \"project=myproj,team=platform\" # Spot dev cluster with custom tags" + exit 0 + ;; + *) + echo "Unknown option: $1" + exit 1 + ;; + esac +done + +# Colors for output +RED='\033[0;31m' +GREEN='\033[0;32m' +YELLOW='\033[1;33m' +BLUE='\033[0;34m' +NC='\033[0m' # No Color + +# Logging function +log() { + echo -e "${BLUE}[$(date +'%Y-%m-%d %H:%M:%S')]${NC} $1" +} + +success() { + echo -e "${GREEN}[$(date +'%Y-%m-%d %H:%M:%S')] ✅ $1${NC}" +} + +warn() { + echo -e "${YELLOW}[$(date +'%Y-%m-%d %H:%M:%S')] ⚠️ $1${NC}" +} + +error() { + echo -e "${RED}[$(date +'%Y-%m-%d %H:%M:%S')] ❌ $1${NC}" + exit 1 +} + +error_no_exit() { + echo -e "${RED}[$(date +'%Y-%m-%d %H:%M:%S')] ❌ $1${NC}" +} + +# Check prerequisites +check_prerequisites() { + log "Checking prerequisites..." + + # Check AWS CLI + if ! command -v aws &> /dev/null; then + error "AWS CLI not found. Please install AWS CLI first." + fi + + # Check eksctl + if ! command -v eksctl &> /dev/null; then + error "eksctl not found. Please install eksctl first." + fi + + # Check kubectl + if ! command -v kubectl &> /dev/null; then + error "kubectl not found. Please install kubectl first." + fi + + # Check Helm + if ! command -v helm &> /dev/null; then + error "Helm not found. Please install Helm first." + fi + + # Check AWS credentials + if ! aws sts get-caller-identity &> /dev/null; then + error "AWS credentials not configured. Please run 'aws configure' first." + fi + + # Optional but recommended: mongosh for endpoint validation. Mirrors the DocumentDB + # troubleshooting best practice of verifying local client tooling before blaming the server. + if ! command -v mongosh &> /dev/null; then + warn "mongosh not found. Install with: brew install mongosh (macOS) or see https://www.mongodb.com/docs/mongodb-shell/install/" + warn "Local connection validation (kubectl port-forward + mongosh) won't work until mongosh is installed." + fi + + success "All prerequisites met" +} + +# Show CloudFormation stack events for eksctl-managed stacks +show_cloudformation_events() { + local status_filter="$1" # "CREATE_FAILED" or "CREATE_COMPLETE" + local stacks + stacks=$(aws cloudformation list-stacks --region "$REGION" \ + --stack-status-filter CREATE_COMPLETE CREATE_FAILED ROLLBACK_COMPLETE ROLLBACK_IN_PROGRESS \ + --query "StackSummaries[?starts_with(StackName, 'eksctl-${CLUSTER_NAME}')].StackName" \ + --output text 2>/dev/null) + + if [ -z "$stacks" ]; then + warn "No CloudFormation stacks found for cluster $CLUSTER_NAME" + return + fi + + for stack in $stacks; do + log "CloudFormation stack: $stack" + if [ "$status_filter" == "CREATE_FAILED" ]; then + local failures + failures=$(aws cloudformation describe-stack-events --region "$REGION" \ + --stack-name "$stack" \ + --query "StackEvents[?ResourceStatus=='CREATE_FAILED'].[Timestamp,LogicalResourceId,ResourceStatusReason]" \ + --output table 2>/dev/null) + if [ -n "$failures" ] && ! echo "$failures" | grep -q "^$"; then + error_no_exit "Failed resources in $stack:" + echo "$failures" + else + success "No failures in $stack" + fi + else + local stack_status + stack_status=$(aws cloudformation describe-stacks --region "$REGION" \ + --stack-name "$stack" \ + --query "Stacks[0].StackStatus" --output text 2>/dev/null) + log " Status: $stack_status" + fi + done +} + +# Create EKS cluster +create_cluster() { + log "Creating EKS cluster: $CLUSTER_NAME in region: $REGION" + + # Check if cluster already exists + if eksctl get cluster --name $CLUSTER_NAME --region $REGION &> /dev/null; then + warn "Cluster $CLUSTER_NAME already exists. Skipping cluster creation." + return 0 + fi + + if [ "$USE_SPOT" == "true" ]; then + warn "============================================================" + warn "SPOT INSTANCES ENABLED - FOR DEV/TEST USE ONLY" + warn "AWS can terminate Spot instances at any time with 2 minutes" + warn "notice. This WILL interrupt your database and require recovery." + warn "Do NOT use Spot for production or long-running workloads." + warn "============================================================" + fi + + # 2 AZs is the minimum EKS supports, reduced from eksctl default of 3 for cost reasons. + local EKSCTL_ARGS=( + --name "$CLUSTER_NAME" + --region "$REGION" + --version "$K8S_VERSION" + --nodes "$NODES" + --nodes-min "$NODES_MIN" + --nodes-max "$NODES_MAX" + --managed + --with-oidc + --tags "$CLUSTER_TAGS" + ) + + if [ "$USE_SPOT" == "true" ]; then + # Multiple instance types improve Spot availability; all Graviton to match the default. + EKSCTL_ARGS+=(--spot --instance-types "m7g.large,m6g.large,r7g.large,r6g.large,c7g.large,c6g.large") + else + EKSCTL_ARGS+=(--node-type "$NODE_TYPE") + fi + + eksctl create cluster "${EKSCTL_ARGS[@]}" --zones "${REGION}a,${REGION}b" + local exit_code=$? + + log "Retrieving CloudFormation stack events..." + if [ $exit_code -eq 0 ]; then + show_cloudformation_events "CREATE_COMPLETE" + success "EKS cluster created successfully" + else + show_cloudformation_events "CREATE_FAILED" + error "Failed to create EKS cluster" + fi +} + +# Enable EKS control plane logging to CloudWatch +# https://docs.aws.amazon.com/prescriptive-guidance/latest/amazon-eks-observability-best-practices/logging-best-practices.html +enable_control_plane_logging() { + log "Enabling EKS control plane logging: $CONTROL_PLANE_LOG_TYPES" + eksctl utils update-cluster-logging \ + --region "$REGION" \ + --cluster "$CLUSTER_NAME" \ + --enable-types "$CONTROL_PLANE_LOG_TYPES" \ + --approve \ + && success "Control plane logging enabled" \ + || warn "Failed to enable control plane logging (continuing)" +} + +# Install Amazon CloudWatch Observability EKS add-on (managed collector for Container Insights) +install_cloudwatch_observability_addon() { + log "Installing Amazon CloudWatch Observability EKS add-on..." + + # Pod identity agent is required to create pod identity associations. + local POD_IDENTITY_STATUS + POD_IDENTITY_STATUS=$(aws eks describe-addon \ + --cluster-name "$CLUSTER_NAME" \ + --addon-name "eks-pod-identity-agent" \ + --region "$REGION" \ + --query 'addon.status' \ + --output text 2>/dev/null || true) + if [ -z "$POD_IDENTITY_STATUS" ] || [ "$POD_IDENTITY_STATUS" = "None" ]; then + log "Installing eks-pod-identity-agent add-on (required for CloudWatch agent IAM)..." + eksctl create addon \ + --cluster "$CLUSTER_NAME" \ + --region "$REGION" \ + --name eks-pod-identity-agent >/dev/null 2>&1 \ + || warn "Failed to install eks-pod-identity-agent add-on (continuing)" + fi + + # Grant CloudWatch agent permission through pod identity (recommended least-privilege path). + local CW_ASSOC_COUNT + CW_ASSOC_COUNT=$(aws eks list-pod-identity-associations \ + --cluster-name "$CLUSTER_NAME" \ + --region "$REGION" \ + --query "length(associations[?namespace=='amazon-cloudwatch' && serviceAccount=='cloudwatch-agent'])" \ + --output text 2>/dev/null || echo "0") + if [ "$CW_ASSOC_COUNT" = "0" ]; then + log "Creating pod identity association for amazon-cloudwatch/cloudwatch-agent..." + eksctl create podidentityassociation \ + --cluster "$CLUSTER_NAME" \ + --region "$REGION" \ + --namespace amazon-cloudwatch \ + --service-account-name cloudwatch-agent \ + --create-service-account \ + --permission-policy-arns arn:aws:iam::aws:policy/CloudWatchAgentServerPolicy >/dev/null 2>&1 \ + || warn "Failed to create pod identity association for cloudwatch-agent" + fi + + local ADDON_NAME="amazon-cloudwatch-observability" + local ADDON_STATUS + ADDON_STATUS=$(aws eks describe-addon \ + --cluster-name "$CLUSTER_NAME" \ + --addon-name "$ADDON_NAME" \ + --region "$REGION" \ + --query 'addon.status' \ + --output text 2>/dev/null || true) + + if [ -n "$ADDON_STATUS" ] && [ "$ADDON_STATUS" != "None" ]; then + log "CloudWatch Observability add-on already exists (status=$ADDON_STATUS); updating to latest compatible version" + aws eks update-addon \ + --cluster-name "$CLUSTER_NAME" \ + --addon-name "$ADDON_NAME" \ + --region "$REGION" \ + --resolve-conflicts OVERWRITE >/dev/null 2>&1 \ + || warn "Failed to update add-on (continuing with existing installation)" + else + aws eks create-addon \ + --cluster-name "$CLUSTER_NAME" \ + --addon-name "$ADDON_NAME" \ + --region "$REGION" \ + --resolve-conflicts OVERWRITE >/dev/null 2>&1 \ + || warn "Failed to create add-on (it may already exist or require IAM setup)" + fi + + log "Waiting for CloudWatch Observability add-on to become ACTIVE..." + if aws eks wait addon-active \ + --cluster-name "$CLUSTER_NAME" \ + --addon-name "$ADDON_NAME" \ + --region "$REGION" 2>/dev/null; then + success "CloudWatch Observability add-on is ACTIVE" + else + warn "Add-on did not become ACTIVE in time; checking current status" + fi + + local FINAL_STATUS + FINAL_STATUS=$(aws eks describe-addon \ + --cluster-name "$CLUSTER_NAME" \ + --addon-name "$ADDON_NAME" \ + --region "$REGION" \ + --query 'addon.status' \ + --output text 2>/dev/null || echo "UNKNOWN") + log "Add-on status: $FINAL_STATUS" + + # The add-on manages collector deployment details internally. + # Namespace/components can vary by add-on/EKS version, so this is best-effort visibility. + if kubectl get ns amazon-cloudwatch >/dev/null 2>&1; then + kubectl get pods -n amazon-cloudwatch || true + else + warn "amazon-cloudwatch namespace not found yet (collector may still be reconciling)" + fi +} + +# Set CloudWatch log group retention for cost control. +# Log groups may not exist yet (collector creates them lazily on first log); +# we retry for up to ~3 minutes per group. +set_log_retention() { + log "Setting CloudWatch log retention to $LOG_RETENTION_DAYS days..." + local groups=( + "/aws/eks/${CLUSTER_NAME}/cluster" + "/aws/containerinsights/${CLUSTER_NAME}/application" + "/aws/containerinsights/${CLUSTER_NAME}/dataplane" + "/aws/containerinsights/${CLUSTER_NAME}/host" + "/aws/containerinsights/${CLUSTER_NAME}/performance" + ) + for group in "${groups[@]}"; do + local set=false + for i in {1..6}; do + # Add-on collectors can create groups lazily; proactively create if missing. + aws logs create-log-group --log-group-name "$group" --region "$REGION" >/dev/null 2>&1 || true + if aws logs put-retention-policy \ + --log-group-name "$group" \ + --retention-in-days "$LOG_RETENTION_DAYS" \ + --region "$REGION" 2>/dev/null; then + success "Retention set on $group: ${LOG_RETENTION_DAYS}d" + set=true + break + fi + sleep 30 + done + if [ "$set" = false ]; then + warn "Log group $group not created yet; will need manual retention: aws logs put-retention-policy --log-group-name $group --retention-in-days $LOG_RETENTION_DAYS --region $REGION" + fi + done +} + +# Create VPC endpoints for cost optimization (S3 Gateway endpoint is free) +create_vpc_endpoints() { + log "Creating VPC endpoints for cost optimization..." + + VPC_ID=$(aws eks describe-cluster --name "$CLUSTER_NAME" --region "$REGION" \ + --query 'cluster.resourcesVpcConfig.vpcId' --output text) + + if [ -z "$VPC_ID" ] || [ "$VPC_ID" = "None" ]; then + warn "Could not determine VPC ID. Skipping VPC endpoint creation." + return 0 + fi + + ROUTE_TABLE_IDS=$(aws ec2 describe-route-tables --region "$REGION" \ + --filters "Name=vpc-id,Values=$VPC_ID" \ + --query 'RouteTables[].RouteTableId' --output text) + + # S3 Gateway Endpoint (free - reduces NAT Gateway data transfer costs) + if aws ec2 describe-vpc-endpoints --region "$REGION" \ + --filters "Name=vpc-id,Values=$VPC_ID" "Name=service-name,Values=com.amazonaws.$REGION.s3" \ + --query 'VpcEndpoints[0].VpcEndpointId' --output text 2>/dev/null | grep -q "vpce-"; then + warn "S3 VPC endpoint already exists. Skipping creation." + else + aws ec2 create-vpc-endpoint \ + --vpc-id "$VPC_ID" \ + --service-name "com.amazonaws.$REGION.s3" \ + --route-table-ids $ROUTE_TABLE_IDS \ + --region "$REGION" 2>/dev/null && success "S3 Gateway VPC endpoint created (free)" || warn "Could not create S3 VPC endpoint" + fi +} + +# Install EBS CSI Driver +install_ebs_csi() { + log "Installing EBS CSI Driver..." + + # Create EBS CSI service account with IAM role + eksctl create iamserviceaccount \ + --cluster $CLUSTER_NAME \ + --namespace kube-system \ + --name ebs-csi-controller-sa \ + --attach-policy-arn arn:aws:iam::aws:policy/service-role/AmazonEBSCSIDriverPolicy \ + --override-existing-serviceaccounts \ + --approve \ + --region $REGION + + # Install EBS CSI driver addon + eksctl create addon \ + --name aws-ebs-csi-driver \ + --cluster $CLUSTER_NAME \ + --region $REGION \ + --force + + # Wait for EBS CSI driver to be ready + log "Waiting for EBS CSI driver to be ready..." + sleep 30 + kubectl wait --for=condition=ready pod -l app=ebs-csi-controller -n kube-system --timeout=300s || warn "EBS CSI driver pods may still be starting" + + success "EBS CSI Driver installed" +} + +# Install AWS Load Balancer Controller +install_load_balancer_controller() { + log "Installing AWS Load Balancer Controller..." + + # Check if already installed + if helm list -n kube-system | grep -q aws-load-balancer-controller; then + warn "AWS Load Balancer Controller already installed. Skipping installation." + return 0 + fi + + # Get VPC ID for the cluster + VPC_ID=$(aws eks describe-cluster --name $CLUSTER_NAME --region $REGION --query 'cluster.resourcesVpcConfig.vpcId' --output text) + log "Using VPC ID: $VPC_ID" + + # Verify subnet tags for Load Balancer Controller + log "Verifying subnet tags for Load Balancer Controller..." + PUBLIC_SUBNETS=$(aws ec2 describe-subnets \ + --filters "Name=vpc-id,Values=$VPC_ID" "Name=map-public-ip-on-launch,Values=true" \ + --query 'Subnets[].SubnetId' --output text --region $REGION) + + PRIVATE_SUBNETS=$(aws ec2 describe-subnets \ + --filters "Name=vpc-id,Values=$VPC_ID" "Name=map-public-ip-on-launch,Values=false" \ + --query 'Subnets[].SubnetId' --output text --region $REGION) + + # Tag public subnets for internet-facing load balancers + if [ -n "$PUBLIC_SUBNETS" ]; then + log "Tagging public subnets for internet-facing load balancers..." + for subnet in $PUBLIC_SUBNETS; do + aws ec2 create-tags --resources "$subnet" --tags Key=kubernetes.io/role/elb,Value=1 --region $REGION 2>/dev/null || true + log "Tagged public subnet: $subnet" + done + fi + + # Tag private subnets for internal load balancers + if [ -n "$PRIVATE_SUBNETS" ]; then + log "Tagging private subnets for internal load balancers..." + for subnet in $PRIVATE_SUBNETS; do + aws ec2 create-tags --resources "$subnet" --tags Key=kubernetes.io/role/internal-elb,Value=1 --region $REGION 2>/dev/null || true + log "Tagged private subnet: $subnet" + done + fi + + # Download the official IAM policy (latest version) + log "Downloading AWS Load Balancer Controller IAM policy (latest version)..." + curl -o /tmp/iam_policy.json https://raw.githubusercontent.com/kubernetes-sigs/aws-load-balancer-controller/main/docs/install/iam_policy.json + + # Get account ID + ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text) + + # Check if policy exists and create/update as needed + if aws iam get-policy --policy-arn arn:aws:iam::$ACCOUNT_ID:policy/AWSLoadBalancerControllerIAMPolicy &>/dev/null; then + log "IAM policy already exists, updating to latest version..." + # Delete and recreate to ensure we have the latest version + aws iam delete-policy --policy-arn arn:aws:iam::$ACCOUNT_ID:policy/AWSLoadBalancerControllerIAMPolicy 2>/dev/null || true + sleep 5 # Wait for deletion to propagate + fi + + # Create IAM policy with latest permissions + log "Creating IAM policy with latest permissions..." + aws iam create-policy \ + --policy-name AWSLoadBalancerControllerIAMPolicy \ + --policy-document file:///tmp/iam_policy.json 2>/dev/null || \ + log "IAM policy already exists or was just created" + + # Wait a moment for policy to be available + sleep 5 + + # Create IAM service account with proper permissions using eksctl + log "Creating IAM service account with proper permissions..." + eksctl create iamserviceaccount \ + --cluster=$CLUSTER_NAME \ + --namespace=kube-system \ + --name=aws-load-balancer-controller \ + --role-name "AmazonEKSLoadBalancerControllerRole-$CLUSTER_NAME" \ + --attach-policy-arn=arn:aws:iam::$ACCOUNT_ID:policy/AWSLoadBalancerControllerIAMPolicy \ + --approve \ + --override-existing-serviceaccounts \ + --region=$REGION + + # Add EKS Helm repository + helm repo add eks https://aws.github.io/eks-charts + helm repo update eks + + # Install Load Balancer Controller using the existing service account + helm install aws-load-balancer-controller eks/aws-load-balancer-controller \ + -n kube-system \ + --set clusterName=$CLUSTER_NAME \ + --set serviceAccount.create=false \ + --set serviceAccount.name=aws-load-balancer-controller \ + --set region=$REGION \ + --set vpcId=$VPC_ID + + # Wait for Load Balancer Controller to be ready + log "Waiting for Load Balancer Controller to be ready..." + sleep 30 + kubectl wait --for=condition=ready pod -l app.kubernetes.io/name=aws-load-balancer-controller -n kube-system --timeout=300s || warn "Load Balancer Controller pods may still be starting" + + # Clean up temp file + rm -f /tmp/iam_policy.json + + success "AWS Load Balancer Controller installed" +} + +# Install cert-manager +install_cert_manager() { + log "Installing cert-manager..." + + # Check if already installed + if helm list -n cert-manager | grep -q cert-manager; then + warn "cert-manager already installed. Skipping installation." + return 0 + fi + + # Add Jetstack Helm repository + helm repo add jetstack https://charts.jetstack.io + helm repo update + + # Install cert-manager + helm install cert-manager jetstack/cert-manager \ + --namespace cert-manager \ + --create-namespace \ + --version v1.13.2 \ + --set installCRDs=true \ + --set prometheus.enabled=false \ + --set webhook.timeoutSeconds=30 + + # Wait for cert-manager to be ready + log "Waiting for cert-manager to be ready..." + sleep 30 + kubectl wait --for=condition=ready pod -l app.kubernetes.io/instance=cert-manager -n cert-manager --timeout=300s || warn "cert-manager pods may still be starting" + + success "cert-manager installed" +} + +# Create optimized storage class +create_storage_class() { + log "Creating DocumentDB storage class..." + + # Check if storage class already exists + if kubectl get storageclass documentdb-storage &> /dev/null; then + warn "DocumentDB storage class already exists. Skipping creation." + return 0 + fi + + kubectl apply -f - </dev/null || true + helm repo update documentdb + + log "Installing DocumentDB operator from public Helm repository..." + if helm install documentdb-operator documentdb/documentdb-operator \ + --namespace documentdb-operator \ + --create-namespace \ + --wait \ + --timeout 10m 2>/dev/null; then + success "DocumentDB operator installed successfully from public Helm repository" + else + # Fallback to OCI registry with GitHub authentication + warn "Public Helm repository installation failed. Falling back to OCI registry..." + + # Check for GitHub authentication + if [ -z "$GITHUB_TOKEN" ] || [ -z "$GITHUB_USERNAME" ]; then + error "DocumentDB operator installation requires GitHub authentication as fallback. + +Please set the following environment variables: + export GITHUB_USERNAME='your-github-username' + export GITHUB_TOKEN='your-github-token' + +To create a GitHub token: +1. Go to https://github.com/settings/tokens +2. Generate a new token with 'read:packages' scope +3. Export the token as shown above + +Then run the script again with --install-operator" + fi + + # Authenticate with GitHub Container Registry + log "Authenticating with GitHub Container Registry..." + if ! echo "$GITHUB_TOKEN" | helm registry login ghcr.io --username "$GITHUB_USERNAME" --password-stdin; then + error "Failed to authenticate with GitHub Container Registry. Please verify your GITHUB_TOKEN and GITHUB_USERNAME." + fi + + # Install DocumentDB operator from OCI registry + log "Pulling and installing DocumentDB operator from ghcr.io/${OPERATOR_GITHUB_ORG}/documentdb-operator..." + helm install documentdb-operator \ + oci://ghcr.io/${OPERATOR_GITHUB_ORG}/documentdb-operator \ + --version ${OPERATOR_CHART_VERSION} \ + --namespace documentdb-operator \ + --create-namespace \ + --wait \ + --timeout 10m + + if [ $? -eq 0 ]; then + success "DocumentDB operator installed successfully from OCI registry: ${OPERATOR_GITHUB_ORG}/documentdb-operator:${OPERATOR_CHART_VERSION}" + else + error "Failed to install DocumentDB operator. Please verify: +- Your GitHub token has 'read:packages' scope +- You have access to ${OPERATOR_GITHUB_ORG}/documentdb-operator repository +- The chart version ${OPERATOR_CHART_VERSION} exists" + fi + fi + + # Wait for operator to be ready + log "Waiting for DocumentDB operator to be ready..." + kubectl wait --for=condition=ready pod -l app.kubernetes.io/name=documentdb-operator -n documentdb-operator --timeout=300s || warn "DocumentDB operator pods may still be starting" + + success "DocumentDB operator installed" +} + +# Deploy DocumentDB instance (optional) +deploy_documentdb_instance() { + if [ "$DEPLOY_INSTANCE" != "true" ]; then + warn "Skipping DocumentDB instance deployment (--skip-instance specified or not enabled)" + return 0 + fi + + log "Deploying DocumentDB instance..." + + # Check if operator is installed + if ! kubectl get deployment -n documentdb-operator documentdb-operator &> /dev/null; then + error "DocumentDB operator not found. Cannot deploy instance without operator." + fi + + # Create DocumentDB namespace + kubectl apply -f - </dev/null || warn "Operator logs unavailable" + + # 3. Recent instance pod logs + log "Recent DocumentDB instance logs (tail 20):" + kubectl logs -n documentdb-instance-ns sample-documentdb-1 --tail=20 2>/dev/null || warn "Instance pod not found or not ready" + + # 4. Verify service endpoint + log "Checking DocumentDB service..." + kubectl get svc -n documentdb-instance-ns documentdb-service-sample-documentdb 2>/dev/null || warn "Service not found" + + # 5. Ping test via in-cluster mongosh (validate endpoint independently of app code) + log "Testing DocumentDB connectivity via in-cluster mongosh..." + if kubectl run mongosh-diag-$RANDOM --rm -i --restart=Never --quiet \ + --image=mongo:7 -n documentdb-instance-ns -- \ + mongosh "mongodb://docdbadmin:SecurePassword123!@documentdb-service-sample-documentdb:10260/?directConnection=true&authMechanism=SCRAM-SHA-256&tls=true&tlsAllowInvalidCertificates=true" \ + --quiet --eval "db.runCommand({ping: 1})" 2>/dev/null | grep -q "ok.*1"; then + success "DocumentDB ping succeeded" + else + warn "DocumentDB ping failed -- see troubleshooting guide in summary" + fi +} + +# Print summary +print_summary() { + echo "" + echo "==================================================" + echo "🎉 CLUSTER SETUP COMPLETE!" + echo "==================================================" + echo "Cluster Name: $CLUSTER_NAME" + echo "Region: $REGION" + echo "Kubernetes: $K8S_VERSION" + echo "Node Type: $NODE_TYPE" + echo "Spot Instances: $USE_SPOT" + echo "Tags: $CLUSTER_TAGS" + echo "Operator Installed: $INSTALL_OPERATOR" + echo "Instance Deployed: $DEPLOY_INSTANCE" + echo "Log Retention: $LOG_RETENTION_DAYS days" + echo "Control Plane Log Types: $CONTROL_PLANE_LOG_TYPES" + echo "" + echo "✅ Components installed:" + echo " - EKS cluster with managed nodes ($NODE_TYPE)" + echo " - S3 Gateway VPC endpoint (cost optimization)" + echo " - EBS CSI driver" + echo " - AWS Load Balancer Controller" + echo " - cert-manager" + echo " - DocumentDB storage class" + echo " - EKS control plane logging ($CONTROL_PLANE_LOG_TYPES) -> CloudWatch" + echo " - Amazon CloudWatch Observability add-on -> CloudWatch" + echo " - CloudWatch log retention: $LOG_RETENTION_DAYS days" + if [ "$INSTALL_OPERATOR" == "true" ]; then + echo " - DocumentDB operator" + fi + if [ "$DEPLOY_INSTANCE" == "true" ]; then + echo " - DocumentDB instance (sample-documentdb)" + fi + echo "" + echo "📊 CloudWatch Log Groups (retention: ${LOG_RETENTION_DAYS}d):" + echo " - /aws/eks/$CLUSTER_NAME/cluster (control plane)" + echo " - /aws/containerinsights/$CLUSTER_NAME/application (pod stdout/stderr)" + echo " - /aws/containerinsights/$CLUSTER_NAME/dataplane (system pods, kubelet)" + echo " - /aws/containerinsights/$CLUSTER_NAME/host (node OS logs)" + echo " - /aws/containerinsights/$CLUSTER_NAME/performance (container insights performance)" + echo "" + echo "💡 Next steps:" + echo " - Verify cluster: kubectl get nodes" + echo " - Check all pods: kubectl get pods --all-namespaces" + echo " - Verify add-on: aws eks describe-addon --cluster-name $CLUSTER_NAME --region $REGION --addon-name amazon-cloudwatch-observability" + if [ "$INSTALL_OPERATOR" == "true" ]; then + echo " - Check operator: kubectl get pods -n documentdb-operator" + fi + if [ "$DEPLOY_INSTANCE" == "true" ]; then + echo " - Check DocumentDB: kubectl get documentdb -n documentdb-instance-ns" + echo " - Check service status: kubectl get svc -n documentdb-instance-ns" + echo " - Wait for LoadBalancer IP: kubectl get svc documentdb-service-sample-documentdb -n documentdb-instance-ns -w" + echo " - Once IP is assigned, connect: mongodb://docdbadmin:SecurePassword123!@:10260/" + echo "" + echo "🔎 Troubleshooting (adapted from DocumentDB troubleshooting best practices):" + echo "" + echo " 1. Verify instance is running (k8s equivalent of 'docker ps' / 'systemctl status'):" + echo " kubectl get pods -n documentdb-instance-ns" + echo "" + echo " 2. Check logs (PRIMARY path via CloudWatch; observability add-on ships pod logs here):" + echo " aws logs tail /aws/containerinsights/$CLUSTER_NAME/application --region $REGION --since 1h --follow" + echo " aws logs tail /aws/containerinsights/$CLUSTER_NAME/application --region $REGION \\" + echo " --filter-pattern '{ \$.kubernetes.namespace_name = \"documentdb-operator\" }' --since 1h" + echo " aws logs tail /aws/containerinsights/$CLUSTER_NAME/application --region $REGION \\" + echo " --filter-pattern '{ \$.kubernetes.pod_name = \"sample-documentdb-1\" }' --since 1h" + echo " aws logs tail /aws/eks/$CLUSTER_NAME/cluster --region $REGION --since 1h # control plane" + echo "" + echo " 3. Check logs (FALLBACK via kubectl -- real-time streaming or if CloudWatch broken):" + echo " kubectl logs -n documentdb-operator -l app.kubernetes.io/name=documentdb-operator -f" + echo " kubectl logs -n documentdb-instance-ns sample-documentdb-1 -f" + echo "" + echo " 4. Verify CloudWatch Observability add-on health (if CloudWatch logs are missing):" + echo " aws eks describe-addon --cluster-name $CLUSTER_NAME --region $REGION --addon-name amazon-cloudwatch-observability" + echo " kubectl get pods -n amazon-cloudwatch" + echo "" + echo " 5. Verify local client tooling works (k8s equivalent of 'python -c import pymongo'):" + echo " which mongosh && mongosh --version" + echo " which kubectl && kubectl version --client" + echo "" + echo " 6. Validate the endpoint independently of your application code:" + echo " kubectl port-forward -n documentdb-instance-ns svc/documentdb-service-sample-documentdb 10260:10260 &" + echo " mongosh \"mongodb://docdbadmin:SecurePassword123!@localhost:10260/?directConnection=true&authMechanism=SCRAM-SHA-256&tls=true&tlsAllowInvalidCertificates=true\"" + echo "" + echo " 7. TLS / certificate errors:" + echo " - For self-signed certs (default): keep tlsAllowInvalidCertificates=true in the connection string" + echo " - For trusted certs: export CA and pass tlsCAFile=/path/to/ca.pem instead" + fi + echo "" + echo "⚠️ IMPORTANT: Run './delete-cluster.sh' when done to avoid AWS charges!" + echo "==================================================" +} + +# Main execution +main() { + log "Starting DocumentDB EKS cluster setup..." + log "Configuration:" + log " Cluster: $CLUSTER_NAME" + log " Region: $REGION" + log " Kubernetes: $K8S_VERSION" + log " Node Type: $NODE_TYPE" + log " Spot Instances: $USE_SPOT" + log " Tags: $CLUSTER_TAGS" + log " Install Operator: $INSTALL_OPERATOR" + log " Deploy Instance: $DEPLOY_INSTANCE" + log " Log Retention (days): $LOG_RETENTION_DAYS" + log " Control Plane Log Types: $CONTROL_PLANE_LOG_TYPES" + echo "" + + # Execute setup steps + check_prerequisites + create_cluster + enable_control_plane_logging + create_vpc_endpoints + install_ebs_csi + install_load_balancer_controller + install_cert_manager + create_storage_class + + # Logging/observability pipeline + install_cloudwatch_observability_addon + set_log_retention + + # Optional components + install_documentdb_operator + deploy_documentdb_instance + + # Post-deploy diagnostics (no-op if --skip-instance) + diagnose_documentdb + + # Show summary + print_summary +} + +# Run main function +main "$@" diff --git a/documentdb-playground/contrib/telemetry-and-cost-optimized-eks/scripts/delete-cluster.sh b/documentdb-playground/contrib/telemetry-and-cost-optimized-eks/scripts/delete-cluster.sh new file mode 100755 index 00000000..039373ec --- /dev/null +++ b/documentdb-playground/contrib/telemetry-and-cost-optimized-eks/scripts/delete-cluster.sh @@ -0,0 +1,938 @@ +#!/bin/bash + +# DocumentDB EKS Cluster Deletion Script +# This script completely removes the EKS cluster and all AWS resources to avoid charges + +set -e # Exit on any error + +# Configuration +CLUSTER_NAME="${CLUSTER_NAME:-documentdb-contrib-cluster}" +REGION="us-west-2" + +# Feature flags - set to "true" to enable, "false" to skip +DELETE_CLUSTER="${DELETE_CLUSTER:-true}" +DELETE_OPERATOR="${DELETE_OPERATOR:-true}" +DELETE_INSTANCE="${DELETE_INSTANCE:-true}" +SKIP_CONFIRMATION="false" + +# Parse command line arguments +while [[ $# -gt 0 ]]; do + case $1 in + --instance-only) + DELETE_INSTANCE="true" + DELETE_OPERATOR="false" + DELETE_CLUSTER="false" + shift + ;; + --instance-and-operator) + DELETE_INSTANCE="true" + DELETE_OPERATOR="true" + DELETE_CLUSTER="false" + shift + ;; + --cluster-name) + CLUSTER_NAME="$2" + shift 2 + ;; + --region) + REGION="$2" + shift 2 + ;; + -y|--yes) + SKIP_CONFIRMATION="true" + shift + ;; + -h|--help) + echo "Usage: $0 [OPTIONS]" + echo "" + echo "Options:" + echo " --instance-only Delete only DocumentDB instances (keep operator and cluster)" + echo " --instance-and-operator Delete instances and operator (keep cluster)" + echo " --cluster-name NAME EKS cluster name (default: documentdb-contrib-cluster)" + echo " --region REGION AWS region (default: us-west-2)" + echo " -y, --yes Skip confirmation prompt" + echo " -h, --help Show this help message" + echo "" + echo "Examples:" + echo " $0 # Delete everything (default)" + echo " $0 --instance-only # Delete only DocumentDB instances" + echo " $0 --instance-and-operator # Delete instances and operator, keep cluster" + echo " $0 --yes # Delete everything without confirmation" + exit 0 + ;; + *) + echo "Unknown option: $1" + exit 1 + ;; + esac +done + +# Colors for output +RED='\033[0;31m' +GREEN='\033[0;32m' +YELLOW='\033[1;33m' +BLUE='\033[0;34m' +NC='\033[0m' # No Color + +# Logging function +log() { + echo -e "${BLUE}[$(date +'%Y-%m-%d %H:%M:%S')]${NC} $1" +} + +success() { + echo -e "${GREEN}[$(date +'%Y-%m-%d %H:%M:%S')] ✅ $1${NC}" +} + +warn() { + echo -e "${YELLOW}[$(date +'%Y-%m-%d %H:%M:%S')] ⚠️ $1${NC}" +} + +error() { + echo -e "${RED}[$(date +'%Y-%m-%d %H:%M:%S')] ❌ $1${NC}" +} + +# Confirmation prompt +confirm_deletion() { + if [ "$SKIP_CONFIRMATION" == "true" ]; then + log "Skipping confirmation (--yes flag provided)" + return 0 + fi + + echo "" + echo "=======================================" + echo " DELETION WARNING" + echo "=======================================" + echo "" + warn "This will DELETE the following resources:" + + if [ "$DELETE_INSTANCE" == "true" ]; then + echo " • All DocumentDB instances" + fi + + if [ "$DELETE_OPERATOR" == "true" ]; then + echo " • DocumentDB operator deployments" + echo " • Related namespaces and CRDs" + fi + + if [ "$DELETE_CLUSTER" == "true" ]; then + echo " • EKS Cluster: $CLUSTER_NAME" + echo " • All persistent volumes" + echo " • Load balancers and networking" + echo " • IAM roles and policies" + fi + + echo "" + warn "This action is IRREVERSIBLE!" + echo "" + + read -p "Are you sure you want to proceed? (type 'yes' to confirm): " confirmation + + if [ "$confirmation" != "yes" ]; then + log "Deletion cancelled by user" + exit 0 + fi + + log "Proceeding with deletion..." +} + +# Delete DocumentDB instances +delete_documentdb_instances() { + if [ "$DELETE_INSTANCE" != "true" ]; then + warn "Skipping DocumentDB instances deletion (--keep-instance specified)" + return 0 + fi + + log "Deleting DocumentDB instances..." + + # Delete all DocumentDB instances (this will trigger LoadBalancer deletion) + kubectl delete documentdb --all --all-namespaces --timeout=300s || warn "No DocumentDB instances found or deletion failed" + + # Wait for LoadBalancer services to be deleted (created by DocumentDB instances) + log "Waiting for DocumentDB LoadBalancer services to be deleted..." + for i in {1..12}; do # Wait up to 6 minutes + LB_SERVICES=$(kubectl get services --all-namespaces -o json 2>/dev/null | jq -r '.items[] | select(.spec.type=="LoadBalancer") | "\(.metadata.namespace)/\(.metadata.name)"' 2>/dev/null || echo "") + if [ -z "$LB_SERVICES" ]; then + success "All LoadBalancer services deleted" + break + fi + log "Still waiting for LoadBalancer services to be deleted... (attempt $i/12)" + echo "$LB_SERVICES" | while read svc; do + if [ -n "$svc" ]; then + log " Remaining service: $svc" + fi + done + sleep 30 + done + + # Wait for AWS LoadBalancers to be cleaned up + log "Waiting for AWS LoadBalancers to be fully removed..." + for i in {1..12}; do # Wait up to 6 minutes for AWS cleanup + # Check for both ELBv2 (ALB/NLB) and Classic ELB + AWS_LBS_V2=$(aws elbv2 describe-load-balancers --region $REGION --query "LoadBalancers[?contains(LoadBalancerName, 'k8s-')].LoadBalancerName" --output text 2>/dev/null || echo "") + AWS_LBS_CLASSIC=$(aws elb describe-load-balancers --region $REGION --query "LoadBalancerDescriptions[?contains(LoadBalancerName, 'k8s-')].LoadBalancerName" --output text 2>/dev/null || echo "") + + if ([ -z "$AWS_LBS_V2" ] || [ "$AWS_LBS_V2" = "None" ]) && ([ -z "$AWS_LBS_CLASSIC" ] || [ "$AWS_LBS_CLASSIC" = "None" ]); then + success "All AWS LoadBalancers cleaned up" + break + fi + log "Still waiting for AWS LoadBalancers to be removed... (attempt $i/12)" + if [ -n "$AWS_LBS_V2" ] && [ "$AWS_LBS_V2" != "None" ]; then + log " Remaining ELBv2: $AWS_LBS_V2" + fi + if [ -n "$AWS_LBS_CLASSIC" ] && [ "$AWS_LBS_CLASSIC" != "None" ]; then + log " Remaining Classic ELB: $AWS_LBS_CLASSIC" + fi + sleep 30 + done + + # Wait for PostgreSQL clusters to be deleted + log "Waiting for PostgreSQL clusters to be deleted..." + sleep 30 + + success "DocumentDB instances and related LoadBalancers deleted" +} + +# Delete Helm releases +delete_helm_releases() { + if [ "$DELETE_OPERATOR" != "true" ]; then + warn "Skipping DocumentDB operator deletion" + return 0 + fi + + log "Deleting DocumentDB operator and related resources..." + + # First, delete all LoadBalancer services to avoid dependency issues + log "Deleting LoadBalancer services..." + kubectl get services --all-namespaces -o json 2>/dev/null | \ + jq -r '.items[] | select(.spec.type=="LoadBalancer") | "\(.metadata.namespace) \(.metadata.name)"' 2>/dev/null | \ + while read namespace service; do + if [ -n "$namespace" ] && [ -n "$service" ]; then + log "Deleting LoadBalancer service: $service in namespace $namespace" + kubectl delete service "$service" -n "$namespace" --timeout=300s || warn "Failed to delete service $service" + fi + done 2>/dev/null || warn "No LoadBalancer services found or jq not available" + + # Wait for LoadBalancers to be deleted from AWS + log "Waiting for AWS LoadBalancers to be cleaned up..." + sleep 30 + + log "Deleting DocumentDB operator Helm releases..." + + # Delete DocumentDB operator + helm uninstall documentdb-operator --namespace documentdb-operator 2>/dev/null || warn "DocumentDB operator not found in documentdb-operator namespace" + + # Only delete these if we're deleting the whole cluster + if [ "$DELETE_CLUSTER" == "true" ]; then + # Delete CloudWatch Observability add-on (managed collector for Container Insights) + log "Removing CloudWatch Observability add-on..." + if aws eks delete-addon \ + --cluster-name "$CLUSTER_NAME" \ + --addon-name amazon-cloudwatch-observability \ + --region "$REGION" >/dev/null 2>&1; then + success "CloudWatch Observability add-on deletion requested" + aws eks wait addon-deleted \ + --cluster-name "$CLUSTER_NAME" \ + --addon-name amazon-cloudwatch-observability \ + --region "$REGION" >/dev/null 2>&1 \ + || warn "Timed out waiting for add-on deletion; cluster deletion will continue" + else + warn "CloudWatch Observability add-on not found or could not be deleted" + fi + + # Delete AWS Load Balancer Controller (after LoadBalancer services are gone) + helm uninstall aws-load-balancer-controller -n kube-system 2>/dev/null || warn "AWS Load Balancer Controller not found" + + # Delete cert-manager + helm uninstall cert-manager -n cert-manager 2>/dev/null || warn "cert-manager not found" + fi + + # Give more time for resources to be cleaned up + log "Waiting for Helm releases and AWS resources to be fully removed..." + sleep 30 + + success "DocumentDB operator and related resources deleted" +} + +# Delete namespaces +delete_namespaces() { + if [ "$DELETE_OPERATOR" != "true" ]; then + warn "Skipping DocumentDB namespaces deletion" + return 0 + fi + + log "Deleting DocumentDB namespaces..." + + # Delete DocumentDB namespace + kubectl delete namespace documentdb-operator --timeout=300s || warn "documentdb-operator namespace not found" + + # Delete instance namespace if it exists + kubectl delete namespace documentdb-instance-ns --timeout=300s || warn "documentdb-instance-ns namespace not found" + + # Only delete these if we're deleting the whole cluster + if [ "$DELETE_CLUSTER" == "true" ]; then + kubectl delete namespace cert-manager --timeout=300s || warn "cert-manager namespace not found" + fi + + success "DocumentDB namespaces deleted" +} + +# Delete CRDs +delete_crds() { + if [ "$DELETE_OPERATOR" != "true" ]; then + warn "Skipping DocumentDB CRDs deletion" + return 0 + fi + + log "Deleting DocumentDB Custom Resource Definitions..." + + # Delete specific CRDs + kubectl delete crd backups.postgresql.cnpg.io \ + clusterimagecatalogs.postgresql.cnpg.io \ + clusters.postgresql.cnpg.io \ + databases.postgresql.cnpg.io \ + imagecatalogs.postgresql.cnpg.io \ + poolers.postgresql.cnpg.io \ + publications.postgresql.cnpg.io \ + scheduledbackups.postgresql.cnpg.io \ + subscriptions.postgresql.cnpg.io \ + dbs.documentdb.io 2>/dev/null || warn "Some CRDs not found or already deleted" + + # Only delete these if we're deleting the whole cluster + if [ "$DELETE_CLUSTER" == "true" ]; then + # Delete cert-manager CRDs + kubectl delete crd -l app.kubernetes.io/name=cert-manager 2>/dev/null || warn "cert-manager CRDs not found" + fi + + success "DocumentDB CRDs deleted" +} + +# Delete CloudWatch log groups created for this cluster (control plane + container insights) +delete_cloudwatch_logs() { + if [ "$DELETE_CLUSTER" != "true" ]; then + return 0 + fi + + log "Deleting CloudWatch log groups for cluster $CLUSTER_NAME..." + local groups=( + "/aws/eks/${CLUSTER_NAME}/cluster" + "/aws/containerinsights/${CLUSTER_NAME}/application" + "/aws/containerinsights/${CLUSTER_NAME}/dataplane" + "/aws/containerinsights/${CLUSTER_NAME}/host" + "/aws/containerinsights/${CLUSTER_NAME}/performance" + ) + for group in "${groups[@]}"; do + if aws logs delete-log-group --log-group-name "$group" --region "$REGION" 2>/dev/null; then + success "Deleted log group: $group" + else + warn "Log group $group not found (may have already been deleted or never created)" + fi + done +} + +# Delete AWS resources +delete_aws_resources() { + log "Deleting AWS resources..." + + # Get account ID + ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text 2>/dev/null) || { + warn "Could not get AWS account ID. Skipping IAM policy deletion." + return 0 + } + + # Delete IAM policies (only if they exist) + aws iam delete-policy --policy-arn arn:aws:iam::$ACCOUNT_ID:policy/AWSLoadBalancerControllerIAMPolicy 2>/dev/null || warn "IAM policy AWSLoadBalancerControllerIAMPolicy not found" + + # Delete any remaining load balancers + log "Checking for remaining load balancers..." + local remaining_lbs=$(aws elbv2 describe-load-balancers --region $REGION --query "LoadBalancers[?contains(LoadBalancerName, 'k8s')].LoadBalancerArn" --output text 2>/dev/null || echo "") + if [ -n "$remaining_lbs" ]; then + warn "Found remaining load balancers. They may take a few minutes to delete automatically." + fi + + # Delete any remaining volumes + log "Checking for remaining EBS volumes..." + local remaining_volumes=$(aws ec2 describe-volumes --region $REGION --filters "Name=tag:kubernetes.io/cluster/$CLUSTER_NAME,Values=owned" --query "Volumes[?State=='available'].VolumeId" --output text 2>/dev/null || echo "") + if [ -n "$remaining_volumes" ]; then + warn "Found remaining EBS volumes. Attempting to delete them..." + for volume in $remaining_volumes; do + aws ec2 delete-volume --volume-id $volume --region $REGION 2>/dev/null || warn "Could not delete volume $volume" + done + fi + + success "AWS resources cleanup attempted" +} + +# Clean up any remaining infrastructure LoadBalancers (not DocumentDB app LoadBalancers) +cleanup_infrastructure_loadbalancers() { + if [ "$DELETE_CLUSTER" != "true" ]; then + return 0 + fi + + log "Checking for remaining infrastructure LoadBalancers..." + + # Only look for LoadBalancers that might be created by cluster infrastructure + # DocumentDB LoadBalancers should already be deleted by delete_documentdb_instances + LB_ARNS=$(aws elbv2 describe-load-balancers --region $REGION --query "LoadBalancers[?contains(LoadBalancerName, 'k8s-elb') || contains(LoadBalancerName, 'k8s-nlb') || contains(LoadBalancerName, '$CLUSTER_NAME')].LoadBalancerArn" --output text 2>/dev/null || echo "") + + if [ -n "$LB_ARNS" ] && [ "$LB_ARNS" != "None" ]; then + log "Found infrastructure LoadBalancers to clean up:" + echo "$LB_ARNS" | tr '\t' '\n' | while read lb_arn; do + if [ -n "$lb_arn" ]; then + LB_NAME=$(aws elbv2 describe-load-balancers --load-balancer-arns "$lb_arn" --region $REGION --query 'LoadBalancers[0].LoadBalancerName' --output text 2>/dev/null || echo "unknown") + log " Deleting infrastructure LoadBalancer: $LB_NAME" + aws elbv2 delete-load-balancer --load-balancer-arn "$lb_arn" --region $REGION 2>/dev/null || warn "Failed to delete LoadBalancer $LB_NAME" + fi + done + + # Wait for infrastructure LoadBalancer deletion to complete + log "Waiting for infrastructure LoadBalancer deletion to complete..." + for i in {1..6}; do # Wait up to 3 minutes + REMAINING_LBS=$(aws elbv2 describe-load-balancers --region $REGION --query "LoadBalancers[?contains(LoadBalancerName, 'k8s-elb') || contains(LoadBalancerName, 'k8s-nlb') || contains(LoadBalancerName, '$CLUSTER_NAME')].LoadBalancerArn" --output text 2>/dev/null || echo "") + if [ -z "$REMAINING_LBS" ] || [ "$REMAINING_LBS" = "None" ]; then + success "All infrastructure LoadBalancers deleted" + break + fi + log "Still waiting for infrastructure LoadBalancers to be deleted... (attempt $i/6)" + sleep 30 + done + else + log "No infrastructure LoadBalancers found to clean up." + fi +} + +# Clean up VPC dependencies that can block CloudFormation deletion with proper waiting +cleanup_vpc_dependencies() { + if [ "$DELETE_CLUSTER" != "true" ]; then + return 0 + fi + + log "Cleaning up VPC dependencies..." + + # Get the VPC ID for our cluster + VPC_ID=$(aws ec2 describe-vpcs --region $REGION --filters "Name=tag:Name,Values=eksctl-$CLUSTER_NAME-cluster/VPC" --query 'Vpcs[0].VpcId' --output text 2>/dev/null || echo "") + + # Clean up VPC endpoints created for cost optimization (e.g. S3 Gateway endpoint). + # Must run before CloudFormation deletion so the VPC can be torn down cleanly. + if [ -n "$VPC_ID" ] && [ "$VPC_ID" != "None" ] && [ "$VPC_ID" != "null" ]; then + log "Cleaning up VPC endpoints..." + VPC_ENDPOINTS=$(aws ec2 describe-vpc-endpoints --region "$REGION" \ + --filters "Name=vpc-id,Values=$VPC_ID" \ + --query 'VpcEndpoints[].VpcEndpointId' --output text 2>/dev/null || echo "") + if [ -n "$VPC_ENDPOINTS" ] && [ "$VPC_ENDPOINTS" != "None" ]; then + for endpoint_id in $VPC_ENDPOINTS; do + log " Deleting VPC endpoint: $endpoint_id" + aws ec2 delete-vpc-endpoints --vpc-endpoint-ids "$endpoint_id" --region "$REGION" 2>/dev/null || warn "Failed to delete VPC endpoint $endpoint_id" + done + success "VPC endpoints cleaned up" + else + log "No VPC endpoints found to clean up." + fi + fi + + if [ -z "$VPC_ID" ] || [ "$VPC_ID" = "None" ] || [ "$VPC_ID" = "null" ]; then + log "No VPC found for cluster $CLUSTER_NAME, checking for any remaining k8s security groups..." + # Fallback: look for any k8s-related security groups + SECURITY_GROUPS=$(aws ec2 describe-security-groups --region $REGION --filters "Name=group-name,Values=k8s-*" --query 'SecurityGroups[].GroupId' --output text 2>/dev/null || echo "") + else + log "Found VPC: $VPC_ID" + + # COMPREHENSIVE LOADBALANCER CLEANUP - Check for any remaining LoadBalancers in this VPC + log "Performing comprehensive LoadBalancer cleanup in VPC $VPC_ID..." + + # Check for ELBv2 LoadBalancers (ALB/NLB) in this VPC + VPC_LBS_V2=$(aws elbv2 describe-load-balancers --region $REGION --query "LoadBalancers[?VpcId=='$VPC_ID'].{Arn:LoadBalancerArn,Name:LoadBalancerName}" --output text 2>/dev/null || echo "") + + if [ -n "$VPC_LBS_V2" ] && [ "$VPC_LBS_V2" != "None" ]; then + log "Found ELBv2 LoadBalancers in VPC, deleting them..." + echo "$VPC_LBS_V2" | while read lb_arn lb_name; do + if [ -n "$lb_arn" ] && [ "$lb_arn" != "None" ]; then + log " Deleting LoadBalancer: $lb_name ($lb_arn)" + aws elbv2 delete-load-balancer --region $REGION --load-balancer-arn "$lb_arn" || warn "Failed to delete LoadBalancer $lb_name" + fi + done + + # Wait for LoadBalancers to be deleted + log "Waiting for ELBv2 LoadBalancers to be deleted..." + for i in {1..12}; do # Wait up to 6 minutes + REMAINING_LBS=$(aws elbv2 describe-load-balancers --region $REGION --query "LoadBalancers[?VpcId=='$VPC_ID'].LoadBalancerName" --output text 2>/dev/null || echo "") + if [ -z "$REMAINING_LBS" ] || [ "$REMAINING_LBS" = "None" ]; then + success "All ELBv2 LoadBalancers deleted from VPC" + break + fi + log "Still waiting for ELBv2 LoadBalancers to be deleted... (attempt $i/12)" + sleep 30 + done + else + log "No ELBv2 LoadBalancers found in VPC" + fi + + # Check for Classic LoadBalancers in this VPC + VPC_LBS_CLASSIC=$(aws elb describe-load-balancers --region $REGION --query "LoadBalancerDescriptions[?VPCId=='$VPC_ID'].LoadBalancerName" --output text 2>/dev/null || echo "") + + if [ -n "$VPC_LBS_CLASSIC" ] && [ "$VPC_LBS_CLASSIC" != "None" ]; then + log "Found Classic LoadBalancers in VPC, deleting them..." + echo "$VPC_LBS_CLASSIC" | tr '\t' '\n' | while read lb_name; do + if [ -n "$lb_name" ] && [ "$lb_name" != "None" ]; then + log " Deleting Classic LoadBalancer: $lb_name" + aws elb delete-load-balancer --region $REGION --load-balancer-name "$lb_name" || warn "Failed to delete Classic LoadBalancer $lb_name" + fi + done + + # Wait for Classic LoadBalancers to be deleted + log "Waiting for Classic LoadBalancers to be deleted..." + for i in {1..12}; do # Wait up to 6 minutes + REMAINING_CLASSIC_LBS=$(aws elb describe-load-balancers --region $REGION --query "LoadBalancerDescriptions[?VPCId=='$VPC_ID'].LoadBalancerName" --output text 2>/dev/null || echo "") + if [ -z "$REMAINING_CLASSIC_LBS" ] || [ "$REMAINING_CLASSIC_LBS" = "None" ]; then + success "All Classic LoadBalancers deleted from VPC" + break + fi + log "Still waiting for Classic LoadBalancers to be deleted... (attempt $i/12)" + sleep 30 + done + else + log "No Classic LoadBalancers found in VPC" + fi + + # Check for network interfaces that might still be attached to LoadBalancers + log "Checking for LoadBalancer network interfaces in VPC subnets..." + VPC_SUBNETS=$(aws ec2 describe-subnets --region $REGION --filters "Name=vpc-id,Values=$VPC_ID" --query 'Subnets[].SubnetId' --output text 2>/dev/null || echo "") + + if [ -n "$VPC_SUBNETS" ] && [ "$VPC_SUBNETS" != "None" ]; then + for subnet_id in $VPC_SUBNETS; do + LB_ENIS=$(aws ec2 describe-network-interfaces --region $REGION --filters "Name=subnet-id,Values=$subnet_id" --query 'NetworkInterfaces[?contains(Description, `ELB`) && Status==`in-use`].{Id:NetworkInterfaceId,Description:Description}' --output text 2>/dev/null || echo "") + + if [ -n "$LB_ENIS" ] && [ "$LB_ENIS" != "None" ]; then + log "Found LoadBalancer network interfaces in subnet $subnet_id:" + echo "$LB_ENIS" | while read eni_id description; do + if [ -n "$eni_id" ] && [ "$eni_id" != "None" ]; then + log " ENI $eni_id: $description" + # Extract LoadBalancer name from description for targeted deletion + LB_FROM_ENI=$(echo "$description" | grep -o 'k8s-[^/]*' | head -1 || echo "") + if [ -n "$LB_FROM_ENI" ]; then + log " Attempting to delete LoadBalancer: $LB_FROM_ENI" + # Try to find and delete the LoadBalancer by name pattern + LB_ARN=$(aws elbv2 describe-load-balancers --region $REGION --query "LoadBalancers[?LoadBalancerName=='$LB_FROM_ENI'].LoadBalancerArn" --output text 2>/dev/null || echo "") + if [ -n "$LB_ARN" ] && [ "$LB_ARN" != "None" ]; then + log " Found ELBv2 LoadBalancer, deleting: $LB_ARN" + aws elbv2 delete-load-balancer --region $REGION --load-balancer-arn "$LB_ARN" || warn "Failed to delete ELBv2 LoadBalancer $LB_FROM_ENI" + else + # Try Classic ELB + aws elb delete-load-balancer --region $REGION --load-balancer-name "$LB_FROM_ENI" 2>/dev/null || warn "Could not delete LoadBalancer $LB_FROM_ENI" + fi + fi + fi + done + fi + done + + # Final wait for all network interfaces to be released + log "Waiting for LoadBalancer network interfaces to be released..." + for i in {1..8}; do # Wait up to 4 minutes + REMAINING_LB_ENIS="" + for subnet_id in $VPC_SUBNETS; do + SUBNET_LB_ENIS=$(aws ec2 describe-network-interfaces --region $REGION --filters "Name=subnet-id,Values=$subnet_id" --query 'NetworkInterfaces[?contains(Description, `ELB`) && Status==`in-use`].NetworkInterfaceId' --output text 2>/dev/null || echo "") + if [ -n "$SUBNET_LB_ENIS" ] && [ "$SUBNET_LB_ENIS" != "None" ]; then + REMAINING_LB_ENIS="$REMAINING_LB_ENIS $SUBNET_LB_ENIS" + fi + done + + if [ -z "$REMAINING_LB_ENIS" ] || [ "$REMAINING_LB_ENIS" = " " ]; then + success "All LoadBalancer network interfaces released" + break + fi + log "Still waiting for LoadBalancer network interfaces to be released... (attempt $i/8)" + sleep 30 + done + fi + + success "Comprehensive LoadBalancer cleanup completed" + + # ENHANCED SECURITY GROUP CLEANUP - Run after LoadBalancer cleanup is complete + log "Performing enhanced security group cleanup..." + + # Wait a bit more for AWS to propagate LoadBalancer deletions + sleep 30 + + # Get all non-default security groups in the VPC with retry logic + for retry in {1..3}; do + log "Attempting security group cleanup (attempt $retry/3)..." + + SECURITY_GROUPS=$(aws ec2 describe-security-groups --region $REGION --filters "Name=vpc-id,Values=$VPC_ID" --query 'SecurityGroups[?GroupName!=`default`].GroupId' --output text 2>/dev/null || echo "") + + if [ -n "$SECURITY_GROUPS" ] && [ "$SECURITY_GROUPS" != "None" ]; then + log "Found security groups to delete: $SECURITY_GROUPS" + + # Delete security groups one by one with detailed error handling + for sg_id in $SECURITY_GROUPS; do + if [ -n "$sg_id" ] && [ "$sg_id" != "None" ]; then + SG_NAME=$(aws ec2 describe-security-groups --group-ids "$sg_id" --region $REGION --query 'SecurityGroups[0].GroupName' --output text 2>/dev/null || echo "unknown") + SG_DESC=$(aws ec2 describe-security-groups --group-ids "$sg_id" --region $REGION --query 'SecurityGroups[0].Description' --output text 2>/dev/null || echo "unknown") + + log " Attempting to delete security group: $SG_NAME ($sg_id) - $SG_DESC" + + # Try to delete the security group + if aws ec2 delete-security-group --group-id "$sg_id" --region $REGION 2>/dev/null; then + success " Successfully deleted security group: $SG_NAME" + else + warn " Failed to delete security group: $SG_NAME - may have dependencies" + + # Check what's still using this security group + SG_DEPS=$(aws ec2 describe-network-interfaces --region $REGION --filters "Name=group-id,Values=$sg_id" --query 'NetworkInterfaces[].{Id:NetworkInterfaceId,Status:Status,Description:Description}' --output text 2>/dev/null || echo "") + if [ -n "$SG_DEPS" ] && [ "$SG_DEPS" != "None" ]; then + log " Security group $SG_NAME is still used by network interfaces:" + echo "$SG_DEPS" | while read eni_id status desc; do + log " ENI: $eni_id ($status) - $desc" + done + fi + fi + fi + done + + # Wait for security group deletions to propagate + if [ $retry -lt 3 ]; then + log "Waiting 60 seconds for security group deletions to propagate..." + sleep 60 + fi + else + log "No non-default security groups found" + break + fi + done + + # Final verification of security group cleanup + REMAINING_SG=$(aws ec2 describe-security-groups --region $REGION --filters "Name=vpc-id,Values=$VPC_ID" --query 'SecurityGroups[?GroupName!=`default`].{GroupId:GroupId,GroupName:GroupName}' --output text 2>/dev/null || echo "") + if [ -z "$REMAINING_SG" ] || [ "$REMAINING_SG" = "None" ]; then + success "All non-default security groups cleaned up successfully" + else + warn "Some security groups remain in VPC:" + echo "$REMAINING_SG" | while read sg_id sg_name; do + warn " Remaining: $sg_name ($sg_id)" + done + fi + + # Clean up security groups in this VPC (except default) + log "Finding security groups in VPC $VPC_ID..." + SECURITY_GROUPS=$(aws ec2 describe-security-groups --region $REGION --filters "Name=vpc-id,Values=$VPC_ID" --query 'SecurityGroups[?GroupName!=`default`].GroupId' --output text 2>/dev/null || echo "") + fi + + if [ -n "$SECURITY_GROUPS" ] && [ "$SECURITY_GROUPS" != "None" ]; then + log "Deleting security groups..." + echo "$SECURITY_GROUPS" | tr '\t' '\n' | while read sg_id; do + if [ -n "$sg_id" ]; then + SG_NAME=$(aws ec2 describe-security-groups --group-ids "$sg_id" --region $REGION --query 'SecurityGroups[0].GroupName' --output text 2>/dev/null || echo "unknown") + log " Deleting security group: $SG_NAME ($sg_id)" + aws ec2 delete-security-group --group-id "$sg_id" --region $REGION 2>/dev/null || warn "Failed to delete security group $sg_id" + fi + done + + # Wait and verify security groups are deleted + log "Waiting for security groups to be deleted..." + for i in {1..6}; do # Wait up to 3 minutes + if [ -n "$VPC_ID" ] && [ "$VPC_ID" != "None" ] && [ "$VPC_ID" != "null" ]; then + REMAINING_SG=$(aws ec2 describe-security-groups --region $REGION --filters "Name=vpc-id,Values=$VPC_ID" --query 'SecurityGroups[?GroupName!=`default`].GroupId' --output text 2>/dev/null || echo "") + else + REMAINING_SG=$(aws ec2 describe-security-groups --region $REGION --filters "Name=group-name,Values=k8s-*" --query 'SecurityGroups[].GroupId' --output text 2>/dev/null || echo "") + fi + + if [ -z "$REMAINING_SG" ] || [ "$REMAINING_SG" = "None" ]; then + success "All security groups deleted successfully" + break + fi + log "Still waiting for security groups to be deleted... (attempt $i/6)" + sleep 30 + done + else + log "No non-default security groups found to clean up." + fi + + # Clean up any remaining network interfaces in the VPC + if [ -n "$VPC_ID" ] && [ "$VPC_ID" != "None" ] && [ "$VPC_ID" != "null" ]; then + log "Checking for remaining network interfaces in VPC $VPC_ID..." + NETWORK_INTERFACES=$(aws ec2 describe-network-interfaces --region $REGION --filters "Name=vpc-id,Values=$VPC_ID" --query 'NetworkInterfaces[?Status!=`in-use`].NetworkInterfaceId' --output text 2>/dev/null || echo "") + + if [ -n "$NETWORK_INTERFACES" ] && [ "$NETWORK_INTERFACES" != "None" ]; then + log "Deleting unused network interfaces..." + echo "$NETWORK_INTERFACES" | tr '\t' '\n' | while read eni_id; do + if [ -n "$eni_id" ]; then + log " Deleting network interface: $eni_id" + aws ec2 delete-network-interface --network-interface-id "$eni_id" --region $REGION 2>/dev/null || warn "Failed to delete network interface $eni_id" + fi + done + + # Wait for network interfaces to be deleted + log "Waiting for network interfaces to be deleted..." + sleep 30 + else + log "No unused network interfaces found." + fi + fi + + success "VPC dependencies cleanup completed." +} + +# Delete EKS cluster +delete_cluster() { + if [ "$DELETE_CLUSTER" != "true" ]; then + warn "Skipping EKS cluster deletion (--keep-cluster specified)" + return 0 + fi + + log "Deleting EKS cluster..." + + # Check if cluster exists + if ! eksctl get cluster --name $CLUSTER_NAME --region $REGION &> /dev/null; then + warn "Cluster $CLUSTER_NAME not found. Skipping cluster deletion." + return 0 + fi + + # Final check: Make sure all LoadBalancers are really gone + log "Final verification: ensuring all LoadBalancers are deleted..." + local retry_count=0 + local max_retries=5 + + while [ $retry_count -lt $max_retries ]; do + # Get VPC ID for targeted cleanup + VPC_ID=$(aws ec2 describe-vpcs --region $REGION --filters "Name=tag:Name,Values=eksctl-$CLUSTER_NAME-cluster/VPC" --query 'Vpcs[0].VpcId' --output text 2>/dev/null || echo "") + + if [ -n "$VPC_ID" ] && [ "$VPC_ID" != "None" ] && [ "$VPC_ID" != "null" ]; then + # Check for LoadBalancers in this VPC + VPC_LBS_V2=$(aws elbv2 describe-load-balancers --region $REGION --query "LoadBalancers[?VpcId=='$VPC_ID'].LoadBalancerArn" --output text 2>/dev/null || echo "") + VPC_LBS_CLASSIC=$(aws elb describe-load-balancers --region $REGION --query "LoadBalancerDescriptions[?VPCId=='$VPC_ID'].LoadBalancerName" --output text 2>/dev/null || echo "") + + if ([ -z "$VPC_LBS_V2" ] || [ "$VPC_LBS_V2" = "None" ]) && ([ -z "$VPC_LBS_CLASSIC" ] || [ "$VPC_LBS_CLASSIC" = "None" ]); then + success "No LoadBalancers found in cluster VPC" + break + fi + + log "Found LoadBalancers still in VPC $VPC_ID, waiting... (attempt $((retry_count + 1))/$max_retries)" + if [ -n "$VPC_LBS_V2" ] && [ "$VPC_LBS_V2" != "None" ]; then + log " ELBv2 LoadBalancers: $VPC_LBS_V2" + fi + if [ -n "$VPC_LBS_CLASSIC" ] && [ "$VPC_LBS_CLASSIC" != "None" ]; then + log " Classic LoadBalancers: $VPC_LBS_CLASSIC" + fi + + sleep 60 + retry_count=$((retry_count + 1)) + else + log "VPC not found or already deleted" + break + fi + done + + # Delete the cluster + eksctl delete cluster --name $CLUSTER_NAME --region $REGION --wait + + if [ $? -eq 0 ]; then + success "EKS cluster deleted successfully" + else + error "Failed to delete EKS cluster" + fi +} + +# Clean up local kubectl context +cleanup_kubectl_context() { + log "Cleaning up kubectl context..." + + # Remove kubectl context (handle both possible context names) + kubectl config delete-context "$CLUSTER_NAME.$REGION.eksctl.io" 2>/dev/null || warn "kubectl context $CLUSTER_NAME.$REGION.eksctl.io not found" + kubectl config delete-cluster "$CLUSTER_NAME.$REGION.eksctl.io" 2>/dev/null || warn "kubectl cluster $CLUSTER_NAME.$REGION.eksctl.io not found" + kubectl config delete-user "documentdb-admin@$CLUSTER_NAME.$REGION.eksctl.io" 2>/dev/null || warn "kubectl user not found" + + # Also try the default user pattern + kubectl config delete-user "$CLUSTER_NAME@$CLUSTER_NAME.$REGION.eksctl.io" 2>/dev/null || warn "kubectl user (alternate pattern) not found" + + success "kubectl context cleaned up" +} + +# Verify deletion +verify_deletion() { + log "Verifying deletion..." + + echo "" + echo "=== Checking for remaining resources ===" + + # Check if cluster exists + if eksctl get cluster --name $CLUSTER_NAME --region $REGION &> /dev/null; then + warn "Cluster still exists!" + else + success "Cluster deleted" + fi + + # Check for remaining CloudFormation stacks + echo "" + log "Checking for remaining CloudFormation stacks..." + aws cloudformation list-stacks --region $REGION --stack-status-filter CREATE_COMPLETE UPDATE_COMPLETE --query "StackSummaries[?contains(StackName, 'eksctl-$CLUSTER_NAME')].{Name:StackName,Status:StackStatus}" --output table || true + + # Check for remaining EBS volumes + echo "" + log "Checking for remaining EBS volumes..." + aws ec2 describe-volumes --region $REGION --filters "Name=tag:kubernetes.io/cluster/$CLUSTER_NAME,Values=owned" --query "Volumes[].{VolumeId:VolumeId,State:State,Size:Size}" --output table 2>/dev/null || log "No volumes found with cluster tag" + + # Check for remaining load balancers + echo "" + log "Checking for remaining load balancers..." + aws elbv2 describe-load-balancers --region $REGION --query "LoadBalancers[?contains(LoadBalancerName, 'k8s')].[LoadBalancerName,State.Code]" --output table 2>/dev/null || log "No load balancers found" + + echo "" + success "Deletion verification complete!" +} + +# Manual cleanup instructions +show_manual_cleanup() { + echo "" + echo "=======================================" + echo " MANUAL CLEANUP (if needed)" + echo "=======================================" + echo "" + echo "If any resources remain, you can manually clean them up:" + echo "" + echo "1. CloudFormation Stacks:" + echo " aws cloudformation delete-stack --stack-name STACK_NAME --region $REGION" + echo "" + echo "2. EBS Volumes:" + echo " aws ec2 delete-volume --volume-id VOLUME_ID --region $REGION" + echo "" + echo "3. Load Balancers:" + echo " aws elbv2 delete-load-balancer --load-balancer-arn LOAD_BALANCER_ARN" + echo "" + echo "4. IAM Roles and Policies:" + echo " Check AWS Console -> IAM for any remaining eksctl-created resources" + echo "" +} + +# Clean up failed CloudFormation stacks with proper waiting +cleanup_failed_cloudformation_stacks() { + if [ "$DELETE_CLUSTER" != "true" ]; then + return 0 + fi + + log "Checking for failed CloudFormation stacks..." + + # Look for stacks related to our cluster that are in failed states + FAILED_STACKS=$(aws cloudformation list-stacks --region $REGION --query "StackSummaries[?contains(StackName, '$CLUSTER_NAME') && (StackStatus=='DELETE_FAILED' || StackStatus=='CREATE_FAILED' || StackStatus=='UPDATE_FAILED')].StackName" --output text 2>/dev/null || echo "") + + if [ -n "$FAILED_STACKS" ] && [ "$FAILED_STACKS" != "None" ]; then + log "Found failed CloudFormation stacks, attempting to delete:" + echo "$FAILED_STACKS" | tr '\t' '\n' | while read stack_name; do + if [ -n "$stack_name" ]; then + log " Deleting failed stack: $stack_name" + aws cloudformation delete-stack --stack-name "$stack_name" --region $REGION 2>/dev/null || warn "Failed to delete stack $stack_name" + fi + done + + # Wait for stack deletion to complete with verification + log "Waiting for CloudFormation stack deletion to complete..." + for i in {1..20}; do # Wait up to 10 minutes + REMAINING_STACKS=$(aws cloudformation list-stacks --region $REGION --query "StackSummaries[?contains(StackName, '$CLUSTER_NAME') && StackStatus!='DELETE_COMPLETE'].StackName" --output text 2>/dev/null || echo "") + if [ -z "$REMAINING_STACKS" ] || [ "$REMAINING_STACKS" = "None" ]; then + success "All CloudFormation stacks deleted successfully" + break + fi + log "Still waiting for CloudFormation stacks to be deleted... (attempt $i/20)" + sleep 30 + done + else + log "No failed CloudFormation stacks found." + fi +} + +# Main execution +main() { + echo "=======================================" + echo " DocumentDB EKS Cluster Deletion" + echo "=======================================" + echo "" + log "Target Configuration:" + log " Cluster: $CLUSTER_NAME" + log " Region: $REGION" + log " Delete Instance: $DELETE_INSTANCE" + log " Delete Operator: $DELETE_OPERATOR" + log " Delete Cluster: $DELETE_CLUSTER" + echo "" + + confirm_deletion + + log "Starting cluster deletion process..." + + # Check if cluster exists before proceeding + if ! eksctl get cluster --name $CLUSTER_NAME --region $REGION &> /dev/null; then + warn "Cluster '$CLUSTER_NAME' not found in region '$REGION'" + log "This may have been already deleted, or the name/region is incorrect." + log "Proceeding with cleanup of any remaining AWS resources..." + + # Even if cluster is gone, clean up any remaining AWS resources + if [ "$DELETE_CLUSTER" == "true" ]; then + cleanup_infrastructure_loadbalancers + cleanup_vpc_dependencies + cleanup_failed_cloudformation_stacks + delete_cloudwatch_logs + cleanup_kubectl_context + fi + return 0 + fi + + # Step 1: Delete Kubernetes resources first + delete_documentdb_instances + delete_helm_releases + delete_namespaces + delete_crds + + # Step 2: Clean up AWS resources in proper order (only if deleting cluster) + if [ "$DELETE_CLUSTER" == "true" ]; then + log "Proceeding with AWS resource cleanup..." + + # Step 2a: Clean up any remaining infrastructure LoadBalancers (not DocumentDB app LBs) + cleanup_infrastructure_loadbalancers + + # Step 2b: Clean up VPC dependencies (security groups, network interfaces) + cleanup_vpc_dependencies + + # Step 2c: Clean up any failed CloudFormation stacks + cleanup_failed_cloudformation_stacks + + # Step 2d: Delete remaining AWS resources (IAM roles, policies) + delete_aws_resources + + # Step 2e: Finally delete the cluster itself + delete_cluster + + # Step 2f: Delete CloudWatch log groups (after cluster deletion so control plane stops writing) + delete_cloudwatch_logs + + # Step 2g: Clean up local kubectl context + cleanup_kubectl_context + fi + + verify_deletion + + echo "" + echo "=======================================" + success "🗑️ Deletion completed!" + echo "=======================================" + echo "" + echo "Summary:" + if [ "$DELETE_INSTANCE" == "true" ]; then + echo " • DocumentDB instances removed" + fi + if [ "$DELETE_OPERATOR" == "true" ]; then + echo " • DocumentDB operator removed" + fi + if [ "$DELETE_CLUSTER" == "true" ]; then + echo " • EKS cluster '$CLUSTER_NAME' deleted from $REGION" + echo " • All AWS resources cleaned up" + echo " • kubectl context removed" + echo "" + success "No more AWS charges should be incurred from this cluster!" + else + echo " • EKS cluster '$CLUSTER_NAME' preserved" + echo "" + success "Cluster preserved - you can reinstall DocumentDB components as needed!" + fi + echo "" + + show_manual_cleanup +} + +# Run main function +main "$@" \ No newline at end of file