Skip to content

feat(contrib): add telemetry-and-cost-optimized-eks variant#352

Open
michaelraney wants to merge 7 commits intodocumentdb:mainfrom
michaelraney:feat/contrib-telemetry-cost-optimized-eks
Open

feat(contrib): add telemetry-and-cost-optimized-eks variant#352
michaelraney wants to merge 7 commits intodocumentdb:mainfrom
michaelraney:feat/contrib-telemetry-cost-optimized-eks

Conversation

@michaelraney
Copy link
Copy Markdown

Summary

Follow-up to #349 as requested by @hossain-rayhan.

Adds a new self-contained variant of the AWS EKS playground under documentdb-playground/contrib/telemetry-and-cost-optimized-eks/ that layers CloudWatch-based observability and additional cost-optimization features on top of the base scripts. The base documentdb-playground/aws-setup/ scripts stay simple and DocumentDB-focused; users who want the full telemetry + cost-optimized variant opt in to this contrib folder.

The variant is delivered as 7 logical commits:

  1. scaffold — self-contained fork of aws-setup/ with the simple options from feat: add NODE_TYPE, EKS_VERSION, CLUSTER_TAGS, USE_SPOT options to AWS EKS setup #349 and a distinct CLUSTER_NAME default.
  2. cost / QoL — 2-AZ deployment, CloudFormation stack event diagnostics, mongosh prerequisite warning.
  3. VPC endpoint — free S3 Gateway VPC endpoint (reduces NAT Gateway data-transfer cost) + teardown.
  4. control-plane logging + retention--control-plane-log-types, --log-retention, log-group creation / cleanup.
  5. Container Insightseks-pod-identity-agent, pod identity association, Amazon CloudWatch Observability EKS add-on, teardown (addon-deleted wait).
  6. DocumentDB diagnostics — post-deploy diagnose_documentdb + CloudWatch-aware troubleshooting block in print_summary.
  7. README — fork explanation, comparison table vs aws-setup/, logging model, cost model, troubleshooting quickstart.

Why

Keeps the core aws-setup/ scripts easy for the maintainers (only the four simple options from #349) while still making the advanced CloudWatch + cost-optimization work available to users who want it. Default CLUSTER_NAME is documentdb-contrib-cluster so this variant can be run alongside the base setup.

Test plan

  • ./documentdb-playground/contrib/telemetry-and-cost-optimized-eks/scripts/create-cluster.sh --deploy-instance — cluster created on m7g.large with default tags, S3 Gateway VPC endpoint attached, amazon-cloudwatch-observability add-on reports ACTIVE.
  • CloudWatch log groups created with 3-day retention: /aws/eks/<CLUSTER>/cluster, /aws/containerinsights/<CLUSTER>/application, .../dataplane, .../host, .../performance.
  • aws logs tail /aws/containerinsights/<CLUSTER>/application --region <REGION> --filter-pattern '{ $.kubernetes.namespace_name = "documentdb-operator" }' --since 5m returns operator logs.
  • kubectl get pods -n amazon-cloudwatch shows collector pods running.
  • ./documentdb-playground/contrib/telemetry-and-cost-optimized-eks/scripts/delete-cluster.sh -y — add-on deleted, VPC endpoints removed, all log groups removed, CloudFormation stacks torn down, no orphan resources.

Related

Introduce documentdb-playground/contrib/telemetry-and-cost-optimized-eks/
as a self-contained fork of the base aws-setup/ scripts.

This scaffold commit ships:
- create-cluster.sh and delete-cluster.sh copied from aws-setup/ with the
  same simple options (NODE_TYPE/EKS_VERSION/USE_SPOT/CLUSTER_TAGS) that
  just landed in aws-setup/, so the contrib variant is usable standalone.
- Default CLUSTER_NAME changed to documentdb-contrib-cluster to avoid
  collisions when running alongside the base setup.
- Placeholder README that will be replaced with the full docs in a later
  commit on this branch.

Subsequent commits will add 2-AZ deployment, CloudFormation diagnostics,
S3 Gateway VPC endpoint, EKS control-plane logging, CloudWatch Container
Insights, and CloudWatch-aware DocumentDB diagnostics.

Signed-off-by: michaelraney <raneymike83@gmail.com>
Made-with: Cursor
…sh prereq

- Deploy to 2 AZs (minimum EKS supports) instead of the eksctl default of 3
  to reduce per-node cross-AZ data transfer for dev/test clusters.
- Print CloudFormation stack events after eksctl create succeeds/fails, so
  users can see the underlying AWS resource status (helps when a create is
  failing partway through).
- Warn (non-fatal) when mongosh is missing, matching the DocumentDB
  troubleshooting guidance of verifying client tooling before blaming the
  server on connection issues.

Signed-off-by: michaelraney <raneymike83@gmail.com>
Made-with: Cursor
Provision a free S3 Gateway VPC endpoint after the cluster is up and
attach it to every route table in the cluster VPC. S3 traffic from the
cluster then bypasses the NAT Gateway, eliminating NAT data-transfer
charges for S3 (useful for pulls from S3, CloudWatch exports, backups,
etc.).

Teardown in delete-cluster.sh enumerates and deletes every VPC endpoint
in the cluster VPC before the rest of the VPC dependency cleanup runs,
so the VPC can be fully destroyed by eksctl/CloudFormation.

Summary output in create-cluster.sh is updated to list the endpoint.

Signed-off-by: michaelraney <raneymike83@gmail.com>
Made-with: Cursor
- LOG_RETENTION_DAYS (default: 3) and CONTROL_PLANE_LOG_TYPES
  (default: api,authenticator) variables, both overridable via env or
  the new --log-retention / --control-plane-log-types flags.
- enable_control_plane_logging() calls eksctl utils update-cluster-logging
  to ship the selected control-plane log streams to CloudWatch.
- set_log_retention() applies the retention policy to the cluster's log
  groups (control plane + container insights) with lazy-create and retry,
  since the container-insights groups only appear after the collector
  emits its first log.
- delete-cluster.sh gets delete_cloudwatch_logs() and wires it into main()
  (both the "cluster still exists" and "cluster already gone" paths) so
  teardown leaves no lingering log groups.
- Help text, configuration banner, and post-install summary updated to
  reflect the new options.

This keeps the slim aws-setup/ scripts unchanged; the advanced logging
behavior only applies when using the contrib variant.

Signed-off-by: michaelraney <raneymike83@gmail.com>
Made-with: Cursor
- install_cloudwatch_observability_addon() in create-cluster.sh installs
  the eks-pod-identity-agent add-on (if needed), wires up the
  amazon-cloudwatch/cloudwatch-agent pod identity association with the
  CloudWatchAgentServerPolicy permission, then installs (or updates) the
  amazon-cloudwatch-observability EKS add-on and waits for it to become
  ACTIVE. This is the managed replacement for hand-rolled Fluent Bit
  manifests and enables Container Insights log shipping.
- Wired into main() after install_cert_manager, before set_log_retention,
  so retention policies apply to the container-insights log groups the
  add-on creates.
- delete-cluster.sh deletes the add-on (waiting for addon-deleted) before
  uninstalling the AWS Load Balancer Controller and cert-manager, so the
  collector stops writing logs before the cluster is torn down.
- Summary output lists the add-on alongside control-plane logging.

Signed-off-by: michaelraney <raneymike83@gmail.com>
Made-with: Cursor
- diagnose_documentdb() runs after deploy_documentdb_instance() (no-op
  when --skip-instance). It mirrors the DocumentDB troubleshooting
  playbook: check pods, tail recent operator + instance logs, verify
  the service, and run an in-cluster mongosh ping via a throwaway pod.
  Uses kubectl logs (not CloudWatch) intentionally, since the add-on
  may not have flushed the last few lines yet.
- Extend print_summary with a CloudWatch log-group listing and a full
  troubleshooting block that points users at:
    - aws logs tail with filter-pattern examples for operator and
      instance logs (primary path via Container Insights),
    - kubectl logs -f fallback,
    - add-on health checks,
    - local tooling sanity checks,
    - port-forward + mongosh endpoint validation,
    - TLS / self-signed certificate guidance.

Signed-off-by: michaelraney <raneymike83@gmail.com>
Made-with: Cursor
Replace the scaffold README with documentation covering:
- What this variant adds vs base aws-setup/ (comparison table).
- Simple options inherited from aws-setup/ and contrib-only logging
  options (--log-retention, --control-plane-log-types).
- Logging model table (which log group carries what) plus ready-to-use
  aws logs tail filter-pattern examples for operator/instance/control
  plane.
- Cost-optimization summary (Graviton default, Spot, 2-AZ, S3 Gateway,
  retention) with a rough dev/test cost ballpark.
- Troubleshooting quickstart that mirrors the block printed by
  create-cluster.sh so users see consistent guidance.
- Teardown summary covering the add-on, VPC endpoints, log groups, and
  the cluster.

Signed-off-by: michaelraney <raneymike83@gmail.com>
Made-with: Cursor
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant