feat(contrib): add telemetry-and-cost-optimized-eks variant#352
Open
michaelraney wants to merge 7 commits intodocumentdb:mainfrom
Open
feat(contrib): add telemetry-and-cost-optimized-eks variant#352michaelraney wants to merge 7 commits intodocumentdb:mainfrom
michaelraney wants to merge 7 commits intodocumentdb:mainfrom
Conversation
Introduce documentdb-playground/contrib/telemetry-and-cost-optimized-eks/ as a self-contained fork of the base aws-setup/ scripts. This scaffold commit ships: - create-cluster.sh and delete-cluster.sh copied from aws-setup/ with the same simple options (NODE_TYPE/EKS_VERSION/USE_SPOT/CLUSTER_TAGS) that just landed in aws-setup/, so the contrib variant is usable standalone. - Default CLUSTER_NAME changed to documentdb-contrib-cluster to avoid collisions when running alongside the base setup. - Placeholder README that will be replaced with the full docs in a later commit on this branch. Subsequent commits will add 2-AZ deployment, CloudFormation diagnostics, S3 Gateway VPC endpoint, EKS control-plane logging, CloudWatch Container Insights, and CloudWatch-aware DocumentDB diagnostics. Signed-off-by: michaelraney <raneymike83@gmail.com> Made-with: Cursor
…sh prereq - Deploy to 2 AZs (minimum EKS supports) instead of the eksctl default of 3 to reduce per-node cross-AZ data transfer for dev/test clusters. - Print CloudFormation stack events after eksctl create succeeds/fails, so users can see the underlying AWS resource status (helps when a create is failing partway through). - Warn (non-fatal) when mongosh is missing, matching the DocumentDB troubleshooting guidance of verifying client tooling before blaming the server on connection issues. Signed-off-by: michaelraney <raneymike83@gmail.com> Made-with: Cursor
Provision a free S3 Gateway VPC endpoint after the cluster is up and attach it to every route table in the cluster VPC. S3 traffic from the cluster then bypasses the NAT Gateway, eliminating NAT data-transfer charges for S3 (useful for pulls from S3, CloudWatch exports, backups, etc.). Teardown in delete-cluster.sh enumerates and deletes every VPC endpoint in the cluster VPC before the rest of the VPC dependency cleanup runs, so the VPC can be fully destroyed by eksctl/CloudFormation. Summary output in create-cluster.sh is updated to list the endpoint. Signed-off-by: michaelraney <raneymike83@gmail.com> Made-with: Cursor
- LOG_RETENTION_DAYS (default: 3) and CONTROL_PLANE_LOG_TYPES (default: api,authenticator) variables, both overridable via env or the new --log-retention / --control-plane-log-types flags. - enable_control_plane_logging() calls eksctl utils update-cluster-logging to ship the selected control-plane log streams to CloudWatch. - set_log_retention() applies the retention policy to the cluster's log groups (control plane + container insights) with lazy-create and retry, since the container-insights groups only appear after the collector emits its first log. - delete-cluster.sh gets delete_cloudwatch_logs() and wires it into main() (both the "cluster still exists" and "cluster already gone" paths) so teardown leaves no lingering log groups. - Help text, configuration banner, and post-install summary updated to reflect the new options. This keeps the slim aws-setup/ scripts unchanged; the advanced logging behavior only applies when using the contrib variant. Signed-off-by: michaelraney <raneymike83@gmail.com> Made-with: Cursor
- install_cloudwatch_observability_addon() in create-cluster.sh installs the eks-pod-identity-agent add-on (if needed), wires up the amazon-cloudwatch/cloudwatch-agent pod identity association with the CloudWatchAgentServerPolicy permission, then installs (or updates) the amazon-cloudwatch-observability EKS add-on and waits for it to become ACTIVE. This is the managed replacement for hand-rolled Fluent Bit manifests and enables Container Insights log shipping. - Wired into main() after install_cert_manager, before set_log_retention, so retention policies apply to the container-insights log groups the add-on creates. - delete-cluster.sh deletes the add-on (waiting for addon-deleted) before uninstalling the AWS Load Balancer Controller and cert-manager, so the collector stops writing logs before the cluster is torn down. - Summary output lists the add-on alongside control-plane logging. Signed-off-by: michaelraney <raneymike83@gmail.com> Made-with: Cursor
- diagnose_documentdb() runs after deploy_documentdb_instance() (no-op
when --skip-instance). It mirrors the DocumentDB troubleshooting
playbook: check pods, tail recent operator + instance logs, verify
the service, and run an in-cluster mongosh ping via a throwaway pod.
Uses kubectl logs (not CloudWatch) intentionally, since the add-on
may not have flushed the last few lines yet.
- Extend print_summary with a CloudWatch log-group listing and a full
troubleshooting block that points users at:
- aws logs tail with filter-pattern examples for operator and
instance logs (primary path via Container Insights),
- kubectl logs -f fallback,
- add-on health checks,
- local tooling sanity checks,
- port-forward + mongosh endpoint validation,
- TLS / self-signed certificate guidance.
Signed-off-by: michaelraney <raneymike83@gmail.com>
Made-with: Cursor
Replace the scaffold README with documentation covering: - What this variant adds vs base aws-setup/ (comparison table). - Simple options inherited from aws-setup/ and contrib-only logging options (--log-retention, --control-plane-log-types). - Logging model table (which log group carries what) plus ready-to-use aws logs tail filter-pattern examples for operator/instance/control plane. - Cost-optimization summary (Graviton default, Spot, 2-AZ, S3 Gateway, retention) with a rough dev/test cost ballpark. - Troubleshooting quickstart that mirrors the block printed by create-cluster.sh so users see consistent guidance. - Teardown summary covering the add-on, VPC endpoints, log groups, and the cluster. Signed-off-by: michaelraney <raneymike83@gmail.com> Made-with: Cursor
5 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Follow-up to #349 as requested by @hossain-rayhan.
Adds a new self-contained variant of the AWS EKS playground under
documentdb-playground/contrib/telemetry-and-cost-optimized-eks/that layers CloudWatch-based observability and additional cost-optimization features on top of the base scripts. The basedocumentdb-playground/aws-setup/scripts stay simple and DocumentDB-focused; users who want the full telemetry + cost-optimized variant opt in to this contrib folder.The variant is delivered as 7 logical commits:
aws-setup/with the simple options from feat: add NODE_TYPE, EKS_VERSION, CLUSTER_TAGS, USE_SPOT options to AWS EKS setup #349 and a distinctCLUSTER_NAMEdefault.mongoshprerequisite warning.--control-plane-log-types,--log-retention, log-group creation / cleanup.eks-pod-identity-agent, pod identity association, Amazon CloudWatch Observability EKS add-on, teardown (addon-deletedwait).diagnose_documentdb+ CloudWatch-aware troubleshooting block inprint_summary.aws-setup/, logging model, cost model, troubleshooting quickstart.Why
Keeps the core
aws-setup/scripts easy for the maintainers (only the four simple options from #349) while still making the advanced CloudWatch + cost-optimization work available to users who want it. DefaultCLUSTER_NAMEisdocumentdb-contrib-clusterso this variant can be run alongside the base setup.Test plan
./documentdb-playground/contrib/telemetry-and-cost-optimized-eks/scripts/create-cluster.sh --deploy-instance— cluster created on m7g.large with default tags, S3 Gateway VPC endpoint attached,amazon-cloudwatch-observabilityadd-on reportsACTIVE./aws/eks/<CLUSTER>/cluster,/aws/containerinsights/<CLUSTER>/application,.../dataplane,.../host,.../performance.aws logs tail /aws/containerinsights/<CLUSTER>/application --region <REGION> --filter-pattern '{ $.kubernetes.namespace_name = "documentdb-operator" }' --since 5mreturns operator logs.kubectl get pods -n amazon-cloudwatchshows collector pods running../documentdb-playground/contrib/telemetry-and-cost-optimized-eks/scripts/delete-cluster.sh -y— add-on deleted, VPC endpoints removed, all log groups removed, CloudFormation stacks torn down, no orphan resources.Related