-
Notifications
You must be signed in to change notification settings - Fork 61
feat(observability-on-aws): Add AWS Observability & FinOps plugin #68
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
e3cf4c6
5e77452
a4cf8d0
0f37872
cee0518
4fdcef6
5d73b18
00016e8
4515dd5
32af043
04931e6
3583878
66d4543
9b2d41f
23b0778
439775e
0ae372f
d914cce
d456452
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,29 @@ | ||
| { | ||
| "author": { | ||
| "name": "Amazon Web Services" | ||
| }, | ||
| "description": "Comprehensive AWS observability and FinOps platform combining CloudWatch Logs, Metrics, Alarms, Application Signals (APM), CloudTrail security auditing, Billing & Cost Management, and automated codebase observability gap analysis for monitoring, troubleshooting, cost optimization, and incident response.", | ||
| "homepage": "https://github.com/awslabs/agent-plugins", | ||
| "keywords": [ | ||
| "aws", | ||
| "observability", | ||
| "cloudwatch", | ||
| "monitoring", | ||
| "logs", | ||
| "metrics", | ||
| "alarms", | ||
| "application-signals", | ||
| "apm", | ||
| "cloudtrail", | ||
| "security", | ||
| "tracing", | ||
|
Comment on lines
+5
to
+19
|
||
| "billing", | ||
| "cost-management", | ||
| "finops", | ||
| "incident-response" | ||
| ], | ||
| "license": "Apache-2.0", | ||
| "name": "observability-on-aws", | ||
| "repository": "https://github.com/awslabs/agent-plugins", | ||
| "version": "1.0.0" | ||
| } | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,52 @@ | ||
| { | ||
| "mcpServers": { | ||
| "awsknowledge": { | ||
| "type": "http", | ||
| "url": "https://knowledge-mcp.global.api.aws" | ||
| }, | ||
| "awslabs.billing-cost-management-mcp-server": { | ||
| "args": [ | ||
| "awslabs.billing-cost-management-mcp-server@latest" | ||
| ], | ||
| "command": "uvx", | ||
| "env": { | ||
| "AWS_PROFILE": "default", | ||
| "AWS_REGION": "us-east-1", | ||
| "FASTMCP_LOG_LEVEL": "ERROR" | ||
| } | ||
| }, | ||
| "awslabs.cloudtrail-mcp-server": { | ||
| "args": [ | ||
| "awslabs.cloudtrail-mcp-server@latest" | ||
| ], | ||
| "command": "uvx", | ||
| "env": { | ||
| "AWS_PROFILE": "default", | ||
| "AWS_REGION": "us-east-1", | ||
| "FASTMCP_LOG_LEVEL": "ERROR" | ||
| } | ||
| }, | ||
| "awslabs.cloudwatch-applicationsignals-mcp-server": { | ||
| "args": [ | ||
| "awslabs.cloudwatch-applicationsignals-mcp-server@latest" | ||
| ], | ||
| "command": "uvx", | ||
| "env": { | ||
| "AWS_PROFILE": "default", | ||
| "AWS_REGION": "us-east-1", | ||
| "FASTMCP_LOG_LEVEL": "ERROR" | ||
| } | ||
| }, | ||
| "awslabs.cloudwatch-mcp-server": { | ||
| "args": [ | ||
| "awslabs.cloudwatch-mcp-server@latest" | ||
| ], | ||
| "command": "uvx", | ||
| "env": { | ||
| "AWS_PROFILE": "default", | ||
| "AWS_REGION": "us-east-1", | ||
| "FASTMCP_LOG_LEVEL": "ERROR" | ||
| } | ||
| } | ||
| } | ||
| } |
| Original file line number | Diff line number | Diff line change | ||||||
|---|---|---|---|---|---|---|---|---|
| @@ -0,0 +1,88 @@ | ||||||||
| --- | ||||||||
| name: observability-on-aws | ||||||||
| description: "Comprehensive AWS observability platform combining CloudWatch Logs, Metrics, Alarms, Application Signals (APM), CloudTrail security auditing, Billing & Cost Management, and automated codebase observability gap analysis. Triggers on phrases like: CloudWatch logs, metrics, alarms, monitoring, observability, application signals, APM, distributed tracing, performance, latency, errors, troubleshooting, root cause analysis, security audit, CloudTrail, log analysis, alerting, SLO, incident response, observability gaps, missing instrumentation, AWS costs, billing, cost anomaly." | ||||||||
| --- | ||||||||
|
|
||||||||
| # AWS Observability | ||||||||
|
|
||||||||
| Requires AWS CLI credentials. All stdio MCP servers use `AWS_PROFILE` and `AWS_REGION` from their env config (defaults: `default` profile, `us-east-1`). | ||||||||
|
|
||||||||
|
||||||||
| Note: This plugin is read-only. It should only query and inspect AWS resources and provide recommendations. It must not provision, modify, or delete AWS resources unless the user explicitly asks for a change, and such changes should preferably be executed via a dedicated deployment or provisioning workflow/plugin. |
theagenticguy marked this conversation as resolved.
Show resolved
Hide resolved
krokoko marked this conversation as resolved.
Show resolved
Hide resolved
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,88 @@ | ||
| # Advanced Alerting | ||
|
|
||
| ## Purpose | ||
|
|
||
| Advanced alerting patterns including composite alarms, anomaly detection, SLO-based alerting, and alarm tuning. For basic alarm setup and configuration patterns, see `alerting-setup.md`. | ||
|
|
||
| ## Composite Alarms | ||
|
|
||
| **Service Health** - combine metrics to determine overall health: | ||
|
|
||
| ``` | ||
| Composite Alarm: "api-service-unhealthy" | ||
| Logic: (high-error-rate OR high-latency) AND low-success-rate | ||
| Components: Errors > 5%, p99 Latency > 2000ms, Success rate < 95% | ||
| ``` | ||
|
|
||
| **Dependency Failure** - detect cascading failures: | ||
|
|
||
| ``` | ||
| Composite Alarm: "service-and-dependency-down" | ||
| Logic: service-errors AND (database-errors OR cache-errors) | ||
| Components: Lambda Errors > 10, RDS CPU > 90%, ElastiCache Evictions > 1000 | ||
| ``` | ||
|
|
||
| ## Anomaly Detection Alarms | ||
|
|
||
| **When to Use**: Metrics with predictable patterns (daily/weekly cycles), metrics where absolute thresholds are hard to define, or detecting unusual behavior vs normal patterns. | ||
|
|
||
| ``` | ||
| Metric: AWS/ApiGateway - Count | Anomaly Detection: Enabled | ||
| Threshold: 2 standard deviations | Evaluation Period: 10 min | ||
| Rationale: Learns normal request patterns, adapts to traffic growth over time | ||
| ``` | ||
|
|
||
| ## SLO-Based Alerting | ||
|
|
||
| Use SLO error budget consumption to drive alerting thresholds: | ||
|
|
||
| ``` | ||
| SLO: 99.9% availability (30-day window) | ||
| Error Budget: 0.1% = 43.2 minutes downtime/month | ||
| Warning (50% consumed, 21.6 min): Notify team, review recent changes | ||
| Critical (80% consumed, 34.6 min): Page on-call, implement feature freeze | ||
| Emergency (100% consumed, 43.2 min): All hands, immediate mitigation | ||
| ``` | ||
|
|
||
| **Implementation**: Set up SLO in Application Signals, create CloudWatch alarm on error budget metric, configure multi-level thresholds, link to incident response procedures. | ||
|
|
||
| ## Alarm Actions | ||
|
|
||
| - **Critical**: Page on-call (PagerDuty/Opsgenie), post to critical alerts channel, create high-priority ticket | ||
| - **Warning**: Post to team channel, create normal-priority ticket, email team distribution list | ||
| - **Info**: Log to monitoring system, email individual owner, no immediate action required | ||
|
|
||
| ## Alarm Tuning and Maintenance | ||
|
|
||
| ### Reducing False Positives | ||
|
|
||
| When alarms trigger frequently without real issues (alarm fatigue): | ||
|
|
||
| 1. **Adjust Thresholds**: Review alarm history, analyze patterns, increase threshold if too sensitive, use percentiles instead of max/min | ||
| 2. **Increase Datapoints to Alarm**: Change from 1/1 to 2/3 to require sustained breach | ||
| 3. **Use Composite Alarms**: Combine multiple signals for more accurate detection | ||
| 4. **Implement Maintenance Windows**: Suppress alarms during deployments using CloudWatch alarm actions | ||
|
|
||
| ### Handling Alarm Flapping | ||
|
|
||
| When an alarm rapidly switches between OK and ALARM: | ||
|
|
||
| 1. **Increase Evaluation Period**: Longer time windows smooth oscillations | ||
| 2. **Add Hysteresis**: Different thresholds for alarm and recovery (e.g., alarm at 80%, recover at 70%) | ||
| 3. **Use Anomaly Detection**: Adapts to patterns, less sensitive to threshold proximity | ||
|
|
||
| ## Alarm Testing | ||
|
|
||
| **Test Checklist**: Alarm triggers on breach, recovers on return to normal, actions execute correctly, description is actionable, runbook link works, on-call receives notification within SLA. | ||
|
|
||
| **Testing Approaches**: | ||
|
|
||
| 1. **Synthetic Testing**: Inject errors or load, verify alarm triggers, confirm notifications | ||
| 2. **Historical Analysis**: Review past incidents, check if alarm would have triggered, adjust as needed | ||
| 3. **Chaos Engineering**: Deliberately cause failures, validate detection and incident response | ||
|
|
||
| ## Integration with Incident Response | ||
|
|
||
| **Alarm-Triggered Investigation**: Alarm triggers notification, on-call checks details, query CloudWatch Logs for errors, analyze Application Signals traces, check CloudTrail for recent changes (use data source priority), implement mitigation, update alarm if needed. | ||
|
|
||
| **Proactive Monitoring**: Review alarm history daily, identify patterns and trends, tune thresholds before issues occur, add missing alarms for coverage gaps, document learnings in runbooks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This marketplace entry description omits Billing/Cost Management, but the plugin ships with the billing-cost-management MCP server and the skill docs describe cost analysis workflows. Updating the marketplace description/keywords would make the listing accurately reflect the plugin's capabilities.