AI-native SRE for Kubernetes incidents.
RootCause is a local-first MCP server that turns natural-language requests into evidence-backed incident analysis, Kubernetes diagnostics, and safer operations.
Built in Go as a single binary, RootCause is optimized for low-friction local workflows using your existing kubeconfig identity.
π Quick Start | π Client Setup | π οΈ Tools | π§© Skills | π Safety | βοΈ Config | ποΈ Architecture | π€ Contributing
RootCause is built for SRE/operator workflows where speed matters, but unsafe automation is unacceptable.
- π Stop context-switching: investigate incidents, rollout risk, Helm/Terraform/AWS signals, and remediation from one MCP server.
- π§ AI-powered diagnostics: evidence-first analysis with RCA, timelines, and action-oriented next checks.
- πΈ Built-in cost optimization: combine resource usage, workload best-practice checks, Terraform plan analysis, and cloud context for optimization decisions.
- π Enterprise-ready guardrails: role/namespace policy enforcement, redaction, read-only mode, destructive tool controls, and mutation preflight.
- β‘ Zero learning curve: ask natural-language operational questions and use provided prompt templates for common SRE flows.
- π Universal compatibility: works with MCP-compatible clients across Claude, Cursor, Copilot, Codex, and more.
- π Production-grade workflow: single Go binary, kubeconfig-native auth, deterministic structured outputs, and broad test coverage.
| Need | RootCause answer |
|---|---|
| "What changed and why did this break?" | rootcause.incident_bundle, rootcause.change_timeline, rootcause.rca_generate |
| "Is it safe to restart or roll out now?" | k8s.restart_safety_check, k8s.best_practice, k8s.safe_mutation_preflight |
| "Is my platform ecosystem healthy?" | k8s.*_detect + k8s.diagnose_* for ArgoCD/Flux/cert-manager/Kyverno/Gatekeeper/Cilium |
| "Can I standardize SRE responses?" | Prompt templates + structured output from shared render/evidence pipeline |
Ask your AI assistant in natural language:
- "Why did this deployment fail after rollout?"
- "Is this workload safe to restart right now?"
- "Why are ArgoCD apps out of sync?"
- "Is Flux healthy in this cluster?"
- "Why are certs failing to renew?"
- "Before patch/apply, is this mutation safe?"
RootCause keeps its depth-first model: evidence-first diagnosis, root-cause analysis, and remediation flow instead of raw tool sprawl.
Power users can map these prompts to concrete tools in this README (Complete Feature Set, Toolchains, and Tools sections).
- Build end-to-end incident evidence with
rootcause.incident_bundle - Generate probable causes with
rootcause.rca_generate - Export timeline and postmortem artifacts for follow-up
- Evaluate rollout/restart risk with
k8s.restart_safety_checkandk8s.best_practice - Run
k8s.safe_mutation_preflightbefore apply/patch/delete/scale operations
- ArgoCD: detect installation and diagnose sync/health drift
- Flux: detect controllers and diagnose reconciliation failures
- cert-manager / Kyverno / Gatekeeper / Cilium: detect footprint and diagnose control-plane or policy issues
| Area | RootCause Capability |
|---|---|
| Incident analysis | rootcause.incident_bundle, rootcause.rca_generate, rootcause.change_timeline, rootcause.postmortem_export, rootcause.capabilities |
| Kubernetes resilience | k8s.restart_safety_check, k8s.best_practice, k8s.safe_mutation_preflight |
| Ecosystem diagnostics | ArgoCD/Flux/cert-manager/Kyverno/Gatekeeper/Cilium via *_detect and diagnose_* tools |
| Deployment safety | Automatic preflight before k8s mutating operations |
| Helm operations | Chart search/list/get, release diff, rollback advisor, template apply/uninstall flows |
| Terraform analysis | Module/provider search + terraform.debug_plan for impact/risk analysis |
| Service mesh & scaling | Linkerd/Istio/Karpenter diagnostics with shared evidence model |
| Category | Representative capabilities |
|---|---|
Kubernetes core (k8s.*) |
CRUD, logs/events, graph-based debug flows, restart safety, best-practice scoring, mutation preflight |
| Ecosystem diagnostics | ArgoCD, Flux, cert-manager, Kyverno, Gatekeeper, Cilium via *_detect and diagnose_* |
Incident intelligence (rootcause.*) |
Incident bundle orchestration, timeline export, RCA generation, remediation playbook, postmortem export |
Helm operations (helm.*) |
Chart registry search/list/get, release status/diff, rollback advisor, install/upgrade/uninstall, template apply/uninstall |
Terraform analysis (terraform.*) |
Modules/providers/resources/data source discovery + plan debugging |
Service mesh (istio.*, linkerd.*) |
Proxy/config/status diagnostics, policy/routing visibility, mesh resource health |
Cluster autoscaling (karpenter.*) |
Provisioning, nodepool/nodeclass, interruption and scheduling diagnostics |
Cloud context (aws.*, gcp.*) |
AWS: IAM, VPC, EC2, EKS, ECR, STS, KMS diagnostics. GCP: Cloud Monitoring metrics + SLOs, Cloud Logging entries, workload-scoped error timelines for cross-layer incident analysis |
| Safety and controls | Read-only mode, destructive gating, explicit confirmation, auto preflight checks before mutating K8s operations |
Extend your AI coding agent with Kubernetes and RootCause expertise using the built-in skills library in skills/.
Skills metadata is schema-versioned and embedded in the CLI from internal/skills/catalog/manifest.json.
# Copy all skills to Claude
cp -r skills/claude/* ~/.claude/skills/
# Or install a specific skill
cp -r skills/claude/k8s-helm ~/.claude/skills/One command syncs both prompts (as slash commands) and skills (as agent-side
guidance) into your AI client's native directories. By default everything is
synced; use --prompts-only or --skills-only for granular control. Custom
prompts under ~/.rootcause/prompts/ and custom skills under
~/.rootcause/skills/ are picked up automatically.
# List supported agents (shows both commands + skills directories)
rootcause sync --list-agents
# List everything available (built-in + custom prompts and skills)
rootcause sync --list
# Default: sync both prompts and skills for one agent (project-local)
rootcause sync --agent claude --project-dir .
# User-globally β writes ~/.claude/commands/ and ~/.claude/skills/
rootcause sync --agent claude --user
# All supported agents
rootcause sync --all-agents
# Granular control
rootcause sync --agent claude --prompts-only
rootcause sync --agent claude --skills-only
rootcause sync --agent claude --prompt gcp_workload_diagnose
rootcause sync --agent claude --skill k8s-incident
# Existing files are NOT overwritten by default. Opt in:
rootcause sync --agent claude --overwrite
# Dry-run to see what would be written
rootcause sync --agent claude --dry-run
# Ignore custom directories and sync only built-ins
rootcause sync --agent claude --builtin-onlyPrompts that took {{namespace}} / {{workload}} tokens become positional
$1 / $2 in the generated slash command. Optional {{name|default}} tokens
render with a "Defaults" preamble so the agent applies fallbacks when an
argument is omitted.
Per-agent target directories:
| Agent | Slash commands | Skills |
|---|---|---|
claude |
.claude/commands/ |
.claude/skills/ |
cursor |
.cursor/commands/ |
.cursor/skills/ |
codex |
.codex/commands/ |
.codex/skills/ |
copilot |
.github/prompts/ |
.github/skills/ |
gemini |
.gemini/commands/ |
.gemini/skills/ |
opencode |
.opencode/commands/ |
.opencode/skills/ |
windsurf |
.windsurf/commands/ |
.windsurf/skills/ |
aider |
.aider/commands/ |
.aider/skills/ |
devin, cody, amazonq |
(no slash-command directory yet) | .devin/skills/ etc. |
Users can add team or personal skills in a folder containing one subdirectory per skill:
~/.rootcause/skills/
team-runbook/
SKILL.md
Use YAML front matter to standardize metadata and tags. Tags decide which MCP tool calls receive the skill as guidance:
---
category: Root Cause Analysis
description: Team-specific RCA checklist
tags: [rootcause, rca, payments]
---
# Team RCA
Always check the payments dashboard before declaring database root cause.RootCause matches tags on every tool call using the toolset (rootcause, k8s, helm), exact tool name (rootcause.rca_generate), tool-name tokens (rca, events, timeline), plus optional call arguments skillTags or customSkillTags.
For all RootCause incident and issue analysis, tag the skill with rootcause. That applies it to every rootcause.* tool, including rootcause.incident_bundle, rootcause.rca_generate, rootcause.remediation_playbook, rootcause.postmortem_export, and rootcause.change_timeline.
Use narrower tags when the guidance should only apply to part of the flow:
| Goal | Recommended tags |
|---|---|
| All RootCause issue workflows | [rootcause] |
| RCA drafting only | [rca] or [rootcause.rca_generate] |
| Kubernetes issue analysis plus RootCause workflows | [rootcause, k8s, incident] |
| A team/service-specific workflow | [rootcause, payments] plus pass skillTags: ["payments"] when needed |
Sync custom skills into supported agent directories:
rootcause sync --agent opencode --skills-only
rootcause sync --agent claude --custom-skill-dir ~/.rootcause/skills --skill team-runbookExpose custom skills through MCP resources by initializing the home config:
rootcause init-configor adding them to config manually:
[skills]
custom_dirs = ["~/.rootcause/skills", "./skills/custom"]
allow_custom_overrides = falseMCP clients can read skill://catalog for the merged skill list and skill://team-runbook for the skill content. Custom names cannot collide with built-ins unless overrides are explicitly enabled.
Every tool call includes matching tagged custom skills in response metadata/payload as customSkillGuidance, so MCP agents can consider team-specific runbook instructions during root-cause analysis and other workflows.
Syncing skills into .claude/, .codex/, .opencode/, or other agent-specific directories is optional. Claude, Codex, OpenCode, and any MCP-compatible client can use configured custom skills through RootCause tool responses and skill://... resources without local skill sync. Syncing is only needed when you want the agent's native skill system to discover RootCause skills outside MCP tool calls.
Do not put secrets, credentials, kubeconfigs, tokens, or private incident data in custom SKILL.md files. Matching skills can be returned in MCP tool responses for the connected client to read.
Skill file formats per agent:
| Agent | Format |
|---|---|
| Claude Code, Codex, Gemini CLI, OpenCode, Aider, Sourcegraph Cody, Amazon Q | SKILL.md |
| Cursor | .mdc |
| GitHub Copilot, Windsurf, Devin | plain .md |
For the matching slash-command directories per agent, see the unified table earlier in this section.
22 skills are currently included.
| Category | Skills |
|---|---|
| Incident Response | k8s-incident, rootcause-rca |
| Core and Operations | k8s-core, k8s-operations |
| Diagnostics and Debugging | k8s-diagnostics, k8s-troubleshoot |
| Deployment and Delivery | k8s-deploy, k8s-helm, k8s-rollouts |
| GitOps | k8s-gitops |
| Networking and Mesh | k8s-networking, k8s-service-mesh, k8s-cilium |
| Security and Policy | k8s-security, k8s-policy, k8s-gatekeeper, k8s-certs |
| Cost and Scaling | k8s-cost, k8s-autoscaling |
| Storage | k8s-storage |
| Browser Automation | k8s-browser |
| Cloud Observability | k8s-gcp |
Supported agents include Claude, Cursor, Codex, Gemini CLI, GitHub Copilot, Goose, Windsurf, Roo, Amp, and more.
Skills include consistent triggers, workflow steps, tool references, troubleshooting notes, and output contracts.
See skills/README.md for full documentation and skills/CATALOG.md for auto-generated catalog output.
Access Kubernetes data as browsable resources:
| Resource URI | Description |
|---|---|
kubeconfig://contexts |
List all available kubeconfig contexts |
kubeconfig://current-context |
Get current active context |
namespace://current |
Get current namespace |
namespace://list |
List all namespaces |
cluster://info |
Get cluster connection info |
cluster://nodes |
Get detailed node information |
cluster://version |
Get Kubernetes version |
cluster://api-resources |
List available API resources |
manifest://deployments/{namespace}/{name} |
Get deployment YAML |
manifest://services/{namespace}/{name} |
Get service YAML |
manifest://pods/{namespace}/{name} |
Get pod YAML |
manifest://configmaps/{namespace}/{name} |
Get ConfigMap YAML |
manifest://secrets/{namespace}/{name} |
Get secret YAML (data masked) |
manifest://ingresses/{namespace}/{name} |
Get ingress YAML |
Pre-built workflow prompts for Kubernetes and platform operations:
| Prompt | Description |
|---|---|
troubleshoot_workload |
Comprehensive troubleshooting guide for pods/deployments |
deploy_application |
Step-by-step deployment workflow |
security_audit |
Security scanning and RBAC analysis workflow |
cost_optimization |
Resource optimization and cost analysis workflow |
disaster_recovery |
Backup and recovery planning workflow |
debug_networking |
Network debugging for services and connectivity |
scale_application |
Scaling guide with HPA/VPA best practices |
upgrade_cluster |
Kubernetes cluster upgrade planning |
sre_incident_commander |
Severity-based SRE incident coordination workflow |
istio_mesh_diagnose |
Diagnose Istio control-plane and traffic policy issues |
linkerd_mesh_diagnose |
Diagnose Linkerd control-plane, proxy, and policy health |
helm_release_recovery |
Recover failed Helm install/upgrade with rollback strategy |
terraform_drift_triage |
Investigate Terraform drift and plan safety |
aws_eks_operational_check |
EKS health, nodegroup, and IAM integration diagnostics |
karpenter_capacity_debug |
Debug Karpenter provisioning and scheduling issues |
gcp_workload_diagnose |
Triage a workload using GCP Cloud Monitoring + Cloud Logging (any cluster shipping to a GCP project) |
For a full walkthrough with a realistic example (AWS PrivateLink debugging),
see docs/AUTHORING.md. It covers writing a prompt,
writing a complementary skill, syncing both into your client, and what happens
when the AI runs the resulting workflow end-to-end.
Quick reference below.
Drop one markdown file per prompt into ~/.rootcause/prompts/. Each file declares the prompt's metadata in YAML front-matter and the rendered text below.
~/.rootcause/prompts/
team-status.md
payments-p1-drill.md
verify-deploy.md
Example β ~/.rootcause/prompts/team-status.md:
---
name: team_status
description: Daily status check for a workload
arguments:
- name: workload
description: Deployment name
required: true
- name: namespace
description: Namespace (defaults to payments)
required: false
---
Give me the current health of workload {{workload}} in namespace {{namespace|payments}}.
Check:
- Pod status (running / restarting / pending)
- Recent errors in logs (last 30 minutes)
- Any k8s events that look unusual
Keep it to 5 bullet points.After saving, expose it as a bare /<name> slash command:
rootcause sync --agent claudeResolution order (first match wins for the directory; the legacy single-file path is also loaded and merged on top):
| Priority | Directory (recommended) | Single file (legacy) |
|---|---|---|
| 1 | ROOTCAUSE_PROMPTS_DIR env |
MCP_PROMPTS_FILE env |
| 2 | [prompts].dir in config.toml |
ROOTCAUSE_PROMPTS_FILE env |
| 3 | ~/.rootcause/prompts/ |
[prompts].file in config.toml |
| 4 | ~/.config/rootcause/prompts/ |
~/.rootcause/prompts.toml |
| 5 | ./rootcause-prompts.d/ |
~/.config/rootcause/prompts.toml, ./rootcause-prompts.toml |
A custom prompt whose name: matches a built-in replaces it β useful for org-specific overrides of troubleshoot_workload etc.
Legacy multi-prompt TOML files are still supported. Place a *.toml file inside the prompts directory and it will be parsed in the [[prompt]] format:
# ~/.rootcause/prompts/security.toml
[[prompt]]
name = "security_audit"
description = "Org-specific security policy checks"
template = "Run security audit for {{namespace|all namespaces}} with CIS and policy controls."
[[prompt.arguments]]
name = "namespace"
required = false- π€ Powerful tool catalog - Kubernetes, ecosystem diagnostics, incident workflows, Helm, Terraform, service mesh, and AWS context.
- π― Prompt-driven workflows - Repeatable runbook templates for incident and reliability analysis.
- π MCP Resources support - Readable resource URIs for kubeconfig, namespace, cluster, and manifest access.
- π Security first - Non-destructive modes, policy enforcement, secret masking, and mutation preflight checks.
- π₯ Advanced diagnostics - Root-cause oriented outputs with evidence and recommended next actions.
- π‘ Strong Helm + Terraform coverage - Chart lifecycle and plan/debug analysis in one server.
- π§ CLI-first operations - Single binary, local kubeconfig usage, and toolset-level controls.
go run . --config config.tomlUse stdio transport and point your MCP client to the rootcause command.
- "Generate an incident bundle for namespace payments and summarize the likely root cause."
- "Run best-practice checks for deployment payment-api and list critical findings."
- "Run safe mutation preflight for this apply operation before execution."
- Run the server:
go run . --config config.example.toml
- Use your existing kubeconfig (default) or point to one:
- Uses
KUBECONFIGif set, otherwise~/.kube/config. - Override with
--kubeconfigand--context.
- Connect your MCP client using stdio.
RootCause is built for local development. No API keys are required in this version.
Safe-by-default workflow: diagnose read-only first, then run mutation preflight before any write operation.
Homebrew:
brew install yindia/homebrew-yindia/rootcause
Curl install:
curl -fsSL https://raw.githubusercontent.com/yindia/rootcause/refs/heads/main/install.sh | sh
Go install:
go install .
Or build a local binary:
go build -o rootcause .
Supported OS: macOS, Linux, and Windows.
Windows build example:
go build -o rootcause.exe .
# Build local image
docker build -t rootcause:local .
# Run stdio mode (default)
docker run --rm -it rootcause:local
# Run HTTP transport
docker run --rm -p 8000:8000 rootcause:local --transport http --host 0.0.0.0 --port 8000 --path /mcpCI image publishing is configured via GitHub Actions in .github/workflows/docker.yml and pushes to GHCR (ghcr.io/<owner>/rootcause) on main and release tags.
Initialize a home config with every built-in toolset enabled and custom skills configured:
rootcause init-configThis writes ${HOME}/.rootcause/config.toml on macOS/Linux and %USERPROFILE%\.rootcause\config.toml on Windows, creates the sibling skills/ directory, and refuses to overwrite unless --overwrite is passed.
Run with a config file:
rootcause --config config.toml
Enable a subset of toolchains:
rootcause --toolsets k8s,istio
Enable read-only mode:
rootcause --read-only
Sync prompts (as slash commands) and skills into agent-specific project directories:
rootcause sync --agent claude --project-dir .See docs/AUTHORING.md for an end-to-end walkthrough that writes a custom prompt + skill for an AWS PrivateLink debug runbook.
All MCP clients use the same core values:
command:rootcauseargs: usually--config /path/to/config.tomlenv: optionalKUBECONFIG
{
"mcpServers": {
"rootcause": {
"command": "rootcause",
"args": ["--config", "/Users/you/.config/rootcause/config.toml"],
"env": { "KUBECONFIG": "/Users/you/.kube/config" }
}
}
}File: ~/Library/Application Support/Claude/claude_desktop_config.json
{
"mcpServers": {
"rootcause": {
"command": "rootcause",
"args": ["--config", "/Users/you/.config/rootcause/config.toml"],
"env": { "KUBECONFIG": "/Users/you/.kube/config" }
}
}
}File: ~/.config/claude-code/mcp.json
{
"mcpServers": {
"rootcause": {
"command": "rootcause",
"args": ["--config", "/Users/you/.config/rootcause/config.toml"],
"env": { "KUBECONFIG": "/Users/you/.kube/config" }
}
}
}File: ~/.cursor/mcp.json
{
"mcpServers": {
"rootcause": {
"command": "rootcause",
"args": ["--config", "/Users/you/.config/rootcause/config.toml"],
"env": { "KUBECONFIG": "/Users/you/.kube/config" }
}
}
}File: VS Code settings.json (MCP-enabled builds)
{
"mcp.servers": {
"rootcause": {
"command": "rootcause",
"args": ["--config", "/Users/you/.config/rootcause/config.toml"],
"env": { "KUBECONFIG": "/Users/you/.kube/config" }
}
}
}Format can vary by release. Equivalent TOML entry:
[mcp.servers.rootcause]
command = "rootcause"
args = ["--config", "/Users/you/.config/rootcause/config.toml"]
env = { KUBECONFIG = "/Users/you/.kube/config" }File: ~/.config/goose/config.yaml
extensions:
rootcause:
command: rootcause
args:
- --config
- /Users/you/.config/rootcause/config.tomlFile: ~/.gemini/settings.json
{
"mcpServers": {
"rootcause": {
"command": "rootcause",
"args": ["--config", "/Users/you/.config/rootcause/config.toml"],
"env": { "KUBECONFIG": "/Users/you/.kube/config" }
}
}
}File: ~/.config/roo-code/mcp.json or ~/.config/kilo-code/mcp.json
{
"mcpServers": {
"rootcause": {
"command": "rootcause",
"args": ["--config", "/Users/you/.config/rootcause/config.toml"],
"env": { "KUBECONFIG": "/Users/you/.kube/config" }
}
}
}File: ~/.config/windsurf/mcp.json
{
"mcpServers": {
"rootcause": {
"command": "rootcause",
"args": ["--config", "/Users/you/.config/rootcause/config.toml"],
"env": { "KUBECONFIG": "/Users/you/.kube/config" }
}
}
}Use the universal template and map keys to the client's schema.
Works seamlessly with MCP-compatible AI assistants:
| Client | Status | Client | Status |
|---|---|---|---|
| Claude Desktop | β Native | Claude Code | β Native |
| Cursor | β Native | Windsurf | β Native |
| GitHub Copilot | β Native | OpenAI Codex | β Native |
| Gemini CLI | β Native | Goose | β Native |
| Roo Code | β Native | Kilo Code | β Native |
| Amp | β Compatible | Trae | β Compatible |
| OpenCode | β Compatible | Kiro CLI | β Compatible |
| Antigravity | β Compatible | Clawdbot | β Compatible |
| Droid (Factory) | β Compatible | Any MCP Client | β Compatible |
- Restart your client after editing MCP config.
- Ask: "List RootCause tools".
- Ask: "Run
k8s.argocd_detect". - If tools are missing, verify
rootcausepath,--toolsets, andKUBECONFIG.
- "Run incident bundle for namespace payments and summarize root cause."
- "Check deployment payment-api restart safety before rollout."
- "Diagnose ArgoCD health in namespace argocd."
- "Preflight this patch operation before mutation."
rootcause --config config.tomlPoint your MCP client to run the command above and use stdio transport.
- "Create incident bundle for namespace payments"
- "Generate RCA from latest incident bundle"
- "Export postmortem draft"
Tools behind this flow:
rootcause.incident_bundlerootcause.rca_generaterootcause.postmortem_export
- "Run restart safety check for deployment payment-api"
- "Run best-practice check for payment-api"
- "Run mutation preflight for rollout restart"
Tools behind this flow:
k8s.restart_safety_checkk8s.best_practicek8s.safe_mutation_preflight
- "Detect Flux in this cluster"
- "Diagnose Flux reconciliation health in namespace flux-system"
- "Summarize top issues and next actions"
Tools behind this flow:
k8s.flux_detectk8s.diagnose_flux
Enabled by default:
| Toolchain | Primary Purpose | Typical Requirement |
|---|---|---|
k8s |
Core Kubernetes operations and diagnostics | Kubernetes API access |
linkerd |
Linkerd health and policy diagnostics | Linkerd control plane |
karpenter |
Node provisioning and scaling diagnostics | Karpenter controller |
istio |
Service mesh configuration and proxy diagnostics | Istio control plane |
helm |
Chart registry/release workflows and diffing | Helm 3 and cluster access |
aws |
EKS/EC2/VPC/IAM/ECR/KMS/STS diagnostics | AWS credentials |
gcp |
Cloud Monitoring metrics + SLOs, Cloud Logging analysis for any workload shipping telemetry to GCP (GKE, EKS, AKS, or self-managed) | GCP credentials (ADC or GOOGLE_APPLICATION_CREDENTIALS); project from GOOGLE_CLOUD_PROJECT / GCP_PROJECT env, or explicit projectId arg |
terraform |
Registry and plan impact analysis | Terraform workflows |
rootcause |
Incident bundles, RCA, timeline, postmortem export | Kubernetes access |
browser (optional) |
Browser automation via agent-browser | MCP_BROWSER_ENABLED=true + agent-browser install |
Optional toolchains return "not detected" when the control plane is absent. Additional toolchains can be registered via the plugin SDK; see PLUGINS.md.
Enable only what you need:
rootcause --toolsets k8s,helm,rootcauseAutomate web-based Kubernetes operations with agent-browser integration.
Quick setup:
# Install agent-browser
npm install -g agent-browser
agent-browser install
# Enable browser tools
export MCP_BROWSER_ENABLED=true
rootcauseWhat you can do:
- π Test deployed apps via Ingress URLs
- πΈ Screenshot Grafana, ArgoCD, or any K8s dashboard
- βοΈ Automate cloud console operations (EKS, GKE, AKS)
- π₯ Health check web applications
- π Export monitoring dashboards as PDF
- π Test authentication flows with persistent sessions
26 available tools: browser_open, browser_screenshot, browser_click, browser_fill, browser_test_ingress, browser_screenshot_grafana, browser_health_check, and 19 more.
Full list: browser_open, browser_screenshot, browser_click, browser_fill, browser_test_ingress, browser_screenshot_grafana, browser_health_check, browser_snapshot, browser_get_text, browser_get_html, browser_evaluate, browser_pdf, browser_wait_for, browser_wait_for_url, browser_press, browser_select, browser_check, browser_uncheck, browser_hover, browser_type, browser_upload, browser_drag, browser_new_tab, browser_switch_tab, browser_close_tab, browser_close.
Advanced features:
- Cloud providers: Browserbase, Browser Use
- Persistent browser profiles
- Remote CDP connections
- Session management
Prompt templates for common debugging flows are in prompts/prompt.md.
- CRUD + discovery:
k8s.get,k8s.list,k8s.describe,k8s.create,k8s.apply,k8s.patch,k8s.delete,k8s.api_resources,k8s.crds - Ops + observability:
k8s.logs,k8s.events,k8s.context,k8s.explain_resource,k8s.ping,k8s.events_timeline - Workload operations and safety:
k8s.scale,k8s.rollout,k8s.restart_safety_check,k8s.best_practice,k8s.safe_mutation_preflight - Ecosystem detection:
k8s.argocd_detect,k8s.flux_detect,k8s.cert_manager_detect,k8s.kyverno_detect,k8s.gatekeeper_detect,k8s.cilium_detect - Ecosystem diagnostics:
k8s.diagnose_argocd,k8s.diagnose_flux,k8s.diagnose_cert_manager,k8s.diagnose_kyverno,k8s.diagnose_gatekeeper,k8s.diagnose_cilium - Debugging:
k8s.overview,k8s.crashloop_debug,k8s.scheduling_debug,k8s.hpa_debug,k8s.vpa_debug,k8s.storage_debug,k8s.config_debug,k8s.permission_debug,k8s.network_debug,k8s.private_link_debug,k8s.debug_flow - Maintenance + topology:
k8s.cleanup_pods,k8s.node_management,k8s.graph,k8s.resource_usage
linkerd.health,linkerd.proxy_status,linkerd.identity_issues,linkerd.policy_debug,linkerd.cr_status,linkerd.virtualservice_status,linkerd.destinationrule_status,linkerd.gateway_status,linkerd.httproute_status
istio.health,istio.proxy_status,istio.config_summary,istio.service_mesh_hosts,istio.discover_namespaces,istio.pods_by_service,istio.external_dependency_checkistio.proxy_clusters,istio.proxy_listeners,istio.proxy_routes,istio.proxy_endpoints,istio.proxy_bootstrap,istio.proxy_config_dumpistio.cr_status,istio.virtualservice_status,istio.destinationrule_status,istio.gateway_status,istio.httproute_status
karpenter.status,karpenter.node_provisioning_debug,karpenter.nodepool_debug,karpenter.nodeclass_debug,karpenter.interruption_debug
- Repo/registry:
helm.repo_add,helm.repo_list,helm.repo_update,helm.list_charts,helm.get_chart,helm.search_charts - Release operations:
helm.list,helm.status,helm.diff_release,helm.rollback_advisor,helm.install,helm.upgrade,helm.uninstall,helm.template_apply,helm.template_uninstall
aws.iam.list_roles,aws.iam.get_role,aws.iam.get_instance_profile,aws.iam.update_role,aws.iam.delete_roleaws.iam.list_policies,aws.iam.get_policy,aws.iam.update_policy,aws.iam.delete_policy
aws.vpc.list_vpcs,aws.vpc.get_vpc,aws.vpc.list_subnets,aws.vpc.get_subnet,aws.vpc.list_route_tables,aws.vpc.get_route_tableaws.vpc.list_nat_gateways,aws.vpc.get_nat_gateway,aws.vpc.list_security_groups,aws.vpc.get_security_groupaws.vpc.list_network_acls,aws.vpc.get_network_acl,aws.vpc.list_internet_gateways,aws.vpc.get_internet_gatewayaws.vpc.list_vpc_endpoints,aws.vpc.get_vpc_endpoint,aws.vpc.list_network_interfaces,aws.vpc.get_network_interfaceaws.vpc.list_resolver_endpoints,aws.vpc.get_resolver_endpoint,aws.vpc.list_resolver_rules,aws.vpc.get_resolver_rule
aws.ec2.list_instances,aws.ec2.get_instance,aws.ec2.list_auto_scaling_groups,aws.ec2.get_auto_scaling_group,aws.ec2.list_load_balancers,aws.ec2.get_load_balanceraws.ec2.list_target_groups,aws.ec2.get_target_group,aws.ec2.list_listeners,aws.ec2.get_listener,aws.ec2.get_target_healthaws.ec2.list_listener_rules,aws.ec2.get_listener_rule,aws.ec2.list_auto_scaling_policies,aws.ec2.get_auto_scaling_policy,aws.ec2.list_scaling_activities,aws.ec2.get_scaling_activityaws.ec2.list_launch_templates,aws.ec2.get_launch_template,aws.ec2.list_launch_configurations,aws.ec2.get_launch_configurationaws.ec2.get_instance_iam,aws.ec2.get_security_group_rules,aws.ec2.list_spot_instance_requests,aws.ec2.get_spot_instance_requestaws.ec2.list_capacity_reservations,aws.ec2.get_capacity_reservation,aws.ec2.list_volumes,aws.ec2.get_volume,aws.ec2.list_snapshots,aws.ec2.get_snapshot,aws.ec2.list_volume_attachmentsaws.ec2.list_placement_groups,aws.ec2.get_placement_group,aws.ec2.list_instance_status,aws.ec2.get_instance_status
aws.eks.list_clusters,aws.eks.get_cluster,aws.eks.list_nodegroups,aws.eks.get_nodegroup,aws.eks.list_addons,aws.eks.get_addonaws.eks.list_fargate_profiles,aws.eks.get_fargate_profile,aws.eks.list_identity_provider_configs,aws.eks.get_identity_provider_configaws.eks.list_updates,aws.eks.get_update,aws.eks.list_nodes,aws.eks.debug
aws.ecr.list_repositories,aws.ecr.describe_repository,aws.ecr.list_images,aws.ecr.describe_images,aws.ecr.describe_registry,aws.ecr.get_authorization_token
aws.sts.get_caller_identity,aws.sts.assume_role
aws.kms.list_keys,aws.kms.list_aliases,aws.kms.describe_key,aws.kms.get_key_policy
gcp.metrics.queryβ run a raw Cloud Monitoring MQL query and return time series.gcp.metrics.workloadβ CPU, memory, and restart count metrics for a Kubernetes workload over a time window.gcp.metrics.list_descriptorsβ list Cloud Monitoring metric descriptors for discoverability (accepts a Monitoring filter).gcp.metrics.slo_listβ enumerate Service Monitoring services and their SLO configuration (goal, period, indicator type). Live burn-rate is out of scope; usegcp.metrics.querywith MQLselect_slo_burn_rate(...)when needed.
gcp.logs.queryβ run a raw Cloud Logging filter and return matching entries.gcp.logs.workloadβ recent errors/warnings for a Kubernetes workload over a time window.gcp.logs.error_timelineβ bucketed error/warning counts for spotting the inflection point.gcp.logs.correlated_with_bundleβ pull log entries matching arootcause.incident_bundleevent window (accepts the bundle object or explicitstartTime/endTime).
terraform.debug_planterraform.list_modules,terraform.get_module,terraform.list_module_versions,terraform.search_modulesterraform.list_providers,terraform.get_provider,terraform.list_provider_versions,terraform.get_provider_package,terraform.search_providersterraform.list_resources,terraform.get_resource,terraform.search_resourcesterraform.list_data_sources,terraform.get_data_source,terraform.search_data_sources
rootcause.incident_bundle,rootcause.change_timeline,rootcause.rca_generate,rootcause.remediation_playbook,rootcause.postmortem_export,rootcause.capabilities
rootcause.incident_bundle accepts an optional workload argument. When provided alongside namespace, and the gcp toolset is enabled, the default chain automatically appends gcp.metrics.workload and gcp.logs.workload so the bundle includes GCP-side metrics and logs for that workload. rca_generate, remediation_playbook, and postmortem_export propagate workload through to the auto-built bundle as well.
browser_open,browser_screenshot,browser_click,browser_fill,browser_test_ingress,browser_screenshot_grafana,browser_health_checkbrowser_snapshot,browser_get_text,browser_get_html,browser_evaluate,browser_pdf,browser_wait_for,browser_wait_for_urlbrowser_press,browser_select,browser_check,browser_uncheck,browser_hover,browser_type,browser_upload,browser_dragbrowser_new_tab,browser_switch_tab,browser_close_tab,browser_close
kubectl_get,kubectl_list,kubectl_describe,kubectl_create,kubectl_apply,kubectl_delete,kubectl_logs,kubectl_patch,kubectl_scale,kubectl_rollout,kubectl_context,kubectl_generic,kubectl_top,explain_resource,list_api_resources,ping
--read-only: removes apply/patch/delete/exec tools from discovery.--disable-destructive: removes delete and risky write tools unless allowlisted (create/scale/rollout remain available).- Mutating tools are documented in this README under
Complete Feature SetandSafety Modes.
Default safety policy:
- If a user does not explicitly request a mutating action, treat the request as read-only diagnostics.
- Do not run mutating tools implicitly during analysis.
- For investigation-first workflows, prefer running RootCause in
--read-onlymode. - K8s mutating tools
create/apply/patch/delete/scale/rollout/cleanup_pods/node_managementrun an automatick8s.safe_mutation_preflightcheck before execution.
Safety workflow recommendation:
- Run read-only diagnosis (
k8s.*_debug,k8s.*_detect,k8s.diagnose_*,rootcause.incident_bundle) - Run
k8s.safe_mutation_preflightfor intended mutation - Execute mutation only after preflight passes and
confirm=true
rootcause --config config.example.toml --toolsets k8s,linkerd,istio,karpenter,helm,aws
Create a cross-platform home config first if you do not already have one:
rootcause init-configThe generated config enables k8s, linkerd, karpenter, istio, helm, aws, gcp, terraform, and rootcause, sets stdio transport defaults, and configures ~/.rootcause/skills (skills) and ~/.rootcause/prompts/ (custom prompts) as the user-authored content directories.
--kubeconfig--context--toolsets(comma-separated)--config--read-only--disable-destructive--transport(stdio|http|sse)--host(for HTTP/SSE)--port(for HTTP/SSE)--path(for HTTP/SSE)--log-level
If --config is not set, RootCause will use the ROOTCAUSE_CONFIG environment variable when present.
The AWS IAM tools use the standard AWS credential chain and region resolution. Set AWS_REGION or AWS_DEFAULT_REGION (defaults to us-east-1), optionally select a profile with AWS_PROFILE or AWS_DEFAULT_PROFILE, and use any of the normal credential sources (env vars, shared config/credentials files, SSO, or instance metadata).
The gcp.* tools use Application Default Credentials by default. Either run gcloud auth application-default login or set GOOGLE_APPLICATION_CREDENTIALS to a service-account key path.
Project ID resolution order:
- Explicit
projectIdargument on the tool call. GOOGLE_CLOUD_PROJECTenv var.GCP_PROJECTenv var.
The observability project is not auto-detected from the kubeconfig context β an EKS or AKS cluster can also ship logs and metrics to GCP, so the project must come from the user, not from the cluster's control-plane identity.
When gcp.metrics.workload and gcp.logs.workload are enabled and rootcause.incident_bundle is called with both namespace and workload, the default bundle chain auto-includes GCP workload metrics and logs alongside the k8s evidence.
If --kubeconfig is not set, RootCause follows standard Kubernetes loading rules: it uses KUBECONFIG when present, otherwise defaults to ~/.kube/config.
Authentication and authorization use your kubeconfig identity only in this version.
- Verify
KUBECONFIGor~/.kube/config - Override explicitly with
--kubeconfig /path/to/config
- Confirm server is running and client points to
rootcause - Check selected toolsets with
--toolsets - If using
--read-only, mutating tools will be hidden by design
- This usually means the ecosystem control plane is not installed in the cluster
- Run
k8s.<ecosystem>_detectfirst, thenk8s.diagnose_<ecosystem>
- Run
k8s.safe_mutation_preflightexplicitly and inspect failed checks - Fix policy/namespace/resource issues, then retry with
confirm=true
AI Client
-> MCP stdio server
-> Tool registry (k8s/linkerd/istio/karpenter/helm/aws/terraform/rootcause)
-> Shared internals (kube clients, evidence, policy, rendering, redaction)
-> Target APIs (Kubernetes + cloud providers)
Why this matters:
- consistent evidence format across toolsets
- reusable diagnostics instead of duplicated logic
- safer operations through centralized policy and preflight checks
RootCause is organized around shared Kubernetes plumbing and toolsets that reuse it.
- Shared clients (typed, dynamic, discovery, RESTMapper) are created once in
internal/kubeand injected into all toolsets. - Common safeguards live in
internal/policy(namespace vs cluster enforcement and tool allowlists) andinternal/redact(token/secret redaction). internal/evidencegathers events, owner chains, endpoints, and pod status summaries used by all toolsets.internal/renderenforces a consistent analysis output format (root causes, evidence, next checks, resources examined) and provides the shared describe helper.- Toolsets live under
toolsets/and register namespaced tools (k8s.*,linkerd.*,karpenter.*,istio.*,helm.*,aws.iam.*,aws.vpc.*) through a shared MCP registry.
The MCP server runs over stdio using the MCP Go SDK and is designed for local kubeconfig usage. Optional in-cluster deployment is intentionally out of scope for Phase 1.
Send SIGHUP to reload config and rebuild the tool registry. On Windows, SIGHUP is not supported; restart the process to reload config.
RootCause supports MCP over stdio (default), http (streamable HTTP), and sse.
Examples:
# stdio
rootcause --config config.toml --transport stdio
# HTTP (streamable)
rootcause --config config.toml --transport http --host 127.0.0.1 --port 8000 --path /mcp
# SSE
rootcause --config config.toml --transport sse --host 127.0.0.1 --port 8000 --path /mcpDesign focus today:
- best-in-class local reliability for AI-assisted SRE workflows
- deterministic, auditable outputs for incident review
- safe mutation gates instead of broad write-by-default behavior
AWS IAM and GCP Cloud Monitoring/Logging support are now available. The toolset system is designed to add deeper cloud integrations (EKS/EC2/VPC/Azure/extended GCP services) without changing the core MCP or shared Kubernetes libraries.
We welcome code, docs, tests, and operational feedback.
- π Report bugs with reproducible steps and expected behavior
- π‘ Propose features with concrete operator scenarios
- π§ͺ Improve tests for safety, policy, and ecosystem diagnostics
- π§© Add or improve toolsets via shared SDK and internal libraries
- Fork and create a feature branch
- Implement focused changes with tests
- Run local verification:
go test ./...- Update docs (
README.md,prompts/prompt.md) if behavior changed - Open PR with problem statement, approach, and verification notes
- Contribution rules:
CONTRIBUTING.md - Plugin SDK and external toolsets:
PLUGINS.md - Config example:
config.toml - MCP eval harness:
eval/README.md
- Behavior matches user/operator expectations
- Safety model preserved (
read-only, destructive gating, preflight) - Tests added/updated for new behavior
- Tool/docs consistency checked (
README.md)