Governance-first, local-first optimization for JVM microservices. ACO combines metric-driven diagnosis, bounded recommendations, audit artifacts, confidence gates, rollback modeling, and benchmark scenarios in one Java codebase.
ACO is an open-source Java Autonomic Reliability Governance (ARG) reference implementation for experimenting with governed optimization of JVM-backed services. ARG is the discipline of designing bounded, auditable, and reversible automation under explicit constraints — particularly where agents plan and act faster than human-in-the-loop validation can operate.
ACO watches runtime signals such as latency, heap, GC, CPU, and thread behavior; produces structured optimization plans; evaluates those plans against policy and actuation budgets; and validates whether changes helped or hurt.
This is not "YOLO let the LLM touch prod." That would be extremely innovative in the worst possible way.
ACO currently supports:
- Local LLM analysis through Ollama
- Deterministic fallback analysis through
SimpleAgent - OptimizationPlan artifacts for auditability
- Policy evaluation before actuation
- Actuation budgets — rate-limiting change magnitude and frequency, not just traffic
- Blast radius constraints through actuation scoping
- Confidence gates and progressive autonomy modes (Advisory / Auto-Governed / Denied)
- Eligibility tiers — action-based permission model separating observational, low-risk, and high-risk actions
- Validation and rollback modeling
- Deterministic benchmark scenarios for amplification testing
JVM tuning is still too often a guessing game:
- thread pools get bumped because latency is bad
- heap gets inflated "just to be safe"
- GC settings drift without a recorded reason
- optimization decisions are made in war rooms and forgotten a week later
ACO turns that into a more disciplined loop:
- Observe runtime behavior
- Reason about likely bottlenecks
- Plan a structured optimization with evidence and rollback path
- Govern that plan with policy, budgets, and autonomy checks
- Validate the outcome
- Audit — preserve every decision for review and rollback
The point is not raw automation. The point is bounded, explainable optimization.
Every optimization cycle flows through a fixed governance pipeline. No step can be skipped.
flowchart TD
WS[Workload Simulator] --> MC[Metrics Collection]
MC --> SLO[SLO Detector]
SLO --> AR[Agent Reasoning]
AR --> PA[Plan Assembly]
PA --> PE[Policy Evaluation]
PE --> AB[Actuation Budget]
AB --> AG{Autonomy Gate}
AG -->|Advisory| ADV[Advisory Report]
AG -->|Auto-Governed| VE[Validation]
AG -->|Denied| RE[Rollback]
VE --> AUD[Audit & Report]
RE --> AUD
ADV --> AUD
ACO is organized into six layers. Each layer has a single responsibility and depends only on the layers below it.
flowchart TD
A["Infrastructure — Ollama · Load Runner · Workload Simulator"]
B["Observation — Metrics Collection · SLO Detection"]
C["Reasoning — LLM Agent · Simple Agent"]
D["Planning — Plan Assembler · Optimization Plan · Rollback Recipe"]
E["Governance — Policy Engine · Budget Ledger · Autonomy Gate"]
F["Validation & Audit — Validation · Rollback · Report"]
A --> B --> C --> D --> E --> F
git clone https://github.com/sibasispadhi/agentic-cloud-optimizer.git
cd agentic-cloud-optimizer
docker compose up --buildOpen:
- Live dashboard: http://localhost:8081/live-dashboard.html
- Results page: http://localhost:8081/results.html
First run downloads the Ollama model, so yeah, give it a minute.
git clone https://github.com/sibasispadhi/agentic-cloud-optimizer.git
cd agentic-cloud-optimizer
docker compose -f docker-compose.simple.yml up --buildOpen:
- Live dashboard: http://localhost:8081/live-dashboard.html
- Results page: http://localhost:8081/results.html
# Terminal 1
ollama serve
ollama pull llama3.2:3b
# Terminal 2
mvn spring-boot:runThen open:
For a more guided local setup, use docs/START_HERE.md.
The Architecture diagram above shows the full pipeline. Each stage maps to a layer:
| Pipeline stage | Layer | Detail |
|---|---|---|
| Workload Simulator | Infrastructure | executes load; drives metric signals |
| Metrics Collection | Observation | latency, throughput, heap, CPU, GC, threads |
| SLO Detector | Observation | triggers optimization when Service Level Objective (SLO) thresholds breach |
| Agent Reasoning | Reasoning | SpringAiLlmAgent or SimpleAgent |
| Plan Assembly | Planning | OptimizationPlan with evidence and rollback path |
| Policy Evaluation | Governance | PolicyEngine approves, warns, or denies |
| Actuation Budget | Governance | ActuationBudgetLedger enforces change limits |
| Autonomy Gate | Governance | Advisory / Auto-Governed / Denied |
| Validation | Validation & Audit | confirms improvement; triggers rollback if not |
| Audit & Report | Validation & Audit | persists every decision for review |
- local LLM-backed reasoning with Ollama
- deterministic fallback agent
- externalized thresholds in configuration
- Docker startup flows
Implemented in src/main/java/com/cloudoptimizer/agent/artifact/
Key outputs:
OptimizationPlanPlanChangePlanEvidenceValidationRecipeRollbackRecipe
Implemented in src/main/java/com/cloudoptimizer/agent/policy/
Key outputs:
PolicyEngineDefaultPolicyEngineActuationPolicyPolicyDecision
Implemented in src/main/java/com/cloudoptimizer/agent/budget/
Key outputs:
ActuationBudgetActuationBudgetLedgerBudgetConsumption
Implemented in src/main/java/com/cloudoptimizer/agent/autonomy/
Key outputs:
AutonomyGateAutonomyModeAutonomyGateResult
Implemented in the artifact and service layers
Key outputs:
ValidationExecutorRollbackExecutorValidationResultRollbackResult
Implemented in src/main/java/com/cloudoptimizer/agent/benchmark/
Included scenarios:
- retry storm
- thread saturation
- CPU throttling
- heap overprovisioning
- burst traffic
ACO includes deterministic benchmark scenarios so governance claims can be tested without needing production data.
Example outcomes from the benchmark layer:
- Retry storm: 3× amplification factor
- Naive latency-reactive agent: p99 worsened by 58%
- Governed agent: p99 recovered by 75% (480 ms → 120 ms); throughput restored from 40 RPS to 78 RPS
That is the whole point of the project: useful automation should dampen instability, not cosplay as a chaos monkey.
ACO writes artifacts and reports under artifacts/.
Typical outputs include:
baseline.jsonafter.jsonreport.json- optimization-plan artifacts
- reasoning traces
- validation and rollback records
You can also inspect example outputs in examples/README.md.
mvn clean package -DskipTests./scripts/run-agent.sh
# Windows: scripts\run-agent.bat./scripts/run-web-ui.sh
# Windows: scripts\run-web-ui.bat./scripts/verify-ollama.sh- Java 21
- Spring Boot 3.2
- Spring AI
- Ollama for local LLM inference
- Jackson for artifact serialization
- JUnit for test coverage
- Docker / Docker Compose for local execution
- docs/START_HERE.md — setup and first run
- docs/INDEX.md — current documentation map
- docs/WHAT_THIS_IS.md — scope and boundaries
- docs/OLLAMA_SETUP.md — Ollama install help
- docs/WINDOWS_SETUP.md — Windows notes
- docs/ARCHITECTURE_PATTERNS.md — design patterns
- docs/STARTUP_READY_PLAN.md — rollout/readiness plan
These are roadmap items, not shipped capabilities:
- GitOps PR generation
- OpenSLO ingestion
- external telemetry adapters beyond the current local flow
- broader multi-service / multi-region rollout controls
If it is not in code and tested, it does not get to sit in the README pretending it exists.
MIT — see LICENSE.