Skip to content

Ashutosh-177/HuggingFace-Pytorch-Hackathon

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

4 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

title Incident Response Triage
emoji 🚨
colorFrom red
colorTo orange
sdk docker
app_port 7860
pinned false
tags
openenv

🚨 Incident Response Triage β€” OpenEnv

An OpenEnv-compliant reinforcement learning environment simulating production incident response and triage. AI agents act as Site Reliability Engineers (SREs) who must diagnose and resolve real-world infrastructure incidents across a simulated microservices architecture.

Why Incident Response?

Incident response costs the tech industry $70B+ annually in downtime. SREs spend countless hours triaging alerts, correlating logs, and remediating failures. This environment provides a realistic training ground for AI agents to learn:

  • Alert triage β€” prioritize and correlate production alerts
  • Root cause analysis β€” trace through service dependencies to find the source
  • Remediation β€” apply the correct fix (restart, rollback, scale, etc.)
  • Efficient resolution β€” minimize time-to-resolution under pressure

Architecture

The environment simulates a 10-service microservices system:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ web-frontend β”‚
β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜
       β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ api-gateway  │──────┬──────┬──────┬──────────────┐
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜      β–Ό      β–Ό      β–Ό              β–Ό
              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”
              β”‚auth-svc  β”‚ β”‚user-svc    β”‚ β”‚order-svc         β”‚ β”‚notif  β”‚
              β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”¬β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”¬β”€β”€β”€β”˜
                   β–Ό             β–Ό            β–Ό    β–Ό     β–Ό          β–Ό
              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”
              β”‚ database β”‚  β”‚ cache β”‚  β”‚payment   β”‚ β”‚ msg-q   β”‚ β”‚ cache β”‚
              β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”˜

Tasks

Task ID Difficulty Description Max Steps
task_easy 🟒 Easy Memory leak in user-service. Clear symptoms, straightforward fix. 15
task_medium 🟑 Medium Database connection pool exhaustion causing cascading timeouts. Multiple services affected. 20
task_hard πŸ”΄ Hard Bad deployment of payment-service causes cascading failures with misleading symptoms and red herrings. 25

Action Space

Text commands sent as strings:

Command Description
INVESTIGATE <service> View recent logs for a service
CHECK_METRICS <service> View CPU, memory, latency, error rate, etc.
CHECK_DEPENDENCIES <service> View upstream/downstream dependencies
RESTART <service> Restart a service
SCALE <service> <count> Scale service instances
ROLLBACK <service> Roll back to previous deployment version
FLUSH_CACHE Clear the Redis cache
INCREASE_CONNECTIONS <service> Increase database connection pool
DIAGNOSE <root_cause> Submit your root cause diagnosis
RESOLVE <summary> Mark incident as resolved

Services: web-frontend, api-gateway, auth-service, user-service, order-service, payment-service, notification-service, database, cache, message-queue


Observation Space

{
  "alerts": [
    {"id": "A1", "severity": "critical", "service": "user-service",
     "message": "Memory usage critical: 92%", "timestamp": "10:24:00"}
  ],
  "services": {
    "user-service": {
      "cpu_percent": 45.0, "memory_percent": 92.0,
      "request_latency_ms": 850, "error_rate_percent": 12.0,
      "active_connections": 156, "status": "degraded"
    }
  },
  "action_result": "Logs for user-service showing memory leak...",
  "step_number": 3,
  "steps_remaining": 12,
  "incident_resolved": false,
  "task_id": "task_easy",
  "done": false
}

Reward Function

Event Reward
Each step -0.01 (time pressure)
Investigate root cause service +0.10
Investigate context/symptoms +0.05
Correct root cause diagnosis +0.20 to +0.25
Correct primary remediation +0.25 to +0.35
Correct secondary remediation (hard) +0.10
Successful resolution +0.25
Wrong remediation action -0.05
Invalid command -0.03

Total possible: ~1.0 per task (minus step penalties). Score normalized to 0.0 – 1.0.


API Endpoints

Method Endpoint Description
GET / Health check
POST /reset Reset environment {"task_id": "task_easy"}
POST /step Execute action {"command": "INVESTIGATE user-service"}
GET /state Get current state with milestones
GET /tasks List all tasks
GET /grade/{task_id} Run grader for a task

Setup & Running

Local

# 1. Install dependencies
pip install -r requirements.txt

# 2. Start the server
uvicorn app:app --host 0.0.0.0 --port 7860 --reload

# 3. Set environment variables
export API_BASE_URL=https://api.openai.com/v1
export MODEL_NAME=gpt-4o-mini
export HF_TOKEN=your_api_key
export ENV_URL=http://localhost:7860

# 4. Run inference
python inference.py

Docker

docker build -t incident-response-triage .
docker run -p 7860:7860 incident-response-triage

Then in another terminal:

export ENV_URL=http://localhost:7860
python inference.py

Environment Variables

Variable Description
API_BASE_URL LLM API endpoint (e.g., https://api.openai.com/v1)
MODEL_NAME Model identifier (e.g., gpt-4o-mini)
HF_TOKEN Hugging Face / API key
ENV_URL URL where the env server is running

Baseline Scores

Task Greedy Agent Expected LLM Agent
task_easy ~0.94 0.70 – 0.95
task_medium ~0.92 0.50 – 0.85
task_hard ~0.90 0.30 – 0.75

Project Structure

β”œβ”€β”€ app.py              # FastAPI server (OpenEnv endpoints)
β”œβ”€β”€ environment.py      # Core environment logic (reset/step/state/grade)
β”œβ”€β”€ models.py           # Pydantic typed models
β”œβ”€β”€ scenarios.py        # Incident scenario definitions
β”œβ”€β”€ inference.py        # Baseline inference script
β”œβ”€β”€ openenv.yaml        # OpenEnv specification
β”œβ”€β”€ requirements.txt    # Python dependencies
β”œβ”€β”€ Dockerfile          # Container definition
└── README.md           # This file

Meta_Hackathon

Meta_Hackathon

Meta_Hackathon

About

OpenEnv-based RL environment for training AI agents in incident response, root cause analysis, and remediation within microservices systems.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors