Skip to content

[New Env] Cloud SRE & FinOps Environment#506

Open
naveenkumar982 wants to merge 4 commits into
huggingface:mainfrom
naveenkumar982:add-cloud-sre-env
Open

[New Env] Cloud SRE & FinOps Environment#506
naveenkumar982 wants to merge 4 commits into
huggingface:mainfrom
naveenkumar982:add-cloud-sre-env

Conversation

@naveenkumar982

Copy link
Copy Markdown

Implements a new Cloud SRE & FinOps environment for OpenEnv. This environment features 3 difficulty-tiered tasks (Phantom Volume Cleanup, Latency Spike Remediation, and Noisy Neighbor Incident), testing an agent's ability to diagnose outages, optimize costs, and perform multi-step mitigations without causing collateral damage to production workloads.

Features:

  • Deterministic grading with fine-grained breakdowns.
  • Seeded procedural generation for reproducible RL training.
  • Chaos event injection for robustness testing.
  • Full integration with the OpenEnv client/server APIs.

@meta-cla

meta-cla Bot commented Apr 4, 2026

Copy link
Copy Markdown

Hi @naveenkumar982!

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at cla@meta.com. Thanks!

@greptile-apps

greptile-apps Bot commented Apr 4, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR adds a new Cloud SRE & FinOps environment with three difficulty-tiered tasks (phantom volume cleanup, latency spike remediation, noisy neighbor incident), seeded procedural generation, chaos injection, and deterministic grading — all following the expected OpenEnv container+client layout.

  • P1: _handle_scale (line 730) checks resource.get(\"cpu_utilization\") after the mutation loop has already set it to 45.0 in-place, making the reward += 0.05 bonus permanently unreachable.
  • Alignment: SREAction/SREObservation/SREState are @dataclasses rather than Pydantic models; this may conflict with the wire-type invariant depending on what the base Action/Observation/State classes are — worth a quick confirmation before merge."

Confidence Score: 4/5

Safe to merge after fixing the _handle_scale reward bug; the alignment question on dataclasses vs Pydantic should be confirmed but is unlikely to break the current wire format.

One genuine P1 logic defect (dead-code reward branch) prevents a score of 5. All other findings are P2 style or alignment questions that do not block correctness of the primary task flows.

envs/cloud_sre_env/server/cloud_sre_environment.py (lines 720-732) and envs/cloud_sre_env/models.py (dataclass vs Pydantic alignment).

Important Files Changed

Filename Overview
envs/cloud_sre_env/server/cloud_sre_environment.py Core env logic is sound, but _handle_scale contains a dead-code reward branch due to mutating the resource dict in-place before reading from it.
envs/cloud_sre_env/models.py Typed models for action/observation/state; uses @dataclass rather than Pydantic — alignment flag raised.
envs/cloud_sre_env/client.py Clean EnvClient subclass; correctly serializes SREAction and deserializes SREObservation/SREState.
envs/cloud_sre_env/server/app.py Minimal create_fastapi_app wiring; follows echo_env pattern exactly.
.github/workflows/docker-build.yml Adds cloud-sre-env to the Docker build matrix but omits context, relying silently on the `
tests/envs/test_cloud_sre_environment.py Good coverage of all three tasks, lifecycle tests, seeded state, and grading; imports server code directly (acceptable in test context).
envs/cloud_sre_env/openenv.yaml Valid metadata; version: 2.0.0 is unusually high for a new env but is cosmetic.
envs/cloud_sre_env/server/Dockerfile Correct multi-layer copy from repo root; health check and CMD are properly configured.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[Agent sends action] --> B[step()]
    B --> C{cmd?}
    C -- terminate --> D[_handle_terminate]
    C -- scale --> E[_handle_scale]
    C -- reboot --> F[_handle_reboot]
    C -- inspect --> G[_handle_inspect]
    C -- wait --> H[step_reward = -0.01]
    D & E & F & G & H --> I[_action_history.append]
    I --> J{chaos_enabled?}
    J -- yes --> K[_maybe_inject_chaos]
    J -- no --> L[_recalculate_state]
    K --> L
    L --> M{current_step >= MAX_STEPS?}
    M -- yes --> N[done = True]
    M -- no --> O[_build_observation]
    N --> O
    O --> P[Return SREObservation]
    P --> Q[grade() — called by orchestrator]
    Q --> R{task?}
    R -- Task1 --> S[PhantomVolumeCleanup grader]
    R -- Task2 --> T[LatencySpikeRemediation grader]
    R -- Task3 --> U[NoisyNeighborIncident grader]
    S & T & U --> V[Return score, breakdown]
Loading

Comments Outside Diff (1)

  1. envs/cloud_sre_env/models.py, line 84-126 (link)

    P2 ALIGNMENT FLAG: Wire types are @dataclass, not Pydantic models

    INVARIANTS.md states: "All wire types (Action, Observation, State) must be Pydantic models — Serialization must be JSON-compatible." SREAction, SREObservation, and SREState are plain Python @dataclasses. If the Action/Observation/State base classes in openenv.core.env_server are Pydantic BaseModel subclasses, mixing @dataclass on top breaks Pydantic's field validation and .model_dump() / .model_validate() contract at the wire boundary.

    Context Used: .claude/docs/INVARIANTS.md (source)

    Prompt To Fix With AI
    This is a comment left during a code review.
    Path: envs/cloud_sre_env/models.py
    Line: 84-126
    
    Comment:
    **ALIGNMENT FLAG: Wire types are `@dataclass`, not Pydantic models**
    
    `INVARIANTS.md` states: *"All wire types (Action, Observation, State) must be Pydantic models — Serialization must be JSON-compatible."* `SREAction`, `SREObservation`, and `SREState` are plain Python `@dataclass`es. If the `Action`/`Observation`/`State` base classes in `openenv.core.env_server` are Pydantic `BaseModel` subclasses, mixing `@dataclass` on top breaks Pydantic's field validation and `.model_dump()` / `.model_validate()` contract at the wire boundary.
    
    - **Principle at stake**: Pydantic serialization invariant (`INVARIANTS.md` §API Invariants #3)
    - **Suggested reviewer**: `@darktex`
    
    **Context Used:** .claude/docs/INVARIANTS.md ([source](https://app.greptile.com/review/custom-context?memory=dbd1ab5e-bd4d-4701-9de0-9817404155a9))
    
    How can I resolve this? If you propose a fix, please make it concise.
Prompt To Fix All With AI
This is a comment left during a code review.
Path: envs/cloud_sre_env/server/cloud_sre_environment.py
Line: 720-732

Comment:
**`_handle_scale` CPU-reward bonus is dead code**

`_find_resource` returns a live reference to the dict in `self._env_state["resources"]`. The mutation loop (lines 720–726) sets `r["cpu_utilization"] = 45.0` in-place on that same object, so when line 730 checks `resource.get("cpu_utilization", 0) > 80`, the value is already `45.0` — the condition can never be `True` and `reward += 0.05` is unreachable.

Capture the CPU value before the mutation loop:

```suggestion
        old_size = resource.get("instance_size", "unknown")
        pre_mutation_cpu = resource.get("cpu_utilization", 0)

        if resource.get("type") == ResourceType.RDS.value:
            if target_size not in RDS_PRICING:
                return -0.05, f"Error: Invalid RDS size '{target_size}'"
            new_cost = RDS_PRICING[target_size]
        else:
            new_cost = resource.get("cost_per_hour", 0) * 2.0

        for r in self._env_state["resources"]:
            if r["id"] == resource_id:
                r["instance_size"] = target_size
                r["cost_per_hour"] = new_cost
                if r.get("cpu_utilization", 0) > 80:
                    r["cpu_utilization"] = 45.0
                break

        self._resolve_alerts_for(resource_id)
        reward = 0.08
        if pre_mutation_cpu > 80:
            reward += 0.05
        return reward, f"Scaled '{resource_id}' from {old_size} to {target_size}"
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: .github/workflows/docker-build.yml
Line: 117-119

Comment:
**Missing `context` field (silently falls back to repo root)**

Every other matrix entry explicitly sets `context:`. For `cloud-sre-env` it is absent, so the workflow resolves it via `${{ matrix.image.context || '.' }}`. The Dockerfile's `COPY src/openenv/core/` and `COPY envs/cloud_sre_env/` do require the repo root as context, so the build is correct today — but the implicit fallback is inconsistent with every other entry and fragile if the default ever changes.

```suggestion
          - name: cloud-sre-env
            dockerfile: envs/cloud_sre_env/server/Dockerfile
            context: .
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: envs/cloud_sre_env/models.py
Line: 84-126

Comment:
**ALIGNMENT FLAG: Wire types are `@dataclass`, not Pydantic models**

`INVARIANTS.md` states: *"All wire types (Action, Observation, State) must be Pydantic models — Serialization must be JSON-compatible."* `SREAction`, `SREObservation`, and `SREState` are plain Python `@dataclass`es. If the `Action`/`Observation`/`State` base classes in `openenv.core.env_server` are Pydantic `BaseModel` subclasses, mixing `@dataclass` on top breaks Pydantic's field validation and `.model_dump()` / `.model_validate()` contract at the wire boundary.

- **Principle at stake**: Pydantic serialization invariant (`INVARIANTS.md` §API Invariants #3)
- **Suggested reviewer**: `@darktex`

**Context Used:** .claude/docs/INVARIANTS.md ([source](https://app.greptile.com/review/custom-context?memory=dbd1ab5e-bd4d-4701-9de0-9817404155a9))

How can I resolve this? If you propose a fix, please make it concise.

Reviews (1): Last reviewed commit: "feat: Add Cloud SRE & FinOps Environment" | Re-trigger Greptile

Comment thread envs/cloud_sre_env/server/cloud_sre_environment.py
Comment thread .github/workflows/docker-build.yml
@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Meta Open Source bot. label Apr 4, 2026

@naveenkumar982 naveenkumar982 left a comment

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

everything is updated now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant