[New Env] Cloud SRE & FinOps Environment by naveenkumar982 · Pull Request #506 · huggingface/OpenEnv

naveenkumar982 · 2026-04-04T16:22:56Z

Implements a new Cloud SRE & FinOps environment for OpenEnv. This environment features 3 difficulty-tiered tasks (Phantom Volume Cleanup, Latency Spike Remediation, and Noisy Neighbor Incident), testing an agent's ability to diagnose outages, optimize costs, and perform multi-step mitigations without causing collateral damage to production workloads.

Features:

Deterministic grading with fine-grained breakdowns.
Seeded procedural generation for reproducible RL training.
Chaos event injection for robustness testing.
Full integration with the OpenEnv client/server APIs.

meta-cla · 2026-04-04T16:23:02Z

Hi @naveenkumar982!

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at cla@meta.com. Thanks!

greptile-apps · 2026-04-04T16:27:20Z

Greptile Summary

This PR adds a new Cloud SRE & FinOps environment with three difficulty-tiered tasks (phantom volume cleanup, latency spike remediation, noisy neighbor incident), seeded procedural generation, chaos injection, and deterministic grading — all following the expected OpenEnv container+client layout.

P1: _handle_scale (line 730) checks resource.get(\"cpu_utilization\") after the mutation loop has already set it to 45.0 in-place, making the reward += 0.05 bonus permanently unreachable.
Alignment: SREAction/SREObservation/SREState are @dataclasses rather than Pydantic models; this may conflict with the wire-type invariant depending on what the base Action/Observation/State classes are — worth a quick confirmation before merge."

Confidence Score: 4/5

Safe to merge after fixing the _handle_scale reward bug; the alignment question on dataclasses vs Pydantic should be confirmed but is unlikely to break the current wire format.

One genuine P1 logic defect (dead-code reward branch) prevents a score of 5. All other findings are P2 style or alignment questions that do not block correctness of the primary task flows.

envs/cloud_sre_env/server/cloud_sre_environment.py (lines 720-732) and envs/cloud_sre_env/models.py (dataclass vs Pydantic alignment).

Important Files Changed

Filename	Overview
envs/cloud_sre_env/server/cloud_sre_environment.py	Core env logic is sound, but `_handle_scale` contains a dead-code reward branch due to mutating the resource dict in-place before reading from it.
envs/cloud_sre_env/models.py	Typed models for action/observation/state; uses `@dataclass` rather than Pydantic — alignment flag raised.
envs/cloud_sre_env/client.py	Clean `EnvClient` subclass; correctly serializes `SREAction` and deserializes `SREObservation`/`SREState`.
envs/cloud_sre_env/server/app.py	Minimal `create_fastapi_app` wiring; follows echo_env pattern exactly.
.github/workflows/docker-build.yml	Adds cloud-sre-env to the Docker build matrix but omits `context`, relying silently on the `
tests/envs/test_cloud_sre_environment.py	Good coverage of all three tasks, lifecycle tests, seeded state, and grading; imports server code directly (acceptable in test context).
envs/cloud_sre_env/openenv.yaml	Valid metadata; `version: 2.0.0` is unusually high for a new env but is cosmetic.
envs/cloud_sre_env/server/Dockerfile	Correct multi-layer copy from repo root; health check and CMD are properly configured.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[Agent sends action] --> B[step()]
    B --> C{cmd?}
    C -- terminate --> D[_handle_terminate]
    C -- scale --> E[_handle_scale]
    C -- reboot --> F[_handle_reboot]
    C -- inspect --> G[_handle_inspect]
    C -- wait --> H[step_reward = -0.01]
    D & E & F & G & H --> I[_action_history.append]
    I --> J{chaos_enabled?}
    J -- yes --> K[_maybe_inject_chaos]
    J -- no --> L[_recalculate_state]
    K --> L
    L --> M{current_step >= MAX_STEPS?}
    M -- yes --> N[done = True]
    M -- no --> O[_build_observation]
    N --> O
    O --> P[Return SREObservation]
    P --> Q[grade() — called by orchestrator]
    Q --> R{task?}
    R -- Task1 --> S[PhantomVolumeCleanup grader]
    R -- Task2 --> T[LatencySpikeRemediation grader]
    R -- Task3 --> U[NoisyNeighborIncident grader]
    S & T & U --> V[Return score, breakdown]

Comments Outside Diff (1)

envs/cloud_sre_env/models.py, line 84-126 (link)

ALIGNMENT FLAG: Wire types are @dataclass, not Pydantic models

INVARIANTS.md states: "All wire types (Action, Observation, State) must be Pydantic models — Serialization must be JSON-compatible." SREAction, SREObservation, and SREState are plain Python @dataclasses. If the Action/Observation/State base classes in openenv.core.env_server are Pydantic BaseModel subclasses, mixing @dataclass on top breaks Pydantic's field validation and .model_dump() / .model_validate() contract at the wire boundary.

Principle at stake: Pydantic serialization invariant (INVARIANTS.md §API Invariants Add boiler plate code for CodingEnv #3)
Suggested reviewer: @darktex

Context Used: .claude/docs/INVARIANTS.md (source)

Prompt To Fix With AI

This is a comment left during a code review.
Path: envs/cloud_sre_env/models.py
Line: 84-126

Comment:
**ALIGNMENT FLAG: Wire types are `@dataclass`, not Pydantic models**

`INVARIANTS.md` states: *"All wire types (Action, Observation, State) must be Pydantic models — Serialization must be JSON-compatible."* `SREAction`, `SREObservation`, and `SREState` are plain Python `@dataclass`es. If the `Action`/`Observation`/`State` base classes in `openenv.core.env_server` are Pydantic `BaseModel` subclasses, mixing `@dataclass` on top breaks Pydantic's field validation and `.model_dump()` / `.model_validate()` contract at the wire boundary.

- **Principle at stake**: Pydantic serialization invariant (`INVARIANTS.md` §API Invariants #3)
- **Suggested reviewer**: `@darktex`

**Context Used:** .claude/docs/INVARIANTS.md ([source](https://app.greptile.com/review/custom-context?memory=dbd1ab5e-bd4d-4701-9de0-9817404155a9))

How can I resolve this? If you propose a fix, please make it concise.

Prompt To Fix All With AI

This is a comment left during a code review.
Path: envs/cloud_sre_env/server/cloud_sre_environment.py
Line: 720-732

Comment:
**`_handle_scale` CPU-reward bonus is dead code**

`_find_resource` returns a live reference to the dict in `self._env_state["resources"]`. The mutation loop (lines 720–726) sets `r["cpu_utilization"] = 45.0` in-place on that same object, so when line 730 checks `resource.get("cpu_utilization", 0) > 80`, the value is already `45.0` — the condition can never be `True` and `reward += 0.05` is unreachable.

Capture the CPU value before the mutation loop:

```suggestion
        old_size = resource.get("instance_size", "unknown")
        pre_mutation_cpu = resource.get("cpu_utilization", 0)

        if resource.get("type") == ResourceType.RDS.value:
            if target_size not in RDS_PRICING:
                return -0.05, f"Error: Invalid RDS size '{target_size}'"
            new_cost = RDS_PRICING[target_size]
        else:
            new_cost = resource.get("cost_per_hour", 0) * 2.0

        for r in self._env_state["resources"]:
            if r["id"] == resource_id:
                r["instance_size"] = target_size
                r["cost_per_hour"] = new_cost
                if r.get("cpu_utilization", 0) > 80:
                    r["cpu_utilization"] = 45.0
                break

        self._resolve_alerts_for(resource_id)
        reward = 0.08
        if pre_mutation_cpu > 80:
            reward += 0.05
        return reward, f"Scaled '{resource_id}' from {old_size} to {target_size}"
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: .github/workflows/docker-build.yml
Line: 117-119

Comment:
**Missing `context` field (silently falls back to repo root)**

Every other matrix entry explicitly sets `context:`. For `cloud-sre-env` it is absent, so the workflow resolves it via `${{ matrix.image.context || '.' }}`. The Dockerfile's `COPY src/openenv/core/` and `COPY envs/cloud_sre_env/` do require the repo root as context, so the build is correct today — but the implicit fallback is inconsistent with every other entry and fragile if the default ever changes.

```suggestion
          - name: cloud-sre-env
            dockerfile: envs/cloud_sre_env/server/Dockerfile
            context: .
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: envs/cloud_sre_env/models.py
Line: 84-126

Comment:
**ALIGNMENT FLAG: Wire types are `@dataclass`, not Pydantic models**

`INVARIANTS.md` states: *"All wire types (Action, Observation, State) must be Pydantic models — Serialization must be JSON-compatible."* `SREAction`, `SREObservation`, and `SREState` are plain Python `@dataclass`es. If the `Action`/`Observation`/`State` base classes in `openenv.core.env_server` are Pydantic `BaseModel` subclasses, mixing `@dataclass` on top breaks Pydantic's field validation and `.model_dump()` / `.model_validate()` contract at the wire boundary.

- **Principle at stake**: Pydantic serialization invariant (`INVARIANTS.md` §API Invariants #3)
- **Suggested reviewer**: `@darktex`

**Context Used:** .claude/docs/INVARIANTS.md ([source](https://app.greptile.com/review/custom-context?memory=dbd1ab5e-bd4d-4701-9de0-9817404155a9))

How can I resolve this? If you propose a fix, please make it concise.

_{Reviews (1): Last reviewed commit: "feat: Add Cloud SRE & FinOps Environment" | Re-trigger Greptile}

naveenkumar982

everything is updated now

feat: Add Cloud SRE & FinOps Environment

bf831b0

greptile-apps Bot reviewed Apr 4, 2026

View reviewed changes

Comment thread envs/cloud_sre_env/server/cloud_sre_environment.py

Comment thread .github/workflows/docker-build.yml

fix(envs): Capture pre-mutation CPU for scaling reward in SRE env

cce62af

meta-cla Bot added the CLA Signed This label is managed by the Meta Open Source bot. label Apr 4, 2026

naveenkumar982 added 2 commits April 4, 2026 22:04

chore(ci): Explicitly define docker build context for cloud-sre-env

4d067a4

fix(envs): Convert SRE wire types to pure Pydantic models

954f01c

naveenkumar982 commented Apr 4, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[New Env] Cloud SRE & FinOps Environment#506

[New Env] Cloud SRE & FinOps Environment#506
naveenkumar982 wants to merge 4 commits into
huggingface:mainfrom
naveenkumar982:add-cloud-sre-env

naveenkumar982 commented Apr 4, 2026

Uh oh!

meta-cla Bot commented Apr 4, 2026

Uh oh!

greptile-apps Bot commented Apr 4, 2026 •

edited

Loading

Comments Outside Diff (1)

Uh oh!

Uh oh!

Uh oh!

naveenkumar982 left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

naveenkumar982 commented Apr 4, 2026

Uh oh!

meta-cla Bot commented Apr 4, 2026

Action Required

Process

Uh oh!

greptile-apps Bot commented Apr 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Flowchart

Comments Outside Diff (1)

Uh oh!

Uh oh!

Uh oh!

naveenkumar982 left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

greptile-apps Bot commented Apr 4, 2026 •

edited

Loading