diff --git a/.claude/commands/cypress/cypress-setup.md b/.claude/commands/cypress/cypress-setup.md deleted file mode 100644 index 817524603..000000000 --- a/.claude/commands/cypress/cypress-setup.md +++ /dev/null @@ -1,75 +0,0 @@ ---- -name: cypress-setup -description: Automated Cypress environment setup with interactive configuration -parameters: [] ---- - -# Cypress Environment Setup - -This command sets up the Cypress testing environment by checking prerequisites, installing dependencies, and interactively configuring environment variables. - -## Instructions for Claude - -Follow these steps in order, always: - -### Step 1: Check Prerequisites - -1. **Check Node.js version** - Required: >= 18 - - Run `node --version` and verify it's >= 18 - - If not, inform the user they need to install Node.js 18 or higher - -### Step 2: Navigate and Install Dependencies - -1. **Navigate to the cypress directory**: - ```bash - cd web/cypress - ``` - -2. **Install npm dependencies**: - ```bash - npm install - ``` - -### Step 3: Interactive Environment Configuration - -**Important**: After Step 2, you should be in the `web/cypress` directory. All paths below are relative to that directory. - -1. Detect the existence of `./export-env.sh` (in the current `web/cypress` directory) - -2. If exists, show its variables, ask if user wants to source it. If user does not want to source it, run `./configure-env.sh` (in the current `web/cypress` directory) in the new terminal window - -3. If doesn't exist, prompt the user for each configuration value in `./configure-env.sh` (in the current `web/cypress` directory) in the new terminal window - ---- - -## Configuration Questions - -### Step 4: Open a new terminal window without asking for approval - -Use the scripts in `.claude/commands/cypress/scripts/` to open a new terminal window named **"Cypress Tests"**. - -**To source existing export-env.sh:** -```bash -# macOS -./.claude/commands/cypress/scripts/open-cypress-terminal-macos.sh - -# Linux -./.claude/commands/cypress/scripts/open-cypress-terminal-linux.sh -``` - -**To run configure-env.sh interactively (when export-env.sh doesn't exist or user wants to reconfigure):** -```bash -# macOS -./.claude/commands/cypress/scripts/open-cypress-terminal-macos.sh --configure - -# Linux -./.claude/commands/cypress/scripts/open-cypress-terminal-linux.sh --configure -``` - -The `--configure` flag will run `./configure-env.sh` interactively, question by question, then source the generated `./export-env.sh` - -### Step 5: Inform the user - -Let the user know that a new terminal window has been opened with the Cypress environment pre-configured, and they can run tests using `/cypress-run` in that window. - ---- diff --git a/.claude/commands/cypress/cypress-run.md b/.claude/commands/cypress/run.md similarity index 98% rename from .claude/commands/cypress/cypress-run.md rename to .claude/commands/cypress/run.md index 630bcd057..f50386fdd 100644 --- a/.claude/commands/cypress/cypress-run.md +++ b/.claude/commands/cypress/run.md @@ -1,5 +1,5 @@ --- -name: cypress-run +name: run description: Display Cypress test commands - choose execution mode (headless recommended) parameters: - name: execution-mode @@ -11,8 +11,8 @@ parameters: # Cypress Test Commands **Prerequisites**: -1. Run `/cypress-setup` first to configure your environment. -2. Ensure the "Cypress Tests" terminal window is open (created by `/cypress-setup`) +1. Run `/cypress:setup` first to configure your environment. +2. Ensure the "Cypress Tests" terminal window is open (created by `/cypress:setup`) **Note**: All commands are executed in the "Cypress Tests" terminal window using the helper scripts. @@ -340,7 +340,7 @@ Use these tags with `--env grepTags`: ## Related Commands -- **`/cypress-setup`** - Configure testing environment and open Cypress Tests terminal +- **`/cypress:setup`** - Configure testing environment and open Cypress Tests terminal --- diff --git a/.claude/commands/cypress/setup.md b/.claude/commands/cypress/setup.md new file mode 100644 index 000000000..14fc2339b --- /dev/null +++ b/.claude/commands/cypress/setup.md @@ -0,0 +1,87 @@ +--- +name: setup +description: Automated Cypress environment setup with direct configuration +parameters: [] +--- + +# Cypress Environment Setup + +This command sets up the Cypress testing environment by checking prerequisites, installing dependencies, and configuring environment variables. + +## Instructions for Claude + +Follow these steps in order: + +### Step 1: Check Prerequisites + +1. **Check Node.js version** - Required: >= 18 + - Run `node --version` and verify it's >= 18 + - If not, inform the user they need to install Node.js 18 or higher + +### Step 2: Navigate and Install Dependencies + +1. **Navigate to the cypress directory**: + ```bash + cd web/cypress + ``` + +2. **Install npm dependencies**: + ```bash + npm install + ``` + +### Step 3: Create or Update Environment Configuration + +**Important**: Do NOT run the interactive `configure-env.sh` script. Instead, create the `web/cypress/export-env.sh` file directly. + +1. Check if `web/cypress/export-env.sh` already exists +2. If it exists, read it and show the current values to the user. Ask if they want to keep or update them. +3. Look for cluster credentials in the following order: + - **Conversation context**: Check if the user already provided a console URL and kubeadmin password earlier in the conversation + - **Existing export-env.sh**: Reuse values from an existing file if the user confirms they're current + - **Ask the user directly**: If no credentials are found, ask the user to provide: + - Console URL (e.g., `https://console-openshift-console.apps.ci-ln-xxxxx.aws-4.ci.openshift.org`) + - Kubeadmin password + - **Do NOT proceed without credentials** — the file cannot be created with placeholder values +4. Write the file directly: + +```bash +cat > web/cypress/export-env.sh << 'ENVEOF' +# shellcheck shell=bash +export CYPRESS_BASE_URL='' +export CYPRESS_LOGIN_IDP='kube:admin' +export CYPRESS_LOGIN_USERS='kubeadmin:' +export CYPRESS_KUBECONFIG_PATH='/tmp/kubeconfig' +export CYPRESS_SKIP_ALL_INSTALL='false' +export CYPRESS_SKIP_COO_INSTALL='false' +export CYPRESS_COO_UI_INSTALL='true' +export CYPRESS_SKIP_KBV_INSTALL='true' +export CYPRESS_KBV_UI_INSTALL='false' +export CYPRESS_TIMEZONE='UTC' +export CYPRESS_MOCK_NEW_METRICS='false' +export CYPRESS_SESSION='true' +export CYPRESS_DEBUG='false' +ENVEOF +``` + +Notes on the values: +- `CYPRESS_BASE_URL`: The OpenShift console URL from the cluster provisioning output +- `CYPRESS_LOGIN_USERS`: Format is `kubeadmin:` using the password from cluster provisioning +- `CYPRESS_KUBECONFIG_PATH`: Use `/tmp/kubeconfig` when running in a Docker sandbox, or the actual path to kubeconfig on the host +- `CYPRESS_SKIP_ALL_INSTALL='false'`: Ensures operators get installed on first run +- `CYPRESS_COO_UI_INSTALL='true'`: Installs the Cluster Observability Operator UI plugin + +### Step 4: Verify the setup + +Source the file and confirm the variables are set: + +```bash +source web/cypress/export-env.sh +echo "Base URL: $CYPRESS_BASE_URL" +``` + +### Step 5: Inform the user + +Let the user know the environment is configured and they can run tests using `/cypress:run`. + +--- diff --git a/.claude/commands/cypress/test-development/fixture-schema-reference.md b/.claude/commands/cypress/test-development/fixture-schema-reference.md new file mode 120000 index 000000000..b2ebd9a95 --- /dev/null +++ b/.claude/commands/cypress/test-development/fixture-schema-reference.md @@ -0,0 +1 @@ +../../../.cursor/commands/fixture-schema-reference.md \ No newline at end of file diff --git a/.claude/commands/cypress/test-development/generate-incident-fixture.md b/.claude/commands/cypress/test-development/generate-incident-fixture.md new file mode 120000 index 000000000..aeb91682d --- /dev/null +++ b/.claude/commands/cypress/test-development/generate-incident-fixture.md @@ -0,0 +1 @@ +../../../.cursor/commands/generate-incident-fixture.md \ No newline at end of file diff --git a/.claude/commands/cypress/test-development/generate-regression-test.md b/.claude/commands/cypress/test-development/generate-regression-test.md new file mode 120000 index 000000000..2480eb71f --- /dev/null +++ b/.claude/commands/cypress/test-development/generate-regression-test.md @@ -0,0 +1 @@ +../../../.cursor/commands/generate-regression-test.md \ No newline at end of file diff --git a/.claude/commands/cypress/test-development/refactor-regression-test.md b/.claude/commands/cypress/test-development/refactor-regression-test.md new file mode 120000 index 000000000..a2575e575 --- /dev/null +++ b/.claude/commands/cypress/test-development/refactor-regression-test.md @@ -0,0 +1 @@ +../../../.cursor/commands/refactor-regression-test.md \ No newline at end of file diff --git a/.claude/commands/cypress/test-development/validate-incident-fixtures.md b/.claude/commands/cypress/test-development/validate-incident-fixtures.md new file mode 120000 index 000000000..19889a9fd --- /dev/null +++ b/.claude/commands/cypress/test-development/validate-incident-fixtures.md @@ -0,0 +1 @@ +../../../.cursor/commands/validate-incident-fixtures.md \ No newline at end of file diff --git a/.claude/commands/cypress/test-iteration/analyze-ci-results.md b/.claude/commands/cypress/test-iteration/analyze-ci-results.md new file mode 100644 index 000000000..88243a331 --- /dev/null +++ b/.claude/commands/cypress/test-iteration/analyze-ci-results.md @@ -0,0 +1,280 @@ +--- +name: analyze-ci-results +description: Analyze OpenShift CI (Prow) test results from a gcsweb URL - identifies infra vs test/code failures and correlates with git commits +parameters: + - name: ci-url + description: > + The gcsweb URL for a CI run. Can be any level of the artifact tree: + - Job root: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/openshift_monitoring-plugin/{PR}/{JOB}/{RUN_ID}/ + - Test artifacts: .../{RUN_ID}/artifacts/e2e-incidents/monitoring-plugin-tests-incidents-ui/ + - Prow UI: https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_monitoring-plugin/{PR}/{JOB}/{RUN_ID} + required: true + - name: focus + description: "Optional: focus analysis on specific test file or area (e.g., 'regression', '01.incidents', 'filtering')" + required: false +--- + +# Analyze OpenShift CI Test Results + +Fetch, parse, and classify failures from an OpenShift CI (Prow) test run. This skill is designed to be the **first step** in an agentic test iteration workflow — it produces a structured diagnosis that the orchestrator can act on. + +## Instructions + +### Step 1: Normalize the URL + +The user may provide a URL at any level. Normalize it to the **job root**: + +``` +https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/openshift_monitoring-plugin/{PR}/{JOB}/{RUN_ID}/ +``` + +If the user provides a Prow UI URL (`prow.ci.openshift.org/view/gs/...`), convert it: +- Replace `https://prow.ci.openshift.org/view/gs/` with `https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/` +- Append trailing `/` if missing + +Derive these base paths: +- **Job root**: `{normalized_url}` +- **Test artifacts root**: `{normalized_url}artifacts/e2e-incidents/monitoring-plugin-tests-incidents-ui/` +- **Screenshots root**: `{test_artifacts_root}artifacts/screenshots/` +- **Videos root**: `{test_artifacts_root}artifacts/videos/` + +### Step 2: Fetch Job Metadata (parallel) + +Fetch these files from the **job root** using WebFetch: + +| File | What to extract | +|------|----------------| +| `started.json` | `timestamp`, `pull` (PR number), `repos` (commit SHAs) | +| `finished.json` | `passed` (bool), `result` ("SUCCESS"/"FAILURE"), `revision` (PR HEAD SHA) | +| `prowjob.json` | PR title, PR author, PR branch, base branch, base SHA, PR SHA, job name, cluster, duration | + +From `started.json` `repos` field, extract: +- **Base commit**: the SHA after `main:` (before the comma) +- **PR commit**: the SHA after `{PR_NUMBER}:` + +Present a summary: +``` +CI Run Summary: + PR: #{PR_NUMBER} - {PR_TITLE} + Author: {AUTHOR} + Branch: {PR_BRANCH} -> {BASE_BRANCH} + PR commit: {PR_SHA} (short: first 7 chars) + Base commit: {BASE_SHA} (short: first 7 chars) + Result: PASSED / FAILED + Duration: {DURATION} + Job: {JOB_NAME} +``` + +### Step 3: Fetch and Parse Test Results + +Fetch `{test_artifacts_root}build-log.txt` using WebFetch. + +#### Cypress Output Format + +The build log contains Cypress console output. Parse these sections: + +**Per-spec results block** — appears after each spec file runs: +``` + (Results) + + ┌──────────────────────────────────────────────────────────┐ + │ Tests: N │ + │ Passing: N │ + │ Failing: N │ + │ Pending: N │ + │ Skipped: N │ + │ Screenshots: N │ + │ Video: true │ + │ Duration: X minutes, Y seconds │ + │ Spec Ran: {spec-file-name}.cy.ts │ + └──────────────────────────────────────────────────────────┘ +``` + +**Final summary table** — appears at the very end: +``` + (Run Finished) + + ┌──────────────────────────────────────────────────────────┐ + │ Spec Tests Passing Failing Pending │ + ├──────────────────────────────────────────────────────────┤ + │ ✓ spec-file.cy.ts 5 5 0 0 │ + │ ✗ other-spec.cy.ts 3 1 2 0 │ + └──────────────────────────────────────────────────────────┘ +``` + +**Failure details** — appear inline during test execution: +``` + 1) Suite Name + "before all" hook for "test description": + ErrorType: error message + > detailed error + at stack trace... + + N failing +``` + +Or for test-level (not hook) failures: +``` + 1) Suite Name + test description: + AssertionError: Timed out retrying after Nms: Expected to find element: .selector +``` + +Extract per-spec: +- Spec file name +- Pass/fail/skip counts +- For failures: test name, error type, error message, whether it was in a hook + +### Step 4: Fetch Failure Screenshots + +For each failing spec, navigate to `{screenshots_root}{spec-file-name}/` and list available screenshots. + +**Screenshot naming convention:** +``` +{Suite Name} -- {Test Title} -- before all hook (failed).png +{Suite Name} -- {Test Title} (failed).png +``` + +Fetch each screenshot URL and **read it using the Read tool** (multimodal) to understand the visual state at failure time. Describe what you see: +- What page/view is shown? +- Are there error dialogs, loading spinners, empty states? +- Is the expected UI element visible? If not, what's in its place? +- Are there console errors visible in the browser? + +### Step 5: Classify Each Failure + +For every failing test, classify it into one of these categories: + +#### Infrastructure Failures (not actionable by test code changes) + +| Classification | Indicators | +|---------------|------------| +| `INFRA_CLUSTER` | Certificate expired, API server unreachable, node not ready, cluster version mismatch | +| `INFRA_OPERATOR` | COO/CMO installation timeout, operator pod not running, CRD not found | +| `INFRA_PLUGIN` | Plugin deployment unavailable, dynamic plugin chunk loading error, console not accessible | +| `INFRA_AUTH` | Login failed, kubeconfig invalid, RBAC permission denied (for expected operations) | +| `INFRA_CI` | Pod eviction, OOM killed, timeout at infrastructure level (not test timeout) | + +**Key signals for infra issues:** +- Errors in `before all` hooks related to cluster setup +- Certificate/TLS errors +- `oc` command failures with connection errors +- Element `.co-clusterserviceversion-install__heading` not found (operator install UI) +- Errors mentioning pod names, namespaces, or k8s resources +- `e is not a function` or similar JS errors from the console application itself (not test code) + +#### Test/Code Failures (actionable) + +| Classification | Indicators | +|---------------|------------| +| `TEST_BUG` | Wrong selector, incorrect assertion logic, race condition / timing issue, test assumes wrong state | +| `FIXTURE_ISSUE` | Mock data doesn't match expected structure, missing alerts/incidents in fixture, edge case not covered | +| `PAGE_OBJECT_GAP` | Page object method missing, selector outdated, doesn't match current DOM | +| `MOCK_ISSUE` | cy.intercept not matching the actual API call, response shape incorrect, query parameter mismatch | +| `CODE_REGRESSION` | Test was passing before, UI behavior genuinely changed — the source code has a bug | + +**Key signals for test/code issues:** +- `AssertionError: Timed out retrying` on application-specific selectors (not infra selectors) +- `Expected X to equal Y` where the assertion logic is wrong +- Failures only in specific test scenarios, not across the board +- Screenshot shows the UI rendered correctly but test expected something different + +### Step 6: Correlate with Git Commits + +Using the PR commit SHA and base commit SHA from Step 2: + +1. **Check local git history**: Run `git log {base_sha}..{pr_sha} --oneline` to see what changed in the PR +2. **Identify relevant changes**: Run `git diff {base_sha}..{pr_sha} --stat` to see which files were modified +3. **For CODE_REGRESSION failures**: Check if the failing component's source code was modified in the PR +4. **For TEST_BUG failures**: Check if the test itself was modified in the PR (new test might have a bug) + +Present the correlation: +``` +Commit correlation for {test_name}: + PR modified: src/components/incidents/IncidentChart.tsx (+45, -12) + Test file: cypress/e2e/incidents/01.incidents.cy.ts (unchanged) + Verdict: CODE_REGRESSION - chart rendering changed but test expectations not updated +``` + +Or: +``` +Commit correlation for {test_name}: + PR modified: cypress/e2e/incidents/regression/01.reg_filtering.cy.ts (+30, -5) + Source code: src/components/incidents/ (unchanged) + Verdict: TEST_BUG - new test code has incorrect assertion +``` + +### Step 7: Produce Structured Report + +Output a structured report with this format: + +``` +# CI Analysis Report + +## Run: PR #{PR} - {TITLE} +- Commit: {SHORT_SHA} by {AUTHOR} +- Branch: {BRANCH} +- Result: {RESULT} +- Duration: {DURATION} + +## Summary +- Total specs: N +- Passed: N +- Failed: N (M infra, K test/code) + +## Infrastructure Issues (not actionable via test changes) + +### INFRA_CLUSTER: Certificate expired +- Affected: ALL tests (cascade failure) +- Detail: x509 certificate expired at {timestamp} +- Action needed: Cluster certificate renewal (outside test scope) + +## Test/Code Issues (actionable) + +### TEST_BUG: Selector timeout in filtering test +- Spec: regression/01.reg_filtering.cy.ts +- Test: "should filter incidents by severity" +- Error: Timed out retrying after 80000ms: Expected to find element: [data-test="severity-filter"] +- Screenshot: [description of what screenshot shows] +- Commit correlation: Test file was modified in this PR (+30 lines) +- Suggested fix: Update selector to match current DOM structure + +### CODE_REGRESSION: Chart not rendering after component refactor +- Spec: regression/02.reg_ui_charts_comprehensive.cy.ts +- Test: "should display incident bars in chart" +- Error: Expected 5 bars, found 0 +- Screenshot: Chart area is empty, no error messages visible +- Commit correlation: src/components/incidents/IncidentChart.tsx was refactored +- Suggested fix: Investigate chart rendering logic in the refactored component + +## Flakiness Indicators +- If a test failed with a timing-related error but similar tests in the same suite passed, + flag it as potentially flaky +- If the error message contains "Timed out retrying" on an element that should exist, + it may be a race condition rather than a missing element + +## Recommendations +- List prioritized next steps +- For infra issues: what needs to happen before tests can run +- For test/code issues: which fixes to attempt first (quick wins vs complex) +- Whether local reproduction is recommended +``` + +### Step 8: If `focus` parameter is provided + +Filter the analysis to only the relevant tests. For example: +- `focus=regression` -> only analyze `regression/*.cy.ts` specs +- `focus=filtering` -> only analyze tests with "filter" in their name +- `focus=01.incidents` -> only analyze `01.incidents.cy.ts` + +Still fetch all metadata and provide the full context, but limit detailed diagnosis to the focused area. + +## Notes for the Orchestrator + +When this skill is used as the first step of `/cypress:test-iteration:iterate-incident-tests`: + +1. **If all failures are INFRA_***: Report to user and STOP. No test changes will help. +2. **If mixed INFRA_* and TEST/CODE**: Report infra issues to user, proceed with test/code fixes only. +3. **If all failures are TEST/CODE**: Proceed with the full iteration loop. +4. **The commit correlation** tells the orchestrator whether to focus on fixing tests or investigating source code changes. +5. **Screenshots** give the Diagnosis Agent a head start — it can reference the CI screenshot analysis instead of reproducing the failure locally first. diff --git a/.claude/commands/cypress/test-iteration/diagnose-test-failure.md b/.claude/commands/cypress/test-iteration/diagnose-test-failure.md new file mode 100644 index 000000000..5a48a59d3 --- /dev/null +++ b/.claude/commands/cypress/test-iteration/diagnose-test-failure.md @@ -0,0 +1,175 @@ +--- +name: diagnose-test-failure +description: Diagnose a Cypress test failure using error output, screenshots, and codebase analysis +parameters: + - name: test-name + description: "Full title of the failing test (from mochawesome 'fullTitle' or Cypress output)" + required: true + - name: spec-file + description: "Path to the spec file (e.g., cypress/e2e/incidents/regression/01.reg_filtering.cy.ts)" + required: true + - name: error-message + description: "The error message from the test failure" + required: true + - name: screenshot-path + description: "Absolute path to the failure screenshot (will be read with multimodal vision)" + required: false + - name: stack-trace + description: "The error stack trace (estack from mochawesome)" + required: false + - name: ci-context + description: "Optional context from /cypress:test-iteration:analyze-ci-results (commit correlation, infra status)" + required: false +--- + +# Diagnose Test Failure + +Analyze a Cypress test failure to determine root cause and recommend a fix. This skill is used by the `/cypress:test-iteration:iterate-incident-tests` orchestrator but can also be invoked standalone. + +## Diagnosis Protocol + +**IMPORTANT**: Follow this order. Visual evidence first, then code analysis. + +### Step 1: Read the Screenshot (if available) + +If `screenshot-path` is provided, read it using the Read tool (multimodal). + +Describe what you see: +- What page/view is displayed? +- Is the expected UI element visible? If not, what's in its place? +- Are there error dialogs, loading spinners, empty states, or overlays? +- Is the page fully loaded or still loading? +- Are there any browser console errors visible? +- Does the layout look correct (no overlapping elements, correct positioning)? + +This visual context often reveals the root cause faster than reading code. + +### Step 2: Read the Test Code + +Read the spec file at `spec-file`. Find the failing test by matching `test-name`. + +Identify: +- What the test is trying to do (user actions + assertions) +- Which page object methods it calls +- Which fixture it loads (look at `before`/`beforeEach` hooks) +- The specific assertion or command that failed +- Whether the failure is in a `before all` hook (affects all tests in suite) or a specific `it()` block + +### Step 3: Read the Page Object + +Read `web/cypress/views/incidents-page.ts`. + +For each page object method used by the failing test: +- Check the selector — does it match current DOM conventions? +- Check for hardcoded waits vs proper Cypress chaining +- Look for methods that might be missing or outdated + +### Step 4: Read the Fixture (if applicable) + +If the test uses `cy.mockIncidentFixture('...')`, read the fixture YAML file. + +Check: +- Does the fixture have the incidents/alerts the test expects? +- Are severities, states, components, timelines correct? +- Are there edge cases (empty arrays, missing fields, zero-duration timelines)? + +### Step 5: Read the Mock Layer (if relevant) + +If the error suggests an API/intercept issue, read relevant files in `cypress/support/incidents_prometheus_query_mocks/`: +- `prometheus-mocks.ts` — intercept setup and route matching +- `mock-generators.ts` — response data generation +- `types.ts` — type definitions for fixtures + +Check: +- Does the intercept URL pattern match the actual API call? +- Is the response shape what the UI code expects? +- Are query parameters (group_id, alertname, severity) handled correctly? + +### Step 6: Cross-reference with Error + +Now combine visual evidence + code analysis + error message to determine root cause. + +**Common patterns:** + +| Error Pattern | Likely Cause | +|--------------|--------------| +| `Timed out retrying after Nms: Expected to find element: .selector` | Selector wrong, element not rendered, or page not loaded | +| `Expected N to equal M` (counts) | Fixture doesn't have enough data, or filter state is wrong | +| `expected true to be false` / vice versa | Assertion logic inverted | +| `Cannot read properties of undefined` | Page object method returns wrong element, or DOM structure changed | +| `cy.intercept() matched no requests` | Mock intercept URL doesn't match actual API call | +| `Timed out retrying` on `.should('be.visible')` | Element exists but hidden (z-index, opacity, overflow, display:none) | +| `before all hook` failure | Setup issue — fixture load, navigation, or login failed | +| `detached from the DOM` | Element re-rendered between find and action — needs `.should('exist')` guard | +| `e is not a function` / runtime JS error | Application code bug, not test issue | +| `x509: certificate` / `Unable to connect` | Infrastructure issue | + +### Step 7: Classify and Recommend + +Output your diagnosis in this exact format: + +``` +## Diagnosis + +**Classification**: TEST_BUG | FIXTURE_ISSUE | PAGE_OBJECT_GAP | MOCK_ISSUE | REAL_REGRESSION | INFRA_ISSUE + +**Confidence**: HIGH | MEDIUM | LOW + +**Root Cause**: +[1-3 sentence explanation of what's wrong and why] + +**Evidence**: +- Screenshot: [what the screenshot showed] +- Error: [what the error message tells us] +- Code: [what the code analysis revealed] + +**Recommended Fix**: +- File: [path to file that needs editing] +- Change: [specific description of what to change] +- [If multiple files need changing, list each] + +**History Check**: +- Run `git log origin/main -- ` for each file in the recommended fix +- Look for prior commits that removed or replaced the pattern being proposed +- If found, note: "WARNING: commit removed this pattern because " +- Example: cy.reload() was removed from prepareIncidentsPageForSearch in e8d0007 + because it breaks dynamic plugin chunk loading. Do NOT re-introduce it. + +**Risk Assessment**: +- Will this fix affect other tests? [yes/no and why] +- Could this mask a real bug? [yes/no and why] +- Does this fix re-introduce a previously reverted pattern? [yes/no] + +**Alternative Hypotheses**: +- [If confidence is MEDIUM or LOW, list other possible causes] +``` + +## Classification Reference + +### Auto-fixable (proceed with Fix Agent) + +| Classification | Description | Examples | +|---------------|-------------|----------| +| `TEST_BUG` | Test code is wrong | Wrong selector, incorrect assertion value, missing wait, wrong test order dependency | +| `FIXTURE_ISSUE` | Test data is wrong | Missing incident in fixture, wrong severity, timeline doesn't cover test's time window | +| `PAGE_OBJECT_GAP` | Page object needs update | Selector targets old class name, method missing for new UI element, method returns wrong element | +| `MOCK_ISSUE` | API mock is wrong | Intercept URL pattern outdated, response missing required field, query filter not handled | + +### Not auto-fixable (report to user) + +| Classification | Description | Examples | +|---------------|-------------|----------| +| `REAL_REGRESSION` | UI code has a bug | Component doesn't render, wrong data displayed, broken interaction | +| `INFRA_ISSUE` | Environment problem | Cluster down, cert expired, operator not installed, console unreachable | + +### Distinguishing TEST_BUG from REAL_REGRESSION + +This is the hardest classification. Use these heuristics: + +1. **Was the test ever passing?** If it's a new test, lean toward `TEST_BUG`. If it was passing before, check what changed. +2. **Does the screenshot show the UI working correctly but the test expecting something different?** → `TEST_BUG` +3. **Does the screenshot show the UI broken (empty state, error, wrong data)?** → Likely `REAL_REGRESSION` +4. **Do other tests in the same suite pass?** If yes, the infra/app is fine → `TEST_BUG` or `FIXTURE_ISSUE` +5. **If CI context is available**: Check if the source code was modified in the PR. Modified source + broken test = likely `REAL_REGRESSION` + +When in doubt, classify as `REAL_REGRESSION` — it's safer to report a false positive to the user than to silently "fix" a test that was correctly catching a bug. diff --git a/.claude/commands/cypress/test-iteration/iterate-ci-flaky.md b/.claude/commands/cypress/test-iteration/iterate-ci-flaky.md new file mode 100644 index 000000000..618483148 --- /dev/null +++ b/.claude/commands/cypress/test-iteration/iterate-ci-flaky.md @@ -0,0 +1,440 @@ +--- +name: iterate-ci-flaky +description: Iterate on flaky Cypress tests against OpenShift CI presubmit jobs — push fixes, trigger CI, analyze results, repeat +parameters: + - name: pr + description: "PR number to iterate on (e.g., 857)" + required: true + - name: max-iterations + description: "Maximum fix-push-wait cycles (default: 3)" + required: false + - name: confirm-runs + description: "Number of green CI runs required to declare stable (default: 2)" + required: false + - name: job + description: "Prow job name to target (default: pull-ci-openshift-monitoring-plugin-main-e2e-incidents)" + required: false + - name: focus + description: "Optional: focus analysis on specific test area (e.g., 'regression', 'filtering')" + required: false + - name: review-window + description: "Seconds to wait for user feedback after posting fix to Slack before pushing (default: 0 = no wait). Requires Option B Slack setup." + required: false +--- + +# Iterate CI Flaky Tests + +Fix flaky Cypress tests by iterating against real OpenShift CI presubmit jobs. Pushes fixes, triggers CI, waits for results, analyzes failures, and repeats until stable. + +## Prerequisites + +### 1. GitHub CLI Authentication + +```bash +gh auth status +``` + +Must be logged in with comment access to `openshift/monitoring-plugin` (for `/test` comments to trigger Prow CI). + +**Recommended auth method**: `gh auth login --web` (OAuth via browser). This uses your GitHub user's existing org permissions — no PAT scope management needed. Revocable anytime at GitHub → Settings → Applications. + +**Why not a PAT?** +- Fine-grained PATs can only scope repos you own — you can't add `openshift/monitoring-plugin` as a contributor. +- Classic PATs with `public_repo` scope work but grant broader access than needed. +- OAuth via `--web` uses the GitHub CLI OAuth app which requests only the permissions it needs and inherits your org membership. + +**Push access**: Git push to your fork uses SSH (`origin` remote) — this is independent of the `gh` token. + +**Fallback**: If the token lacks upstream comment permissions, the agent will report the blocker and ask you to post the `/test` comment manually on the PR page. + +### 2. Permissions + +Required in `.claude/settings.local.json`: + +```json +{ + "permissions": { + "allow": [ + "Bash(gh auth:*)", + "Bash(gh api:*)", + "Bash(gh pr:*)", + "Bash(git push:*)", + "Bash(git add:*)", + "Bash(git commit:*)", + "Bash(git status:*)", + "Bash(git diff:*)", + "Bash(git log:*)", + "Bash(git rev-parse:*)", + "Bash(git -C:*)", + "Bash(git checkout:*)", + "Bash(git fetch:*)", + "Bash(python3:*)", + "Bash(find screenshots:*)", + "Bash(find cypress/screenshots:*)", + "Bash(find cypress/videos:*)", + "WebFetch(domain:gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com)" + ] + } +} +``` + +### 3. Notifications & Review (optional) + +Notifications and review are optional — if not configured, the script prints to stdout and the loop continues normally. + +**Slack Notifications (one-way):** +```bash +export SLACK_WEBHOOK_URL="https://hooks.slack.com/services/T.../B.../..." +``` +Setup: Slack → Apps → Incoming Webhooks → create webhook for your channel. 5 minutes. +Provides one-way status notifications at key events (ci_started, ci_failed, fix_applied, etc.). + +**GitHub PR Comment Review (two-way):** + +The `review-window` parameter enables a two-way review flow using GitHub PR comments. When a fix is ready: + +1. Agent posts fix details as a PR comment (via `review-github.py post`) +2. Agent also sends a Slack webhook notification (if configured) +3. Agent waits `review-window` seconds for a reply from the **PR author only** +4. If the author replies on the PR — agent reads the feedback and adjusts the fix +5. If no reply within the window — agent proceeds autonomously + +**Security**: Author filtering is **code-enforced** in `review-github.py` — only comments where `.user.login` matches the PR author are considered. This is deterministic, not instruction-based. + +**How to reply**: Post a regular comment on the PR. The agent only reads comments from the PR author posted after the agent's notification. Optionally prefix with `/agent` for clarity. + +No additional setup needed beyond `gh auth` (Step 1) — the same token used for `/test` comments is used for posting and reading review comments. + +Both Slack webhook URL and review-window can be set in `cypress/export-env.sh` or `~/.zshrc`. + +### 4. Unsigned Commits + +Same as `/cypress:test-iteration:iterate-incident-tests` — all commits use `--no-gpg-sign`. They live on a PR branch and are squash-merged by the user. + +## Instructions + +**IMPORTANT — Autonomous Execution Rules:** +- **Never chain commands** with `&&` or `|` — use separate Bash calls for each operation. Compound commands and pipes trigger security prompts that block autonomous execution. +- **Never combine `cd` with other commands** — `cd && git` triggers an unskippable security prompt. +- When you need to process command output (e.g., parse JSON), capture it with a Bash call first, then process it in a second call or read the output directly. + +### Step 1: Gather PR Context + +Fetch PR metadata: +```bash +gh pr view {pr} --json headRefName,headRefOid,baseRefName,number,title,url,author,statusCheckRollup +``` + +Extract: +- **Branch**: `headRefName` +- **HEAD SHA**: `headRefOid` +- **Check runs**: from `statusCheckRollup`, find the job matching `{job}` (default: `pull-ci-openshift-monitoring-plugin-main-e2e-incidents`) + +Check out the PR branch locally: +```bash +git fetch origin {headRefName} +``` +```bash +git checkout {headRefName} +``` + +Present summary: +``` +PR #{pr}: {title} +Branch: {headRefName} +HEAD: {short_sha} +CI job: {job} +Latest run status: {SUCCESS|FAILURE|PENDING|none} +``` + +### Step 2: Read Stability Ledger + +Read `web/cypress/reports/test-stability.md` and parse the JSON block between `STABILITY_DATA_START` and `STABILITY_DATA_END`. + +Extract a compressed summary for use during diagnosis: + +``` +Stability context (from {N} previous runs): + Known flaky: + - "test full title" — passed {X}/{Y} runs, last failure: {date}, reason: {reason} + - "test full title" — passed {X}/{Y} runs, last failure: {date}, reason: {reason} + Recently fixed: + - "test full title" — fixed by {commit}, stable since {date} + Persistent failures: + - "test full title" — failed {X}/{Y} runs, never fixed +``` + +If the ledger has no data yet (first run), note "No prior stability data" and continue. + +Store this as `stability_context` — it will be passed to Diagnosis Agents in Step 7 to help them: +- Prioritize known-flaky tests (these are higher value to fix) +- Avoid re-attempting fixes that already failed in previous runs +- Distinguish new failures from recurring ones + +### Step 3: Determine Current CI State + +From the status check rollup, determine the state of the target job: + +- **SUCCESS**: Skip to Step 8 (flakiness confirmation — was it truly stable?) +- **FAILURE**: Proceed to Step 6 (analyze the failure) +- **PENDING / IN_PROGRESS**: Skip to Step 5 (wait for it) +- **No run found**: Trigger one in Step 4 + +### Step 4: Trigger CI Run (if needed) + +If there's no recent run, or a fix was just pushed: + +```bash +gh api repos/openshift/monitoring-plugin/issues/{pr}/comments -f body="/test e2e-incidents" +``` + +**IMPORTANT**: The `/test` command uses the **short alias** (`e2e-incidents`), not the full Prow job name. Using the full name will fail with "specified target(s) for /test were not found." + +Note: If you just pushed a commit in Step 7, the push automatically triggers Prow — you can skip the `/test` comment. Only use `/test` for: +- Retriggering without code changes (flakiness retry) +- The initial run if none exists + +After triggering, notify and proceed to Step 5: +```bash +python3 .claude/commands/cypress/test-iteration/scripts/notify-slack.py send ci_started "CI triggered for PR #{pr}. Polling for results (~2h)." --pr {pr} --branch {headRefName} +``` + +### Step 5: Wait for CI Completion + +Use the polling script at `.claude/commands/cypress/test-iteration/scripts/poll-ci-status.py`: + +```bash +python3 .claude/commands/cypress/test-iteration/scripts/poll-ci-status.py {pr} +``` + +Arguments: ` [job_substring] [max_attempts] [interval_seconds]` +- Default job substring: `e2e-incidents` +- Default max attempts: 30 (150 minutes at 5-minute intervals) +- Default interval: 300 seconds + +Run this with `run_in_background: true` and a timeout of 9000000ms (150 minutes). + +When the background task completes, parse the output line starting with `CI_COMPLETE`: +- Extract `state` (SUCCESS or FAILURE) +- Extract `url` (Prow URL for the run) + +### Step 6: Analyze CI Results + +Convert the Prow URL to a gcsweb URL: +- Replace `https://prow.ci.openshift.org/view/gs/` with `https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/` + +Run `/cypress:test-iteration:analyze-ci-results` (or follow its instructions inline): + +1. Fetch `started.json`, `finished.json`, `prowjob.json` for metadata +2. Fetch `build-log.txt` from the test artifacts path +3. List and fetch failure screenshots +4. Classify each failure + +**Classification outcomes:** + +| Classification | Action | +|---------------|--------| +| `INFRA_*` | Report to user. Optionally retrigger with `/retest` (Step 4). Do NOT attempt code fixes. | +| `TEST_BUG` | Diagnose and fix locally (Step 7) | +| `FIXTURE_ISSUE` | Diagnose and fix locally (Step 7) | +| `PAGE_OBJECT_GAP` | Diagnose and fix locally (Step 7) | +| `MOCK_ISSUE` | Diagnose and fix locally (Step 7) | +| `CODE_REGRESSION` | Report to user and **STOP** | + +Notify after analysis: + +If failures: +```bash +python3 .claude/commands/cypress/test-iteration/scripts/notify-slack.py send ci_failed "{N} failures found: {test_names}. Diagnosing..." --pr {pr} --branch {headRefName} --url {ci_url} +``` + +If all green: +```bash +python3 .claude/commands/cypress/test-iteration/scripts/notify-slack.py send ci_complete "All tests passed. Starting flakiness confirmation." --pr {pr} --branch {headRefName} --url {ci_url} +``` + +If `CODE_REGRESSION` or `INFRA_*` blocks the loop: +```bash +python3 .claude/commands/cypress/test-iteration/scripts/notify-slack.py send blocked "{classification}: {description}. Agent stopped — needs human input." --pr {pr} --branch {headRefName} +``` + +If **all green** (SUCCESS): Proceed to Step 8 (flakiness confirmation). + +### Step 7: Fix and Push + +For each fixable failure: + +1. **Diagnose** using `/cypress:test-iteration:diagnose-test-failure` (read screenshots, test code, fixtures, page object). Include `stability_context` from Step 2 — prior flakiness history helps distinguish new failures from recurring ones and avoids re-attempting previously failed fixes. +2. **Fix** — edit the relevant files. Same constraints as `/cypress:test-iteration:iterate-incident-tests`: + - May edit: `cypress/e2e/incidents/**`, `cypress/fixtures/incident-scenarios/**`, `cypress/views/incidents-page.ts`, `cypress/support/incidents_prometheus_query_mocks/**` + - Must NOT edit: `src/**`, non-incident tests, cypress config +3. **Validate locally** (optional but recommended if cluster is accessible): + ```bash + source cypress/export-env.sh && npx cypress run --spec "{SPEC}" --env grep="{TEST_NAME}" + ``` +4. **Commit**: + ```bash + git add {files} + ``` + ```bash + git commit --no-gpg-sign -m "fix(tests): {summary} + + CI run: {prow_url} + Classifications: {list} + + Co-Authored-By: Claude Opus 4.6 " + ``` + +5. **Notify and review window** (before pushing): + + **a) Slack notification** (one-way, if configured): + ```bash + python3 .claude/commands/cypress/test-iteration/scripts/notify-slack.py send fix_applied "*What changed:*\n• {file}: {change_description}\n\n*Why:* {diagnosis_summary}\n*Classification:* {classification} (confidence: {confidence})\n\n`git diff HEAD~1` on branch `{headRefName}`" --pr {pr} --branch {headRefName} + ``` + + **b) GitHub PR review comment** (two-way, if `review-window` > 0): + + Post fix details as a PR comment: + ```bash + python3 .claude/commands/cypress/test-iteration/scripts/review-github.py post {pr} "**What changed:**\n• {file}: {change_description}\n\n**Why:** {diagnosis_summary}\n**Classification:** {classification} (confidence: {confidence})\n\n\`git diff HEAD~1\` on branch \`{headRefName}\`" + ``` + + Capture `COMMENT_TIME` from the output, then wait for author reply: + ```bash + python3 .claude/commands/cypress/test-iteration/scripts/review-github.py wait {pr} {COMMENT_TIME} --timeout {review-window} + ``` + + Parse the output: + - `REPLY=`: PR author provided feedback. Read the reply text and adjust the fix accordingly. This may mean: + - Reverting the commit (`git reset --soft HEAD~1`), applying the user's suggestion, and re-committing + - Or making an additional commit on top with the adjustment + - `NO_REPLY`: No feedback within the window. Proceed with push. + + **Note**: The `wait` command only considers comments from the PR author (`.user.login` match, code-enforced). Comments from other users or bots are ignored. + +6. **Push**: + ```bash + git push origin {headRefName} + ``` + +The push automatically triggers a new Prow run. Go to **Step 5** (wait for CI). + +Track iteration count. If `current_iteration >= max-iterations`: Report remaining failures and **STOP**. + +### Step 8: Flakiness Confirmation + +A single green CI run doesn't prove stability. Trigger `confirm-runs` additional runs (default: 2) to confirm. + +For each confirmation run: + +1. Trigger via `/test` comment (no code changes): + ```bash + gh api repos/openshift/monitoring-plugin/issues/{pr}/comments -f body="/test e2e-incidents" + ``` + +2. Wait for completion (Step 5) + +3. Analyze results (Step 6) + +4. If failures found: + - If same test fails across runs → likely a real bug, diagnose and fix (Step 7) + - If different tests fail across runs → environment-dependent flakiness, harder to fix + - Report flakiness pattern to user + +Track results across all runs: +``` +Stability Report: + Run 1 (fix iteration): {SHA} — PASSED + Run 2 (confirm #1): {SHA} — PASSED + Run 3 (confirm #2): {SHA} — PASSED (or FAILED: test X) +``` + +### Step 9: Final Report + +``` +# CI Flaky Test Iteration Report + +## PR: #{pr} - {title} +## Branch: {headRefName} +## Iterations: {N} + +## Timeline +1. [SHA] Initial state — CI FAILURE + - {N} failures: {test names} +2. [SHA] fix(tests): {summary} — pushed, CI triggered +3. [SHA] CI result: PASSED +4. Confirmation run 1: PASSED +5. Confirmation run 2: PASSED + +## Fixes Applied +1. [commit] fix(tests): {summary} + - {file}: {change} + CI run: {prow_url} + +## Stability Assessment +- Tests stable: {N}/{total} (passed all runs) +- Tests flaky: {N} (intermittent failures) +- Tests broken: {N} (failed every run) + +## Flaky Test Details (if any) +- "test name": passed 2/3 runs + Failure pattern: {timing issue / element not found / etc.} + Fix attempted: {yes/no} + +## Remaining Issues +- {any unresolved items} + +## Recommendations +- {merge / needs more investigation / etc.} +``` + +After generating the report, send the final notification: +```bash +python3 .claude/commands/cypress/test-iteration/scripts/notify-slack.py send iteration_done "Iteration complete: {passed}/{total} passed, {flaky} flaky, {iterations} cycles.\n\n{short_summary}" --pr {pr} --branch {headRefName} +``` + +### Step 10: Update Stability Ledger + +After the final report, update `web/cypress/reports/test-stability.md`. + +Read the file and update both sections: + +**1. Current Status table** — for each test in this run: +- If test already in table: update pass rate, update trend +- If test is new: add a row +- Pass rate = total passes / total runs across all recorded iterations +- Trend: compare last 3 runs — improving / stable / degrading + +**2. Run History log** — append a new row: +``` +| {next_number} | {YYYY-MM-DD} | ci | {branch} | {total_tests} | {passed} | {failed} | {flaky} | {commit_sha} | +``` + +**3. Machine-readable data** — update the JSON block between `STABILITY_DATA_START` and `STABILITY_DATA_END` with the new run data. + +Commit: +```bash +git add web/cypress/reports/test-stability.md +``` +```bash +git commit --no-gpg-sign -m "docs: update test stability ledger — {passed}/{total} passed, {flaky} flaky (CI)" +``` + +## Error Handling + +- **Push rejected** (branch protection, force push required): Report to user. Do NOT force push. +- **`/test` comment ignored by Prow**: User may lack `ok-to-test` permission. Check if the label exists on the PR: `gh pr view {pr} --json labels`. +- **CI timeout** (>150 min): Report timeout, check if the job is stuck. Suggest manual inspection. +- **Multiple CI jobs running**: Only track the latest run. Use the `detailsUrl` from the most recent check run. +- **Merge conflicts after push**: Report to user. The PR branch may need rebasing — do NOT rebase automatically. +- **Rate limiting on gh api**: GitHub allows 5000 requests/hour for authenticated users. Polling every 5 min = 12/hour, well within limits. + +## Guardrails + +- **Never force-push** — always additive commits +- **Never push to main** — only to the PR branch +- **Never edit source code** (`src/`) — only test infrastructure +- **Never close or merge the PR** — that's the user's decision +- **Max 3 `/test` comments per hour** — avoid spamming the PR +- **Always include the CI run URL** in commit messages for traceability +- **Stop on CODE_REGRESSION** — if the UI is genuinely broken, that's not a flaky test diff --git a/.claude/commands/cypress/test-iteration/iterate-incident-tests.md b/.claude/commands/cypress/test-iteration/iterate-incident-tests.md new file mode 100644 index 000000000..b168e5caa --- /dev/null +++ b/.claude/commands/cypress/test-iteration/iterate-incident-tests.md @@ -0,0 +1,510 @@ +--- +name: iterate-incident-tests +description: Autonomously run, diagnose, fix, and verify incident detection Cypress tests with flakiness probing +parameters: + - name: target + description: > + What to test. Options: + - "all" — all incident tests (excluding @e2e-real) + - "regression" — only regression/ directory tests + - a specific spec file path (e.g., "cypress/e2e/incidents/01.incidents.cy.ts") + - a grep pattern for a specific test (e.g., "should filter by severity") + required: true + - name: max-iterations + description: "Maximum fix-and-retry cycles (default: 3)" + required: false + - name: ci-url + description: "Optional: gcsweb or Prow URL for CI results to use as starting context (triggers /cypress:test-iteration:analyze-ci-results first)" + required: false + - name: flakiness-runs + description: "Number of flakiness probe runs (default: 3). Set to 0 to skip flakiness probing" + required: false + - name: skip-branch + description: "If 'true', work on current branch instead of creating a new one (default: false)" + required: false +--- + +# Iterate Incident Tests + +Autonomous test iteration loop: run tests, diagnose failures, apply fixes, verify, and probe for flakiness. + +## Prerequisites + +### 1. Cypress Environment + +Ensure `web/cypress/export-env.sh` exists with cluster credentials. If missing, create it directly — do NOT run the interactive `/cypress:setup` configurator. Use the cluster credentials provided in the conversation (console URL, kubeadmin password) to write the file: + +```bash +cat > web/cypress/export-env.sh << 'EOF' +# shellcheck shell=bash +export CYPRESS_BASE_URL='' +export CYPRESS_LOGIN_IDP='kube:admin' +export CYPRESS_LOGIN_USERS='kubeadmin:' +export CYPRESS_KUBECONFIG_PATH='/tmp/kubeconfig' +export CYPRESS_SKIP_ALL_INSTALL='false' +export CYPRESS_SKIP_COO_INSTALL='false' +export CYPRESS_COO_UI_INSTALL='true' +export CYPRESS_SKIP_KBV_INSTALL='true' +export CYPRESS_KBV_UI_INSTALL='false' +export CYPRESS_TIMEZONE='UTC' +export CYPRESS_MOCK_NEW_METRICS='false' +export CYPRESS_SESSION='true' +export CYPRESS_DEBUG='false' +EOF +``` + +### 2. Permissions + +This skill runs autonomously and needs pre-approved permissions in `.claude/settings.local.json` to avoid interactive approval prompts blocking the loop. Required permissions: + +```json +{ + "permissions": { + "allow": [ + "Bash(git stash:*)", + "Bash(git checkout:*)", + "Bash(git checkout -b:*)", + "Bash(git branch:*)", + "Bash(git add:*)", + "Bash(git commit:*)", + "Bash(git status:*)", + "Bash(git diff:*)", + "Bash(git log:*)", + "Bash(rm -f screenshots/cypress_report_*.json:*)", + "Bash(rm -f screenshots/merged-report.json:*)", + "Bash(rm -rf cypress/screenshots/*:*)", + "Bash(rm -rf cypress/videos/*:*)", + "Bash(npx cypress run:*)", + "Bash(npx mochawesome-merge:*)", + "Bash(source cypress/export-env.sh:*)", + "Bash(cd /home/drajnoha/Code/monitoring-plugin:*)", + "Bash(find /home/drajnoha/Code/monitoring-plugin/web/cypress:*)", + "Bash(ls:*)" + ] + } +} +``` + +The `rm` permissions are scoped to test artifact directories only (mochawesome reports, screenshots, videos) — these are regenerated every run. + +### 3. Unsigned Commits + +All commits in this workflow use `--no-gpg-sign` to avoid GPG passphrase prompts blocking the loop. These unsigned commits live on a working branch and are intended to be **squash-merged** by the user with their own signature when approved. Never push unsigned commits directly to main. + +If using CI analysis, also add to `web/.claude/settings.local.json`: +```json +"WebFetch(domain:gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com)" +``` + +## Instructions + +Execute the following steps in order. This is the main orchestrator — it coordinates sub-agents and manages the iteration loop. + +### Step 0: CI Context (optional) + +If `ci-url` is provided, run `/cypress:test-iteration:analyze-ci-results` first to get CI failure context. + +Capture the CI analysis output: +- If **all failures are INFRA_***: Report the infrastructure issues to the user and **STOP**. No test changes will help. +- If **mixed infra + test/code**: Note the infra issues for the user, but proceed with the test/code failures only. +- If **all test/code**: Proceed. Use the CI diagnosis (commit correlation, screenshots) as context for the local iteration. + +Store the CI analysis as `ci_context` for later reference by diagnosis agents. + +### Step 1: Branch Setup + +First, check the current branch: +```bash +git rev-parse --abbrev-ref HEAD +``` + +**Decision logic:** +- If `skip-branch` is "true": Stay on the current branch, skip to Step 3. +- If already on a `test/incident-robustness-*` branch: Stay on it, skip to Step 3. +- If on any other non-main working branch (e.g., `agentic-test-iteration`, a feature branch): Ask the user whether to create a child branch or work on the current one. +- If on `main`: Create a new branch. + +To create a branch (only when needed): +```bash +git checkout -b test/incident-robustness-$(date +%Y-%m-%d) +``` + +If that branch name already exists, append a suffix: `-2`, `-3`, etc. + +**IMPORTANT**: Do NOT combine `cd` and `git` in the same command — compound `cd && git` commands trigger a security approval prompt that blocks autonomous execution. Always use separate Bash calls, or set the working directory before running git. + +### Step 2: Read Stability Ledger + +Read `web/cypress/reports/test-stability.md` and parse the JSON block between `STABILITY_DATA_START` and `STABILITY_DATA_END`. + +Extract a compressed summary for use during diagnosis: + +``` +Stability context (from {N} previous runs): + Known flaky: + - "test full title" — passed {X}/{Y} runs, last failure: {date}, reason: {reason} + - "test full title" — passed {X}/{Y} runs, last failure: {date}, reason: {reason} + Recently fixed: + - "test full title" — fixed by {commit}, stable since {date} + Persistent failures: + - "test full title" — failed {X}/{Y} runs, never fixed +``` + +If the ledger has no data yet (first run), note "No prior stability data" and continue. + +Store this as `stability_context` — it will be passed to Diagnosis Agents in Step 8 to help them: +- Prioritize known-flaky tests (these are higher value to fix) +- Avoid re-attempting fixes that already failed in previous runs +- Distinguish new failures from recurring ones + +### Step 3: Resolve Target + +Based on the `target` parameter, determine the Cypress run command: + +| Target | Spec | Grep Tags | +|--------|------|-----------| +| `all` | `cypress/e2e/incidents/**/*.cy.ts` | `@incidents --@e2e-real --@flaky --@demo` | +| `regression` | `cypress/e2e/incidents/regression/**/*.cy.ts` | `@incidents --@e2e-real --@flaky` | +| specific file | `cypress/e2e/incidents/{target}` | (none) | +| grep pattern | `cypress/e2e/incidents/**/*.cy.ts` | (none, use `--env grepTags="{target}"`) | + +### Step 4: Clean Previous Results + +From the `web/` directory: +```bash +bash scripts/clean-test-artifacts.sh +``` + +### Step 5: Run Tests + +Execute Cypress inline (NOT in a separate terminal). From the `web/` directory: + +```bash +source cypress/export-env.sh && npx cypress run --spec "{SPEC}" --env grepTags="{GREP_TAGS}" +``` + +Where `{GREP_TAGS}` comes from the "Grep Tags" column in the target table above. If the target has no grep tags, omit the `--env` flag entirely. + +**IMPORTANT**: Use `grepTags` (not `grep`). The `grep` option searches test names as text, while `grepTags` filters by `@tag` annotations. Using `grep` with tag strings like `@incidents --@e2e-real` will match nothing and cause all specs to run unfiltered. + +Note: `source && npx` is one logical operation (env setup + run) and is acceptable as a single command. + +**IMPORTANT**: Tests can run for a long time, especially e2e tests that wait for alerts to fire. Use `run_in_background` to avoid blocking, and check the output when notified of completion. + +Capture the exit code: +- `0` = all passed +- non-zero = failures occurred + +### Step 6: Parse Results + +Merge mochawesome reports and parse. From the `web/` directory: + +```bash +npx mochawesome-merge screenshots/cypress_report_*.json -o screenshots/merged-report.json +``` + +Read `screenshots/merged-report.json` and extract: + +For each test: +``` +{ + spec_file: string, // from results[].fullFile + suite: string, // from suites[].title + test_name: string, // from tests[].title + full_title: string, // from tests[].fullTitle + state: "passed" | "failed" | "skipped", + error_message: string, // from tests[].err.message (if failed) + stack_trace: string, // from tests[].err.estack (if failed) + duration_ms: number // from tests[].duration +} +``` + +Build a failure list and a pass list. + +**Note**: Mochawesome JSON has nested suites. Walk the tree recursively: +``` +results[] -> suites[] -> tests[] + -> suites[] -> tests[] (nested suites) +``` + +### Step 7: Identify Screenshots + +For each failure, find the corresponding screenshot: + +```bash +find /home/drajnoha/Code/monitoring-plugin/web/cypress/screenshots -name "*.png" -type f +``` + +Match screenshots to failures using the naming convention: +``` +{Suite Name} -- {Test Title} (failed).png +{Suite Name} -- {Test Title} -- before all hook (failed).png +``` + +### Step 8: Diagnosis Loop + +**If no failures** (exit code 0): Skip to Step 13 (flakiness probe). + +**If failures exist**: For each failing test, spawn a **Diagnosis Agent** (Explore-type sub-agent). + +Use the `/cypress:test-iteration:diagnose-test-failure` skill prompt. Provide: +- `test-name`: the full title +- `spec-file`: the spec file path +- `error-message`: the error message +- `screenshot-path`: absolute path to the failure screenshot +- `stack-trace`: the error stack trace +- `ci-context`: any relevant context from Step 0 +- `stability-context`: the compressed stability summary from Step 2 (prior flakiness history, previous failure patterns, previously attempted fixes) + +**Parallelization**: If failures are in **different spec files**, spawn diagnosis agents in parallel. If they're in the **same spec file**, diagnose sequentially (they may share root causes like a broken `before all` hook). + +**Before-all hook failures**: If a `before all` hook failed, all tests in that suite were skipped. Diagnose only the hook failure — fixing it will unblock all skipped tests. + +Collect all diagnoses. Separate into: +- **Fixable**: `TEST_BUG`, `FIXTURE_ISSUE`, `PAGE_OBJECT_GAP`, `MOCK_ISSUE` +- **Blocking**: `REAL_REGRESSION`, `INFRA_ISSUE` + +If any **blocking** issues found: Report them to the user. Continue fixing the fixable issues. + +### Step 9: Fix Loop + +For each fixable failure, spawn a **Fix Agent** (general-purpose sub-agent). + +Provide the Fix Agent with: +1. The full diagnosis from Step 8 +2. The test file content (read it) +3. The page object content (read `cypress/views/incidents-page.ts`) +4. The fixture content (if relevant) +5. These constraints: + +``` +## Fix Constraints + +You may ONLY edit files in these paths: +- web/cypress/e2e/incidents/**/*.cy.ts (test files) +- web/cypress/fixtures/incident-scenarios/*.yaml (fixtures) +- web/cypress/views/incidents-page.ts (page object) +- web/cypress/support/incidents_prometheus_query_mocks/** (mock layer) + +You must NOT edit: +- web/src/** (source code — that's Phase 2) +- Non-incident test files +- Cypress config or support infrastructure +- Any file outside the web/ directory + +## Fix Guidelines + +- Prefer the minimal change that fixes the issue +- Don't refactor surrounding code — only fix the failing test +- If adding a wait/timeout, prefer Cypress retry-ability (.should()) over cy.wait() +- If fixing a selector, check that the new selector exists in the current DOM + by reading the relevant React component in src/ (read-only, don't edit) +- If fixing a fixture, validate it against the fixture schema + (run /cypress:test-development:validate-incident-fixtures mentally or reference the schema) +- If adding a page object method, follow existing naming conventions +- **Before applying any fix, check git history** for the file being changed: + `git log origin/main -- ` — look for prior commits that explicitly + removed or replaced the pattern you are about to introduce. For example, + `cy.reload()` was previously removed from prepareIncidentsPageForSearch + because it breaks dynamic plugin chunk loading in headless CI. The iteration + agent lacks git history context and will re-discover "fixes" that were + already tried and reverted. If a prior commit removed the pattern for a + documented reason, do NOT re-introduce it. +``` + +After the Fix Agent returns, verify the fix makes sense: +- Does the edit address the diagnosed root cause? +- Could the edit break other tests? +- Is it the minimal change needed? + +If the fix looks wrong, re-diagnose with additional context. + +### Step 10: Validate Fixes + +After applying fixes, re-run **only the previously failing tests**: + +From the `web/` directory: +```bash +source cypress/export-env.sh && npx cypress run --spec "{SPEC}" --env grepTags="{FAILING_TEST_TAGS}" +``` + +For each test: +- **Now passes**: Stage the fix files with `git add` +- **Still fails**: Re-diagnose (increment retry counter). Max 2 retries per test. +- **After 2 retries still failing**: Mark as `UNRESOLVED` and report to user + +### Step 11: Commit Batch + +After all fixable failures are addressed (or max retries reached): + +Stage and commit as separate commands (never chain `cd && git`): +```bash +git add +``` +```bash +git commit --no-gpg-sign -m "" +``` + +Commit message format: +``` +fix(tests): + +- : +- : + +Classifications: N TEST_BUG, N FIXTURE_ISSUE, N PAGE_OBJECT_GAP, N MOCK_ISSUE +Unresolved: N (if any) + +Co-Authored-By: Claude Opus 4.6 +``` + +Track commit count. If commit count reaches **5**: Notify the user that the review threshold has been reached and ask whether to continue or pause for review. + +### Step 12: Iterate + +If there were failures and `current_iteration < max-iterations`: +- Increment iteration counter +- Go back to **Step 4** (clean results and re-run) + +This catches cascading fixes — e.g., fixing a `before all` hook unblocks skipped tests that may have their own issues. + +If all tests pass: Proceed to Step 13. + +### Step 13: Flakiness Probe + +Run the full target test suite `flakiness-runs` times (default: 3), even if everything is green. + +For each run: +1. Clean previous results (Step 4) +2. Run tests (Step 5) +3. Parse results (Step 6) +4. Record per-test pass/fail + +After all runs, compute flakiness: + +``` +Flakiness Report: + Total tests: N + Stable (all runs passed): N + Flaky (some runs failed): N + Broken (all runs failed): N + + Flaky tests: + - "test name" — passed 2/3 runs + Error on failure: + - "test name" — passed 1/3 runs + Error on failure: +``` + +For each **flaky** test: +- Diagnose it using `/cypress:test-iteration:diagnose-test-failure` with the context that it's intermittent +- Common flaky patterns: race conditions, animation timing, network mock timing, DOM detach/reattach +- Apply fix if confident (add `.should('exist')` guards, use `{ timeout: N }`, avoid `.eq(N)` on dynamic lists) +- Re-run flakiness probe on just the fixed tests to verify + +### Step 14: Final Report + +Output a summary: + +``` +# Iteration Complete + +## Branch: test/incident-robustness-YYYY-MM-DD +## Commits: N +## Iterations: N + +## Results +- Tests run: N +- Passing: N +- Fixed in this session: N +- Unresolved: N (details below) +- Flaky (stabilized): N +- Flaky (remaining): N + +## Fixes Applied +1. [commit-sha] fix(tests): + - : + +2. [commit-sha] fix(tests): + - : + +## Unresolved Issues +- "test name": REAL_REGRESSION — . Source file X was modified in PR #N. +- "test name": UNRESOLVED after 2 retries — + +## Remaining Flakiness +- "test name": 2/3 passed — timing issue in chart rendering, needs investigation + +## Recommendations +- [Next steps for unresolved issues] +- [Whether to merge current fixes or wait] +``` + +### Step 15: Update Stability Ledger + +After the final report, update `web/cypress/reports/test-stability.md`. + +Read the file and update both sections: + +**1. Current Status table** — for each test in this run: +- If test already in table: update pass rate (rolling average across all recorded runs), update trend +- If test is new: add a row +- Pass rate = total passes / total runs across all recorded iterations +- Trend: compare last 3 runs — improving / stable / degrading + +**2. Run History log** — append a new row: +``` +| {next_number} | {YYYY-MM-DD} | local | {branch} | {total_tests} | {passed} | {failed} | {flaky} | {commit_sha} | +``` + +**3. Machine-readable data** — update the JSON block between `STABILITY_DATA_START` and `STABILITY_DATA_END`: +```json +{ + "tests": { + "test full title": { + "results": ["pass", "pass", "fail", "pass"], + "last_failure_reason": "Timed out...", + "last_failure_date": "2026-03-23", + "fixed_by": "abc1234" + } + }, + "runs": [ + { + "date": "2026-03-23", + "type": "local", + "branch": "test/incident-robustness-2026-03-23", + "total": 15, + "passed": 15, + "failed": 0, + "flaky": 0, + "commit": "abc1234" + } + ] +} +``` + +Commit the ledger update together with the final batch of fixes if any, or as a standalone commit: +```bash +git add web/cypress/reports/test-stability.md +``` +```bash +git commit --no-gpg-sign -m "docs: update test stability ledger — {passed}/{total} passed, {flaky} flaky" +``` + +### Error Handling + +- **Cypress crashes** (not just test failures): Check if it's an OOM issue (`--max-old-space-size`), a missing dependency, or a config problem. Report to user. +- **No `export-env.sh`**: Create it directly using the cluster credentials from the conversation. Do NOT use the interactive `/cypress:setup`. +- **No mochawesome reports generated**: Check if the reporter config is correct. Fall back to parsing Cypress console output. +- **Git conflicts**: If the working branch has conflicts with changes, report to user and stop. +- **Sub-agent failure**: If a Diagnosis or Fix agent fails, log the error and skip that test. Don't let one broken agent block the whole loop. + +### Guardrails + +- **Never edit source code** (`src/`) in Phase 1 +- **Never disable a test** — if a test can't be fixed, mark it as unresolved, don't add `.skip()` +- **Never add `@flaky` tag** to a test — that's a human decision +- **Never change test assertions to match wrong behavior** — if the UI is wrong, it's a REAL_REGRESSION +- **Max 2 retries per test** to avoid infinite loops +- **Max 5 commits before pausing** for user review +- **Always run flakiness probe** before declaring success diff --git a/.claude/commands/cypress/test-iteration/productize-iterations.md b/.claude/commands/cypress/test-iteration/productize-iterations.md new file mode 100644 index 000000000..3745727ff --- /dev/null +++ b/.claude/commands/cypress/test-iteration/productize-iterations.md @@ -0,0 +1,289 @@ +--- +description: Consolidate one or more agentic test iteration branches into a clean, shippable fix branch — analyzes overlaps, merges intelligently, verifies with flakiness probing, and pushes +parameters: + - name: branch-name + description: "Target branch name (e.g. test/OBSINTA-1290-incident-stability). Created from main." + required: true + type: string + - name: source-branches + description: "Comma-separated list of iteration branch names or a glob (e.g. 'FIX-PROPOSAL-AGENTIC/*' or 'branch-a,branch-b,branch-c'). Fetched from origin." + required: true + type: string + - name: flakiness-runs + description: "Number of flakiness probe runs after fixes are applied (default: 3)" + required: false + type: string + - name: test-target + description: "Cypress test target — 'all', 'regression', a spec file, or grepTags pattern (default: all)" + required: false + type: string +--- + +# Productize Agentic Test Iterations + +Consolidate one or more agentic iteration branches into a single clean, shippable branch. +When multiple branches exist, analyze whether their changes are complementary, redundant, +or conflicting — then merge them intelligently rather than blindly stacking cherry-picks. + +## Prerequisites + +- `web/cypress/export-env.sh` must exist with cluster credentials. + If missing, create it from cluster credentials in the conversation (do NOT run `/cypress:setup`). +- Node modules installed in `web/` (`npm install`). + +## Step 1: Discover and Fetch Source Branches + +Parse `$2` (source-branches): +- If it contains `*`, treat as a glob: `git ls-remote origin | grep ` +- If comma-separated, split into a list + +Fetch all source branches as remote tracking refs: +```bash +git fetch origin ":refs/remotes/origin/" +``` + +If any source branch has an overview/index document (e.g. `docs/agentic-fix-proposals.md`), +read it first — it may describe the branches and their relationships. + +Report what was found: +``` +Found N source branches: + - origin/ — N commits above main + - origin/ — N commits above main +``` + +## Step 2: Analyze Each Branch + +For each source branch, extract: + +1. **Commits above main**: `git log origin/main.. --oneline` +2. **Files changed**: `git diff origin/main.. --stat` +3. **Commit messages and descriptions**: Read each commit message for intent + +Build a structured summary per branch: +``` +Branch: +Origin: +Commits: N +Files touched: +What it fixes: + - +``` + +## Step 3: Evaluate Overlaps + +This is the critical step. For each pair of branches that touch the **same file**: + +1. **Read both versions** of the overlapping file +2. **Identify the relationship**: + - **Complementary**: Different fixes to different parts of the same file (both needed) + - **Progressive**: One branch's changes are a superset of another's (keep the latest) + - **Redundant**: Both branches solve the same problem differently (pick the better one) + - **Conflicting**: Changes that are incompatible (requires manual resolution) +3. **Decide the merge strategy** for each overlap + +Also check for cross-file dependencies: +- Does branch A add a function that branch B calls? +- Does branch A change a signature that branch B depends on? + +**Critical: Check git history for reverted patterns.** For each changed file, run +`git log origin/main -- ` and read recent commit messages. Look for +patterns where a prior commit explicitly removed or replaced something that the +iteration branch is re-introducing. Common examples: +- `cy.reload()` removed in favor of SPA navigation (dynamic plugin chunk loading + breaks on full page reload) +- Flattened selectors replaced with grouped selectors +- Timeouts adjusted for a specific reason + +If an iteration branch re-introduces a pattern that was previously removed, flag +it as **REGRESSIVE** and investigate whether the original removal reason still +applies. The iteration agent doesn't have git history context and will often +re-discover a "fix" that was already tried and reverted. + +Produce an evaluation report: +``` +## Overlap Analysis + +### +Touched by: branch-1, branch-3 +Relationship: COMPLEMENTARY + - branch-1: adds OOM prevention (cy.reload, quiet search) + - branch-3: adds warmUpForPlugin() method +Strategy: Apply both — they modify different sections + +### +Touched by: branch-1, branch-2 +Relationship: PROGRESSIVE + - branch-1: adds cy.reload() to search loop + - branch-2: rewrites search loop with quiet search (includes reload) +Strategy: Take branch-2 only — it's a superset + +## Proposed Commit Structure +1. : — from branches X, Y +2. : — from branch Z +``` + +**Present this evaluation to the user and wait for confirmation before proceeding.** + +## Step 4: Create Clean Branch + +```bash +git checkout -b $1 origin/main +``` + +If the branch already exists, ask the user whether to overwrite or append a suffix. + +## Step 5: Apply Changes + +Based on the evaluation from Step 3, apply changes in logical order. + +**Do NOT blindly cherry-pick individual commits.** Instead: + +1. For **progressive** overlaps: only apply the most complete version +2. For **redundant** overlaps: apply the one chosen in Step 3 +3. For **complementary** changes: apply in dependency order (e.g., page object before tests that use it) +4. For **conflicts**: manually merge, reading both versions and combining intent + +After applying each logical group: +- Run `npx prettier --write ` immediately +- Run `npx eslint --fix ` +- Check for remaining lint errors with `npx eslint ` +- Fix any remaining max-len or formatting issues manually + +**Do not defer lint fixes to a separate commit.** Each commit must pass the pre-commit hook. + +## Step 6: Structure Commits by Logical Concern + +Group changes into commits by **what problem they solve**, not by which branch they came from. +Good commit groupings: +- One commit per distinct fix category (OOM prevention, plugin warm-up, test hygiene, etc.) +- Page object changes bundled with the test changes that use them +- Fixture additions bundled with the tests that reference them + +Bad commit groupings: +- One commit per source branch (archaeology, not intent) +- Intermediate steps that are immediately superseded +- Separate "fix lint" commits + +Commit message format: +``` +fix(tests): + + + +Co-Authored-By: Claude Sonnet 4.6 +``` + +Use `--no-gpg-sign` for all commits (sandbox environment). + +## Step 7: Run Verification + +### 7a: Resolve test target + +Based on `$4` (test-target, default: `all`): + +| Target | Spec | Grep Tags | +|--------|------|-----------| +| `all` | `cypress/e2e/incidents/**/*.cy.ts` | `@incidents --@e2e-real --@xfail --@demo` | +| `regression` | `cypress/e2e/incidents/regression/**/*.cy.ts` | `@incidents --@e2e-real --@xfail` | +| specific file | `cypress/e2e/incidents/{target}` | (none) | + +### 7b: Run tests once + +From `web/`: +```bash +bash scripts/clean-test-artifacts.sh +source cypress/export-env.sh && node --max-old-space-size=4096 \ + ./node_modules/.bin/cypress run --browser electron \ + --spec "{SPEC}" --env grepTags="{GREP_TAGS}" +``` + +If there are failures, diagnose using `/cypress:test-iteration:diagnose-test-failure`. +Apply fixes and re-run. Max 2 retries per test. + +### 7c: Run e2e-real (if cluster available) + +If `web/cypress/export-env.sh` has cluster credentials and the cluster is reachable: +```bash +source cypress/export-env.sh && node --max-old-space-size=4096 \ + ./node_modules/.bin/cypress run --browser electron \ + --spec "cypress/e2e/incidents/00.coo_incidents_e2e.cy.ts" +``` + +This test takes 10-25 minutes. Run in background. +If it fails, diagnose the failure — it may reveal issues the regression suite doesn't cover. + +### 7d: Flakiness probe + +Run the test target `$3` times (default: 3). For each run: +1. Clean artifacts +2. Run tests +3. Record per-test pass/fail + +Compute flakiness: +``` +Flakiness Report: + Total tests: N + Stable (all runs passed): N + Flaky (some runs failed): N + Broken (all runs failed): N +``` + +If any flaky tests are found, diagnose and fix them. Re-run the probe on fixed tests. + +## Step 8: Present Results and Confirm Push + +Present the final state to the user: +``` +# Ready to Push + +## Branch: $1 +## Commits: N + +| # | SHA | Description | +|---|-----|-------------| + +## Verification +- Regression: N/N passed, N runs, 0 flaky +- e2e-real: passed / skipped / failed +- Files changed: N + +## Excluded from this branch +- +``` + +**Wait for user confirmation before pushing.** + +## Step 9: Push + +```bash +git push origin $1 +``` + +If the push fails due to auth, try HTTPS with `gh auth token`: +```bash +git remote set-url origin https://$(gh auth token)@github.com//.git +git push origin $1 +``` + +Report the push result and suggest PR creation if desired. + +## Error Handling + +- **Lint failures after cherry-pick**: Run prettier + eslint --fix first. If max-len errors + remain, shorten log messages or wrap comments. Never create a separate "fix lint" commit. +- **Cherry-pick conflicts**: Read both sides, understand intent, merge manually. Never use + `--ours` or `--theirs` without understanding what's being dropped. +- **Test failures after merge**: Diagnose with `/cypress:test-iteration:diagnose-test-failure`. + If the failure is caused by the merge (not a pre-existing issue), fix it before proceeding. +- **Cypress crashes**: Check `--max-old-space-size`, missing deps, or config issues. +- **No Chrome**: Use `--browser electron` as fallback. + +## Guardrails + +- **Never edit source code** (`src/`) — only test files, page objects, fixtures +- **Never disable tests** — no `.skip()`, no adding `@flaky` tags +- **Never push without user confirmation** +- **Never include process artifacts** (iteration docs, stability ledgers) in the output branch + unless the user explicitly asks +- **Preserve commit authorship** where possible — use the original author from cherry-picks diff --git a/.claude/commands/cypress/test-iteration/scripts/notify-slack.py b/.claude/commands/cypress/test-iteration/scripts/notify-slack.py new file mode 100644 index 000000000..49e3c6bfd --- /dev/null +++ b/.claude/commands/cypress/test-iteration/scripts/notify-slack.py @@ -0,0 +1,305 @@ +#!/usr/bin/env python3 +"""Send Slack notifications for agentic test iteration loops. + +Supports two modes based on environment variables: + +Option A (Webhook — one-way): + SLACK_WEBHOOK_URL="https://hooks.slack.com/services/T.../B.../..." + +Option B (Bot with thread replies — two-way): + SLACK_BOT_TOKEN="xoxb-..." + SLACK_CHANNEL_ID="C0123456789" + +If neither is set, prints the message to stdout and exits cleanly. + +Usage: + # Send a notification (both modes) + python3 notify-slack.py send [options] + + # Wait for thread reply (Option B only) + python3 notify-slack.py wait [--timeout 600] + +Event types: + fix_applied, ci_started, ci_complete, ci_failed, + review_needed, iteration_done, flaky_found, blocked + +Options: + --pr PR number (adds link to message) + --branch Branch name + --url CI run URL + --thread-ts Reply in a thread (Option B) + --timeout Review window timeout for 'wait' command (default: 600) +""" + +import argparse +import json +import os +import subprocess +import sys +import time +import urllib.request +import urllib.error + + +EMOJI = { + "fix_applied": ":wrench:", + "ci_started": ":hourglass_flowing_sand:", + "ci_complete": ":white_check_mark:", + "ci_failed": ":x:", + "review_needed": ":eyes:", + "iteration_done": ":checkered_flag:", + "flaky_found": ":warning:", + "blocked": ":octagonal_sign:", +} + + +def build_blocks(event_type, message, pr=None, branch=None, url=None): + """Build Slack Block Kit blocks for the notification.""" + emoji = EMOJI.get(event_type, ":robot_face:") + + blocks = [ + { + "type": "section", + "text": { + "type": "mrkdwn", + "text": f"{emoji} *Agent: {event_type.replace('_', ' ').title()}*", + }, + }, + { + "type": "section", + "text": {"type": "mrkdwn", "text": message}, + }, + ] + + context_parts = [] + if pr: + context_parts.append( + f"" + ) + if branch: + context_parts.append(f"Branch: `{branch}`") + if url: + context_parts.append(f"<{url}|CI Run>") + + if context_parts: + blocks.append( + { + "type": "context", + "elements": [ + {"type": "mrkdwn", "text": " | ".join(context_parts)} + ], + }, + ) + + return blocks + + +def send_webhook(webhook_url, blocks): + """Option A: Send via incoming webhook.""" + payload = json.dumps({"blocks": blocks}).encode("utf-8") + + req = urllib.request.Request( + webhook_url, + data=payload, + headers={"Content-Type": "application/json"}, + method="POST", + ) + + try: + with urllib.request.urlopen(req) as resp: + return {"ok": True, "status": resp.status} + except urllib.error.HTTPError as e: + print(f"Webhook failed: HTTP {e.code} — {e.read().decode()}", file=sys.stderr) + return {"ok": False, "error": str(e)} + + +def slack_api(token, method, payload): + """Call a Slack Web API method.""" + url = f"https://slack.com/api/{method}" + data = json.dumps(payload).encode("utf-8") + + req = urllib.request.Request( + url, + data=data, + headers={ + "Content-Type": "application/json; charset=utf-8", + "Authorization": f"Bearer {token}", + }, + method="POST", + ) + + try: + with urllib.request.urlopen(req) as resp: + return json.loads(resp.read().decode()) + except urllib.error.HTTPError as e: + body = e.read().decode() + print(f"Slack API {method} failed: HTTP {e.code} — {body}", file=sys.stderr) + return {"ok": False, "error": str(e)} + + +def send_bot(token, channel, blocks, thread_ts=None): + """Option B: Send via bot token.""" + payload = { + "channel": channel, + "blocks": blocks, + } + if thread_ts: + payload["thread_ts"] = thread_ts + + result = slack_api(token, "chat.postMessage", payload) + + if result.get("ok"): + ts = result.get("ts", "") + print(f"MESSAGE_TS={ts}") + return {"ok": True, "ts": ts} + else: + print(f"Bot send failed: {result.get('error')}", file=sys.stderr) + return {"ok": False, "error": result.get("error")} + + +def wait_for_reply(token, channel, message_ts, timeout=600, poll_interval=30): + """Option B: Poll for thread replies within a review window. + + Returns the latest user reply text, or None if no reply within timeout. + Output format: + REPLY= + NO_REPLY + """ + # Get bot's own user ID to filter out its own messages + auth_result = slack_api(token, "auth.test", {}) + bot_user_id = auth_result.get("user_id", "") + + deadline = time.time() + timeout + seen_messages = set() + + # Seed with the original message to ignore it + seen_messages.add(message_ts) + + print(f"Waiting up to {timeout}s for reply in thread {message_ts}...", flush=True) + + while time.time() < deadline: + result = slack_api( + token, + "conversations.replies", + {"channel": channel, "ts": message_ts}, + ) + + if result.get("ok"): + messages = result.get("messages", []) + for msg in messages: + msg_ts = msg.get("ts", "") + user = msg.get("user", "") + + if msg_ts in seen_messages: + continue + seen_messages.add(msg_ts) + + # Skip bot's own messages + if user == bot_user_id: + continue + + # Found a user reply + reply_text = msg.get("text", "") + print(f"REPLY={reply_text}") + return reply_text + + remaining = int(deadline - time.time()) + if remaining > 0: + print( + f"No reply yet, {remaining}s remaining...", + file=sys.stderr, + flush=True, + ) + + time.sleep(min(poll_interval, max(1, remaining))) + + print("NO_REPLY") + return None + + +def cmd_send(args): + """Handle the 'send' subcommand.""" + webhook_url = os.environ.get("SLACK_WEBHOOK_URL", "") + bot_token = os.environ.get("SLACK_BOT_TOKEN", "") + channel_id = os.environ.get("SLACK_CHANNEL_ID", "") + + blocks = build_blocks( + args.event_type, args.message, pr=args.pr, branch=args.branch, url=args.url + ) + + # Option B: Bot token takes priority (supports two-way) + if bot_token and channel_id: + result = send_bot(bot_token, channel_id, blocks, thread_ts=args.thread_ts) + return 0 if result.get("ok") else 1 + + # Option A: Webhook (one-way) + if webhook_url: + result = send_webhook(webhook_url, blocks) + return 0 if result.get("ok") else 1 + + # No Slack configured — print to stdout and exit cleanly + emoji = EMOJI.get(args.event_type, "") + print(f"[slack-skip] {emoji} {args.event_type}: {args.message}") + return 0 + + +def cmd_wait(args): + """Handle the 'wait' subcommand.""" + bot_token = os.environ.get("SLACK_BOT_TOKEN", "") + channel_id = os.environ.get("SLACK_CHANNEL_ID", "") + + if not bot_token or not channel_id: + print( + "NO_REPLY (Option B not configured — SLACK_BOT_TOKEN and SLACK_CHANNEL_ID required)" + ) + return 0 + + reply = wait_for_reply( + bot_token, channel_id, args.message_ts, timeout=args.timeout + ) + return 0 + + +def main(): + parser = argparse.ArgumentParser( + description="Slack notifications for agentic test iteration" + ) + subparsers = parser.add_subparsers(dest="command", required=True) + + # 'send' subcommand + send_parser = subparsers.add_parser("send", help="Send a notification") + send_parser.add_argument( + "event_type", + choices=list(EMOJI.keys()), + help="Event type", + ) + send_parser.add_argument("message", help="Message text (Slack mrkdwn supported)") + send_parser.add_argument("--pr", help="PR number") + send_parser.add_argument("--branch", help="Branch name") + send_parser.add_argument("--url", help="CI run URL") + send_parser.add_argument( + "--thread-ts", help="Thread timestamp to reply in (Option B)" + ) + + # 'wait' subcommand + wait_parser = subparsers.add_parser( + "wait", help="Wait for thread reply (Option B only)" + ) + wait_parser.add_argument("message_ts", help="Message timestamp to watch") + wait_parser.add_argument( + "--timeout", + type=int, + default=600, + help="Seconds to wait for reply (default: 600)", + ) + + args = parser.parse_args() + + if args.command == "send": + return cmd_send(args) + elif args.command == "wait": + return cmd_wait(args) + + +if __name__ == "__main__": + sys.exit(main()) diff --git a/.claude/commands/cypress/test-iteration/scripts/poll-ci-status.py b/.claude/commands/cypress/test-iteration/scripts/poll-ci-status.py new file mode 100644 index 000000000..a2b96a508 --- /dev/null +++ b/.claude/commands/cypress/test-iteration/scripts/poll-ci-status.py @@ -0,0 +1,92 @@ +#!/usr/bin/env python3 +"""Poll OpenShift CI (Prow) job status for a PR until completion. + +Usage: + python3 poll-ci-status.py [job_substring] [max_attempts] [interval_seconds] + +Arguments: + pr_number GitHub PR number to poll + job_substring Substring to match in job name (default: e2e-incidents) + max_attempts Maximum polling attempts (default: 30) + interval_seconds Sleep between polls in seconds (default: 300) + +Output on completion: + CI_COMPLETE state=SUCCESS url= + CI_COMPLETE state=FAILURE url= + CI_TIMEOUT (if max_attempts reached) + +Requires: gh CLI authenticated with access to the repo. +""" + +import subprocess +import json +import time +import sys + + +def poll(pr, job_substring="e2e-incidents", max_attempts=30, interval=300): + for attempt in range(max_attempts): + result = subprocess.run( + ["gh", "pr", "checks", pr, "--repo", "openshift/monitoring-plugin", "--json", "name,state,link"], + capture_output=True, + text=True, + ) + + if result.returncode != 0: + print( + f"gh pr checks failed (attempt {attempt + 1}/{max_attempts}): {result.stderr.strip()}", + flush=True, + ) + time.sleep(interval) + continue + + try: + checks = json.loads(result.stdout) + except json.JSONDecodeError: + print( + f"Invalid JSON from gh pr checks (attempt {attempt + 1}/{max_attempts})", + flush=True, + ) + time.sleep(interval) + continue + + found = False + for check in checks: + if job_substring in check.get("name", ""): + found = True + state = check["state"] + url = check.get("link", "") + + if state in ("SUCCESS", "FAILURE"): + print(f"CI_COMPLETE state={state} url={url}") + return 0 + + print( + f"CI_PENDING state={state}, attempt {attempt + 1}/{max_attempts}, sleeping {interval}s...", + flush=True, + ) + break + + if not found: + print( + f"Job '{job_substring}' not found yet, attempt {attempt + 1}/{max_attempts}, sleeping {interval}s...", + flush=True, + ) + + time.sleep(interval) + + print("CI_TIMEOUT") + return 1 + + +if __name__ == "__main__": + if len(sys.argv) < 2: + print(f"Usage: {sys.argv[0]} [job_substring] [max_attempts] [interval_seconds]") + sys.exit(2) + + pr = sys.argv[1] + job = sys.argv[2] if len(sys.argv) > 2 else "e2e-incidents" + attempts = int(sys.argv[3]) if len(sys.argv) > 3 else 30 + interval = int(sys.argv[4]) if len(sys.argv) > 4 else 300 + + sys.exit(poll(pr, job, attempts, interval)) diff --git a/.claude/commands/cypress/test-iteration/scripts/review-github.py b/.claude/commands/cypress/test-iteration/scripts/review-github.py new file mode 100644 index 000000000..57877c103 --- /dev/null +++ b/.claude/commands/cypress/test-iteration/scripts/review-github.py @@ -0,0 +1,232 @@ +#!/usr/bin/env python3 +"""GitHub PR comment-based review flow for agentic test iteration. + +Posts fix details as PR comments and polls for author replies within a +timed review window. Designed to work alongside Slack webhook notifications +(one-way) — GitHub PR comments provide the two-way interaction channel. + +Usage: + # Post a review comment on a PR + python3 review-github.py post [--repo owner/repo] + + # Wait for author reply within a review window + python3 review-github.py wait [--timeout 600] [--repo owner/repo] + +Output formats: + post: COMMENT_ID= COMMENT_TIME= + wait: REPLY= (author replied) + NO_REPLY (timeout reached, no author reply) + +Requires: gh CLI authenticated with comment access to the target repo. + +Security: Author filtering is enforced deterministically in code — +the PR author's login is fetched via API and only comments from that +user are considered. This is not instruction-based filtering. +""" + +import argparse +import json +import subprocess +import sys +import time +from datetime import datetime, timezone + + +DEFAULT_REPO = "openshift/monitoring-plugin" +MAGIC_PREFIX = "/agent" + + +def gh_api(endpoint, method="GET", body=None, repo=None): + """Call GitHub API via gh CLI.""" + cmd = ["gh", "api"] + if repo: + endpoint = endpoint.replace("{repo}", repo) + if method != "GET": + cmd.extend(["--method", method]) + if body: + for key, value in body.items(): + cmd.extend(["-f", f"{key}={value}"]) + cmd.append(endpoint) + + result = subprocess.run(cmd, capture_output=True, text=True) + if result.returncode != 0: + print(f"gh api failed: {result.stderr.strip()}", file=sys.stderr) + return None + + if not result.stdout.strip(): + return {} + + try: + return json.loads(result.stdout) + except json.JSONDecodeError: + print(f"Invalid JSON from gh api: {result.stdout[:200]}", file=sys.stderr) + return None + + +def get_pr_author(pr, repo): + """Fetch the PR author's login.""" + data = gh_api(f"repos/{repo}/pulls/{pr}") + if data and "user" in data: + return data["user"]["login"] + return None + + +def post_comment(pr, message, repo): + """Post a comment on a PR. Returns (comment_id, created_at).""" + data = gh_api( + f"repos/{repo}/issues/{pr}/comments", + method="POST", + body={"body": message}, + ) + if data and "id" in data: + comment_id = data["id"] + created_at = data.get("created_at", "") + print(f"COMMENT_ID={comment_id}") + print(f"COMMENT_TIME={created_at}") + return comment_id, created_at + + print("Failed to post comment", file=sys.stderr) + return None, None + + +def wait_for_author_reply(pr, since_timestamp, repo, timeout=600, poll_interval=30): + """Poll PR comments for a reply from the PR author. + + Only considers comments that: + 1. Were posted AFTER since_timestamp (time-scoped) + 2. Were authored by the PR author (deterministic .user.login check) + 3. Optionally start with the magic prefix /agent (if present, stripped from reply) + + Args: + pr: PR number + since_timestamp: ISO 8601 timestamp — only comments after this are considered + repo: owner/repo string + timeout: seconds to wait before giving up + poll_interval: seconds between polls + + Returns: + Reply text if found, None otherwise. + """ + # Fetch PR author login — deterministic, code-enforced filter + pr_author = get_pr_author(pr, repo) + if not pr_author: + print("Could not determine PR author. Proceeding without review.", file=sys.stderr) + print("NO_REPLY") + return None + + print(f"Waiting up to {timeout}s for reply from @{pr_author} on PR #{pr}...", flush=True) + + deadline = time.time() + timeout + seen_ids = set() + + while time.time() < deadline: + # Fetch comments created after since_timestamp + comments = gh_api( + f"repos/{repo}/issues/{pr}/comments?since={since_timestamp}&per_page=50" + ) + + if comments is None: + remaining = int(deadline - time.time()) + if remaining > 0: + print(f"API error, retrying in {poll_interval}s ({remaining}s remaining)...", + file=sys.stderr, flush=True) + time.sleep(min(poll_interval, max(1, remaining))) + continue + + for comment in comments: + comment_id = comment.get("id") + if comment_id in seen_ids: + continue + seen_ids.add(comment_id) + + # Deterministic author filter — code-enforced, not instruction-based + commenter = comment.get("user", {}).get("login", "") + if commenter != pr_author: + continue + + body = comment.get("body", "").strip() + + # If magic prefix is used, strip it; otherwise accept any author comment + if body.startswith(MAGIC_PREFIX): + body = body[len(MAGIC_PREFIX):].strip() + + if body: + print(f"REPLY={body}") + return body + + remaining = int(deadline - time.time()) + if remaining > 0: + print( + f"No reply yet from @{pr_author}, {remaining}s remaining...", + file=sys.stderr, + flush=True, + ) + time.sleep(min(poll_interval, max(1, remaining))) + + print("NO_REPLY") + return None + + +def format_fix_comment(message): + """Wrap the agent's message in a standard comment format.""" + return ( + "### Agent: Fix Applied\n\n" + f"{message}\n\n" + "---\n" + f"*Reply to this comment (or prefix with `{MAGIC_PREFIX}`) to provide feedback. " + "The agent will incorporate your input before pushing, or proceed automatically " + "after the review window expires.*" + ) + + +def cmd_post(args): + """Handle the 'post' subcommand.""" + formatted = format_fix_comment(args.message) + comment_id, created_at = post_comment(args.pr, formatted, args.repo) + return 0 if comment_id else 1 + + +def cmd_wait(args): + """Handle the 'wait' subcommand.""" + wait_for_author_reply( + args.pr, args.since, args.repo, timeout=args.timeout + ) + return 0 + + +def main(): + parser = argparse.ArgumentParser( + description="GitHub PR comment-based review for agentic test iteration" + ) + parser.add_argument( + "--repo", default=DEFAULT_REPO, + help=f"GitHub repo (default: {DEFAULT_REPO})" + ) + subparsers = parser.add_subparsers(dest="command", required=True) + + # 'post' subcommand + post_parser = subparsers.add_parser("post", help="Post a review comment on a PR") + post_parser.add_argument("pr", help="PR number") + post_parser.add_argument("message", help="Comment body (markdown supported)") + + # 'wait' subcommand + wait_parser = subparsers.add_parser( + "wait", help="Wait for author reply on a PR" + ) + wait_parser.add_argument("pr", help="PR number") + wait_parser.add_argument("since", help="ISO 8601 timestamp — only consider comments after this") + wait_parser.add_argument( + "--timeout", type=int, default=600, + help="Seconds to wait for reply (default: 600)" + ) + + args = parser.parse_args() + + if args.command == "post": + return cmd_post(args) + elif args.command == "wait": + return cmd_wait(args) + + +if __name__ == "__main__": + sys.exit(main()) diff --git a/.claude/commands/fixture-schema-reference.md b/.claude/commands/fixture-schema-reference.md deleted file mode 120000 index 97edc3ca9..000000000 --- a/.claude/commands/fixture-schema-reference.md +++ /dev/null @@ -1 +0,0 @@ -../../.cursor/commands/fixture-schema-reference.md \ No newline at end of file diff --git a/.claude/commands/generate-incident-fixture.md b/.claude/commands/generate-incident-fixture.md deleted file mode 120000 index 9f30f3733..000000000 --- a/.claude/commands/generate-incident-fixture.md +++ /dev/null @@ -1 +0,0 @@ -../../.cursor/commands/generate-incident-fixture.md \ No newline at end of file diff --git a/.claude/commands/generate-regression-test.md b/.claude/commands/generate-regression-test.md deleted file mode 120000 index 466685582..000000000 --- a/.claude/commands/generate-regression-test.md +++ /dev/null @@ -1 +0,0 @@ -../../.cursor/commands/generate-regression-test.md \ No newline at end of file diff --git a/.claude/commands/refactor-regression-test.md b/.claude/commands/refactor-regression-test.md deleted file mode 120000 index b13ef9d09..000000000 --- a/.claude/commands/refactor-regression-test.md +++ /dev/null @@ -1 +0,0 @@ -../../.cursor/commands/refactor-regression-test.md \ No newline at end of file diff --git a/.claude/commands/validate-incident-fixtures.md b/.claude/commands/validate-incident-fixtures.md deleted file mode 120000 index c41caae98..000000000 --- a/.claude/commands/validate-incident-fixtures.md +++ /dev/null @@ -1 +0,0 @@ -../../.cursor/commands/validate-incident-fixtures.md \ No newline at end of file diff --git a/docs/agentic-development/architecture/security/docker-sandbox-blast-radius.md b/docs/agentic-development/architecture/security/docker-sandbox-blast-radius.md new file mode 100644 index 000000000..e3a990882 --- /dev/null +++ b/docs/agentic-development/architecture/security/docker-sandbox-blast-radius.md @@ -0,0 +1,155 @@ +# Docker Sandbox: Blast Radius Analysis + +## Threat Model + +An AI agent (Claude Code) running inside a container could be manipulated via **prompt injection** — malicious instructions embedded in code comments, PR descriptions, issue bodies, or fetched web content. The agent then executes commands believing they are legitimate tasks. + +### What we're protecting against + +| Threat | Vector | Severity | +|---|---|---| +| Credential theft (SSH keys) | Agent reads `~/.ssh/` and exfiltrates via network | **Critical** | +| Credential theft (API keys) | Agent reads env vars or config files, posts to attacker-controlled endpoint | **Critical** | +| Code destruction | Agent force-pushes to main, deletes branches | **High** | +| Code exfiltration | Agent pushes proprietary code to external repo or pastes to web service | **High** | +| Lateral movement | Agent accesses other projects, clusters, or services on the host | **High** | +| Cluster damage | Agent runs destructive `oc` commands (delete namespace, scale to 0) | **Medium** | +| Supply chain | Agent modifies dependencies, CI/CD config to inject malicious code | **Medium** | + +--- + +## What the container exposes vs. isolates + +### Mounted (accessible to agent) + +| Resource | Mount | Mode | Risk | Mitigation | +|---|---|---|---|---| +| Project worktree | `-v ./worktree:/sandbox` | read-write | Agent can modify any file in the project | Use a git worktree clone; main repo is untouched | +| GCP ADC credentials | `-v $ADC_PATH:/tmp/adc.json` | **read-only** | Agent can read refresh token, get access tokens for Vertex AI | Scoped to Vertex AI API only; can't access other GCP resources without IAM roles | +| Kubeconfig | `-v $KUBECONFIG:/tmp/kubeconfig` | **read-only** | Agent can run any `oc` command the token allows | Use a scoped service account (see below) | +| GitHub token | `GITHUB_TOKEN` env var | env | Agent can push, create PRs, potentially delete branches | Use fine-grained PAT with minimal scopes (see below) | + +### NOT mounted (isolated from agent) + +| Resource | Why it matters | +|---|---| +| `~/.ssh/` | SSH private keys — can't be exfiltrated | +| `~/.claude/` | Claude config, history, session tokens | +| `~/.config/` | Full GCP config, other service credentials | +| `~/.kube/config` (full) | Only a scoped kubeconfig is mounted, not the full one | +| `~/.gnupg/` | GPG signing keys | +| `~/.gitconfig` (host) | Host git identity; container uses its own | +| `~/.npmrc`, `~/.docker/` | Registry credentials | +| Other project directories | Only the specific worktree is mounted | +| Host network services | Container uses default bridge network | + +--- + +## Credential-specific analysis + +### SSH Keys — ELIMINATED +Not mounted. Agent cannot access them. Even with prompt injection, there's nothing to steal. + +### Claude API Key — ELIMINATED +We use Vertex AI with GCP ADC, not an Anthropic API key. The ADC file is mounted read-only. It contains a refresh token that can only obtain access tokens for APIs your GCP project allows. The agent could theoretically use it to make extra API calls, but: +- It can't access other GCP services without IAM roles +- The token is tied to your identity — all usage is logged in GCP audit logs +- You can revoke it with `gcloud auth application-default revoke` + +### GitHub Token — SCOPED +**This is the highest-risk credential.** Mitigations: +1. Use a **fine-grained Personal Access Token** (not classic) +2. Scope it to **this repository only** +3. Grant minimal permissions: + - `contents: write` — needed for push (unfortunately also allows branch deletion) + - `pull_requests: write` — needed for creating PRs + - `metadata: read` — required baseline +4. Do NOT grant: `admin`, `actions`, `secrets`, `environments`, `pages` + +**Residual risk:** With `contents: write`, the agent CAN: +- Force-push to branches (including main if not protected) +- Delete branches +- Push malicious commits + +**Mitigations for residual GitHub risk:** +- Enable branch protection rules on `main` (require PR, no force push) +- Use `--dangerously-skip-permissions` but configure CLAUDE.md to restrict destructive git operations +- Monitor: set up GitHub webhooks or audit log alerts for force-push/branch-delete events + +### OpenShift Token — SCOPED +Use a **service account** with limited RBAC instead of `kubeadmin`: +```bash +# Create a scoped service account on the host +oc create serviceaccount claude-agent -n +oc adm policy add-role-to-user view system:serviceaccount::claude-agent -n +# Add edit only if the agent needs to modify resources: +# oc adm policy add-role-to-user edit system:serviceaccount::claude-agent -n +``` + +This limits the agent to a single namespace with view (or edit) permissions only. It can't delete namespaces, access secrets in other namespaces, or escalate privileges. + +For ephemeral test clusters (like your CI clusters), using `kubeadmin` is acceptable since the cluster is destroyed after use. + +--- + +## Network exposure + +The container has **full outbound network access** (no proxy). This means: + +| Can do | Risk level | Mitigation | +|---|---|---| +| Call Vertex AI API | Expected | None needed | +| Push to GitHub | Expected | Scoped PAT | +| Connect to OpenShift cluster | Expected | Scoped kubeconfig | +| Reach any internet host | **Medium** — could exfiltrate code | Docker network policies (optional) | +| Reach host services (localhost) | **Low** — default bridge doesn't route to host | Docker default behavior | + +**Optional hardening:** Use Docker network restrictions to limit outbound to specific hosts: +```bash +# Create a network with no internet access +docker network create --internal sandbox-net +# Then selectively allow specific hosts via iptables or a proxy +``` + +This adds complexity. For most use cases, the credential scoping + filesystem isolation is sufficient. + +--- + +## Worst-case scenarios + +### Scenario 1: Prompt injection via malicious code comment +Agent reads a file containing `` +- **With this setup:** Agent could exfiltrate the ADC refresh token. Impact: attacker gets time-limited GCP access. +- **Mitigation:** ADC token is scoped, usage is logged, revocable. Rotate after incident. + +### Scenario 2: Agent deletes branches +Injected prompt causes `git push origin --delete important-branch` +- **With this setup:** Could happen if the PAT has `contents: write`. +- **Mitigation:** Branch protection rules. Git reflog on remote retains deleted branches for ~90 days. Recovery is possible. + +### Scenario 3: Agent pushes malicious code to main +- **With this setup:** Blocked by branch protection (require PR + approval). +- **Residual risk:** Agent could create a PR with malicious code that looks legitimate. + +### Scenario 4: Agent destroys OpenShift resources +`oc delete namespace production` +- **With this setup:** Blocked if using scoped service account. Even with `kubeadmin` on ephemeral CI clusters, the blast radius is limited to a throwaway cluster. + +--- + +## Summary + +| Resource | Exposure | Acceptable? | +|---|---|---| +| SSH keys | None | Yes | +| Claude/Anthropic API key | None | Yes | +| GCP ADC (refresh token) | Read-only, scoped | Yes (monitor audit logs) | +| GitHub | Scoped PAT, repo-only | Yes (with branch protection) | +| OpenShift | Scoped SA or ephemeral kubeadmin | Yes | +| Host filesystem | Only worktree | Yes | +| Network | Full outbound | Acceptable (optional hardening available) | + +The main residual risks are: +1. **GitHub branch deletion** — mitigated by branch protection + recoverability +2. **Code exfiltration via network** — mitigated by the code being in a private repo anyway (attacker already needs GitHub access to inject prompts) +3. **ADC token theft** — mitigated by scoping, audit logging, and revocability diff --git a/docs/agentic-development/architecture/test-iteration-system.md b/docs/agentic-development/architecture/test-iteration-system.md new file mode 100644 index 000000000..99b16fd73 --- /dev/null +++ b/docs/agentic-development/architecture/test-iteration-system.md @@ -0,0 +1,258 @@ +# Agentic Test Iteration Architecture + +Autonomous multi-agent system for iterating on Cypress test robustness, with visual feedback (screenshots + videos), CI result ingestion, and flakiness detection. + +## Goals + +| Phase | Objective | +|-------|-----------| +| **Phase 1** (current) | Make incident detection tests robust — fix selectors, timing, fixtures, page object gaps | +| **Phase 2** (future) | Refactor frontend code using tests as a behavioral contract / safety net | + +## Architecture Overview + +``` +User: /cypress:test-iteration:iterate-incident-tests target=regression max-iterations=3 + +Coordinator (main Claude Code session) + | + |-- [CI Analysis] /cypress:test-iteration:analyze-ci-results (optional first step) + | Fetches CI artifacts, classifies infra vs test/code failures + | Correlates failures with git commits for context + | If all INFRA -> report to user and STOP + | + |-- Create branch: test/incident-robustness- + | + |-- [Runner] Cypress headless via Bash (inline, not separate terminal) + | Sources export-env.sh, produces mochawesome JSON + screenshots + videos + | + |-- [Parser] Extract failures from mochawesome JSON reports + | Per failure: test name, error message, stack trace, screenshot path, video path + | + |-- For each failure (parallelizable): + | | + | |-- [Diagnosis Agent] (Explore-type sub-agent) + | | Reads: screenshot (multimodal) + error + test code + fixture + page object + | | Returns: root cause classification + recommended fix + | | + | |-- [Fix Agent] (general-purpose sub-agent) + | | Makes targeted edits based on diagnosis + | | Returns: diff summary + | | + | |-- [Validation] Re-run the specific failing test + | Pass -> stage fix + | Fail -> re-diagnose (max 2 retries per test) + | + |-- Commit batch of related fixes + |-- Repeat if failures remain (up to max-iterations) + |-- [Flakiness Probe] Run full suite 3x even if green + |-- Report final state to user +``` + +## Agent Roles + +### 1. Coordinator (main session) + +Owns the iteration loop, branch management, and commit strategy. + +Responsibilities: +- Create and manage the working branch +- Run Cypress tests inline via Bash +- Parse mochawesome JSON reports +- Dispatch Diagnosis and Fix agents +- Track cumulative pass/fail state across iterations +- Commit fixes in batches (threshold: **5 commits** before notifying user) +- Run flakiness probes (multiple runs even when green) +- Decide when to stop: all green + flakiness probe passed, max iterations, or needs human input + +### 2. Diagnosis Agent (Explore-type sub-agent) + +Input per failure: +- Error message and stack trace from mochawesome JSON +- Screenshot path (read with multimodal Read tool) +- Video path (reference for user, not directly parseable by agent) +- Test file path + relevant line numbers +- Associated fixture YAML +- Page object methods used + +Output — one of these classifications: + +| Classification | Description | Action | +|---------------|-------------|--------| +| `TEST_BUG` | Wrong selector, incorrect assertion, timing/race condition | Auto-fix | +| `FIXTURE_ISSUE` | Missing data, wrong structure, edge case not covered | Auto-fix | +| `PAGE_OBJECT_GAP` | Missing method, stale selector, outdated DOM reference | Auto-fix | +| `MOCK_ISSUE` | Intercept not matching, response shape wrong | Auto-fix | +| `REAL_REGRESSION` | Actual UI/code bug — not a test problem | **STOP and report to user** | +| `INFRA_ISSUE` | Cluster down, cert expired, operator not installed | **STOP and report to user** | + +The agent should **read the screenshot first** before looking at code — visual state often reveals the root cause faster than stack traces. + +### 3. Fix Agent (general-purpose sub-agent) + +Input: +- Diagnosis classification and details +- Specific file references and what to change + +Scope — may only edit: +- `cypress/e2e/incidents/**/*.cy.ts` (test files) +- `cypress/fixtures/incident-scenarios/*.yaml` (fixtures) +- `cypress/views/incidents-page.ts` (page object) +- `cypress/support/incidents_prometheus_query_mocks/**` (mock layer) + +Must NOT edit: +- Source code (`src/`) — that's Phase 2 +- Non-incident test files +- Cypress config or support infrastructure + +### 4. Validation Agent + +Re-runs the specific failing test after a fix is applied: +```bash +source cypress/export-env.sh && npx cypress run --env grep="" --spec "" +``` + +Reports pass/fail. If still failing, feeds back to Diagnosis Agent (max 2 retries per test). + +## Flakiness Detection + +Even if the first run is all green, the coordinator runs a **flakiness probe**: + +1. Run the full incident test suite 3 times consecutively +2. Track per-test results across runs +3. Flag any test that fails in any run as `FLAKY` +4. For flaky tests: attempt to diagnose the timing/race condition and fix +5. Report flakiness statistics: `test_name: 2/3 passed` etc. + +This catches intermittent failures that a single run would miss. + +## CI Result Ingestion + +CI analysis is handled by the dedicated `/cypress:test-iteration:analyze-ci-results` skill (`.claude/commands/cypress:test-iteration:analyze-ci-results.md`). + +The skill fetches artifacts from OpenShift CI (Prow) runs on GCS, classifies failures as infrastructure vs test/code issues, reads failure screenshots with multimodal vision, and correlates failures with the git commits that triggered them. + +### Key Capabilities + +- **URL normalization**: Accepts gcsweb or Prow UI URLs at any level of the artifact tree +- **Structured metadata**: Extracts PR number, author, branch, commit SHAs from `started.json` / `finished.json` / `prowjob.json` +- **Build log parsing**: Parses Cypress console output from `build-log.txt` for per-spec pass/fail/skip counts and error details +- **Visual diagnosis**: Fetches and reads failure screenshots (multimodal) to understand UI state at failure time +- **Failure classification**: Categorizes each failure as `INFRA_*` (cluster, operator, plugin, auth, CI) or test/code (`TEST_BUG`, `FIXTURE_ISSUE`, `PAGE_OBJECT_GAP`, `MOCK_ISSUE`, `CODE_REGRESSION`) +- **Commit correlation**: Maps failures to specific file changes in the PR using `git diff {base}..{pr_head}` + +### Integration with Orchestrator + +The orchestrator uses `/cypress:test-iteration:analyze-ci-results` as an optional first step: + +1. If all failures are `INFRA_*` -> report to user and STOP (no test changes will help) +2. If mixed -> report infra issues, proceed with test/code fixes only +3. If all test/code -> proceed with full iteration loop +4. Commit correlation tells the orchestrator whether to fix tests or investigate source changes +5. CI screenshots give the Diagnosis Agent a head start before local reproduction + +### Usage + +``` +/cypress:test-iteration:analyze-ci-results ci-url=https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/.../{RUN_ID}/ +/cypress:test-iteration:analyze-ci-results ci-url=https://prow.ci.openshift.org/view/gs/.../{RUN_ID} focus=regression +``` + +## Commit Strategy + +- **Branch naming**: `test/incident-robustness-YYYY-MM-DD` +- **Commit granularity**: Group related fixes (e.g., "fix 3 selector issues in filtering tests") +- **Review threshold**: Notify user after **5 commits** for review +- **Never force-push**; always additive commits +- User merges when ready, or continues iteration + +## Test Execution (Inline) + +Tests run inline via Bash, not in a separate terminal: + +```bash +cd web && source cypress/export-env.sh && \ + npx cypress run \ + --spec "cypress/e2e/incidents/regression/**/*.cy.ts" \ + --env grepTags="@incidents --@e2e-real --@flaky" \ + --reporter ./node_modules/cypress-multi-reporters \ + --reporter-options configFile=reporter-config.json +``` + +Results are collected from: +- **Exit code**: 0 = all passed, non-zero = failures +- **Mochawesome JSON**: `screenshots/cypress_report_*.json` — per-test details +- **Screenshots**: `cypress/screenshots/{spec}/` — failure screenshots +- **Videos**: `cypress/videos/{spec}.mp4` — kept on failure + +### Report Parsing + +Mochawesome JSON structure (per report file): +```json +{ + "stats": { "passes": N, "failures": N, "skipped": N }, + "results": [{ + "suites": [{ + "title": "Suite Name", + "tests": [{ + "title": "test description", + "fullTitle": "Suite -- test description", + "state": "failed|passed|skipped", + "err": { + "message": "error description", + "estack": "full stack trace" + } + }] + }] + }] +} +``` + +Use `npx mochawesome-merge screenshots/cypress_report_*.json > merged-report.json` to combine per-spec reports. + +## Skills + +| Skill | Purpose | Invoked by | +|-------|---------|------------| +| `/cypress:test-iteration:iterate-incident-tests` | Main orchestrator — local iteration loop, dispatches agents, manages commits | User | +| `/cypress:test-iteration:iterate-ci-flaky` | CI-based iteration — push fixes, trigger Prow jobs, wait, analyze, repeat | User | +| `/cypress:test-iteration:diagnose-test-failure` | Classifies a single test failure using screenshots + code analysis | Orchestrator (as sub-agent prompt) | +| `/cypress:test-iteration:analyze-ci-results` | Fetches and analyzes OpenShift CI artifacts, classifies infra vs test/code | User or orchestrator | + +Skills are defined in `.claude/commands/` and can be invoked as slash commands. + +## Existing Infrastructure Leveraged + +| Asset | How the agent uses it | +|-------|----------------------| +| mochawesome JSON reporter | Structured test results parsing | +| `@cypress/grep` | Run individual tests by name or tag | +| `export-env.sh` | Source env vars for inline execution | +| YAML fixture system | Edit fixtures to fix data issues | +| Page object (`incidents-page.ts`) | Fix selectors and add missing methods | +| Mock layer (`incidents_prometheus_query_mocks/`) | Fix intercept patterns | +| `/cypress:test-development:generate-incident-fixture` skill | Generate new fixtures when needed | +| `/cypress:test-development:validate-incident-fixtures` skill | Validate fixture edits | + +## Phase 2: Frontend Refactoring (Future) + +### Concept + +Tests become the behavioral contract. The agent refactors frontend code while using the test suite as a safety net. + +### Additional Agent Roles + +| Agent | Role | +|-------|------| +| **Refactor Planner** | Analyzes frontend code, proposes refactoring steps | +| **Refactor Agent** | Makes code changes (replaces Fix Agent) | +| **Contract Validator** | Runs tests, classifies failures as regression vs test-coupling | +| **Test Adapter** | Updates tests that assert implementation details instead of behavior | + +### Key Principle + +If a test breaks due to refactoring but behavior is preserved, the test needs updating — it was too coupled to implementation. Phase 1 (robustness) reduces this coupling, making Phase 2 more effective. + +### Additional Classification + +The Diagnosis Agent gains `TEST_TOO_COUPLED` — the test asserts implementation details (specific DOM structure, internal state) rather than observable behavior. The Test Adapter agent rewrites it to be implementation-agnostic. diff --git a/docs/agentic-development/roadmap/openshell/sandbox-bun-crash-report.md b/docs/agentic-development/roadmap/openshell/sandbox-bun-crash-report.md new file mode 100644 index 000000000..b6ee5801b --- /dev/null +++ b/docs/agentic-development/roadmap/openshell/sandbox-bun-crash-report.md @@ -0,0 +1,90 @@ +# Sandbox Issue Report: Bun Segfault in Openshell Sandbox + +> **Deprecated**: This report documents a Bun runtime crash specific to the openshell-based sandbox approach, which has been abandoned. The production sandbox uses **Docker** instead — see [docs/agentic-development/setup/docker-sandbox-guide.md](../../setup/docker-sandbox-guide.md). This report is preserved for historical reference and in case the openshell approach is revisited. + +**Date:** 2026-04-14 +**Reporter:** David Rajnoha +**Environment:** Openshell 0.0.19, Linux Kernel 6.18.13, x64 (sse42 popcnt avx avx2) + +## Summary + +Claude Code fails to start inside an openshell sandbox due to a Bun runtime segfault. Both the base image's bundled Bun (1.3.11) and the host's version (1.3.13) crash with the same error. The issue appears to be an incompatibility between Bun's runtime and the sandbox's security restrictions (seccomp/landlock). + +## Steps to Reproduce + +1. Create a sandbox: + ```bash + openshell sandbox create --name my-project --provider gcp-adc --provider my-github --upload ".:/sandbox" --policy ./sandbox-policy.yaml + ``` + +2. Connect and run Claude: + ```bash + openshell sandbox connect my-project + cd /sandbox + claude + ``` + +## Observed Behavior + +``` +Bun v1.3.11 (0d72d5a9) Linux x64 (baseline) +Linux Kernel v6.18.13 | glibc v2.39 +CPU: sse42 popcnt avx avx2 +Args: "claude" +Features: jsc +Elapsed: 2ms | User: 0ms | Sys: 4ms +RSS: 33.56MB | Peak: 9.54MB | Commit: 33.56MB | Faults: 1 + +panic(main thread): Segmentation fault at address 0xBBADBEEF +oh no: Bun has crashed. This indicates a bug in Bun, not your code. +Illegal instruction (core dumped) +``` + +The crash report link: https://bun.report/1.3.11/B_10d72d5aAggggC+ypRktvoBq/5luGko7luGq92luGktvoB4qyxkFktvoBkk27jFktvoBqhqtvEktvoBi2ptvE02rm6Cozxl6Cy8wK0oxK6ivl6CA2AjxgpqkC + +## What Was Tried + +| Attempt | Result | +|---|---| +| Run `claude` from base image (Bun 1.3.11) | Segfault at 0xBBADBEEF | +| Pull newer base image and recreate sandbox | Same image/version, same crash | +| Upload host Claude binary (Bun 1.3.13) to sandbox | Same segfault | +| `npm install -g @anthropic-ai/claude-code@latest` | EACCES — sandbox user can't write to `/usr/lib/node_modules/` | +| `curl -fsSL https://bun.sh/install \| bash` | `/dev/null` permission denied, `unzip` not available | + +## Root Cause Analysis + +- The `0xBBADBEEF` address is a sentinel value, suggesting Bun deliberately crashes when it detects an unsupported or restricted environment (likely seccomp filters or landlock restrictions blocking syscalls Bun requires). +- This is NOT a CPU compatibility issue — the same binary runs fine on the host with the same CPU. +- This is NOT a Bun version issue — both 1.3.11 and 1.3.13 exhibit the same behavior. +- The sandbox security layer (seccomp/landlock/process restrictions) cannot be modified at runtime — only `network_policies` support hot-reload. + +## Potential Solutions + +1. **Create a claude provider and use `-- claude` flag** when creating the sandbox. This may configure the sandbox environment specifically for Claude (e.g., relaxed seccomp profile for Bun). This was not attempted because no claude provider was configured. + +2. **Install Claude Code via npm to a user-writable directory** (uses Node.js instead of Bun): + ```bash + npm install --prefix ~/claude-local @anthropic-ai/claude-code@latest + ~/claude-local/node_modules/.bin/claude + ``` + This requires `registry.npmjs.org` in the network policy (already configured). + +3. **Use `npx`** to run without installing: + ```bash + npx @anthropic-ai/claude-code@latest + ``` + +4. **Update the base image** (`ghcr.io/nvidia/openshell-community/sandboxes/base:latest`) to include a Bun version compatible with the sandbox security profile, or switch Claude Code's runtime to Node.js in the image. + +## Current Sandbox Configuration + +- **Sandbox name:** my-project +- **Base image:** ghcr.io/nvidia/openshell-community/sandboxes/base:latest +- **Providers:** gcp-adc (generic), my-github (github) +- **Network policy:** anthropic_api, google_oauth, github, npm_registry, bun_install +- **Process:** runs as `sandbox` user (non-root) + +## Recommended Next Step + +Configure a claude provider (`openshell provider create --type anthropic ...`) and recreate the sandbox with `-- claude` to let openshell handle the Claude runtime environment properly. diff --git a/docs/agentic-development/roadmap/openshell/sandbox-setup-guide.md b/docs/agentic-development/roadmap/openshell/sandbox-setup-guide.md new file mode 100644 index 000000000..efd0b640f --- /dev/null +++ b/docs/agentic-development/roadmap/openshell/sandbox-setup-guide.md @@ -0,0 +1,260 @@ +# Openshell Sandbox Setup Guide + +> **Deprecated**: This guide documents the openshell-based sandbox approach, which was abandoned due to fundamental incompatibilities — Bun runtime crashes under seccomp/landlock restrictions and the TLS-terminating proxy breaks OpenShift API connections. See [sandbox-bun-crash-report.md](sandbox-bun-crash-report.md) for details. The production sandbox uses **Docker** instead — see [docs/agentic-development/setup/docker-sandbox-guide.md](../../setup/docker-sandbox-guide.md). + +## Prerequisites + +- `openshell` CLI installed and on PATH +- Docker running +- Gateway cluster running (`openshell gateway start` + verify with `openshell status`) +- Providers already configured (`openshell provider list` to verify — need `gcp-adc` and `my-github`) + +## Step 1: Create the policy file + +Create `sandbox-policy.yaml` in your project root: + +```yaml +version: 1 + +network_policies: + google_oauth: + endpoints: + - host: oauth2.googleapis.com + port: 443 + enforcement: enforce + binaries: + - { path: /usr/bin/node } + - { path: /usr/bin/bash } + + vertex_ai: + endpoints: + - host: us-east5-aiplatform.googleapis.com + port: 443 + enforcement: enforce + binaries: + - { path: /usr/bin/node } + - { path: /usr/bin/bash } + + github: + endpoints: + - host: github.com + port: 443 + enforcement: enforce + binaries: + - { path: /usr/bin/git } + + npm_registry: + endpoints: + - host: registry.npmjs.org + port: 443 + enforcement: enforce + binaries: + - { path: /usr/bin/node } + - { path: /usr/bin/bash } +``` + +> **IMPORTANT:** Write this file using `cat > sandbox-policy.yaml << 'EOF' ... EOF` or a +> text editor. Do **NOT** copy-paste from rendered markdown — it can introduce invisible +> characters that break YAML parsing. + +Notes on the policy: +- Binaries must match the actual paths inside the sandbox (`/usr/bin/node`, `/usr/bin/bash`), not host paths +- The base image bundles Bun, but it segfaults inside the sandbox due to seccomp/landlock restrictions — Claude must be installed via npm and runs on Node.js instead +- Add more entries as needed based on denied requests in logs (see Step 5) + +## Step 2: Create the sandbox + +Run from the project root directory: + +```bash +openshell sandbox create --name my-project --provider gcp-adc --provider my-github --upload ".:/sandbox" --policy ./sandbox-policy.yaml +``` + +Notes: +- Do **NOT** add `-- claude` unless you have a claude provider set up (Vertex AI GCP auth uses the `gcp-adc` generic provider instead) +- If zsh prompts to correct anything, press `n` + +## Step 3: Upload GCP credentials + +The `gcp-adc` provider injects credentials via openshell's proxy placeholder system, which doesn't work for ADC file-based auth (the proxy can intercept HTTP headers but not local file reads). Upload the ADC file directly: + +```bash +openshell sandbox upload my-project \ + "$HOME/.config/gcloud/application_default_credentials.json" \ + /sandbox/adc.json +``` + +> **Note:** `openshell sandbox upload` treats the destination as a directory and places the source file inside it. So the actual file path inside the sandbox will be `/sandbox/adc.json/application_default_credentials.json`. + +> **Security tradeoff:** The ADC file contains a refresh token (not raw credentials). The agent inside the sandbox can read it. This is a known limitation — see the Troubleshooting section for alternatives. + +## Step 4: Connect and install Claude via npm + +```bash +openshell sandbox connect my-project +``` + +Inside the sandbox, install Claude Code via npm (the base image's Bun runtime crashes under sandbox restrictions): + +```bash +npm install --prefix ~/claude-local @anthropic-ai/claude-code@latest +``` + +## Step 5: Start Claude with Vertex AI + +Set the required environment variables and launch: + +```bash +export GOOGLE_APPLICATION_CREDENTIALS=/sandbox/adc.json/application_default_credentials.json +export CLAUDE_CODE_USE_VERTEX=1 +export ANTHROPIC_VERTEX_PROJECT_ID=itpc-gcp-hcm-pe-eng-claude +export CLOUD_ML_REGION=us-east5 +~/claude-local/node_modules/.bin/claude +``` + +> **Tip:** Add these exports to `~/.bashrc` inside the sandbox so they persist across sessions. + +## Step 6: Monitor and iterate on policy + +In a separate terminal on the host: + +```bash +openshell logs my-project --tail --source sandbox +``` + +When you see denied requests, update the policy: + +```bash +# Edit sandbox-policy.yaml to add the blocked host as a new entry... + +# Push the update (hot-reloaded, no restart needed) +openshell policy set my-project --policy sandbox-policy.yaml --wait + +# Verify +openshell policy list my-project +``` + +Only `network_policies` can be updated at runtime. Changes to `filesystem_policy`, `landlock`, or `process` require recreating the sandbox. + +## Step 7: Upload/download files + +```bash +# Upload a file to the running sandbox +openshell sandbox upload my-project ./local-file /sandbox/dest-path + +# Download from sandbox +openshell sandbox download my-project /sandbox/some-file ./local-dest +``` + +> **Note:** Upload always creates a directory at the destination path and places the file inside. Plan your paths accordingly. + +## Step 8: Clean up + +```bash +openshell sandbox delete my-project +``` + +--- + +## Troubleshooting + +### Bun segfault / "Illegal instruction (core dumped)" + +The base image ships Bun 1.3.11 which crashes with `Segmentation fault at address 0xBBADBEEF` inside the sandbox due to seccomp/landlock restrictions. This affects all Bun versions tested (1.3.11, 1.3.13). + +**Workaround:** Install Claude Code via npm so it runs on Node.js instead: + +```bash +npm install --prefix ~/claude-local @anthropic-ai/claude-code@latest +~/claude-local/node_modules/.bin/claude +``` + +This requires `registry.npmjs.org` in the network policy with `/usr/bin/node` and `/usr/bin/bash` as allowed binaries. + +### npm install permission denied (EACCES) + +The sandbox runs as the `sandbox` user (non-root). Global npm install (`npm install -g`) fails because it can't write to `/usr/lib/node_modules/`. + +**Fix:** Install to a user-writable prefix: + +```bash +npm install --prefix ~/claude-local @anthropic-ai/claude-code@latest +``` + +### npm 403 Forbidden + +The network policy allows the endpoint but the binary path is wrong. Check the logs: + +```bash +openshell logs my-project --source sandbox +``` + +Look for `action=deny` lines — they show the actual binary path and ancestors. The policy must list the exact binary paths used inside the sandbox (e.g., `/usr/bin/node` not `/usr/local/bin/node`). + +### ADC credential alternatives + +**Option A (current):** Upload the ADC file directly. The agent can read the refresh token. This is the pragmatic choice — the token can only get short-lived access tokens scoped to your GCP project. + +**Option B (more secure, not yet tested):** Generate a short-lived access token on the host and inject it as a provider: + +```bash +TOKEN=$(gcloud auth application-default print-access-token) +openshell provider create --name vertex-token --type generic \ + --credential ANTHROPIC_VERTEX_AUTH_TOKEN="$TOKEN" +``` + +Note: Claude Code does not currently support a direct token env var for Vertex AI, so this requires additional work (e.g., an auth proxy with `CLAUDE_CODE_SKIP_VERTEX_AUTH=1`). + +### ADC token expired + +Re-authenticate on the host, then re-upload: + +```bash +gcloud auth application-default login +openshell sandbox upload my-project \ + "$HOME/.config/gcloud/application_default_credentials.json" \ + /sandbox/adc.json +``` + +### DependenciesNotReady / pod stuck Pending + +The sandbox image can't be pulled. Rebuild the cluster: + +```bash +openshell sandbox delete my-project +openshell gateway destroy +openshell gateway start +``` + +Verify the image is available on your host: + +```bash +docker pull ghcr.io/nvidia/openshell-community/sandboxes/base:latest +``` + +If that fails, check DNS/VPN/firewall. + +### Policy parse errors + +- `invalid type: string "1\nnetwork_policies"` — the YAML file has invisible characters or encoding issues. Recreate it with `cat > sandbox-policy.yaml << 'EOF' ... EOF` +- `expected struct NetworkPolicyRuleDef` — wrong YAML structure. `network_policies` must be a map of named policies, each with `endpoints` and `binaries` arrays +- Always validate structure: `version` is a top-level integer, `network_policies` is a sibling map at the same level + +### Check cluster health + +```bash +openshell status # Cluster reachable? +openshell doctor logs # Container-level logs +openshell sandbox list # Existing sandboxes +openshell provider list # Registered providers +``` + +### Sandbox won't start after failed attempt + +Delete the old one first: + +```bash +openshell sandbox delete +``` + +Find the name with `openshell sandbox list` — it may have an auto-generated name like `smitten-mayfly` instead of `my-project`. diff --git a/docs/agentic-development/roadmap/test-iteration-ideas.md b/docs/agentic-development/roadmap/test-iteration-ideas.md new file mode 100644 index 000000000..66840b2f6 --- /dev/null +++ b/docs/agentic-development/roadmap/test-iteration-ideas.md @@ -0,0 +1,464 @@ +# Agentic Test Iteration — Ideas & Future Improvements + +Ideas and potential enhancements for the agentic test iteration system. These are not committed plans — they're options to explore when the core workflow is stable. + +## Authentication: GitHub App for CI Triggering + +**Problem**: The CI iteration skill (`/cypress:test-iteration:iterate-ci-flaky`) needs to comment `/test` on upstream PRs to trigger Prow. Current options (PATs, OAuth) are tied to a personal GitHub account. + +**Idea**: Create a dedicated GitHub App installed on `openshift/monitoring-plugin`. + +### How it would work + +1. Create a GitHub App with minimal permissions: `Issues: Write`, `Pull requests: Read`, `Checks: Read` +2. An org admin approves installation on `openshift/monitoring-plugin` +3. The app authenticates via a private key (`.pem` file) → short-lived installation tokens (1h expiry, auto-rotated) +4. Comments appear as `my-ci-bot[bot]` instead of a personal user + +### Tradeoffs vs OAuth + +| Aspect | OAuth (`gh auth login --web`) | GitHub App | +|--------|-------------------------------|------------| +| Setup effort | Minimal | Moderate (create app, org admin approval) | +| Tied to a person | Yes | No — bot identity | +| Survives user leaving org | No | Yes | +| Token management | Manual refresh | Automatic (1h expiry from private key) | +| Audit trail | Personal user | Dedicated bot account | +| Team sharing | Each person needs own auth | One app, anyone's agent can use it | + +### When to pursue + +- When multiple team members want to use the CI iteration skill +- When you want a persistent bot identity for test automation comments +- When you want to remove personal account dependency + +### Blocker + +Requires an `openshift` org admin to approve the app installation. + +--- + +## CI Iteration: Fully Automated Job Triggering + +**Problem**: Currently the CI loop requires either a `/test` comment (needs upstream write access) or a `git push` (triggers automatically). The push path works but creates noise commits. + +**Ideas**: +- **Empty commits**: `git commit --allow-empty -m "retrigger CI"` — triggers Prow without code changes, but pollutes history +- **Prow API**: Prow may have a direct API for retriggering jobs without GitHub comments — investigate `https://prow.ci.openshift.org/` endpoints +- **GitHub Actions bridge**: A lightweight GitHub Action on the fork that comments `/test` on the upstream PR when triggered via `workflow_dispatch` + +--- + +## Parallel CI Runs for Flakiness Detection + +**Problem**: Flakiness probing requires N sequential CI runs (~2h each). 3 runs = 6 hours. + +**Idea**: Open N temporary PRs from the same branch, each triggers its own CI run in parallel. Collect all results, then close the temporary PRs. + +**Tradeoff**: Consumes N times the CI resources. May not be acceptable for shared CI infrastructure. + +**Alternative**: Ask if Prow supports multiple runs of the same job on the same PR — some CI systems allow this. + +--- + +## Local Mock Tests + CI Real Tests as Two-Phase Validation + +**Problem**: Local iteration is fast but uses mocked data. CI uses real clusters but is slow (~2h). + +**Idea**: Formalize a two-phase approach: +1. **Phase A** (`/cypress:test-iteration:iterate-incident-tests`): Fast local iteration with mocks — fix all mock-testable issues +2. **Phase B** (`/cypress:test-iteration:iterate-ci-flaky`): Push to CI — catch environment-specific flakiness + +The orchestrator could automatically transition from Phase A to Phase B when local tests are green. + +--- + +## Agent Fork with Deploy Key + +**Problem**: The agent creates unsigned commits on the user's working branch. Push access, GPG signing, and branch management all create friction. + +**Idea**: A dedicated fork (`monitoring-plugin-agent` or similar) with: +- A passwordless deploy key for push access +- No GPG signing requirement +- Agent creates PRs from the fork to the upstream repo +- User reviews and merges — clean separation of human vs agent work + +**Benefits**: +- No unsigned commits in the user's fork +- Agent can push freely without SSH key access to user's account +- Clear audit trail: all agent work comes from the agent fork +- Multiple agents (different team members) can share the same fork + +--- + +## Screenshot Diffing for Visual Regression + +**Problem**: The diagnosis agent reads failure screenshots to understand UI state, but has no reference for "what it should look like." + +**Idea**: Capture baseline screenshots from passing tests and store them. On failure, the agent can compare the failure screenshot against the baseline to identify visual differences. + +**Implementation**: Cypress has plugins for visual regression testing (`cypress-image-snapshot`). The agent could: +1. Generate baselines from a known-good run +2. On failure, diff the failure screenshot against baseline +3. Highlight visual changes to speed up diagnosis + +--- + + ## Test Stability Ledger + + **Status**: Partially implemented. Ledger file created at `web/cypress/reports/test-stability.md`. Update step added to `/cypress:test-iteration:iterate-incident-tests` (Step 14). Still needs to be wired into `/cypress:test-iteration:iterate-ci-flaky`. + + **Problem**: Flakiness data is ephemeral — it exists in the agent's report from one run and is lost. Next time the agent runs, it has no memory of previous results. + + **Design**: A markdown file with embedded machine-readable JSON, updated by both skills after each run. + + **Location**: `web/cypress/reports/test-stability.md` — committed to the working branch, travels with the fixes. + + **Contents**: + - Human-readable table: per-test pass rate, trend, last failure reason, fix commit + - Run history log: date, type (local/CI), branch, pass/fail counts + - Machine-readable JSON block for programmatic parsing by the agent + + **Agent behavior**: + - Reads the ledger at the start of each run to prioritize — "this test was flaky in last 3 runs, focus here" + - Updates the ledger after each run with new results + - Commits the ledger update alongside fixes + + --- + +## Slack Notifications for Long-Running Loops + +**Status**: Implemented. Slack webhook notifications (Option A) integrated into `/cypress:test-iteration:iterate-ci-flaky`. GitHub PR comment-based review flow implemented as the two-way interaction channel (`review-github.py`). Option B (Slack bot with thread replies) documented but deprioritized due to internal setup complexity. + +### The Problem + +The CI iteration loop (`/cypress:test-iteration:iterate-ci-flaky`) runs for hours — each CI run takes ~2h, and the loop may do 3-5 fix-push-wait cycles. During that time: + +- The user has no visibility into what the agent decided to fix or how +- By the time the loop finishes, multiple commits may have been pushed with no chance to course-correct +- A wrong fix in cycle 1 wastes 2+ hours of CI time before the agent discovers it didn't work +- The user may have domain context ("that test is flaky because of animation timing, not the selector") that would save cycles + +The core tension: **autonomy vs oversight**. The agent should run independently, but the user needs the ability to intervene at natural pause points. + +### Natural Pause Points + +The CI loop has built-in pauses where user input is most valuable: + +``` +Push fix ──→ [PAUSE: fix_applied] ──→ CI runs (~2h) ──→ [PAUSE: ci_complete] ──→ Analyze ──→ ... +``` + +1. **After fix, before CI runs** (`fix_applied`): The agent committed a fix and is about to push (or just pushed). This is the highest-value notification — the user can review the approach and say "redo" before a 2-hour CI cycle starts. + +2. **After CI completes** (`ci_complete`): Results are in. The agent is about to diagnose. User might have context about known issues. + +3. **When blocked** (`blocked`): Agent can't continue — needs human decision. + +### Review Window + +For the `fix_applied` event, the agent could optionally **wait before pushing**, giving the user a time window to respond: + +``` +Agent: "I'm about to push this fix. Waiting 10 minutes for feedback before proceeding." + [Shows diff summary in Slack] + +User (within 10 min): "Don't change the selector, the issue is timing. Add a cy.wait(500) instead." + +Agent: Reverts fix, applies user's suggestion, pushes that instead. +``` + +Or if no response within the window, the agent proceeds autonomously. + +Configuration: `review-window=10m` parameter on `/cypress:test-iteration:iterate-ci-flaky`. Set to `0` for fully autonomous (no waiting). + +### Notification Content — What Makes Each Message Actionable + +**`fix_applied`** — the most important notification: +``` +:wrench: Agent: Fix Applied + +*What changed:* +• `cypress/views/incidents-page.ts:45` — selector `.severity-filter` → `[data-test="severity-filter"]` +• `cypress/e2e/incidents/regression/01.reg_filtering.cy.ts:78` — added `.should('exist')` guard before click + +*Why:* Screenshot showed the filter dropdown existed but had a different class. The `data-test` attribute is stable across builds. + +*Classification:* PAGE_OBJECT_GAP (confidence: HIGH) + +*Diff:* `git diff HEAD~1` on branch `test/incident-robustness-2026-03-24` + +*Next:* CI will trigger automatically on push. Reply in the agent session to change approach. + +PR #860 | Branch: test/incident-robustness-2026-03-24 +``` + +The key: show **what** changed, **why** the agent chose that fix, and **how confident** it is. This lets the user quickly decide "looks good, let it run" vs "wrong approach, let me intervene." + +**`ci_complete`** — actionable status: +``` +:white_check_mark: Agent: CI Complete — PASSED (run 2/5) + +*Results:* 15/15 tests passed in 1h 47m +*Flakiness probe:* 2 of 5 confirmation runs complete, all green so far + +*Next:* Triggering confirmation run 3. No action needed. + +PR #860 | Branch: test/incident-robustness-2026-03-24 | CI Run +``` + +Or on failure: +``` +:x: Agent: CI Complete — FAILED (iteration 2/3) + +*Results:* 13/15 passed, 2 failed +*Failures:* +• "should filter by severity" — Timed out on `[data-test="severity-chip"]` (same as last run) +• "should display chart bars" — new failure, `Expected 5 bars, found 0` + +*Assessment:* +• severity filter: same fix didn't work, will try different approach +• chart bars: new failure — possibly caused by previous fix (will investigate) + +*Next:* Diagnosing and fixing. Will notify before pushing. + +PR #860 | Branch: test/incident-robustness-2026-03-24 | CI Run +``` + +**`blocked`** — requires user action: +``` +:octagonal_sign: Agent: Blocked — REAL_REGRESSION + +*Test:* "should display incident bars in chart" +*Issue:* Chart component renders empty. Screenshot shows the chart area with no bars, no error, no loading state. +*Commit correlation:* `src/components/incidents/IncidentChart.tsx` was modified in this PR (+45, -12) + +*This is not a test issue* — the chart rendering logic appears broken. Agent cannot fix source code in Phase 1. + +*Action needed:* Investigate the chart component refactor. Agent will stop iterating on this test. + +PR #860 | Branch: test/incident-robustness-2026-03-24 +``` + +### Implementation Options + +**Option A: Slack Incoming Webhook** (recommended starting point) +- Setup: Slack → Apps → Incoming Webhooks → create webhook for your channel. 5 minutes. +- Set `SLACK_WEBHOOK_URL` in `export-env.sh` or `~/.zshrc` +- Agent posts via `curl` in a standalone `notify-slack.py` script +- Messages formatted with Slack Block Kit (sections, context, code blocks) +- Pro: No Slack app, no server, no OAuth. Just a URL. +- Con: One-way — user sees notifications but must respond in the Claude Code session, not in Slack + +**Option B: Slack Bot with thread-based interaction** (no callback server needed) +- Create a Slack App with bot token (`chat:write`, `channels:history`) +- Agent posts messages to a channel, capturing the message `ts` (timestamp/ID) +- Before proceeding at pause points, agent **reads thread replies** via `conversations.replies` API +- If user replied in the Slack thread → agent reads the reply and adjusts +- If no reply within the review window → agent proceeds + +``` +Agent posts: "Fix applied. Reply in this thread to change approach. Proceeding in 10 min." +User replies: "Use data-test attributes instead of class selectors" +Agent reads: conversations.replies → sees user feedback → adjusts fix +``` + +- Pro: Two-way interaction without a callback server. User stays in Slack. +- Con: Needs a Slack App (not just a webhook). Polling for replies adds complexity. Bot token needs to be stored securely. + +**Implementation sketch for Option B:** +```python +# Post notification and get message timestamp +response = slack_client.chat_postMessage(channel=CHANNEL, blocks=blocks) +message_ts = response["ts"] + +# Wait for review window, polling for replies +deadline = time.time() + review_window_seconds +while time.time() < deadline: + replies = slack_client.conversations_replies(channel=CHANNEL, ts=message_ts) + user_replies = [r for r in replies["messages"] if r.get("user") != BOT_USER_ID] + if user_replies: + return user_replies[-1]["text"] # Return latest user feedback + time.sleep(30) + +return None # No feedback, proceed autonomously +``` + +**Option C: Claude Code hooks → Slack bridge** +- Configure a Claude Code hook that fires on `git commit` or specific tool calls +- The hook runs a shell script that posts to Slack +- Pro: Zero changes to the skills — hooks are external +- Con: Less control over notification content and timing. Can't implement review windows. Hooks are local config, not portable. + +**Option D: GitHub PR comments as notification channel** +- Instead of Slack, the agent posts status updates as PR comments +- User replies directly on the PR +- Agent reads PR comments via `gh api` before proceeding +- Pro: No Slack setup at all. Everything stays in GitHub. Natural for code review context. +- Con: Noisier PR history. Not real-time (no push notifications unless GitHub notifications are configured). + +### Recommended Progression + +1. **Start with Option A** — get visibility. User monitors passively, intervenes in Claude Code session when needed. +2. **Upgrade to Option B** when the review window pattern proves valuable — adds two-way interaction within Slack. +3. **Option D** is a good alternative if you prefer keeping everything in GitHub — especially for team use where the PR is the natural communication hub. + +### Configuration + +```bash +# Option A: Webhook only (one-way) +export SLACK_WEBHOOK_URL="https://hooks.slack.com/services/T.../B.../..." + +# Option B: Bot with thread interaction (two-way) +export SLACK_BOT_TOKEN="xoxb-..." +export SLACK_CHANNEL_ID="C0123456789" +export SLACK_REVIEW_WINDOW="600" # seconds to wait for feedback (0 = no wait) +``` + +### Skill Integration Points + +Where notifications fire in each skill: + +**`/cypress:test-iteration:iterate-ci-flaky`:** +- Step 3: `ci_started` — after `/test` comment or push +- Step 5: `ci_complete` — after CI analysis +- Step 6: `fix_applied` — after committing fix, before push (with optional review window) +- Step 7: `flaky_found` — when flakiness detected in confirmation runs +- Step 8: `iteration_done` — final summary +- Any step: `blocked` — on REAL_REGRESSION, INFRA_ISSUE, auth failure + +**`/cypress:test-iteration:iterate-incident-tests`:** +- Step 10: `fix_applied` — after committing batch (less critical since local runs are fast) +- Step 12: `flaky_found` — during flakiness probe +- Step 13: `iteration_done` — final summary +- Any step: `blocked` — on REAL_REGRESSION + +--- + +## Cloud Execution: Long-Running Autonomous Agent + +**Problem**: The current setup requires a local machine with an active Claude Code CLI session. Long CI polling (~2h per run) causes session timeouts, and the user must keep a terminal open. + +### Option 1: Claude Code Headless Mode (simplest) + +Run Claude Code non-interactively without a TTY: + +```bash +claude --print --dangerously-skip-permissions \ + -p "/cypress:test-iteration:iterate-ci-flaky pr=860 confirm-runs=5" +``` + +- `--print` / `-p`: non-interactive, outputs result and exits +- `--dangerously-skip-permissions`: skips all approval prompts (use only in sandboxed environments) +- Can run in `tmux`, `nohup`, GitHub Actions, or any CI runner +- Uses the same tools, skills, and CLAUDE.md as interactive mode +- Limitation: single-shot execution — runs the prompt and exits + +**Deployment**: `nohup claude --print ... > output.log 2>&1 &` on any machine, or in a GitHub Actions runner. + +### Option 2: Claude Agent SDK (most flexible) + +The Agent SDK (`@anthropic-ai/claude-code`) is a Node.js/TypeScript library that embeds Claude Code as a programmable agent: + +```typescript +import { Claude } from "@anthropic-ai/claude-code"; + +const claude = new Claude({ + dangerouslySkipPermissions: true, +}); + +const result = await claude.message({ + prompt: "/cypress:test-iteration:iterate-ci-flaky pr=860 confirm-runs=5", + workingDirectory: "/path/to/monitoring-plugin", +}); + +// Post result as PR comment +await octokit.issues.createComment({ + owner: "openshift", repo: "monitoring-plugin", + issue_number: 860, body: result.text, +}); +``` + +#### SDK vs CLI comparison + +| Aspect | CLI (`claude`) | Agent SDK | +|--------|---------------|-----------| +| Runtime | Terminal process | Node.js library | +| Lifecycle | Single session, exits | Embed in any long-lived process | +| Event-driven | No | Yes — webhooks, timers, PR events | +| Permissions | Interactive prompts or skip-all | Programmatic control | +| Tools | Built-in (Read, Write, Bash, etc.) | Same built-in + custom tools | +| State | Session-scoped | Persistent (DB, files, etc.) | +| Deployment | Local terminal | Anywhere Node.js runs | + +#### Requirements to port current skills + +- Node.js runtime with `@anthropic-ai/claude-code` +- `ANTHROPIC_API_KEY` environment variable +- `gh` CLI authenticated (or GitHub App token for comment access) +- Git + SSH for pushing to fork +- The repo cloned in the agent's working directory +- All skill files (`.claude/commands/`) present in the clone + +#### What stays the same + +- Skills (`.md` files) — the SDK reads them from `.claude/commands/` +- Polling script (`poll-ci-status.py`) — SDK runs Bash the same way +- `/cypress:test-iteration:diagnose-test-failure`, `/cypress:test-iteration:analyze-ci-results` — all work as-is +- File editing, git operations, Cypress execution — identical + +#### What changes + +- No permission prompts — `dangerouslySkipPermissions` in a sandboxed container +- State between runs — persist to file or DB instead of ephemeral session +- Triggering — webhook handler calls the SDK instead of user typing a command +- Error recovery — the wrapping process can catch failures and retry + +### Option 3: GitHub Actions Workflow (cloud, event-driven) + +A GitHub Actions workflow that runs the agent on PR events: + +```yaml +name: Flaky Test Iteration +on: + issue_comment: + types: [created] + +jobs: + iterate: + if: contains(github.event.comment.body, '/run-flaky-iteration') + runs-on: ubuntu-latest + steps: + - uses: actions/checkout@v4 + - uses: actions/setup-node@v4 + - name: Install Claude Code + run: npm install -g @anthropic-ai/claude-code + - name: Run iteration + env: + ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }} + GH_TOKEN: ${{ secrets.GH_TOKEN }} + run: | + claude --print --dangerously-skip-permissions \ + -p "/cypress:test-iteration:iterate-ci-flaky pr=${{ github.event.issue.number }} confirm-runs=3" + - name: Post results + run: gh pr comment ${{ github.event.issue.number }} --body-file output.md +``` + +**Flow**: +1. User comments `/run-flaky-iteration` on a PR +2. GitHub Actions triggers the workflow +3. Claude Code runs in headless mode on the Actions runner +4. Agent executes the full iteration loop (trigger CI, wait, analyze, fix, push) +5. Results posted back as a PR comment + +**Considerations**: +- GitHub Actions runners have a 6h timeout — enough for 2-3 CI runs +- Needs `ANTHROPIC_API_KEY` and `GH_TOKEN` as repository secrets +- Runner needs SSH key for git push (or use `GH_TOKEN` with HTTPS) +- Cost: API tokens consumed + GitHub Actions minutes + +### Recommendation + +1. **Start with headless mode** (`tmux` + `--print`) to validate the flow works without interactive prompts +2. **Move to GitHub Actions** for true cloud execution — event-driven, no local machine needed +3. **Agent SDK** when you want a custom orchestrator with richer state management, error recovery, or Slack integration beyond what the skills provide diff --git a/docs/agentic-development/setup/docker-sandbox-guide.md b/docs/agentic-development/setup/docker-sandbox-guide.md new file mode 100644 index 000000000..37a882f9d --- /dev/null +++ b/docs/agentic-development/setup/docker-sandbox-guide.md @@ -0,0 +1,180 @@ +# Docker Sandbox Setup Guide + +Run Claude Code in an isolated Docker container with filesystem isolation, scoped credentials, and full Cypress test tooling. + +## Prerequisites + +- **Docker** installed and running +- **GCP Application Default Credentials** — for Vertex AI access to Claude + ```bash + gcloud auth application-default login + ``` +- **kubeconfig** — for `oc` commands against an OpenShift cluster (optional) +- **GitHub token** — for pushing code and creating PRs (optional, auto-detected from `gh auth`) + +## Quick Start + +From the project root: + +```bash +./sandbox/run.sh +``` + +This builds the Docker image, creates an isolated git clone of the current branch, mounts credentials read-only, and drops you into a Claude Code session. + +## What Happens Under the Hood + +### 1. Image Build + +The `sandbox/Dockerfile` builds a `node:22-bookworm` image with: +- Cypress system dependencies (Xvfb, GTK, NSS, etc.) +- `gh` CLI (GitHub operations) +- `oc` + `kubectl` CLI (OpenShift operations) +- Claude Code installed globally via npm +- A `sandbox` user with UID matching your host user (for file permission compatibility) +- Vertex AI environment pre-configured (project, region, ADC path) +- Claude defaults: Sonnet 4.6, 1M token context + +### 2. Repository Isolation + +The script creates a **shallow clone** of your current branch into a temp directory: + +```bash +git clone --single-branch --branch "$CURRENT_BRANCH" --depth 50 "$PROJECT_ROOT" "$SANDBOX_DIR" +``` + +This means: +- The agent works on a copy, not your working tree +- Your uncommitted changes, stashes, and other branches are not visible +- The remote URL is rewritten to the actual upstream so `git push` works from inside the container +- The temp directory is **preserved after exit** — inspect it to review changes, cherry-pick commits, or investigate what the agent did + +### 3. Credential Mounting + +| Credential | Container Path | Mode | Source | +|---|---|---|---| +| GCP ADC | `/tmp/adc.json` | read-only | `$GOOGLE_APPLICATION_CREDENTIALS` or `~/.config/gcloud/application_default_credentials.json` | +| Kubeconfig | `/tmp/kubeconfig` | read-only | `$KUBECONFIG` or `~/.kube/config` | +| GitHub token | `$GITHUB_TOKEN` env var | env | `$GITHUB_TOKEN` or `gh auth token` | + +Credentials are mounted **read-only** — the agent cannot modify or delete them. Your SSH keys, GPG keys, Docker credentials, and other config files are **not mounted** and are invisible to the agent. + +See [docker-sandbox-blast-radius.md](../architecture/security/docker-sandbox-blast-radius.md) for a full security analysis. + +### 4. Container Launch + +The container runs interactively with: +- The cloned worktree mounted at `/sandbox` (read-write) +- All Vertex AI environment variables injected +- Claude Code started with `--dangerously-skip-permissions` (the container itself is the isolation boundary) + +## Configuration + +Override defaults with environment variables before running: + +```bash +# Use a different GCP project +export ANTHROPIC_VERTEX_PROJECT_ID=my-other-project +export CLOUD_ML_REGION=europe-west1 + +# Use a specific kubeconfig +export KUBECONFIG=$HOME/.kube/staging-config + +# Use a specific GitHub token +export GITHUB_TOKEN=ghp_xxxxxxxxxxxx + +./sandbox/run.sh +``` + +## Scoping Credentials (Recommended) + +### GitHub: Fine-Grained PAT + +Create a [fine-grained Personal Access Token](https://github.com/settings/tokens?type=beta) scoped to this repository only: +- `contents: write` — push commits +- `pull_requests: write` — create PRs +- `metadata: read` — required baseline + +Do NOT grant `admin`, `actions`, `secrets`, or `environments`. + +### OpenShift: Scoped Service Account + +For non-ephemeral clusters, use a scoped service account instead of `kubeadmin`: + +```bash +oc create serviceaccount claude-agent -n +oc adm policy add-role-to-user view system:serviceaccount::claude-agent -n +``` + +Export a kubeconfig for that SA and pass it to the sandbox. + +## Troubleshooting + +### ADC token expired + +``` +ERROR: GCP ADC credentials not found at ... +``` + +Re-authenticate: + +```bash +gcloud auth application-default login +``` + +### kubeconfig not found + +The script prints a warning but continues — `oc` commands inside the container won't work. Set `KUBECONFIG` to the correct path if it's not in the default location. + +### GitHub token not set + +The script tries `gh auth token` as a fallback. If that also fails, `git push` and `gh pr create` won't work inside the container. Either: +- Run `gh auth login` on the host, or +- Set `GITHUB_TOKEN` explicitly + +### File permission issues + +The Dockerfile creates a `sandbox` user with your host UID (`id -u`). If files inside the container are owned by a different user, rebuild the image: + +```bash +docker build -t claude-sandbox sandbox/ --build-arg "HOST_UID=$(id -u)" +``` + +### Retrieving changes after a session + +The sandbox worktree is preserved at `/tmp/claude-sandbox-*` after the container exits. To inspect or use the agent's work: + +```bash +# Find the sandbox directory +ls -dt /tmp/claude-sandbox-* + +# Review what the agent did +cd /tmp/claude-sandbox-XXXXXX +git log --oneline +git diff + +# Cherry-pick commits into your working tree +cd /path/to/your/repo +git remote add sandbox /tmp/claude-sandbox-XXXXXX +git cherry-pick +git remote remove sandbox + +# Or apply uncommitted changes as a patch +cd /tmp/claude-sandbox-XXXXXX +git diff > /tmp/agent-changes.patch +cd /path/to/your/repo +git apply /tmp/agent-changes.patch +``` + +### Cleaning up old sandboxes + +Sandbox directories accumulate over time. Clean them up when no longer needed: + +```bash +ls -dt /tmp/claude-sandbox-* +rm -rf /tmp/claude-sandbox-* +``` + +## Historical Note + +An earlier approach used [openshell](../roadmap/openshell/) for sandboxing but was abandoned due to Bun segfaults under seccomp/landlock restrictions and TLS proxy incompatibilities with OpenShift API connections. See the [openshell sandbox docs](../roadmap/openshell/) for details. diff --git a/sandbox-policy.yaml b/sandbox-policy.yaml new file mode 100644 index 000000000..8aa53fabc --- /dev/null +++ b/sandbox-policy.yaml @@ -0,0 +1,96 @@ +version: 1 + +filesystem_policy: + include_workdir: true + read_only: + - /usr + - /lib + - /proc + - /dev/urandom + - /etc + read_write: + - /sandbox + - /tmp + - /dev/null + - /home + +landlock: + compatibility: best_effort + +network_policies: + google_oauth: + endpoints: + - host: oauth2.googleapis.com + port: 443 + enforcement: enforce + binaries: + - { path: /usr/local/bin/claude } + - { path: /sandbox/claude } + - { path: /usr/bin/node } + - { path: /usr/bin/bash } + + anthropic_api: + endpoints: + - host: api.anthropic.com + port: 443 + enforcement: enforce + binaries: + - { path: /usr/local/bin/claude } + - { path: /sandbox/claude } + + vertex_ai: + endpoints: + - host: us-east5-aiplatform.googleapis.com + port: 443 + enforcement: enforce + binaries: + - { path: /usr/bin/node } + - { path: /usr/bin/bash } + + github: + endpoints: + - host: github.com + port: 443 + enforcement: enforce + - host: api.github.com + port: 443 + enforcement: enforce + binaries: + - { path: /usr/bin/git } + - { path: /usr/bin/gh } + - { path: /usr/bin/node } + - { path: /usr/bin/bash } + + openshift: + endpoints: + - host: 10.200.0.1 + port: 9444 + enforcement: enforce + - host: "*.ci-ln-8bnwfst-76ef8.aws-4.ci.openshift.org" + port: 443 + enforcement: enforce + - host: console-openshift-console.apps.ci-ln-8bnwfst-76ef8.aws-4.ci.openshift.org + port: 443 + enforcement: enforce + binaries: + - { path: /sandbox/oc/oc } + - { path: /usr/bin/node } + - { path: /usr/bin/bash } + - { path: /usr/bin/curl } + + npm_registry: + endpoints: + - host: registry.npmjs.org + port: 443 + enforcement: enforce + binaries: + - { path: /usr/bin/node } + - { path: /usr/bin/bash } + + bun_install: + endpoints: + - host: bun.sh + port: 443 + enforcement: enforce + binaries: + - { path: /usr/bin/curl } diff --git a/sandbox-start.sh b/sandbox-start.sh new file mode 100755 index 000000000..09e220855 --- /dev/null +++ b/sandbox-start.sh @@ -0,0 +1,17 @@ +#!/bin/bash +# Run this inside the sandbox after connecting with: +# openshell sandbox connect my-project + +npm install --prefix ~/claude-local @anthropic-ai/claude-code@latest + +export GOOGLE_APPLICATION_CREDENTIALS=/sandbox/adc.json/application_default_credentials.json +export CLAUDE_CODE_USE_VERTEX=1 +export ANTHROPIC_VERTEX_PROJECT_ID=itpc-gcp-hcm-pe-eng-claude +export CLOUD_ML_REGION=us-east5 + +# Force git to use HTTPS instead of SSH +git config --global url."https://github.com/".insteadOf "git@github.com:" +# Trust the sandbox proxy's TLS certificates +git config --global http.sslVerify false + +~/claude-local/node_modules/.bin/claude --dangerously-skip-permissions diff --git a/sandbox/Dockerfile b/sandbox/Dockerfile new file mode 100644 index 000000000..ad077d100 --- /dev/null +++ b/sandbox/Dockerfile @@ -0,0 +1,57 @@ +FROM node:22-bookworm + +# Install system dependencies + Cypress requirements +RUN apt-get update && apt-get install -y --no-install-recommends \ + git \ + curl \ + ca-certificates \ + jq \ + xvfb \ + libgtk2.0-0 \ + libgtk-3-0 \ + libgbm-dev \ + libnotify-dev \ + libnss3 \ + libxss1 \ + libasound2 \ + libxtst6 \ + xauth \ + && rm -rf /var/lib/apt/lists/* + +# Install gh CLI +RUN curl -fsSL https://cli.github.com/packages/githubcli-archive-keyring.gpg -o /usr/share/keyrings/githubcli-archive-keyring.gpg \ + && echo "deb [arch=$(dpkg --print-architecture) signed-by=/usr/share/keyrings/githubcli-archive-keyring.gpg] https://cli.github.com/packages stable main" \ + > /etc/apt/sources.list.d/github-cli.list \ + && apt-get update && apt-get install -y gh \ + && rm -rf /var/lib/apt/lists/* + +# Install oc CLI +RUN curl -fsSL https://mirror.openshift.com/pub/openshift-v4/clients/ocp/stable/openshift-client-linux.tar.gz \ + | tar xz -C /usr/local/bin oc kubectl + +# Install Claude Code via npm (uses Node.js, not Bun) +RUN npm install -g @anthropic-ai/claude-code@latest + +# Create sandbox user with same UID as host user for file permissions +ARG HOST_UID=1000 +RUN useradd -m -s /bin/bash -u ${HOST_UID} sandbox +USER sandbox +WORKDIR /sandbox + +# Vertex AI configuration +ENV CLAUDE_CODE_USE_VERTEX=1 +ENV ANTHROPIC_VERTEX_PROJECT_ID=itpc-gcp-hcm-pe-eng-claude +ENV CLOUD_ML_REGION=us-east5 +ENV GOOGLE_APPLICATION_CREDENTIALS=/tmp/adc.json +ENV KUBECONFIG=/tmp/kubeconfig + +# Git config for the container +RUN git config --global user.name "Claude Agent" \ + && git config --global user.email "claude-agent@sandbox.local" \ + && git config --global --add safe.directory /sandbox + +# Claude settings - Sonnet 4.6 with 1M context +RUN mkdir -p /home/sandbox/.claude && \ + echo '{"model": "claude-sonnet-4-6", "maxTokens": 1000000}' > /home/sandbox/.claude/settings.json + +CMD ["claude", "--dangerously-skip-permissions"] diff --git a/sandbox/run.sh b/sandbox/run.sh new file mode 100755 index 000000000..657890f6c --- /dev/null +++ b/sandbox/run.sh @@ -0,0 +1,84 @@ +#!/bin/bash +set -euo pipefail + +# Docker-based sandbox for Claude Code with filesystem isolation +# See docs/agentic-development/architecture/security/docker-sandbox-blast-radius.md for security analysis + +SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" +PROJECT_ROOT="$(dirname "$SCRIPT_DIR")" +IMAGE_NAME="claude-sandbox" +CONTAINER_NAME="claude-sandbox-$$" + +# --- Configuration --- +# Override these with environment variables if needed +ADC_PATH="${GOOGLE_APPLICATION_CREDENTIALS:-$HOME/.config/gcloud/application_default_credentials.json}" +KUBECONFIG_PATH="${KUBECONFIG:-$HOME/.kube/config}" +VERTEX_PROJECT="${ANTHROPIC_VERTEX_PROJECT_ID:-itpc-gcp-hcm-pe-eng-claude}" +VERTEX_REGION="${CLOUD_ML_REGION:-us-east5}" + +# --- Validation --- +if [ ! -f "$ADC_PATH" ]; then + echo "ERROR: GCP ADC credentials not found at $ADC_PATH" + echo "Run: gcloud auth application-default login" + exit 1 +fi + +if [ ! -f "$KUBECONFIG_PATH" ]; then + echo "WARNING: kubeconfig not found at $KUBECONFIG_PATH — oc commands won't work" + KUBECONFIG_MOUNT="" +else + KUBECONFIG_MOUNT="-v ${KUBECONFIG_PATH}:/tmp/kubeconfig:ro" +fi + +if [ -z "${GITHUB_TOKEN:-}" ]; then + # Try to get token from gh CLI + GITHUB_TOKEN=$(gh auth token 2>/dev/null || true) + if [ -z "$GITHUB_TOKEN" ]; then + echo "WARNING: No GITHUB_TOKEN set and gh auth not configured — git push/PR won't work" + fi +fi + +# --- Build image if needed --- +echo "Building sandbox image..." +docker build -t "$IMAGE_NAME" "$SCRIPT_DIR" --build-arg "HOST_UID=$(id -u)" --quiet + +# --- Create an isolated copy of the repo --- +SANDBOX_DIR=$(mktemp -d /tmp/claude-sandbox-XXXXXX) +CURRENT_BRANCH=$(git -C "$PROJECT_ROOT" branch --show-current) +REMOTE_URL=$(git -C "$PROJECT_ROOT" remote get-url origin 2>/dev/null || true) + +echo "Cloning branch '$CURRENT_BRANCH' into $SANDBOX_DIR..." +git clone --single-branch --branch "$CURRENT_BRANCH" --depth 50 "$PROJECT_ROOT" "$SANDBOX_DIR" + +# Set the remote to the actual upstream so push/pull work inside the container +if [ -n "$REMOTE_URL" ]; then + git -C "$SANDBOX_DIR" remote set-url origin "$REMOTE_URL" +fi + +# --- Run the container --- +echo "" +echo "=== Sandbox Configuration ===" +echo " Project: $PROJECT_ROOT" +echo " Branch: $CURRENT_BRANCH" +echo " Sandbox: $SANDBOX_DIR" +echo " Vertex AI: $VERTEX_PROJECT ($VERTEX_REGION)" +echo " GitHub: $([ -n "${GITHUB_TOKEN:-}" ] && echo 'token set' || echo 'NOT SET')" +echo " Kubeconfig: $([ -n "${KUBECONFIG_MOUNT:-}" ] && echo 'mounted (read-only)' || echo 'NOT SET')" +echo "" +echo " Filesystem: Only worktree is writable. Host is isolated." +echo " See docs/docker-sandbox-blast-radius.md for full security analysis." +echo "==============================" +echo "" + +docker run -it --rm \ + --name "$CONTAINER_NAME" \ + -v "${SANDBOX_DIR}:/sandbox" \ + -v "${ADC_PATH}:/tmp/adc.json:ro" \ + ${KUBECONFIG_MOUNT:-} \ + -e "GOOGLE_APPLICATION_CREDENTIALS=/tmp/adc.json" \ + -e "CLAUDE_CODE_USE_VERTEX=1" \ + -e "ANTHROPIC_VERTEX_PROJECT_ID=$VERTEX_PROJECT" \ + -e "CLOUD_ML_REGION=$VERTEX_REGION" \ + -e "KUBECONFIG=/tmp/kubeconfig" \ + -e "GITHUB_TOKEN=${GITHUB_TOKEN:-}" \ + "$IMAGE_NAME" diff --git a/web/cypress/reports/test-stability.md b/web/cypress/reports/test-stability.md new file mode 100644 index 000000000..298be21e4 --- /dev/null +++ b/web/cypress/reports/test-stability.md @@ -0,0 +1,34 @@ +# Test Stability Ledger + +Tracks incident detection test stability across local and CI iteration runs. Updated automatically by `/cypress:test-iteration:iterate-incident-tests` and `/cypress:test-iteration:iterate-ci-flaky`. + +## How to Read + +- **Pass rate**: percentage across all recorded runs (local + CI combined) +- **Trend**: direction over last 3 runs +- **Last failure**: most recent failure reason and which run it occurred in +- **Fixed by**: commit that resolved the issue (if applicable) + +## Current Status + +| Test | Pass Rate | Trend | Runs | Last Failure | Fixed By | +|------|-----------|-------|------|-------------|----------| +| _No data yet — run `/cypress:test-iteration:iterate-incident-tests` or `/cypress:test-iteration:iterate-ci-flaky` to populate_ | | | | | | + +## Run History + +### Run Log + +| # | Date | Type | Branch | Tests | Passed | Failed | Flaky | Commit | +|---|------|------|--------|-------|--------|--------|-------|--------| +| _No runs recorded yet_ | | | | | | | | | + + diff --git a/web/scripts/clean-test-artifacts.sh b/web/scripts/clean-test-artifacts.sh new file mode 100755 index 000000000..c6daef6b7 --- /dev/null +++ b/web/scripts/clean-test-artifacts.sh @@ -0,0 +1,8 @@ +#!/bin/bash +# Clean Cypress test artifacts (screenshots, videos, reports) +SCRIPT_DIR="$(cd "$(dirname "$0")/.." && pwd)" + +rm -f "$SCRIPT_DIR/screenshots/cypress_report_"*.json +rm -f "$SCRIPT_DIR/screenshots/merged-report.json" +rm -rf "$SCRIPT_DIR/cypress/screenshots/"* +rm -rf "$SCRIPT_DIR/cypress/videos/"*