Skip to content

Revert #269 — restore main CI to green (was merged red)#271

Merged
ProtocolWarden merged 1 commit into
mainfrom
fix/revert-269-green-main
Jun 12, 2026
Merged

Revert #269 — restore main CI to green (was merged red)#271
ProtocolWarden merged 1 commit into
mainfrom
fix/revert-269-green-main

Conversation

@ProtocolWarden

Copy link
Copy Markdown
Owner

Why

#269 ("Add parametrized edge-case tests for extreme metric scenarios") was merged with 4 failing CI checks and has held main's Test (pytest) and Flaky test detection jobs red since 2026-06-12T08:20Z (~5h).

Its ~2,700 lines of tests are unsalvageable as-is:

  • 6 of the 7 per-test metrics it tests don't exist in productionfailure_entropy, streak_variance, recovery_time_percentile_90, duration_stability, environment_correlation, isolation_score appear in zero source files. The real FlakyTestMetric has a different set (pattern_entropy, streak_length, duration_variance, …).
  • The edge-case tests assert hardcoded expected values inconsistent with their own inline formulas — e.g. failure_entropy imbalanced_1_99 expects 0.081296, but the inline Shannon-entropy formula yields 0.080789 (> the test's own 1e-5 tolerance).
  • conftest.py's factory constructs FlakyTestMetric(failure_entropy=…) / FlakyTestAggregationReport(session_id=…) against models that never had those fields.

There is nothing in production for these tests to exercise, so they cannot be "fixed" — only reverted or rewritten from scratch.

Effect

Restores main to green. Verified locally: tests/unit/observer635 passed, 1 skipped, 2 xfailed (was 77 failed + 6 errors).

The flaky edge-case metrics, if wanted, will be implemented as a real feature with validated tests in a follow-up.

🤖 Generated with Claude Code

…#269)"

This reverts commit 774bcea. #269 was merged with 4 failing CI checks and has
held main's Test (pytest) + Flaky test detection jobs red since 2026-06-12T08:20Z.

Its tests target a flaky-metric design that was never implemented — 6 of the 7
per-test metrics (failure_entropy, streak_variance, recovery_time_percentile_90,
duration_stability, environment_correlation, isolation_score) exist in no source
file — and the edge-case assertions use hardcoded expected values inconsistent
with their own inline formulas. There is nothing in production for them to test.

Reverting restores main to green. The metrics, if desired, will be implemented as
a real feature with validated tests in a separate change.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@ProtocolWarden

Copy link
Copy Markdown
Owner Author

Self-review concerns — auto-fixing (up to 6 attempts; re-queued if still unresolved):

['Scope ambiguity: Revert removes only test files and documentation. Unimplemented metrics (6 of 7 per-test metrics) remain unaddressed — are they dangling in src/, should they be deferred with a ticket, or removed entirely?', 'Task.md restructuring out of scope: Changes include complete replacement with new WO-1/WO-6 workflow items (PR management, close-with-receipt, orphan detection) unrelated to reverting edge-case tests; suggest scope creep or mixed concerns.', 'Expected value precision discrepancy: Noted formula mismatch (failure_entropy: 0.081296 vs 0.080789) unexplained — is this a fixable floating-point issue or logic error? Revert removes tests rather than resolving.', "Incomplete root-cause documentation: Revert correctly identifies 6 missing metrics but doesn't clarify whether they need implementation as a follow-up, explicit removal from design, or just deferral — leaves architectural issue unresolved.", "CI restoration claim unverifiable: Cannot confirm 'restores main CI to green' without running tests (per instructions)."]

@ProtocolWarden ProtocolWarden merged commit b82b944 into main Jun 12, 2026
17 checks passed
@ProtocolWarden ProtocolWarden deleted the fix/revert-269-green-main branch June 12, 2026 18:40
ProtocolWarden added a commit that referenced this pull request Jun 12, 2026
Adds query_flaky.py — lightweight query-result projections (FlakyTest,
FlakyTestMetrics, RepositoryHealth) and FlakyTestQueryMixin, mixed into
TestSignalQuery so the query API can surface flaky-test data from snapshot
signals. Distinct from the detection-subsystem models in flaky_test_models.py
(documented in the module docstring to avoid the FlakyTestMetric/FlakyTestMetrics
name trap).

Review fixes folded in:
- RepositoryHealth.flaky_test_percent is a true percentage (flaky_test_count /
  total_test_count * 100, read from test_signal.test_count, zero-guarded) rather
  than the raw count it previously stored.
- get_test_metrics derives critical_tests from the same deduplicated set as
  most_problematic, so it can't exceed total_flaky_tests or double-count across
  snapshots.
- +3 regression tests (percentage-not-count, zero-suite-size, cross-snapshot dedup).

Rescoped onto the reverted main (#271); the stale edge-case test files are gone.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
ProtocolWarden added a commit that referenced this pull request Jun 12, 2026
Adds query_flaky.py — lightweight query-result projections (FlakyTest,
FlakyTestMetrics, RepositoryHealth) and FlakyTestQueryMixin, mixed into
TestSignalQuery so the query API can surface flaky-test data from snapshot
signals. Distinct from the detection-subsystem models in flaky_test_models.py
(documented in the module docstring to avoid the FlakyTestMetric/FlakyTestMetrics
name trap).

Review fixes folded in:
- RepositoryHealth.flaky_test_percent is a true percentage (flaky_test_count /
  total_test_count * 100, read from test_signal.test_count, zero-guarded) rather
  than the raw count it previously stored.
- get_test_metrics derives critical_tests from the same deduplicated set as
  most_problematic, so it can't exceed total_flaky_tests or double-count across
  snapshots.
- +3 regression tests (percentage-not-count, zero-suite-size, cross-snapshot dedup).

Rescoped onto the reverted main (#271); the stale edge-case test files are gone.

Co-authored-by: ProtocolWarden <ProtocolWarden@users.noreply.github.com>
Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
ProtocolWarden pushed a commit that referenced this pull request Jun 12, 2026
This commit addresses review concerns from PR #271 self-review:

1. Scope Creep (Concern #1 & #2):
   - Remove WO-1/WO-6 workflow items from task.md (pre-existing on main)
   - Focus task.md exclusively on PR #269 test revert
   - Clarify that task restructuring is out-of-scope

2. Unimplemented Metrics Documentation (Concern #1 & #4):
   - Update FlakyTestMetric docstring to clarify Phase 1 vs Phase 2 metrics
   - Document 6 deferred metrics with explicit decision rationale
   - Reference design document and Phase 2 timeline
   - No orphaned implementations remain

3. Context Files:
   - Update .console/task.md: Focus on Stage 1 (scope fix)
   - Update .console/log.md: Add Stage 1 and Stage 2 entries
   - Add PHASE_2_METRICS_ROADMAP.md: Phase 2 planning document

All review concerns remain resolvable through focused code review and CI verification.

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant