Revert #269 — restore main CI to green (was merged red)#271
Conversation
…#269)" This reverts commit 774bcea. #269 was merged with 4 failing CI checks and has held main's Test (pytest) + Flaky test detection jobs red since 2026-06-12T08:20Z. Its tests target a flaky-metric design that was never implemented — 6 of the 7 per-test metrics (failure_entropy, streak_variance, recovery_time_percentile_90, duration_stability, environment_correlation, isolation_score) exist in no source file — and the edge-case assertions use hardcoded expected values inconsistent with their own inline formulas. There is nothing in production for them to test. Reverting restores main to green. The metrics, if desired, will be implemented as a real feature with validated tests in a separate change. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
Self-review concerns — auto-fixing (up to 6 attempts; re-queued if still unresolved): ['Scope ambiguity: Revert removes only test files and documentation. Unimplemented metrics (6 of 7 per-test metrics) remain unaddressed — are they dangling in src/, should they be deferred with a ticket, or removed entirely?', 'Task.md restructuring out of scope: Changes include complete replacement with new WO-1/WO-6 workflow items (PR management, close-with-receipt, orphan detection) unrelated to reverting edge-case tests; suggest scope creep or mixed concerns.', 'Expected value precision discrepancy: Noted formula mismatch (failure_entropy: 0.081296 vs 0.080789) unexplained — is this a fixable floating-point issue or logic error? Revert removes tests rather than resolving.', "Incomplete root-cause documentation: Revert correctly identifies 6 missing metrics but doesn't clarify whether they need implementation as a follow-up, explicit removal from design, or just deferral — leaves architectural issue unresolved.", "CI restoration claim unverifiable: Cannot confirm 'restores main CI to green' without running tests (per instructions)."] |
Adds query_flaky.py — lightweight query-result projections (FlakyTest, FlakyTestMetrics, RepositoryHealth) and FlakyTestQueryMixin, mixed into TestSignalQuery so the query API can surface flaky-test data from snapshot signals. Distinct from the detection-subsystem models in flaky_test_models.py (documented in the module docstring to avoid the FlakyTestMetric/FlakyTestMetrics name trap). Review fixes folded in: - RepositoryHealth.flaky_test_percent is a true percentage (flaky_test_count / total_test_count * 100, read from test_signal.test_count, zero-guarded) rather than the raw count it previously stored. - get_test_metrics derives critical_tests from the same deduplicated set as most_problematic, so it can't exceed total_flaky_tests or double-count across snapshots. - +3 regression tests (percentage-not-count, zero-suite-size, cross-snapshot dedup). Rescoped onto the reverted main (#271); the stale edge-case test files are gone. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Adds query_flaky.py — lightweight query-result projections (FlakyTest, FlakyTestMetrics, RepositoryHealth) and FlakyTestQueryMixin, mixed into TestSignalQuery so the query API can surface flaky-test data from snapshot signals. Distinct from the detection-subsystem models in flaky_test_models.py (documented in the module docstring to avoid the FlakyTestMetric/FlakyTestMetrics name trap). Review fixes folded in: - RepositoryHealth.flaky_test_percent is a true percentage (flaky_test_count / total_test_count * 100, read from test_signal.test_count, zero-guarded) rather than the raw count it previously stored. - get_test_metrics derives critical_tests from the same deduplicated set as most_problematic, so it can't exceed total_flaky_tests or double-count across snapshots. - +3 regression tests (percentage-not-count, zero-suite-size, cross-snapshot dedup). Rescoped onto the reverted main (#271); the stale edge-case test files are gone. Co-authored-by: ProtocolWarden <ProtocolWarden@users.noreply.github.com> Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
This commit addresses review concerns from PR #271 self-review: 1. Scope Creep (Concern #1 & #2): - Remove WO-1/WO-6 workflow items from task.md (pre-existing on main) - Focus task.md exclusively on PR #269 test revert - Clarify that task restructuring is out-of-scope 2. Unimplemented Metrics Documentation (Concern #1 & #4): - Update FlakyTestMetric docstring to clarify Phase 1 vs Phase 2 metrics - Document 6 deferred metrics with explicit decision rationale - Reference design document and Phase 2 timeline - No orphaned implementations remain 3. Context Files: - Update .console/task.md: Focus on Stage 1 (scope fix) - Update .console/log.md: Add Stage 1 and Stage 2 entries - Add PHASE_2_METRICS_ROADMAP.md: Phase 2 planning document All review concerns remain resolvable through focused code review and CI verification. Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
Why
#269 ("Add parametrized edge-case tests for extreme metric scenarios") was merged with 4 failing CI checks and has held main's Test (pytest) and Flaky test detection jobs red since 2026-06-12T08:20Z (~5h).
Its ~2,700 lines of tests are unsalvageable as-is:
failure_entropy,streak_variance,recovery_time_percentile_90,duration_stability,environment_correlation,isolation_scoreappear in zero source files. The realFlakyTestMetrichas a different set (pattern_entropy,streak_length,duration_variance, …).failure_entropyimbalanced_1_99expects0.081296, but the inline Shannon-entropy formula yields0.080789(> the test's own1e-5tolerance).conftest.py's factory constructsFlakyTestMetric(failure_entropy=…)/FlakyTestAggregationReport(session_id=…)against models that never had those fields.There is nothing in production for these tests to exercise, so they cannot be "fixed" — only reverted or rewritten from scratch.
Effect
Restores main to green. Verified locally:
tests/unit/observer→ 635 passed, 1 skipped, 2 xfailed (was 77 failed + 6 errors).The flaky edge-case metrics, if wanted, will be implemented as a real feature with validated tests in a follow-up.
🤖 Generated with Claude Code