Skip to content

Fix failed terminal retention floor#350

Merged
hardbyte merged 3 commits into
mainfrom
fix/337-failed-terminal-retention
Jun 13, 2026
Merged

Fix failed terminal retention floor#350
hardbyte merged 3 commits into
mainfrom
fix/337-failed-terminal-retention

Conversation

@hardbyte

@hardbyte hardbyte commented Jun 13, 2026

Copy link
Copy Markdown
Owner

Summary

  • keep non-DLQ failed terminal rows retryable for at least failed_retention in queue storage by carrying in-floor rows forward during prune
  • add pruned_failed_count / QueueCounts.pruned_failed and return RetryFailedOutcome from failed-job retry admin surfaces
  • update Rust CLI, Python bindings/CLI, docs, ADR-032, changelog, and deterministic regression coverage

Fixes #337.

Verification

  • git diff --check
  • cargo fmt --all -- --check
  • cargo fmt --manifest-path awa-python/Cargo.toml -- --check
  • SQLX_OFFLINE=true cargo build --workspace
  • SQLX_OFFLINE=true cargo clippy --workspace --all-targets --all-features -- -D warnings
  • PYO3_PYTHON=/Users/brian/.local/bin/python3.14 SQLX_OFFLINE=true cargo clippy --manifest-path awa-python/Cargo.toml --all-targets -- -D warnings
  • DATABASE_URL=postgres://postgres:test@localhost:15432/awa_test SQLX_OFFLINE=true cargo test -p awa --test queue_storage_runtime_test
  • DATABASE_URL=postgres://postgres:test@localhost:15432/awa_test SQLX_OFFLINE=true cargo test -p awa --test migration_test test_v032_backfills_queue_storage_pruned_failed_rollup_column
  • DATABASE_URL=postgres://postgres:test@localhost:15432/awa_test SQLX_OFFLINE=true cargo test -p awa-model
  • maturin develop --uv from awa-python/
  • DATABASE_URL=postgres://postgres:test@localhost:15432/awa_test .venv/bin/pytest tests/ -k "retry_failed or dlq"
  • DATABASE_URL=postgres://postgres:test@localhost:15432/awa_test .venv/bin/pytest tests/test_awa.py tests/test_sync.py
  • DATABASE_URL=postgres://postgres:test@localhost:15432/awa_test .venv/bin/pytest tests/test_cli.py tests/test_awa.py::test_admin_retry_failed_and_discard_failed tests/test_sync.py::test_retry_failed_sync tests/test_sync.py::test_retry_failed_sync_queue_storage_reports_pruned_count

Note: uv run --locked maturin develop still hits the existing project.readme.file ../README.md resolves outside allowed metadata root packaging issue; direct maturin develop --uv succeeds.

Summary by CodeRabbit

  • New Features

    • Failed jobs now retain longer and remain retryable for a configurable duration before permanent removal.
    • Retry operations report detailed outcomes: matched jobs, retried count, and pruned-past-retention count.
    • Enhanced CLI and SDK visibility into retry status and pruning.
  • Documentation

    • New ADR-032 documents the failed terminal retention floor design.

@coderabbitai

coderabbitai Bot commented Jun 13, 2026

Copy link
Copy Markdown

Review Change Stack

Warning

Review limit reached

@hardbyte, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 28 minutes and 11 seconds. Learn how PR review limits work.

Your organization has run out of usage credits. Purchase more credits in the billing tab to continue.

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: c29db3df-8946-48f2-b60c-a88be9de7063

📥 Commits

Reviewing files that changed from the base of the PR and between b92a2f5 and 23c7924.

⛔ Files ignored due to path filters (1)
  • awa-python/uv.lock is excluded by !**/*.lock
📒 Files selected for processing (30)
  • CHANGELOG.md
  • awa-cli/README.md
  • awa-cli/src/main.rs
  • awa-model/migrations/v023_install_queue_storage_substrate.sql
  • awa-model/migrations/v032_failed_terminal_retention.sql
  • awa-model/src/admin.rs
  • awa-model/src/lib.rs
  • awa-model/src/migrations.rs
  • awa-model/src/queue_storage.rs
  • awa-python/python/awa/__init__.py
  • awa-python/python/awa/__main__.py
  • awa-python/python/awa/_awa.pyi
  • awa-python/python/awa/client.py
  • awa-python/src/client.rs
  • awa-python/src/lib.rs
  • awa-python/tests/test_awa.py
  • awa-python/tests/test_cli.py
  • awa-python/tests/test_sync.py
  • awa-worker/src/maintenance.rs
  • awa/tests/migration_test.rs
  • awa/tests/queue_storage_benchmark_test.rs
  • awa/tests/queue_storage_runtime_test.rs
  • awa/tests/receipt_plane_chaos_test.rs
  • correctness/storage/MAPPING.md
  • docs/adr/026-narrow-terminal-history.md
  • docs/adr/032-failed-terminal-retention.md
  • docs/adr/README.md
  • docs/configuration.md
  • docs/troubleshooting.md
  • examples/python/email_campaign.py
📝 Walkthrough

Walkthrough

This PR implements ADR-032 "failed-terminal retention floor" for queue-storage pruning. During queue-ring prune, failed terminal rows within a configured failed_retention duration are carried forward into the live segment as synthetic wide rows instead of being truncated; rows aged past the floor are folded into rollups as pruned_failed_count. Admin retry APIs return structured RetryFailedOutcome including matched vs retried counts and optional pruned-past-retention counts. Schema version advances from 31 to 32 with a backfill migration. Python bindings and CLI updated to expose pruning visibility.

Changes

Failed Terminal Retention Floor

Layer / File(s) Summary
Schema contracts and migration infrastructure
awa-model/migrations/v023_install_queue_storage_substrate.sql, awa-model/migrations/v032_failed_terminal_retention.sql, awa-model/src/migrations.rs, awa-model/src/queue_storage.rs, awa-model/src/admin.rs
pruned_failed_count column added to queue_terminal_rollups (v023 initial, v032 backfill); QueueCounts extended with pruned_failed field; PruneOutcome::Pruned struct extended with carried_failed_rows; new RetryFailedOutcome struct introduced; schema version bumped to 32.
Core prune carry-forward and terminal accounting
awa-model/src/queue_storage.rs
prune_oldest accepts failed_retention: Duration, carries failed rows within floor into done_entries, splits pruned counts into pruned_completed/pruned_failed; queue_counts_exact SQL updated to project pruned_failed; new pruned_failed_count_for_queue helper; retry_jobs_by_ids returns (Vec<JobRow>, u64) tuple; lease/claim prune return shapes updated.
Admin retry outcome APIs
awa-model/src/admin.rs, awa-model/src/lib.rs
retry_failed_by_kind and retry_failed_by_queue return RetryFailedOutcome with matched/retried/pruned fields instead of Vec<JobRow>; both queue-storage (with pruned_failed_count) and canonical (without) paths implemented; bulk_retry destructures new tuple shape; re-exports reformatted.
Python bindings and client updates
awa-python/src/client.rs, awa-python/python/awa/client.py, awa-python/python/awa/_awa.pyi, awa-python/python/awa/__init__.py
New PyRetryFailedResult class wraps jobs, matched, and optional pruned count; async/sync Client.retry_failed return type changed to RetryFailedResult; type stubs extended; package-level exports added.
CLI and user-facing rendering
awa-cli/src/main.rs, awa-python/python/awa/__main__.py, examples/python/email_campaign.py
retry-failed handler uses structured outcome to render richer message (retried, raced/pruned split, pruned-past-retention warning); CLI dispatch and example script updated.
Worker maintenance integration
awa-worker/src/maintenance.rs
Queue-storage prune passes self.failed_retention to prune_oldest; PruneOutcome::Pruned patterns updated to destructure carried_failed_rows; test recovery case updated for new shape.
Retention floor tests and migrations
awa/tests/queue_storage_runtime_test.rs, awa/tests/migration_test.rs, awa/tests/queue_storage_benchmark_test.rs, awa/tests/receipt_plane_chaos_test.rs, awa-python/tests/test_*.py
New #337 retention-focused tests: carry-forward, pruning floor boundary, admin outcome reporting, retry/rebuild survival, re-carry-once semantics; migration backfill test; CLI outcome test; sync pruned-count test; pattern match updates across all test files.
Documentation and ADRs
docs/adr/032-failed-terminal-retention.md, docs/adr/026-narrow-terminal-history.md, docs/adr/README.md, docs/configuration.md, docs/troubleshooting.md, correctness/storage/MAPPING.md, CHANGELOG.md
New ADR-032 specifying retention floor, carry-forward guarantees, and operational visibility; ADR-026 amended to scope carry-forward exception; configuration docs clarified for cross-engine failed_retention semantics; troubleshooting guidance expanded; TLA+ model scope note; CHANGELOG unreleased entry.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related issues

  • #337: The changes directly implement the retention-floor and outcome API requirements originally tracked in this issue, including carry-forward behavior, pruned-count visibility, and operator-facing retry outcome changes.
  • #335: Related context on queue-ring prune semantics and pre-prune-race flakiness that motivates the retention floor design.

Possibly related PRs

  • hardbyte/awa#304: Both PRs modify QueueStorage::prune_oldest terminal-state folding logic; this PR adds the failed_retention carry-forward mechanism and revised pruned-count accounting that work alongside counter-row cleanup.
  • hardbyte/awa#306: Both PRs update QueueCounts terminal-total computation; this PR adds pruned_failed field and revised SQL projection that builds on the terminal-count refactoring in that PR.
  • hardbyte/awa#289: Both PRs extend queue-count reporting; this PR adds pruned_failed accounting to complement the queue_counts_fast method shape introduced in that PR.

Poem

🐰 Failed rows feared the pruning sweep,
So we carry them forth from the slot so deep,
failed_retention now guards the floor,
While pruned_failed_count tracks what was more.
Retries survive the rotation ring—
No silent loss, just clear accounting! ✨

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title 'Fix failed terminal retention floor' clearly describes the main objective of implementing a retention floor for failed terminal rows so they remain retryable, which is the primary change across all 21 modified files.
Linked Issues check ✅ Passed All issue #337 requirements are met: failed-retention floor prevents premature pruning (queue_storage.rs prune_oldest with failed_retention parameter), RetryFailedOutcome exposes pruned_failed_count to distinguish silent losses, and tests provide deterministic coverage of retention/prune behavior.
Out of Scope Changes check ✅ Passed All changes directly support failed-retention floor implementation: database schema updates (migrations, pruned_failed_count column), queue-storage logic (prune_oldest, retry_jobs_by_ids, PruneOutcome.carried_failed_rows), admin APIs (RetryFailedOutcome), CLI/Python bindings, documentation (ADR-032, configuration, troubleshooting), and comprehensive test coverage.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@hardbyte hardbyte force-pushed the fix/337-failed-terminal-retention branch from b92a2f5 to 31a4fe4 Compare June 13, 2026 00:51

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
awa-model/src/queue_storage.rs (1)

6456-6479: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Deduplicate ids before retrying or counting them.

The new matched value inherits raw ids.len(), but the admin queue retry path feeds this helper from a plain SELECT job_id FROM ...done_entries WHERE state = 'failed', so one job with multiple failed attempts can appear more than once. In that case this loop retries the same job_id multiple times in one transaction, which can either enqueue it twice or fail on the second insert, and matched over-reports unique jobs. Dedup the ids up front and report the deduped count instead.

💡 Proposed fix
     pub async fn retry_jobs_by_ids(
         &self,
         pool: &PgPool,
         ids: &[i64],
     ) -> Result<(Vec<JobRow>, u64), AwaError> {
         if ids.is_empty() {
             return Ok((Vec::new(), 0));
         }
 
+        let mut seen = HashSet::with_capacity(ids.len());
+        let unique_ids: Vec<i64> = ids.iter().copied().filter(|id| seen.insert(*id)).collect();
+
         let mut tx = pool.begin().await.map_err(map_sqlx_error)?;
-        let mut rows = Vec::with_capacity(ids.len());
-        for job_id in ids {
+        let mut rows = Vec::with_capacity(unique_ids.len());
+        for job_id in &unique_ids {
             if let Some(row) = self.retry_job_tx(&mut tx, *job_id).await? {
                 rows.push(row);
             }
         }
         tx.commit().await.map_err(map_sqlx_error)?;
-        Ok((rows, ids.len() as u64))
+        Ok((rows, unique_ids.len() as u64))
     }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@awa-model/src/queue_storage.rs` around lines 6456 - 6479, The function
retry_jobs_by_ids currently iterates the raw ids slice and returns ids.len(),
which double-counts jobs that appear multiple times; before starting the
transaction, deduplicate the incoming ids (e.g., build a Vec of unique ids by
tracking seen ids with a HashSet or by sorting+dedup) and use that deduped
collection for the loop that calls retry_job_tx and for sizing rows; finally
return the deduped count (unique_ids.len() as u64) instead of ids.len(); update
references to ids in retry_jobs_by_ids to use the unique collection while
leaving retry_job_tx and JobRow unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@awa-cli/src/main.rs`:
- Around line 692-719: The RetryFailed branch of JobCommands currently allows
both --kind and --queue and silently prefers kind; update the validation so
exactly one of kind or queue is provided (i.e., reject when both are Some or
both are None). In the JobCommands::RetryFailed match arm, replace the nested
if/else with a guard that checks (kind.is_some() ^ queue.is_some()), and if that
check fails print a clear error (e.g., "Must specify exactly one of --kind or
--queue") and exit non-zero; when valid, call the appropriate helpers
awa_model::admin::retry_failed_by_kind(&pool, &kind) or
awa_model::admin::retry_failed_by_queue(&pool, &queue) as before. Ensure the
error path mirrors existing behavior (eprintln! and std::process::exit(1)).

---

Outside diff comments:
In `@awa-model/src/queue_storage.rs`:
- Around line 6456-6479: The function retry_jobs_by_ids currently iterates the
raw ids slice and returns ids.len(), which double-counts jobs that appear
multiple times; before starting the transaction, deduplicate the incoming ids
(e.g., build a Vec of unique ids by tracking seen ids with a HashSet or by
sorting+dedup) and use that deduped collection for the loop that calls
retry_job_tx and for sizing rows; finally return the deduped count
(unique_ids.len() as u64) instead of ids.len(); update references to ids in
retry_jobs_by_ids to use the unique collection while leaving retry_job_tx and
JobRow unchanged.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 7ff99388-8141-4b01-bcfe-725c40c8ff89

📥 Commits

Reviewing files that changed from the base of the PR and between 7be5bd8 and b92a2f5.

⛔ Files ignored due to path filters (1)
  • awa-python/uv.lock is excluded by !**/*.lock
📒 Files selected for processing (29)
  • CHANGELOG.md
  • awa-cli/src/main.rs
  • awa-model/migrations/v023_install_queue_storage_substrate.sql
  • awa-model/migrations/v032_failed_terminal_retention.sql
  • awa-model/src/admin.rs
  • awa-model/src/lib.rs
  • awa-model/src/migrations.rs
  • awa-model/src/queue_storage.rs
  • awa-python/python/awa/__init__.py
  • awa-python/python/awa/__main__.py
  • awa-python/python/awa/_awa.pyi
  • awa-python/python/awa/client.py
  • awa-python/src/client.rs
  • awa-python/src/lib.rs
  • awa-python/tests/test_awa.py
  • awa-python/tests/test_cli.py
  • awa-python/tests/test_sync.py
  • awa-worker/src/maintenance.rs
  • awa/tests/migration_test.rs
  • awa/tests/queue_storage_benchmark_test.rs
  • awa/tests/queue_storage_runtime_test.rs
  • awa/tests/receipt_plane_chaos_test.rs
  • correctness/storage/MAPPING.md
  • docs/adr/026-narrow-terminal-history.md
  • docs/adr/032-failed-terminal-retention.md
  • docs/adr/README.md
  • docs/configuration.md
  • docs/troubleshooting.md
  • examples/python/email_campaign.py

Comment thread awa-cli/src/main.rs Outdated
@hardbyte hardbyte merged commit 58eb350 into main Jun 13, 2026
12 checks passed
@hardbyte hardbyte deleted the fix/337-failed-terminal-retention branch June 13, 2026 01:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Failed terminal rows are prunable with no retention floor; admin retry/DLQ-move silently return empty

1 participant