Fix failed terminal retention floor#350
Conversation
|
Warning Review limit reached
More reviews will be available in 28 minutes and 11 seconds. Learn how PR review limits work. Your organization has run out of usage credits. Purchase more credits in the billing tab to continue. ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available. Please see our Fair Usage Limits Policy for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: ⛔ Files ignored due to path filters (1)
📒 Files selected for processing (30)
📝 WalkthroughWalkthroughThis PR implements ADR-032 "failed-terminal retention floor" for queue-storage pruning. During queue-ring prune, failed terminal rows within a configured ChangesFailed Terminal Retention Floor
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related issues
Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
b92a2f5 to
31a4fe4
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
awa-model/src/queue_storage.rs (1)
6456-6479:⚠️ Potential issue | 🟠 Major | ⚡ Quick winDeduplicate
idsbefore retrying or counting them.The new
matchedvalue inherits rawids.len(), but the admin queue retry path feeds this helper from a plainSELECT job_id FROM ...done_entries WHERE state = 'failed', so one job with multiple failed attempts can appear more than once. In that case this loop retries the samejob_idmultiple times in one transaction, which can either enqueue it twice or fail on the second insert, andmatchedover-reports unique jobs. Dedup the ids up front and report the deduped count instead.💡 Proposed fix
pub async fn retry_jobs_by_ids( &self, pool: &PgPool, ids: &[i64], ) -> Result<(Vec<JobRow>, u64), AwaError> { if ids.is_empty() { return Ok((Vec::new(), 0)); } + let mut seen = HashSet::with_capacity(ids.len()); + let unique_ids: Vec<i64> = ids.iter().copied().filter(|id| seen.insert(*id)).collect(); + let mut tx = pool.begin().await.map_err(map_sqlx_error)?; - let mut rows = Vec::with_capacity(ids.len()); - for job_id in ids { + let mut rows = Vec::with_capacity(unique_ids.len()); + for job_id in &unique_ids { if let Some(row) = self.retry_job_tx(&mut tx, *job_id).await? { rows.push(row); } } tx.commit().await.map_err(map_sqlx_error)?; - Ok((rows, ids.len() as u64)) + Ok((rows, unique_ids.len() as u64)) }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@awa-model/src/queue_storage.rs` around lines 6456 - 6479, The function retry_jobs_by_ids currently iterates the raw ids slice and returns ids.len(), which double-counts jobs that appear multiple times; before starting the transaction, deduplicate the incoming ids (e.g., build a Vec of unique ids by tracking seen ids with a HashSet or by sorting+dedup) and use that deduped collection for the loop that calls retry_job_tx and for sizing rows; finally return the deduped count (unique_ids.len() as u64) instead of ids.len(); update references to ids in retry_jobs_by_ids to use the unique collection while leaving retry_job_tx and JobRow unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@awa-cli/src/main.rs`:
- Around line 692-719: The RetryFailed branch of JobCommands currently allows
both --kind and --queue and silently prefers kind; update the validation so
exactly one of kind or queue is provided (i.e., reject when both are Some or
both are None). In the JobCommands::RetryFailed match arm, replace the nested
if/else with a guard that checks (kind.is_some() ^ queue.is_some()), and if that
check fails print a clear error (e.g., "Must specify exactly one of --kind or
--queue") and exit non-zero; when valid, call the appropriate helpers
awa_model::admin::retry_failed_by_kind(&pool, &kind) or
awa_model::admin::retry_failed_by_queue(&pool, &queue) as before. Ensure the
error path mirrors existing behavior (eprintln! and std::process::exit(1)).
---
Outside diff comments:
In `@awa-model/src/queue_storage.rs`:
- Around line 6456-6479: The function retry_jobs_by_ids currently iterates the
raw ids slice and returns ids.len(), which double-counts jobs that appear
multiple times; before starting the transaction, deduplicate the incoming ids
(e.g., build a Vec of unique ids by tracking seen ids with a HashSet or by
sorting+dedup) and use that deduped collection for the loop that calls
retry_job_tx and for sizing rows; finally return the deduped count
(unique_ids.len() as u64) instead of ids.len(); update references to ids in
retry_jobs_by_ids to use the unique collection while leaving retry_job_tx and
JobRow unchanged.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 7ff99388-8141-4b01-bcfe-725c40c8ff89
⛔ Files ignored due to path filters (1)
awa-python/uv.lockis excluded by!**/*.lock
📒 Files selected for processing (29)
CHANGELOG.mdawa-cli/src/main.rsawa-model/migrations/v023_install_queue_storage_substrate.sqlawa-model/migrations/v032_failed_terminal_retention.sqlawa-model/src/admin.rsawa-model/src/lib.rsawa-model/src/migrations.rsawa-model/src/queue_storage.rsawa-python/python/awa/__init__.pyawa-python/python/awa/__main__.pyawa-python/python/awa/_awa.pyiawa-python/python/awa/client.pyawa-python/src/client.rsawa-python/src/lib.rsawa-python/tests/test_awa.pyawa-python/tests/test_cli.pyawa-python/tests/test_sync.pyawa-worker/src/maintenance.rsawa/tests/migration_test.rsawa/tests/queue_storage_benchmark_test.rsawa/tests/queue_storage_runtime_test.rsawa/tests/receipt_plane_chaos_test.rscorrectness/storage/MAPPING.mddocs/adr/026-narrow-terminal-history.mddocs/adr/032-failed-terminal-retention.mddocs/adr/README.mddocs/configuration.mddocs/troubleshooting.mdexamples/python/email_campaign.py
Summary
failed_retentionin queue storage by carrying in-floor rows forward during prunepruned_failed_count/QueueCounts.pruned_failedand returnRetryFailedOutcomefrom failed-job retry admin surfacesFixes #337.
Verification
git diff --checkcargo fmt --all -- --checkcargo fmt --manifest-path awa-python/Cargo.toml -- --checkSQLX_OFFLINE=true cargo build --workspaceSQLX_OFFLINE=true cargo clippy --workspace --all-targets --all-features -- -D warningsPYO3_PYTHON=/Users/brian/.local/bin/python3.14 SQLX_OFFLINE=true cargo clippy --manifest-path awa-python/Cargo.toml --all-targets -- -D warningsDATABASE_URL=postgres://postgres:test@localhost:15432/awa_test SQLX_OFFLINE=true cargo test -p awa --test queue_storage_runtime_testDATABASE_URL=postgres://postgres:test@localhost:15432/awa_test SQLX_OFFLINE=true cargo test -p awa --test migration_test test_v032_backfills_queue_storage_pruned_failed_rollup_columnDATABASE_URL=postgres://postgres:test@localhost:15432/awa_test SQLX_OFFLINE=true cargo test -p awa-modelmaturin develop --uvfromawa-python/DATABASE_URL=postgres://postgres:test@localhost:15432/awa_test .venv/bin/pytest tests/ -k "retry_failed or dlq"DATABASE_URL=postgres://postgres:test@localhost:15432/awa_test .venv/bin/pytest tests/test_awa.py tests/test_sync.pyDATABASE_URL=postgres://postgres:test@localhost:15432/awa_test .venv/bin/pytest tests/test_cli.py tests/test_awa.py::test_admin_retry_failed_and_discard_failed tests/test_sync.py::test_retry_failed_sync tests/test_sync.py::test_retry_failed_sync_queue_storage_reports_pruned_countNote:
uv run --locked maturin developstill hits the existingproject.readme.file ../README.md resolves outside allowed metadata rootpackaging issue; directmaturin develop --uvsucceeds.Summary by CodeRabbit
New Features
Documentation