fix(consolidation): skip null-payload operations in deduplication query#1187
fix(consolidation): skip null-payload operations in deduplication query#1187CRMbyRSM wants to merge 1 commit intovectorize-io:mainfrom
Conversation
The deduplication check in _submit_async_operation matched pending operations by bank_id + operation_type + status='pending' without verifying task_payload was non-null. This caused a deadlock when a corrupted row (null task_payload from older code where INSERT didn't include task_payload atomically) existed: 1. Zombie consolidation row sits with status='pending', task_payload=NULL 2. Every new consolidation request deduplicates to the zombie (deduplicated=True) 3. Worker claim SQL requires task_payload IS NOT NULL → rejects the zombie 4. Result: no new consolidation can ever be claimed or processed Fix: Add AND task_payload IS NOT NULL to the deduplication query so corrupted rows are skipped. New consolidation operations bypass the zombie and create fresh rows with proper task_payload. Also adds a diagnostic warning log when null-payload zombies are detected, helping operators identify and clean up corrupted operations.
nicoloboschi
left a comment
There was a problem hiding this comment.
Review
The core fix (AND task_payload IS NOT NULL in the dedup query) is correct and well-motivated. The PR description and commit message are excellent — clear root cause analysis with observed symptoms. A few things to address before merging:
1. Extra query on every deduplicated submission (should fix)
The zombie-counting query runs on every call when dedupe_by_bank=True, even though null-payload rows are a legacy artifact that will eventually disappear. This adds an unnecessary DB round-trip to every consolidation/dedup submission.
Suggestion: Combine both queries into one. For example:
SELECT operation_id, (task_payload IS NULL) AS is_zombie
FROM async_operations
WHERE bank_id = $1 AND operation_type = $2 AND status = 'pending'Then partition results in Python: if any non-zombie exists, dedup to it; if only zombies exist, log the warning. One query instead of two.
2. No cleanup of zombie rows (consider)
The PR detects and logs zombie rows but never cleans them up. They'll sit there forever, triggering the warning log on every submission. Consider adding a simple cleanup — e.g., mark them as failed or delete them — right after the warning. This would make the fix self-healing rather than just diagnostic.
3. No test coverage (should fix)
There's no test for this change. A test that:
- Inserts a zombie row (pending,
task_payload=NULL) - Calls
_submit_async_operationwithdedupe_by_bank=True - Asserts a new operation is created (not deduplicated to the zombie)
...would be straightforward and would prevent regressions. The existing test_async_batch_retain.py already tests similar atomic-INSERT scenarios, so the pattern is established.
4. Minor: redundant null check
if null_count and null_count > 0:COUNT(*) never returns None in PostgreSQL — it always returns an integer (0+). if null_count > 0: is sufficient.
… add tests Reviewer feedback from PR vectorize-io#1187: 1. Single query for dedup+zombie (was: extra DB roundtrip per submission) Before: query existing, then count nulls separately After: SELECT operation_id, (task_payload IS NULL) AS is_zombie — partitions in Python; one DB roundtrip instead of two 2. Self-healing cleanup (was: detected but never removed) Zombie rows (status=pending, task_payload=NULL) are now marked status='failed' with result_metadata.zombie_cleanup=true instead of accumulating and logging on every subsequent submission. 3. Test coverage added (was: none) - test_zombie_does_not_block_new_submission - test_valid_pending_blocks_new_submission - test_mixed_zombie_and_valid_pending_uses_valid 4. Minor: removed redundant null_count > 0 guard (COUNT(*) in PostgreSQL always returns an integer, never None)
… add tests Reviewer feedback from PR vectorize-io#1187: 1. Single query for dedup+zombie (was: extra DB roundtrip per submission) Before: query existing, then count nulls separately After: SELECT operation_id, (task_payload IS NULL) AS is_zombie — partitions in Python; one DB roundtrip instead of two 2. Self-healing cleanup (was: detected but never removed) Zombie rows (status=pending, task_payload=NULL) are now marked status='failed' with result_metadata.zombie_cleanup=true instead of accumulating and logging on every subsequent submission. 3. Test coverage added (was: none) - test_zombie_does_not_block_new_submission - test_valid_pending_blocks_new_submission - test_mixed_zombie_and_valid_pending_uses_valid 4. Minor: removed redundant null_count > 0 guard (COUNT(*) in PostgreSQL always returns an integer, never None)
``markdown
Bug
Consolidation tasks permanently stuck as pending and never claimed by workers after LLM failures.
Root Cause
The deduplication check in _submit_async_operation() matches existing pending operations by bank_id + operation_type + status='pending' without verifying task_payload IS NOT NULL.
When a corrupted row exists (status='pending', task_payload=NULL — from older code where the INSERT didn't include task_payload atomically, or from a crash between INSERT and UPDATE):
Observed Symptoms
Fix
Two changes in hindsight-api-slim/hindsight_api/engine/memory_engine.py:
Diagnostic warning log when null-payload zombies are detected
Verification
Before: consolidation stuck 10+ hours, 158+ pending items, zero consolidation LLM calls
After: consolidation resumed within 60 seconds, backlog draining, 3 successful LLM calls within 60 seconds