Skip to content

fix(consolidation): skip null-payload operations in deduplication query#1187

Open
CRMbyRSM wants to merge 1 commit intovectorize-io:mainfrom
CRMbyRSM:fix/consolidation-null-payload-dedup
Open

fix(consolidation): skip null-payload operations in deduplication query#1187
CRMbyRSM wants to merge 1 commit intovectorize-io:mainfrom
CRMbyRSM:fix/consolidation-null-payload-dedup

Conversation

@CRMbyRSM
Copy link
Copy Markdown

``markdown

Bug

Consolidation tasks permanently stuck as pending and never claimed by workers after LLM failures.

Root Cause

The deduplication check in _submit_async_operation() matches existing pending operations by bank_id + operation_type + status='pending' without verifying task_payload IS NOT NULL.

When a corrupted row exists (status='pending', task_payload=NULL — from older code where the INSERT didn't include task_payload atomically, or from a crash between INSERT and UPDATE):

  1. Zombie consolidation row sits with status='pending', task_payload=NULL
  2. Every new POST /consolidate request deduplicates to the zombie (deduplicated=True)
  3. Worker claim SQL requires task_payload IS NOT NULL → rejects the zombie
  4. Deadlock: no new consolidation can ever be claimed or processed

Observed Symptoms

  • Worker logs: consolidation: total=1 claimable=1 payload_null=0 retry_blocked=0 assigned=0 — task appears claimable but never claimed across 40+ poll cycles
  • LLM metrics: zero scope="consolidation" calls despite retain working fine
  • pending_consolidation grows continuously
  • last_consolidated_at frozen

Fix

Two changes in hindsight-api-slim/hindsight_api/engine/memory_engine.py:

  1. Add AND task_payload IS NOT NULL` to the deduplication query — skips corrupted zombie rows
    Diagnostic warning log when null-payload zombies are detected

Verification
Before: consolidation stuck 10+ hours, 158+ pending items, zero consolidation LLM calls

After: consolidation resumed within 60 seconds, backlog draining, 3 successful LLM calls within 60 seconds

The deduplication check in _submit_async_operation matched pending operations
by bank_id + operation_type + status='pending' without verifying task_payload
was non-null. This caused a deadlock when a corrupted row (null task_payload
from older code where INSERT didn't include task_payload atomically) existed:

1. Zombie consolidation row sits with status='pending', task_payload=NULL
2. Every new consolidation request deduplicates to the zombie (deduplicated=True)
3. Worker claim SQL requires task_payload IS NOT NULL → rejects the zombie
4. Result: no new consolidation can ever be claimed or processed

Fix: Add AND task_payload IS NOT NULL to the deduplication query so corrupted
rows are skipped. New consolidation operations bypass the zombie and create
fresh rows with proper task_payload.

Also adds a diagnostic warning log when null-payload zombies are detected,
helping operators identify and clean up corrupted operations.
Copy link
Copy Markdown
Collaborator

@nicoloboschi nicoloboschi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review

The core fix (AND task_payload IS NOT NULL in the dedup query) is correct and well-motivated. The PR description and commit message are excellent — clear root cause analysis with observed symptoms. A few things to address before merging:

1. Extra query on every deduplicated submission (should fix)

The zombie-counting query runs on every call when dedupe_by_bank=True, even though null-payload rows are a legacy artifact that will eventually disappear. This adds an unnecessary DB round-trip to every consolidation/dedup submission.

Suggestion: Combine both queries into one. For example:

SELECT operation_id, (task_payload IS NULL) AS is_zombie
FROM async_operations
WHERE bank_id = $1 AND operation_type = $2 AND status = 'pending'

Then partition results in Python: if any non-zombie exists, dedup to it; if only zombies exist, log the warning. One query instead of two.

2. No cleanup of zombie rows (consider)

The PR detects and logs zombie rows but never cleans them up. They'll sit there forever, triggering the warning log on every submission. Consider adding a simple cleanup — e.g., mark them as failed or delete them — right after the warning. This would make the fix self-healing rather than just diagnostic.

3. No test coverage (should fix)

There's no test for this change. A test that:

  1. Inserts a zombie row (pending, task_payload=NULL)
  2. Calls _submit_async_operation with dedupe_by_bank=True
  3. Asserts a new operation is created (not deduplicated to the zombie)

...would be straightforward and would prevent regressions. The existing test_async_batch_retain.py already tests similar atomic-INSERT scenarios, so the pattern is established.

4. Minor: redundant null check

if null_count and null_count > 0:

COUNT(*) never returns None in PostgreSQL — it always returns an integer (0+). if null_count > 0: is sufficient.

CRMbyRSM pushed a commit to CRMbyRSM/hindsight that referenced this pull request Apr 22, 2026
… add tests

Reviewer feedback from PR vectorize-io#1187:

1. Single query for dedup+zombie (was: extra DB roundtrip per submission)
   Before: query existing, then count nulls separately
   After:  SELECT operation_id, (task_payload IS NULL) AS is_zombie
           — partitions in Python; one DB roundtrip instead of two

2. Self-healing cleanup (was: detected but never removed)
   Zombie rows (status=pending, task_payload=NULL) are now marked
   status='failed' with result_metadata.zombie_cleanup=true instead of
   accumulating and logging on every subsequent submission.

3. Test coverage added (was: none)
   - test_zombie_does_not_block_new_submission
   - test_valid_pending_blocks_new_submission
   - test_mixed_zombie_and_valid_pending_uses_valid

4. Minor: removed redundant null_count > 0 guard (COUNT(*) in PostgreSQL
   always returns an integer, never None)
CRMbyRSM pushed a commit to CRMbyRSM/hindsight that referenced this pull request Apr 22, 2026
… add tests

Reviewer feedback from PR vectorize-io#1187:

1. Single query for dedup+zombie (was: extra DB roundtrip per submission)
   Before: query existing, then count nulls separately
   After:  SELECT operation_id, (task_payload IS NULL) AS is_zombie
           — partitions in Python; one DB roundtrip instead of two

2. Self-healing cleanup (was: detected but never removed)
   Zombie rows (status=pending, task_payload=NULL) are now marked
   status='failed' with result_metadata.zombie_cleanup=true instead of
   accumulating and logging on every subsequent submission.

3. Test coverage added (was: none)
   - test_zombie_does_not_block_new_submission
   - test_valid_pending_blocks_new_submission
   - test_mixed_zombie_and_valid_pending_uses_valid

4. Minor: removed redundant null_count > 0 guard (COUNT(*) in PostgreSQL
   always returns an integer, never None)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants