fix(consolidation): skip null-payload operations in deduplication query by CRMbyRSM · Pull Request #1187 · vectorize-io/hindsight

CRMbyRSM · 2026-04-21T21:34:21Z

``markdown

Bug

Consolidation tasks permanently stuck as pending and never claimed by workers after LLM failures.

Root Cause

The deduplication check in _submit_async_operation() matches existing pending operations by bank_id + operation_type + status='pending' without verifying task_payload IS NOT NULL.

When a corrupted row exists (status='pending', task_payload=NULL — from older code where the INSERT didn't include task_payload atomically, or from a crash between INSERT and UPDATE):

Zombie consolidation row sits with status='pending', task_payload=NULL
Every new POST /consolidate request deduplicates to the zombie (deduplicated=True)
Worker claim SQL requires task_payload IS NOT NULL → rejects the zombie
Deadlock: no new consolidation can ever be claimed or processed

Observed Symptoms

Worker logs: consolidation: total=1 claimable=1 payload_null=0 retry_blocked=0 assigned=0 — task appears claimable but never claimed across 40+ poll cycles
LLM metrics: zero scope="consolidation" calls despite retain working fine
pending_consolidation grows continuously
last_consolidated_at frozen

Fix

Two changes in hindsight-api-slim/hindsight_api/engine/memory_engine.py:

Add AND task_payload IS NOT NULL` to the deduplication query — skips corrupted zombie rows
Diagnostic warning log when null-payload zombies are detected

Verification
Before: consolidation stuck 10+ hours, 158+ pending items, zero consolidation LLM calls

After: consolidation resumed within 60 seconds, backlog draining, 3 successful LLM calls within 60 seconds

The deduplication check in _submit_async_operation matched pending operations by bank_id + operation_type + status='pending' without verifying task_payload was non-null. This caused a deadlock when a corrupted row (null task_payload from older code where INSERT didn't include task_payload atomically) existed: 1. Zombie consolidation row sits with status='pending', task_payload=NULL 2. Every new consolidation request deduplicates to the zombie (deduplicated=True) 3. Worker claim SQL requires task_payload IS NOT NULL → rejects the zombie 4. Result: no new consolidation can ever be claimed or processed Fix: Add AND task_payload IS NOT NULL to the deduplication query so corrupted rows are skipped. New consolidation operations bypass the zombie and create fresh rows with proper task_payload. Also adds a diagnostic warning log when null-payload zombies are detected, helping operators identify and clean up corrupted operations.

nicoloboschi

Review

The core fix (AND task_payload IS NOT NULL in the dedup query) is correct and well-motivated. The PR description and commit message are excellent — clear root cause analysis with observed symptoms. A few things to address before merging:

1. Extra query on every deduplicated submission (should fix)

The zombie-counting query runs on every call when dedupe_by_bank=True, even though null-payload rows are a legacy artifact that will eventually disappear. This adds an unnecessary DB round-trip to every consolidation/dedup submission.

Suggestion: Combine both queries into one. For example:

SELECT operation_id, (task_payload IS NULL) AS is_zombie
FROM async_operations
WHERE bank_id = $1 AND operation_type = $2 AND status = 'pending'

Then partition results in Python: if any non-zombie exists, dedup to it; if only zombies exist, log the warning. One query instead of two.

2. No cleanup of zombie rows (consider)

The PR detects and logs zombie rows but never cleans them up. They'll sit there forever, triggering the warning log on every submission. Consider adding a simple cleanup — e.g., mark them as failed or delete them — right after the warning. This would make the fix self-healing rather than just diagnostic.

3. No test coverage (should fix)

There's no test for this change. A test that:

Inserts a zombie row (pending, task_payload=NULL)
Calls _submit_async_operation with dedupe_by_bank=True
Asserts a new operation is created (not deduplicated to the zombie)

...would be straightforward and would prevent regressions. The existing test_async_batch_retain.py already tests similar atomic-INSERT scenarios, so the pattern is established.

4. Minor: redundant null check

if null_count and null_count > 0:

COUNT(*) never returns None in PostgreSQL — it always returns an integer (0+). if null_count > 0: is sufficient.

… add tests Reviewer feedback from PR vectorize-io#1187: 1. Single query for dedup+zombie (was: extra DB roundtrip per submission) Before: query existing, then count nulls separately After: SELECT operation_id, (task_payload IS NULL) AS is_zombie — partitions in Python; one DB roundtrip instead of two 2. Self-healing cleanup (was: detected but never removed) Zombie rows (status=pending, task_payload=NULL) are now marked status='failed' with result_metadata.zombie_cleanup=true instead of accumulating and logging on every subsequent submission. 3. Test coverage added (was: none) - test_zombie_does_not_block_new_submission - test_valid_pending_blocks_new_submission - test_mixed_zombie_and_valid_pending_uses_valid 4. Minor: removed redundant null_count > 0 guard (COUNT(*) in PostgreSQL always returns an integer, never None)

nicoloboschi reviewed Apr 22, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(consolidation): skip null-payload operations in deduplication query#1187

fix(consolidation): skip null-payload operations in deduplication query#1187
CRMbyRSM wants to merge 1 commit intovectorize-io:mainfrom
CRMbyRSM:fix/consolidation-null-payload-dedup

CRMbyRSM commented Apr 21, 2026

Uh oh!

nicoloboschi left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

CRMbyRSM commented Apr 21, 2026

Bug

Root Cause

Observed Symptoms

Fix

Uh oh!

nicoloboschi left a comment

Choose a reason for hiding this comment

Review

1. Extra query on every deduplicated submission (should fix)

2. No cleanup of zombie rows (consider)

3. No test coverage (should fix)

4. Minor: redundant null check

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants