Vault: automated stale pending queue purge#22689
Conversation
|
I see you updated files related to
|
|
✅ No conflicts with other open PRs targeting |
|
|
| } | ||
|
|
||
| func (m *pluginMetrics) trackPendingQueueStaleAutoEmpty(ctx context.Context) { | ||
| if m == nil { |
There was a problem hiding this comment.
I know we have this elsewhere, but this is a smell -- it means we're potentially instantiating a nil pointer and calling methods on it
In prod we should never have that situation; this struct should always be initialized
| } | ||
|
|
||
| lag := seqNr - index.WrittenSeqNr | ||
| if threshold <= 0 { |
There was a problem hiding this comment.
Nit: move this so it's closer to where we get the threshold limit
Atm this reads like:
- fetch threshold
- check some other condition not related to threshold
- establish the lag
- check something to do with threshold
|
|
||
| // skipStoreBackedPendingQueue reports whether Observation/ValidateObservation should skip | ||
| // reading the KV pending queue this round (manual force-empty or stale-queue TTL). | ||
| func (r *ReportingPlugin) skipStoreBackedPendingQueue(ctx context.Context, seqNr uint64, readKV *KVStore) (skip bool, reason string, err error) { |
There was a problem hiding this comment.
Definitely worth running this logic by Chrysa/Kostis to make sure we have it correct.
| return false, "", fmt.Errorf("could not fetch VaultPendingQueueStaleRoundThreshold: %w", err) | ||
| } | ||
|
|
||
| if seqNr < index.WrittenSeqNr { |
There was a problem hiding this comment.
Please add some comments here to explain the logic in full
We are basically saying that:
- seqNr increases monotonically
- we should always be reading a pending queue that is fairly fresh (by some lag, currently it would be 1 round in the happy path)
- if we're reading from a pending queue that is significantly older than that, then we've failed to write to it in a while, which means the queue is stuck
- therefore this is a candidate for flushing the queue automatically





Summary
Adds a gated recovery path when the vault KV pending queue is stuck and blocking observation quorum, so ops no longer need to leave
VaultForceEmptyOCRRoundsenabled globally.Depends on chainlink-common#2104 (
StoredPendingQueueIndex.written_seq_nr, CRE settingsVaultPendingQueueStaleAutoEmptyandVaultPendingQueueStaleRoundThreshold).Stale pending queue auto-empty (gated, default off)
VaultPendingQueueStaleAutoEmptyis on andseqNr - written_seq_nr >= VaultPendingQueueStaleRoundThreshold(default 30),ObservationandValidateObservationskip store-backed pending items for that round—same effect as force-empty for the KV queue only, so a successfulStateTransitioncan rewrite/clear stuck entries.written_seq_nris written to the pending queue index only when the gate is enabled, keeping KV index bytes identical to pre-change{length}serialization while the feature is off (avoids rollout / determinism issues).written_seq_nr == 0, auto-empty does not run until the first post-enableWritePendingQueuesets the watermark.VaultForceEmptyOCRRoundsis unchanged and still takes precedence when enabled (manual escape hatch).Local queue hardening (ungated)
prepareObservationPendingQueueBlobsskips a single oversize local-queue blob instead of failing the whole observation.Observability