Skip to content

Vault: automated stale pending queue purge#22689

Draft
russell-stern wants to merge 4 commits into
developfrom
find_bad_requests
Draft

Vault: automated stale pending queue purge#22689
russell-stern wants to merge 4 commits into
developfrom
find_bad_requests

Conversation

@russell-stern
Copy link
Copy Markdown
Contributor

Summary

Adds a gated recovery path when the vault KV pending queue is stuck and blocking observation quorum, so ops no longer need to leave VaultForceEmptyOCRRounds enabled globally.

Depends on chainlink-common#2104 (StoredPendingQueueIndex.written_seq_nr, CRE settings VaultPendingQueueStaleAutoEmpty and VaultPendingQueueStaleRoundThreshold).

Stale pending queue auto-empty (gated, default off)

  • When VaultPendingQueueStaleAutoEmpty is on and seqNr - written_seq_nr >= VaultPendingQueueStaleRoundThreshold (default 30), Observation and ValidateObservation skip store-backed pending items for that round—same effect as force-empty for the KV queue only, so a successful StateTransition can rewrite/clear stuck entries.
  • written_seq_nr is written to the pending queue index only when the gate is enabled, keeping KV index bytes identical to pre-change {length} serialization while the feature is off (avoids rollout / determinism issues).
  • Backwards compatibility: if the queue is non-empty but written_seq_nr == 0, auto-empty does not run until the first post-enable WritePendingQueue sets the watermark.
  • VaultForceEmptyOCRRounds is unchanged and still takes precedence when enabled (manual escape hatch).

Local queue hardening (ungated)

  • prepareObservationPendingQueueBlobs skips a single oversize local-queue blob instead of failing the whole observation.

Observability

  • Beholder counters for stale auto-empty and skipped local blobs.

@github-actions
Copy link
Copy Markdown
Contributor

I see you updated files related to core. Please run make gocs in the root directory to add a changeset as well as in the text include at least one of the following tags:

  • #added For any new functionality added.
  • #breaking_change For any functionality that requires manual action for the node to boot.
  • #bugfix For bug fixes.
  • #changed For any change to the existing functionality.
  • #db_update For any feature that introduces updates to database schema.
  • #deprecation_notice For any upcoming deprecation functionality.
  • #internal For changesets that need to be excluded from the final changelog.
  • #nops For any feature that is NOP facing and needs to be in the official Release Notes for the release.
  • #removed For any functionality/config that is removed.
  • #updated For any functionality that is updated.
  • #wip For any change that is not ready yet and external communication about it should be held off till it is feature complete.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 29, 2026

✅ No conflicts with other open PRs targeting develop

@trunk-io
Copy link
Copy Markdown

trunk-io Bot commented May 29, 2026

Static BadgeStatic BadgeStatic BadgeStatic Badge

Failed Test Failure Summary Logs
TestHandler_HandleGatewayMessage/capability_execute_success The test failed because the capability request parameters were invalid due to missing enclave signer configuration, causing a signing failure. Logs ↗︎
TestHandler_HandleGatewayMessage The test 'TestHandler_HandleGatewayMessage' failed during execution. Logs ↗︎

View Full Report ↗︎Docs

@cl-sonarqube-production
Copy link
Copy Markdown

}

func (m *pluginMetrics) trackPendingQueueStaleAutoEmpty(ctx context.Context) {
if m == nil {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know we have this elsewhere, but this is a smell -- it means we're potentially instantiating a nil pointer and calling methods on it

In prod we should never have that situation; this struct should always be initialized

}

lag := seqNr - index.WrittenSeqNr
if threshold <= 0 {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: move this so it's closer to where we get the threshold limit

Atm this reads like:

  • fetch threshold
  • check some other condition not related to threshold
  • establish the lag
  • check something to do with threshold


// skipStoreBackedPendingQueue reports whether Observation/ValidateObservation should skip
// reading the KV pending queue this round (manual force-empty or stale-queue TTL).
func (r *ReportingPlugin) skipStoreBackedPendingQueue(ctx context.Context, seqNr uint64, readKV *KVStore) (skip bool, reason string, err error) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely worth running this logic by Chrysa/Kostis to make sure we have it correct.

return false, "", fmt.Errorf("could not fetch VaultPendingQueueStaleRoundThreshold: %w", err)
}

if seqNr < index.WrittenSeqNr {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add some comments here to explain the logic in full

We are basically saying that:

  • seqNr increases monotonically
  • we should always be reading a pending queue that is fairly fresh (by some lag, currently it would be 1 round in the happy path)
  • if we're reading from a pending queue that is significantly older than that, then we've failed to write to it in a while, which means the queue is stuck
  • therefore this is a candidate for flushing the queue automatically

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants