Vault: automated stale pending queue purge by russell-stern · Pull Request #22689 · smartcontractkit/chainlink

russell-stern · 2026-05-29T19:06:11Z

Summary

Adds a gated recovery path when the vault KV pending queue is stuck and blocking observation quorum, so ops no longer need to leave VaultForceEmptyOCRRounds enabled globally.

Depends on chainlink-common#2104 (StoredPendingQueueIndex.written_seq_nr, CRE settings VaultPendingQueueStaleAutoEmpty and VaultPendingQueueStaleRoundThreshold).

Stale pending queue auto-empty (gated, default off)

When VaultPendingQueueStaleAutoEmpty is on and seqNr - written_seq_nr >= VaultPendingQueueStaleRoundThreshold (default 30), Observation and ValidateObservation skip store-backed pending items for that round—same effect as force-empty for the KV queue only, so a successful StateTransition can rewrite/clear stuck entries.
written_seq_nr is written to the pending queue index only when the gate is enabled, keeping KV index bytes identical to pre-change {length} serialization while the feature is off (avoids rollout / determinism issues).
Backwards compatibility: if the queue is non-empty but written_seq_nr == 0, auto-empty does not run until the first post-enable WritePendingQueue sets the watermark.
VaultForceEmptyOCRRounds is unchanged and still takes precedence when enabled (manual escape hatch).

Local queue hardening (ungated)

prepareObservationPendingQueueBlobs skips a single oversize local-queue blob instead of failing the whole observation.

Observability

Beholder counters for stale auto-empty and skipped local blobs.

github-actions · 2026-05-29T19:07:31Z

I see you updated files related to core. Please run make gocs in the root directory to add a changeset as well as in the text include at least one of the following tags:

#added For any new functionality added.
#breaking_change For any functionality that requires manual action for the node to boot.
#bugfix For bug fixes.
#changed For any change to the existing functionality.
#db_update For any feature that introduces updates to database schema.
#deprecation_notice For any upcoming deprecation functionality.
#internal For changesets that need to be excluded from the final changelog.
#nops For any feature that is NOP facing and needs to be in the official Release Notes for the release.
#removed For any functionality/config that is removed.
#updated For any functionality that is updated.
#wip For any change that is not ready yet and external communication about it should be held off till it is feature complete.

github-actions · 2026-05-29T19:08:02Z

✅ No conflicts with other open PRs targeting develop

trunk-io · 2026-05-29T19:24:52Z

Failed Test	Failure Summary	Logs
`TestHandler_HandleGatewayMessage/capability_execute_success`	The test failed because the capability request parameters were invalid due to missing enclave signer configuration, causing a signing failure.	Logs ↗︎
`TestHandler_HandleGatewayMessage`	The test 'TestHandler_HandleGatewayMessage' failed during execution.	Logs ↗︎

_{View Full Report ↗︎ ⋅ Docs}

cl-sonarqube-production · 2026-06-02T02:16:25Z

Quality Gate passed

Issues
2 New issues
2 Fixed issues
0 Accepted issues

Measures
0 Security Hotspots
No data about Coverage
0.0% Duplication on New Code

See analysis details on SonarQube

cedric-cordenier · 2026-06-05T12:20:39Z

 }
+
+func (m *pluginMetrics) trackPendingQueueStaleAutoEmpty(ctx context.Context) {
+	if m == nil {


I know we have this elsewhere, but this is a smell -- it means we're potentially instantiating a nil pointer and calling methods on it

In prod we should never have that situation; this struct should always be initialized

cedric-cordenier · 2026-06-05T12:24:27Z

+	}
+
+	lag := seqNr - index.WrittenSeqNr
+	if threshold <= 0 {


Nit: move this so it's closer to where we get the threshold limit

Atm this reads like:

fetch threshold

check some other condition not related to threshold

establish the lag

check something to do with threshold

cedric-cordenier · 2026-06-05T12:27:49Z

+
+// skipStoreBackedPendingQueue reports whether Observation/ValidateObservation should skip
+// reading the KV pending queue this round (manual force-empty or stale-queue TTL).
+func (r *ReportingPlugin) skipStoreBackedPendingQueue(ctx context.Context, seqNr uint64, readKV *KVStore) (skip bool, reason string, err error) {


Definitely worth running this logic by Chrysa/Kostis to make sure we have it correct.

cedric-cordenier · 2026-06-05T12:30:07Z

+		return false, "", fmt.Errorf("could not fetch VaultPendingQueueStaleRoundThreshold: %w", err)
+	}
+
+	if seqNr < index.WrittenSeqNr {


Please add some comments here to explain the logic in full

We are basically saying that:

seqNr increases monotonically

we should always be reading a pending queue that is fairly fresh (by some lag, currently it would be 1 round in the happy path)

if we're reading from a pending queue that is significantly older than that, then we've failed to write to it in a while, which means the queue is stuck

therefore this is a candidate for flushing the queue automatically

russell-stern added 2 commits May 29, 2026 14:50

Added an automated pending queue purge in vault

9bea202

Put writing seqNr behind the flag

14af338

russell-stern added 2 commits May 29, 2026 15:31

Fixed linting

5b18946

Bump go mod

8ae1c47

cedric-cordenier reviewed Jun 5, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vault: automated stale pending queue purge#22689

Vault: automated stale pending queue purge#22689
russell-stern wants to merge 4 commits into
developfrom
find_bad_requests

russell-stern commented May 29, 2026

Uh oh!

github-actions Bot commented May 29, 2026

Uh oh!

github-actions Bot commented May 29, 2026 •

edited

Loading

Uh oh!

trunk-io Bot commented May 29, 2026 •

edited

Loading

Uh oh!

cl-sonarqube-production Bot commented Jun 2, 2026

Uh oh!

cedric-cordenier Jun 5, 2026

Uh oh!

cedric-cordenier Jun 5, 2026

Uh oh!

cedric-cordenier Jun 5, 2026

Uh oh!

cedric-cordenier Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

russell-stern commented May 29, 2026

Summary

Stale pending queue auto-empty (gated, default off)

Local queue hardening (ungated)

Observability

Uh oh!

github-actions Bot commented May 29, 2026

Uh oh!

github-actions Bot commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

trunk-io Bot commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cl-sonarqube-production Bot commented Jun 2, 2026

Quality Gate passed

Uh oh!

cedric-cordenier Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

cedric-cordenier Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

cedric-cordenier Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

cedric-cordenier Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

github-actions Bot commented May 29, 2026 •

edited

Loading

trunk-io Bot commented May 29, 2026 •

edited

Loading