feat(metrics): implement lean_aggregator_skipped_total from leanMetrics PR #36 by pablodeymo · Pull Request #430 · lambdaclass/ethlambda

pablodeymo · 2026-06-10T18:53:01Z

🗒️ Description / Motivation

Implements the new cross-client counter lean_aggregator_skipped_total{reason=...} proposed in leanMetrics PR #36, so operators can attribute missed aggregations directly instead of deriving them from coverage gauges or logs.

What Changed

reason	When it fires in ethlambda
`not_aggregator`	Every interval-2 tick where the aggregator flag is off — separates "no duty" from genuine misses
`other`	Aggregation jobs the worker never reached because the 750 ms session deadline (or actor shutdown) cancelled it, incremented by the number of dropped jobs
`not_synced`, `missing_state`, `spawn_failed`	Never fire — ethlambda has no sync gate on aggregation, no per-target pre-state resolution, and `spawn_blocking` cannot fail to start. Seeded at zero so fleet dashboards see the full label set

crates/blockchain/src/metrics.rs: IntCounterVec registration, reason-list const documenting the never-firing labels, two inc_* helpers, init seeding
crates/blockchain/src/lib.rs: count the skip at the interval-2 tick when not an aggregator
crates/blockchain/src/aggregation.rs: track attempted vs. total jobs in run_aggregation_worker and count the dropped remainder on cancellation

✅ Verification Checklist

Ran make fmt — clean
Ran make lint (clippy with -D warnings) — clean
cargo test -p ethlambda-blockchain --lib — 29 passed

…cross-client counter attributes missed aggregations by reason: not_aggregator increments on every interval-2 tick where the aggregator flag is off (separating no-duty slots from genuine misses), and other counts jobs the aggregation worker never reached because the 750ms session deadline or actor shutdown cancelled it. The remaining reasons (not_synced, missing_state, spawn_failed) never fire in ethlambda - aggregation is not gated on sync status, needs no per-target pre-state resolution, and spawn_blocking cannot fail to start - but all five label values are seeded at zero in metrics::init() so fleet-wide dashboards see the complete series from startup.

github-actions · 2026-06-10T18:53:55Z

🤖 Kimi Code Review

The PR adds comprehensive metrics for aggregator skip reasons, improving observability for cross-client fleet monitoring. The implementation is correct and follows Rust best practices.

Code Quality & Correctness

Defensive arithmetic (aggregation.rs:456): The subtraction jobs_total - jobs_attempted is logically safe (attempted ≤ total by construction), but consider jobs_total.saturating_sub(jobs_attempted) as future-proofing against potential refactoring bugs.
Type safety: The usize to u64 cast (aggregation.rs:458) is acceptable—realistic job counts will never overflow 32-bit bounds, and ethlambda targets 64-bit platforms.
Metrics initialization (metrics.rs:621-623): Pre-seeding all label values in init() ensures Prometheus receives zero-filled series immediately, preventing "metric disappears" artifacts in dashboards. This is correct Prometheus client usage.

Architecture

Separation of concerns: The AGGREGATOR_SKIP_REASONS constant with detailed documentation (metrics.rs:519-535) cleanly documents cross-client metric compatibility without cluttering the business logic.
Control flow (lib.rs:257-265): Restructuring the interval == 2 logic to explicitly handle the !is_aggregator case eliminates ambiguity and ensures the "not_aggregator" reason is recorded exactly once per missed duty.

Consensus & Security

No consensus impact: These changes are strictly observability-side (counters only). They do not affect attestation validation, aggregation bitfields, or fork choice weight calculations.
Thread safety: LazyLock-protected statics with the Prometheus client’s internal atomic counters are correctly used for the cross-thread aggregation worker → main thread metrics export.

Nitpicks

Documentation consistency: The comment at aggregation.rs:453 refers to "leanMetrics" (CamelCase) while the metric name uses snake_case (lean_aggregator_skipped_total). Consider aligning terminology with the actual metric name.

Verdict: LGTM. The metrics accurately track dropped jobs under cancellation pressure and distinguish "no duty" from genuine operational skips, which is essential for debugging aggregation participation rates in production.

Automated review by Kimi (Moonshot AI) · kimi-k2.5 · custom prompt

github-actions · 2026-06-10T18:54:48Z

🤖 Codex Code Review

Minor: crates/blockchain/src/lib.rs increments lean_aggregator_skipped_total{reason="not_aggregator"} on every interval-2 tick whenever the node is configured as a non-aggregator. That is a metric-semantics mismatch with the help text in crates/blockchain/src/metrics.rs ("Aggregation submissions skipped"): a permanently non-aggregating node will accumulate skips even when it had no local validator duties or no aggregateable inputs. If this series is meant to track missed aggregation opportunities, it should be gated on an actual aggregation opportunity, not just role configuration.

I did not find a consensus-critical regression in this diff. The changes are confined to observability around aggregation; they do not alter fork choice, attestation validation, justification/finalization, state transition, XMSS verification, or SSZ handling.

I could not run the targeted Rust tests in this environment because cargo/rustup failed trying to create temp files under a read-only /home/runner/.rustup.

Automated review by OpenAI Codex · gpt-5.4 · custom prompt

github-actions · 2026-06-10T18:55:06Z

🤖 Claude Code Review

PR #430 Review: `lean_aggregator_skipped_total` Implementation

Overview: Adds a cross-client Prometheus counter lean_aggregator_skipped_total{reason=...} per leanMetrics PR #36. Three files changed (+71/-3). Clean and well-scoped.

`crates/blockchain/src/metrics.rs`

The registration pattern matches existing codebase conventions (LazyLock, register_int_counter_vec!, public helper functions). A few notes:

The AGGREGATOR_SKIP_REASONS const is well-motivated — it makes the cross-client label contract explicit in one place and prevents the seeding loop and the docblock from drifting apart. Good.
Seeding via LEAN_AGGREGATOR_SKIPPED_TOTAL.with_label_values(&[reason]); (discarding the return value) is the correct Prometheus pattern for creating a series at zero without incrementing. No issue.
inc_aggregator_skipped_other(count: u64) correctly uses inc_by rather than a loop of inc() calls.
Minor: the docblock on AGGREGATOR_SKIP_REASONS re-states information already in the PR description. At the code level, a one-liner like // Cross-client label set; some reasons are seeded-only (never fire in ethlambda). would be lighter without losing the essential context.

`crates/blockchain/src/lib.rs`

The restructuring from if interval == 2 && is_aggregator to a nested if/else is the right shape for adding the new branch. No behavioral regression. The reformatted emit_agg_start_new_coverage call is rustfmt output, not a style choice. ✓

`crates/blockchain/src/aggregation.rs`

The counting logic is correct. Tracing through the cases:

No cancellation: jobs_attempted == jobs_total, jobs_dropped == 0, guard prevents no-op inc_by(0). ✓
Cancel before first job: jobs_attempted == 0, all jobs counted as dropped. ✓
Cancel after N jobs: jobs_attempted == N, jobs_dropped == total - N (jobs the loop never entered). ✓
Empty job list: 0 - 0 == 0, guard short-circuits. ✓

usize → u64 cast is safe on all platforms (usize::MAX <= u64::MAX). ✓

One observation: jobs_attempted is incremented before the job body executes. This is semantically "jobs the loop reached", not "jobs that completed successfully". Given the metric's stated purpose (counting jobs "the loop never reached"), this is the right placement — a job that starts but is aborted mid-processing would still be counted as attempted, not dropped. The naming is consistent with the intent.

No unit test for the dropped-job counting: the run_aggregation_worker cancellation path has no dedicated test. Given the existing test approach (29 lib unit tests, no explicit aggregation cancellation test), this is in-line with the rest of the codebase. It would be worth adding a test that pre-cancels the token before calling run_aggregation_worker and asserts the counter fires, but this is not a blocker.

Summary

The implementation is correct, follows project conventions (LazyLock registration, function-scoped metrics helpers, tracing log field order), and the seeding strategy for never-firing labels is the right approach for fleet dashboards. The jobs_dropped arithmetic is correct across all cancellation timings.

Findings worth addressing before merge:

The docblock on AGGREGATOR_SKIP_REASONS is heavier than CLAUDE.md style guidance calls for. Consider trimming to a single sentence.
Consider a unit test for the cancellation path in run_aggregation_worker (pre-cancel the CancellationToken, call the function with N jobs, assert inc_aggregator_skipped_other fires with count N).

Otherwise, this is a clean and correct metrics addition. Approved pending the optional test addition.

Automated review by Claude (Anthropic) · sonnet · custom prompt

greptile-apps · 2026-06-10T18:56:46Z

Greptile Summary

Adds the cross-client lean_aggregator_skipped_total{reason=...} Prometheus counter proposed in leanMetrics PR #36, wiring two live increment paths and pre-seeding three never-fire labels at startup.

metrics.rs: Registers the IntCounterVec, declares AGGREGATOR_SKIP_REASONS as the canonical label list, seeds all five reasons to zero in init(), and exposes inc_aggregator_skipped_not_aggregator / inc_aggregator_skipped_other.
lib.rs: Expands the interval == 2 guard into an if/else so every tick where the node has no aggregation duty increments not_aggregator.
aggregation.rs: Tracks jobs_total and jobs_attempted around the per-job loop, then increments other by the number of jobs the loop never reached after a cancellation or actor-gone break.

Confidence Score: 4/5

Safe to merge; the new metric paths are additive and do not touch any consensus or data-persistence logic.

All three files add purely observational counter increments alongside existing logic. The one notable subtlety is in aggregation.rs where a job whose actor.send fails is classified as attempted rather than dropped, leaving an ambiguity against the leanMetrics spec that is worth clarifying but does not affect node correctness.

crates/blockchain/src/aggregation.rs — the jobs_attempted/jobs_dropped boundary around the actor-gone break deserves a second look to confirm it matches the leanMetrics spec intent.

Important Files Changed

Filename	Overview
crates/blockchain/src/metrics.rs	Registers IntCounterVec with 5-label seed pattern; seeding via discarded with_label_values is correct prometheus-rs idiom. No issues.
crates/blockchain/src/lib.rs	Refactors the interval-2 guard from a combined && into an if/else; increments not_aggregator on every tick where the duty flag is off. Correct and minimal.
crates/blockchain/src/aggregation.rs	Tracks jobs_total/jobs_attempted around the cancellation loop; drops are correctly counted. Minor semantic nuance around the actor-gone break path worth reviewing.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[interval-2 tick] --> B{is_aggregator?}
    B -- yes --> C[emit_agg_start_new_coverage]
    C --> D[start_aggregation_session\nspawn_blocking run_aggregation_worker]
    B -- no --> E[inc_aggregator_skipped_not_aggregator\nreason=not_aggregator +1]

    D --> F[run_aggregation_worker loop]
    F --> G{cancel.is_cancelled?}
    G -- yes, break --> H[jobs_dropped = jobs_total - jobs_attempted]
    G -- no --> I[jobs_attempted += 1\naggregate_job]
    I --> J{actor.send OK?}
    J -- yes --> G
    J -- no, break --> H
    I --> K{aggregate_job OK?}
    K -- no --> G

    H --> L{jobs_dropped > 0?}
    L -- yes --> M[inc_aggregator_skipped_other\nreason=other +jobs_dropped]
    L -- no --> N[send AggregationDone]
    M --> N

Comments Outside Diff (1)

crates/blockchain/src/aggregation.rs, line 447-453 (link)

Actor-gone break undercounts other skips by one job

When actor.send(AggregateProduced) fails (line 449–453), jobs_attempted was already incremented for the current job, so that job's result is silently discarded without contributing to jobs_dropped. Only the remaining unstarted jobs are counted. Depending on the leanMetrics spec intent — whether a "job whose aggregated output could not be delivered" should also be a skip — this can undercount by one per actor-gone event. If "attempted" is meant to cover "we ran aggregate_job for it", the current behaviour is correct; if it should only cover "the result was successfully sent", the increment should be deferred until after the send succeeds.

Prompt To Fix With AI

This is a comment left during a code review.
Path: crates/blockchain/src/aggregation.rs
Line: 447-453

Comment:
**Actor-gone break undercounts `other` skips by one job**

When `actor.send(AggregateProduced)` fails (line 449–453), `jobs_attempted` was already incremented for the current job, so that job's result is silently discarded without contributing to `jobs_dropped`. Only the remaining *unstarted* jobs are counted. Depending on the leanMetrics spec intent — whether a "job whose aggregated output could not be delivered" should also be a skip — this can undercount by one per actor-gone event. If "attempted" is meant to cover "we ran `aggregate_job` for it", the current behaviour is correct; if it should only cover "the result was successfully sent", the increment should be deferred until after the send succeeds.

How can I resolve this? If you propose a fix, please make it concise.

Prompt To Fix All With AI

Fix the following 1 code review issue. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 1
crates/blockchain/src/aggregation.rs:447-453
**Actor-gone break undercounts `other` skips by one job**

When `actor.send(AggregateProduced)` fails (line 449–453), `jobs_attempted` was already incremented for the current job, so that job's result is silently discarded without contributing to `jobs_dropped`. Only the remaining *unstarted* jobs are counted. Depending on the leanMetrics spec intent — whether a "job whose aggregated output could not be delivered" should also be a skip — this can undercount by one per actor-gone event. If "attempted" is meant to cover "we ran `aggregate_job` for it", the current behaviour is correct; if it should only cover "the result was successfully sent", the increment should be deferred until after the send succeeds.

_{Reviews (1): Last reviewed commit: "Implement lean_aggregator_skipped_total ..." | Re-trigger Greptile}

pablodeymo requested a review from MegaRedHand June 10, 2026 19:09

Merge branch 'main' into aggregator-skip-metric

607dcf9

MegaRedHand approved these changes Jun 10, 2026

View reviewed changes

MegaRedHand merged commit ed2fea0 into main Jun 10, 2026
1 of 2 checks passed

MegaRedHand deleted the aggregator-skip-metric branch June 10, 2026 19:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(metrics): implement lean_aggregator_skipped_total from leanMetrics PR #36#430

feat(metrics): implement lean_aggregator_skipped_total from leanMetrics PR #36#430
MegaRedHand merged 2 commits into
mainfrom
aggregator-skip-metric

pablodeymo commented Jun 10, 2026

Uh oh!

github-actions Bot commented Jun 10, 2026

Uh oh!

github-actions Bot commented Jun 10, 2026

Uh oh!

github-actions Bot commented Jun 10, 2026

Uh oh!

greptile-apps Bot commented Jun 10, 2026 •

edited

Loading

Important Files Changed

Comments Outside Diff (1)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

pablodeymo commented Jun 10, 2026

🗒️ Description / Motivation

What Changed

✅ Verification Checklist

Uh oh!

github-actions Bot commented Jun 10, 2026

🤖 Kimi Code Review

Uh oh!

github-actions Bot commented Jun 10, 2026

🤖 Codex Code Review

Uh oh!

github-actions Bot commented Jun 10, 2026

🤖 Claude Code Review

PR #430 Review: lean_aggregator_skipped_total Implementation

crates/blockchain/src/metrics.rs

crates/blockchain/src/lib.rs

crates/blockchain/src/aggregation.rs

Summary

Uh oh!

greptile-apps Bot commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Flowchart

Comments Outside Diff (1)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

PR #430 Review: `lean_aggregator_skipped_total` Implementation

`crates/blockchain/src/metrics.rs`

`crates/blockchain/src/lib.rs`

`crates/blockchain/src/aggregation.rs`

greptile-apps Bot commented Jun 10, 2026 •

edited

Loading