Skip to content

feat(metrics): implement lean_aggregator_skipped_total from leanMetrics PR #36#430

Merged
MegaRedHand merged 2 commits into
mainfrom
aggregator-skip-metric
Jun 10, 2026
Merged

feat(metrics): implement lean_aggregator_skipped_total from leanMetrics PR #36#430
MegaRedHand merged 2 commits into
mainfrom
aggregator-skip-metric

Conversation

@pablodeymo

Copy link
Copy Markdown
Collaborator

🗒️ Description / Motivation

Implements the new cross-client counter lean_aggregator_skipped_total{reason=...} proposed in leanMetrics PR #36, so operators can attribute missed aggregations directly instead of deriving them from coverage gauges or logs.

What Changed

reason When it fires in ethlambda
not_aggregator Every interval-2 tick where the aggregator flag is off — separates "no duty" from genuine misses
other Aggregation jobs the worker never reached because the 750 ms session deadline (or actor shutdown) cancelled it, incremented by the number of dropped jobs
not_synced, missing_state, spawn_failed Never fire — ethlambda has no sync gate on aggregation, no per-target pre-state resolution, and spawn_blocking cannot fail to start. Seeded at zero so fleet dashboards see the full label set
  • crates/blockchain/src/metrics.rs: IntCounterVec registration, reason-list const documenting the never-firing labels, two inc_* helpers, init seeding
  • crates/blockchain/src/lib.rs: count the skip at the interval-2 tick when not an aggregator
  • crates/blockchain/src/aggregation.rs: track attempted vs. total jobs in run_aggregation_worker and count the dropped remainder on cancellation

✅ Verification Checklist

  • Ran make fmt — clean
  • Ran make lint (clippy with -D warnings) — clean
  • cargo test -p ethlambda-blockchain --lib — 29 passed

…cross-client counter attributes missed aggregations by reason: not_aggregator increments on every interval-2 tick where the aggregator flag is off (separating no-duty slots from genuine misses), and other counts jobs the aggregation worker never reached because the 750ms session deadline or actor shutdown cancelled it. The remaining reasons (not_synced, missing_state, spawn_failed) never fire in ethlambda - aggregation is not gated on sync status, needs no per-target pre-state resolution, and spawn_blocking cannot fail to start - but all five label values are seeded at zero in metrics::init() so fleet-wide dashboards see the complete series from startup.
@github-actions

Copy link
Copy Markdown

🤖 Kimi Code Review

The PR adds comprehensive metrics for aggregator skip reasons, improving observability for cross-client fleet monitoring. The implementation is correct and follows Rust best practices.

Code Quality & Correctness

  • Defensive arithmetic (aggregation.rs:456): The subtraction jobs_total - jobs_attempted is logically safe (attempted ≤ total by construction), but consider jobs_total.saturating_sub(jobs_attempted) as future-proofing against potential refactoring bugs.

  • Type safety: The usize to u64 cast (aggregation.rs:458) is acceptable—realistic job counts will never overflow 32-bit bounds, and ethlambda targets 64-bit platforms.

  • Metrics initialization (metrics.rs:621-623): Pre-seeding all label values in init() ensures Prometheus receives zero-filled series immediately, preventing "metric disappears" artifacts in dashboards. This is correct Prometheus client usage.

Architecture

  • Separation of concerns: The AGGREGATOR_SKIP_REASONS constant with detailed documentation (metrics.rs:519-535) cleanly documents cross-client metric compatibility without cluttering the business logic.

  • Control flow (lib.rs:257-265): Restructuring the interval == 2 logic to explicitly handle the !is_aggregator case eliminates ambiguity and ensures the "not_aggregator" reason is recorded exactly once per missed duty.

Consensus & Security

  • No consensus impact: These changes are strictly observability-side (counters only). They do not affect attestation validation, aggregation bitfields, or fork choice weight calculations.
  • Thread safety: LazyLock-protected statics with the Prometheus client’s internal atomic counters are correctly used for the cross-thread aggregation worker → main thread metrics export.

Nitpicks

  • Documentation consistency: The comment at aggregation.rs:453 refers to "leanMetrics" (CamelCase) while the metric name uses snake_case (lean_aggregator_skipped_total). Consider aligning terminology with the actual metric name.

Verdict: LGTM. The metrics accurately track dropped jobs under cancellation pressure and distinguish "no duty" from genuine operational skips, which is essential for debugging aggregation participation rates in production.


Automated review by Kimi (Moonshot AI) · kimi-k2.5 · custom prompt

@github-actions

Copy link
Copy Markdown

🤖 Codex Code Review

  1. Minor: crates/blockchain/src/lib.rs increments lean_aggregator_skipped_total{reason="not_aggregator"} on every interval-2 tick whenever the node is configured as a non-aggregator. That is a metric-semantics mismatch with the help text in crates/blockchain/src/metrics.rs ("Aggregation submissions skipped"): a permanently non-aggregating node will accumulate skips even when it had no local validator duties or no aggregateable inputs. If this series is meant to track missed aggregation opportunities, it should be gated on an actual aggregation opportunity, not just role configuration.

I did not find a consensus-critical regression in this diff. The changes are confined to observability around aggregation; they do not alter fork choice, attestation validation, justification/finalization, state transition, XMSS verification, or SSZ handling.

I could not run the targeted Rust tests in this environment because cargo/rustup failed trying to create temp files under a read-only /home/runner/.rustup.


Automated review by OpenAI Codex · gpt-5.4 · custom prompt

@github-actions

Copy link
Copy Markdown

🤖 Claude Code Review

PR #430 Review: lean_aggregator_skipped_total Implementation

Overview: Adds a cross-client Prometheus counter lean_aggregator_skipped_total{reason=...} per leanMetrics PR #36. Three files changed (+71/-3). Clean and well-scoped.


crates/blockchain/src/metrics.rs

The registration pattern matches existing codebase conventions (LazyLock, register_int_counter_vec!, public helper functions). A few notes:

  • The AGGREGATOR_SKIP_REASONS const is well-motivated — it makes the cross-client label contract explicit in one place and prevents the seeding loop and the docblock from drifting apart. Good.
  • Seeding via LEAN_AGGREGATOR_SKIPPED_TOTAL.with_label_values(&[reason]); (discarding the return value) is the correct Prometheus pattern for creating a series at zero without incrementing. No issue.
  • inc_aggregator_skipped_other(count: u64) correctly uses inc_by rather than a loop of inc() calls.
  • Minor: the docblock on AGGREGATOR_SKIP_REASONS re-states information already in the PR description. At the code level, a one-liner like // Cross-client label set; some reasons are seeded-only (never fire in ethlambda). would be lighter without losing the essential context.

crates/blockchain/src/lib.rs

The restructuring from if interval == 2 && is_aggregator to a nested if/else is the right shape for adding the new branch. No behavioral regression. The reformatted emit_agg_start_new_coverage call is rustfmt output, not a style choice. ✓


crates/blockchain/src/aggregation.rs

The counting logic is correct. Tracing through the cases:

  • No cancellation: jobs_attempted == jobs_total, jobs_dropped == 0, guard prevents no-op inc_by(0). ✓
  • Cancel before first job: jobs_attempted == 0, all jobs counted as dropped. ✓
  • Cancel after N jobs: jobs_attempted == N, jobs_dropped == total - N (jobs the loop never entered). ✓
  • Empty job list: 0 - 0 == 0, guard short-circuits. ✓

usizeu64 cast is safe on all platforms (usize::MAX <= u64::MAX). ✓

One observation: jobs_attempted is incremented before the job body executes. This is semantically "jobs the loop reached", not "jobs that completed successfully". Given the metric's stated purpose (counting jobs "the loop never reached"), this is the right placement — a job that starts but is aborted mid-processing would still be counted as attempted, not dropped. The naming is consistent with the intent.

No unit test for the dropped-job counting: the run_aggregation_worker cancellation path has no dedicated test. Given the existing test approach (29 lib unit tests, no explicit aggregation cancellation test), this is in-line with the rest of the codebase. It would be worth adding a test that pre-cancels the token before calling run_aggregation_worker and asserts the counter fires, but this is not a blocker.


Summary

The implementation is correct, follows project conventions (LazyLock registration, function-scoped metrics helpers, tracing log field order), and the seeding strategy for never-firing labels is the right approach for fleet dashboards. The jobs_dropped arithmetic is correct across all cancellation timings.

Findings worth addressing before merge:

  • The docblock on AGGREGATOR_SKIP_REASONS is heavier than CLAUDE.md style guidance calls for. Consider trimming to a single sentence.
  • Consider a unit test for the cancellation path in run_aggregation_worker (pre-cancel the CancellationToken, call the function with N jobs, assert inc_aggregator_skipped_other fires with count N).

Otherwise, this is a clean and correct metrics addition. Approved pending the optional test addition.


Automated review by Claude (Anthropic) · sonnet · custom prompt

@greptile-apps

greptile-apps Bot commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

Adds the cross-client lean_aggregator_skipped_total{reason=...} Prometheus counter proposed in leanMetrics PR #36, wiring two live increment paths and pre-seeding three never-fire labels at startup.

  • metrics.rs: Registers the IntCounterVec, declares AGGREGATOR_SKIP_REASONS as the canonical label list, seeds all five reasons to zero in init(), and exposes inc_aggregator_skipped_not_aggregator / inc_aggregator_skipped_other.
  • lib.rs: Expands the interval == 2 guard into an if/else so every tick where the node has no aggregation duty increments not_aggregator.
  • aggregation.rs: Tracks jobs_total and jobs_attempted around the per-job loop, then increments other by the number of jobs the loop never reached after a cancellation or actor-gone break.

Confidence Score: 4/5

Safe to merge; the new metric paths are additive and do not touch any consensus or data-persistence logic.

All three files add purely observational counter increments alongside existing logic. The one notable subtlety is in aggregation.rs where a job whose actor.send fails is classified as attempted rather than dropped, leaving an ambiguity against the leanMetrics spec that is worth clarifying but does not affect node correctness.

crates/blockchain/src/aggregation.rs — the jobs_attempted/jobs_dropped boundary around the actor-gone break deserves a second look to confirm it matches the leanMetrics spec intent.

Important Files Changed

Filename Overview
crates/blockchain/src/metrics.rs Registers IntCounterVec with 5-label seed pattern; seeding via discarded with_label_values is correct prometheus-rs idiom. No issues.
crates/blockchain/src/lib.rs Refactors the interval-2 guard from a combined && into an if/else; increments not_aggregator on every tick where the duty flag is off. Correct and minimal.
crates/blockchain/src/aggregation.rs Tracks jobs_total/jobs_attempted around the cancellation loop; drops are correctly counted. Minor semantic nuance around the actor-gone break path worth reviewing.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[interval-2 tick] --> B{is_aggregator?}
    B -- yes --> C[emit_agg_start_new_coverage]
    C --> D[start_aggregation_session\nspawn_blocking run_aggregation_worker]
    B -- no --> E[inc_aggregator_skipped_not_aggregator\nreason=not_aggregator +1]

    D --> F[run_aggregation_worker loop]
    F --> G{cancel.is_cancelled?}
    G -- yes, break --> H[jobs_dropped = jobs_total - jobs_attempted]
    G -- no --> I[jobs_attempted += 1\naggregate_job]
    I --> J{actor.send OK?}
    J -- yes --> G
    J -- no, break --> H
    I --> K{aggregate_job OK?}
    K -- no --> G

    H --> L{jobs_dropped > 0?}
    L -- yes --> M[inc_aggregator_skipped_other\nreason=other +jobs_dropped]
    L -- no --> N[send AggregationDone]
    M --> N
Loading

Comments Outside Diff (1)

  1. crates/blockchain/src/aggregation.rs, line 447-453 (link)

    P2 Actor-gone break undercounts other skips by one job

    When actor.send(AggregateProduced) fails (line 449–453), jobs_attempted was already incremented for the current job, so that job's result is silently discarded without contributing to jobs_dropped. Only the remaining unstarted jobs are counted. Depending on the leanMetrics spec intent — whether a "job whose aggregated output could not be delivered" should also be a skip — this can undercount by one per actor-gone event. If "attempted" is meant to cover "we ran aggregate_job for it", the current behaviour is correct; if it should only cover "the result was successfully sent", the increment should be deferred until after the send succeeds.

    Prompt To Fix With AI
    This is a comment left during a code review.
    Path: crates/blockchain/src/aggregation.rs
    Line: 447-453
    
    Comment:
    **Actor-gone break undercounts `other` skips by one job**
    
    When `actor.send(AggregateProduced)` fails (line 449–453), `jobs_attempted` was already incremented for the current job, so that job's result is silently discarded without contributing to `jobs_dropped`. Only the remaining *unstarted* jobs are counted. Depending on the leanMetrics spec intent — whether a "job whose aggregated output could not be delivered" should also be a skip — this can undercount by one per actor-gone event. If "attempted" is meant to cover "we ran `aggregate_job` for it", the current behaviour is correct; if it should only cover "the result was successfully sent", the increment should be deferred until after the send succeeds.
    
    How can I resolve this? If you propose a fix, please make it concise.
Prompt To Fix All With AI
Fix the following 1 code review issue. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 1
crates/blockchain/src/aggregation.rs:447-453
**Actor-gone break undercounts `other` skips by one job**

When `actor.send(AggregateProduced)` fails (line 449–453), `jobs_attempted` was already incremented for the current job, so that job's result is silently discarded without contributing to `jobs_dropped`. Only the remaining *unstarted* jobs are counted. Depending on the leanMetrics spec intent — whether a "job whose aggregated output could not be delivered" should also be a skip — this can undercount by one per actor-gone event. If "attempted" is meant to cover "we ran `aggregate_job` for it", the current behaviour is correct; if it should only cover "the result was successfully sent", the increment should be deferred until after the send succeeds.

Reviews (1): Last reviewed commit: "Implement lean_aggregator_skipped_total ..." | Re-trigger Greptile

@pablodeymo pablodeymo requested a review from MegaRedHand June 10, 2026 19:09
@MegaRedHand MegaRedHand merged commit ed2fea0 into main Jun 10, 2026
1 of 2 checks passed
@MegaRedHand MegaRedHand deleted the aggregator-skip-metric branch June 10, 2026 19:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants