feat(metrics): implement lean_aggregator_skipped_total from leanMetrics PR #36#430
Conversation
…cross-client counter attributes missed aggregations by reason: not_aggregator increments on every interval-2 tick where the aggregator flag is off (separating no-duty slots from genuine misses), and other counts jobs the aggregation worker never reached because the 750ms session deadline or actor shutdown cancelled it. The remaining reasons (not_synced, missing_state, spawn_failed) never fire in ethlambda - aggregation is not gated on sync status, needs no per-target pre-state resolution, and spawn_blocking cannot fail to start - but all five label values are seeded at zero in metrics::init() so fleet-wide dashboards see the complete series from startup.
🤖 Kimi Code ReviewThe PR adds comprehensive metrics for aggregator skip reasons, improving observability for cross-client fleet monitoring. The implementation is correct and follows Rust best practices. Code Quality & Correctness
Architecture
Consensus & Security
Nitpicks
Verdict: LGTM. The metrics accurately track dropped jobs under cancellation pressure and distinguish "no duty" from genuine operational skips, which is essential for debugging aggregation participation rates in production. Automated review by Kimi (Moonshot AI) · kimi-k2.5 · custom prompt |
🤖 Codex Code Review
I did not find a consensus-critical regression in this diff. The changes are confined to observability around aggregation; they do not alter fork choice, attestation validation, justification/finalization, state transition, XMSS verification, or SSZ handling. I could not run the targeted Rust tests in this environment because Automated review by OpenAI Codex · gpt-5.4 · custom prompt |
🤖 Claude Code ReviewPR #430 Review:
|
Greptile SummaryAdds the cross-client
Confidence Score: 4/5Safe to merge; the new metric paths are additive and do not touch any consensus or data-persistence logic. All three files add purely observational counter increments alongside existing logic. The one notable subtlety is in aggregation.rs where a job whose actor.send fails is classified as attempted rather than dropped, leaving an ambiguity against the leanMetrics spec that is worth clarifying but does not affect node correctness. crates/blockchain/src/aggregation.rs — the jobs_attempted/jobs_dropped boundary around the actor-gone break deserves a second look to confirm it matches the leanMetrics spec intent.
|
| Filename | Overview |
|---|---|
| crates/blockchain/src/metrics.rs | Registers IntCounterVec with 5-label seed pattern; seeding via discarded with_label_values is correct prometheus-rs idiom. No issues. |
| crates/blockchain/src/lib.rs | Refactors the interval-2 guard from a combined && into an if/else; increments not_aggregator on every tick where the duty flag is off. Correct and minimal. |
| crates/blockchain/src/aggregation.rs | Tracks jobs_total/jobs_attempted around the cancellation loop; drops are correctly counted. Minor semantic nuance around the actor-gone break path worth reviewing. |
Flowchart
%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[interval-2 tick] --> B{is_aggregator?}
B -- yes --> C[emit_agg_start_new_coverage]
C --> D[start_aggregation_session\nspawn_blocking run_aggregation_worker]
B -- no --> E[inc_aggregator_skipped_not_aggregator\nreason=not_aggregator +1]
D --> F[run_aggregation_worker loop]
F --> G{cancel.is_cancelled?}
G -- yes, break --> H[jobs_dropped = jobs_total - jobs_attempted]
G -- no --> I[jobs_attempted += 1\naggregate_job]
I --> J{actor.send OK?}
J -- yes --> G
J -- no, break --> H
I --> K{aggregate_job OK?}
K -- no --> G
H --> L{jobs_dropped > 0?}
L -- yes --> M[inc_aggregator_skipped_other\nreason=other +jobs_dropped]
L -- no --> N[send AggregationDone]
M --> N
Comments Outside Diff (1)
-
crates/blockchain/src/aggregation.rs, line 447-453 (link)Actor-gone break undercounts
otherskips by one jobWhen
actor.send(AggregateProduced)fails (line 449–453),jobs_attemptedwas already incremented for the current job, so that job's result is silently discarded without contributing tojobs_dropped. Only the remaining unstarted jobs are counted. Depending on the leanMetrics spec intent — whether a "job whose aggregated output could not be delivered" should also be a skip — this can undercount by one per actor-gone event. If "attempted" is meant to cover "we ranaggregate_jobfor it", the current behaviour is correct; if it should only cover "the result was successfully sent", the increment should be deferred until after the send succeeds.Prompt To Fix With AI
This is a comment left during a code review. Path: crates/blockchain/src/aggregation.rs Line: 447-453 Comment: **Actor-gone break undercounts `other` skips by one job** When `actor.send(AggregateProduced)` fails (line 449–453), `jobs_attempted` was already incremented for the current job, so that job's result is silently discarded without contributing to `jobs_dropped`. Only the remaining *unstarted* jobs are counted. Depending on the leanMetrics spec intent — whether a "job whose aggregated output could not be delivered" should also be a skip — this can undercount by one per actor-gone event. If "attempted" is meant to cover "we ran `aggregate_job` for it", the current behaviour is correct; if it should only cover "the result was successfully sent", the increment should be deferred until after the send succeeds. How can I resolve this? If you propose a fix, please make it concise.
Prompt To Fix All With AI
Fix the following 1 code review issue. Work through them one at a time, proposing concise fixes.
---
### Issue 1 of 1
crates/blockchain/src/aggregation.rs:447-453
**Actor-gone break undercounts `other` skips by one job**
When `actor.send(AggregateProduced)` fails (line 449–453), `jobs_attempted` was already incremented for the current job, so that job's result is silently discarded without contributing to `jobs_dropped`. Only the remaining *unstarted* jobs are counted. Depending on the leanMetrics spec intent — whether a "job whose aggregated output could not be delivered" should also be a skip — this can undercount by one per actor-gone event. If "attempted" is meant to cover "we ran `aggregate_job` for it", the current behaviour is correct; if it should only cover "the result was successfully sent", the increment should be deferred until after the send succeeds.
Reviews (1): Last reviewed commit: "Implement lean_aggregator_skipped_total ..." | Re-trigger Greptile
🗒️ Description / Motivation
Implements the new cross-client counter
lean_aggregator_skipped_total{reason=...}proposed in leanMetrics PR #36, so operators can attribute missed aggregations directly instead of deriving them from coverage gauges or logs.What Changed
not_aggregatorothernot_synced,missing_state,spawn_failedspawn_blockingcannot fail to start. Seeded at zero so fleet dashboards see the full label setcrates/blockchain/src/metrics.rs:IntCounterVecregistration, reason-list const documenting the never-firing labels, twoinc_*helpers, init seedingcrates/blockchain/src/lib.rs: count the skip at the interval-2 tick when not an aggregatorcrates/blockchain/src/aggregation.rs: track attempted vs. total jobs inrun_aggregation_workerand count the dropped remainder on cancellation✅ Verification Checklist
make fmt— cleanmake lint(clippy with-D warnings) — cleancargo test -p ethlambda-blockchain --lib— 29 passed