feat(curtailment): StartCurtailment + dispatch + reconciler#192
feat(curtailment): StartCurtailment + dispatch + reconciler#192rongxin-liu wants to merge 27 commits intomainfrom
Conversation
Adds the foundation primitives a curtailment Start RPC and reconciler will dispatch through: - proto CommandType: COMMAND_TYPE_CURTAIL, COMMAND_TYPE_UNCURTAIL - commandtype.Curtail, commandtype.Uncurtail with String/FromString round-trip - session.ActorCurtailment for self-originated traffic so future command-preflight filters can bypass curtailment-active gating - dto.CurtailPayload carrying the curtailment level - capability mapping for the new command types (OR-relationship across CapabilityCurtailFull and CapabilityCurtailEfficiency, matching the DeviceCurtailment optional interface) - executeCommandOnDevice dispatch arms invoking minerInfo.Curtail and minerInfo.Uncurtail with the SDK request shapes Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
🔐 Codex Security Review
Review SummaryOverall Risk: HIGH Findings[HIGH] Unconfirmed curtailment dispatches can lock devices indefinitely
[MEDIUM] Reconciler has no cross-process ownership or state-guarded updates
[MEDIUM] Non-admin callers can set effectively unbounded curtailment durations
[MEDIUM] Shutdown deadline does not actually bound reconciler stop
NotesStartCurtailment is wired with Generated by Codex Security Review | |
There was a problem hiding this comment.
Pull request overview
Adds foundational curtailment command plumbing across the proto surface, server command domain, and execution path so upcoming StartCurtailment + reconciler work can dispatch Curtail/Uncurtail through the existing queue + capability-check framework.
Changes:
- Extend
minercommand.v1.CommandTypewithCOMMAND_TYPE_CURTAILandCOMMAND_TYPE_UNCURTAIL(and regenerate Go/TS outputs). - Add Go domain enums + DTO payload (
commandtype.Curtail/Uncurtail,dto.CurtailPayload) and hook command execution dispatch tominerInfo.Curtail/Uncurtail. - Add session attribution (
session.ActorCurtailment) and update capability mapping/tests for the new command types.
Reviewed changes
Copilot reviewed 11 out of 13 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| server/internal/domain/session/models.go | Adds ActorCurtailment session actor for self-originated curtailment traffic attribution. |
| server/internal/domain/session/models_test.go | Pins actor string labels and ensures new actor doesn’t collide with existing values. |
| server/internal/domain/miner/dto/command_dto.go | Introduces CurtailPayload for queue message payloads (curtailment level). |
| server/internal/domain/commandtype/enum.go | Adds Curtail/Uncurtail command types with String/FromString support. |
| server/internal/domain/commandtype/enum_test.go | Adds round-trip and label-stability tests for command type string conversions. |
| server/internal/domain/command/service.go | Extends activity event mapping for curtail / uncurtail. |
| server/internal/domain/command/execution_service.go | Adds execution dispatch arms that call SDK curtailment methods. |
| server/internal/domain/command/execution_service_test.go | Adds unit tests verifying curtail/un-curtail dispatch and payload unmarshalling behavior. |
| server/internal/domain/command/capability_mapping.go | Maps new proto command types to curtailment capability set (OR semantics). |
| server/internal/domain/command/capability_checker_test.go | Ensures new command types are included in capability mapping + RequiresCapabilityCheck. |
| server/generated/grpc/minercommand/v1/command.pb.go | Regenerated Go protobuf output reflecting new command types. |
| proto/minercommand/v1/command.proto | Adds COMMAND_TYPE_CURTAIL and COMMAND_TYPE_UNCURTAIL enum values. |
| client/src/protoFleet/api/generated/minercommand/v1/command_pb.ts | Regenerated TS protobuf output reflecting new command types. |
…ties The capability_mapping entries for COMMAND_TYPE_CURTAIL and COMMAND_TYPE_UNCURTAIL were added without a matching arm in hasAnyCapability, so the switch fell through and CheckCommandCapabilities silently reported none_supported=true for every device — even when the device's CommandCapabilities advertised CurtailFullSupported or CurtailEfficiencySupported. Add the two missing case arms wired to the existing proto fields, plus unit coverage for both constants (true/false per capability) and an OR-semantics test confirming the map entry now actually gates. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Reject out-of-range CurtailLevel at the dispatch boundary as FailedPrecondition (which is in the permanent-failure arm of markQueueMessageStatus) so a malformed payload doesn't burn through MaxFailureRetries before being rejected. - Add activitymodels.ActorCurtailment and route session.ActorCurtailment through actorTypeFromSession so reconciler-dispatched commands attribute to the curtailment actor instead of falling through to ActorUser. - Trim the CurtailPayload.Level doc comment of its hardcoded numeric values (1=Efficiency, 2=Full) which had no compile-time anchor. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Loop the bounds-guard sub-test over level=0 (below Efficiency) and level=3 (above Full) so a mutation on either comparison operator is caught. Each iteration creates a fresh gomock controller and asserts the miner's Curtail is never called. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…tion A malformed curtail payload is a deterministic, non-recoverable input bug — retrying it MaxFailureRetries times on the per-device FIFO queue just blocks the queue head. Reclassify the unmarshal-error arm to FailedPrecondition (which short-circuits to UpdateMessagePermanentlyFailed on the first attempt) so it matches the invalid-level branch already in this dispatch. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CheckCommandCapabilities for COMMAND_TYPE_CURTAIL was OR'ing FULL and Efficiency, so a miner that advertised only Efficiency was reported as supporting CURTAIL even though curtailment dispatches FULL. Operators or automations relying on the capability surface for fleet selection would see false positives and the batch would fail at execution. Tighten the mapping to CapabilityCurtailFull only; UNCURTAIL keeps the OR-set since restore is level-independent. Tests split accordingly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ubtest Simplification pass: tighten three new comments where the code already communicated what the comment said, and split the bounds-guard test loop into named t.Run subtests so failures attribute per level value. - capability_mapping.go: drop "currently" qualifier and 3 redundant lines - commandtype/enum.go: trim Curtail godoc to one-liner matching peers - command_dto.go: drop the type-restating second sentence - execution_service_test.go: wrap level=0 / level=3 in named subtests Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Implements the StartCurtailment write path. The handler validates the request, runs the existing selector pipeline shared with Preview, and persists event + per-target rows in a single transaction with baseline_power_w captured from the latest telemetry sample per device. On insufficient curtailable load the handler returns InvalidArgument with the same structured detail Preview emits today; on empty plan after a successful selector run it returns InvalidArgument rather than persisting an empty event. Service-level changes: - Refactor Preview into a shared runSelector pipeline so Start reuses org-config + scope + candidate + classify + buildPlan without duplication; Plan gains an EventUUID populated only on Start success. - Add Service.Start with a StartRequest superset of PreviewRequest (restore batch, durations, idempotency/external attribution, reason). validateStartRequest layers the new bounds checks on top of the existing Preview validator. - Idempotency lookup is left as a TODO at the persistence boundary; the field is plumbed end-to-end so the lookup query can land later without a contract change. Persistence: - CurtailmentStore.InsertEventWithTargets is the transactional helper that wraps both inserts so an Insert that succeeds with zero targets cannot leak into the lifecycle tables. - The sqlstore implementation runs both queries inside db.WithTransaction. Handler: - StartCurtailment derives source_actor_type from session.Info (user / api_key) so the audit trail attributes correctly without pulling session into the service. The admin override gate already fires when restore_batch_size_override or candidate_min_power_w_override are set. - Response echoes the persisted event with target rollup; pending state is the persisted shape since dispatch is the reconciler's job. Out of scope for this commit (separate follow-ups still complete BE-3): - Initial Curtail batch dispatch and the reconciler that picks up pending events. - CurtailmentActiveFilter registration on commandSvc. - Schedule-processor curtailment-skip emission. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 129c208fc0
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
…event filter Curtail / Uncurtail public methods on the command service mirror the existing Reboot / StartMining shape so the curtailment reconciler can dispatch through the standard preflight + queue pipeline rather than poking the queue directly. CurtailmentActiveFilter gates external commands against devices that are part of an active curtailment event. Curtailment-origin traffic (Actor=ActorCurtailment) bypasses the filter so the reconciler can issue Curtail/Uncurtail without self-blocking, mirroring how ScheduleConflictFilter bypasses ActorScheduler for scheduler-origin traffic. The filter short-circuits when no active events exist so the common-case command preflight stays cheap. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…leetd wiring Reconciler picks up pending curtailment events and drives them through their lifecycle: dispatch initial Curtail per pending target with Actor=ActorCurtailment so the active-event filter self-bypasses, watch telemetry to confirm targets, transition the event pending->active when all targets confirm, then run drift detection on active events and re-dispatch up to MaxRetries before declaring drift-exhausted. Each tick is single-instance and serial. End-of-tick heartbeat upsert backs the operational liveness alert. Per-event panics are caught so one bad event doesn't kill the tick. Schedule processor now distinguishes curtailment-active filter skips and emits schedule_skipped_due_to_curtailment instead of the generic schedule_executed device_count=0 path, so operators can see when an active curtailment event preempted a scheduled command. fleetd registers CurtailmentActiveFilter on commandSvc alongside the existing ScheduleConflictFilter, and starts the reconciler alongside the schedule processor with the same graceful-shutdown shape. New sqlc queries: ListNonTerminalCurtailmentEvents, UpdateCurtailmentEventState, UpdateCurtailmentTargetState, UpsertCurtailmentReconcilerHeartbeat. Store interface gains ListNonTerminalEvents / UpdateEventState / UpdateTargetState / UpsertHeartbeat to wrap them. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 48bbed492e
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
Several reconciler issues that block real operation under failure modes: - A filter-skipped command (cmd.Curtail returns nil error with non-empty result.Skipped) was silently treated as a successful dispatch. The device never received the curtail command but the target moved to Dispatched, so confirmDispatched would never see telemetry move and the target stayed dispatched indefinitely. Now examines Skipped after the nil-error check; matching device entries fall through to the recordDispatchFailure path with the skip reason as LastError. - A single permanently-failing dispatch held the entire event in pending forever — maybeMarkActive required all targets in Confirmed with no terminal-failure budget. dispatchOneCurtail errors now bump RetryCount, persist LastError, and at MaxRetries transition the target to RestoreFailed; maybeMarkActive admits Confirmed-or-terminal as the promotion condition. When every target is RestoreFailed the event transitions to completed_with_failures rather than sitting non-terminal. - After Confirmed→Drifted→Dispatched (drift redispatch), observeActive's Dispatched arm was an empty fall-through. Targets stayed dispatched even when telemetry showed curtailment had resumed. The arm now calls the same telemetry-confirmation path used during initial dispatch so drift recovery closes back to Confirmed in the same flow. - ListTargetsByEvent fetched three times per pending event per tick (dispatchPending + confirmDispatched + maybeMarkActive). dispatchPending now fetches once and threads the slice through subsequent phases; per-target updates mutate in-place. The structural change also resolves the silent-error-swallowing in confirmDispatched and maybeMarkActive since neither phase fetches anymore. - RetryCount was never reset when a re-dispatched target re-confirmed, so the budget was consumed across unrelated drift episodes; an off-by-one between checkDrift and observeActive's guards meant checkDrift's effective budget was MaxRetries+1. Both now reset on (re)confirm and use >=MaxRetries as the consistent boundary. - runTick lacked a top-level defer recover(); a panic in ListNonTerminalEvents tore down the goroutine. safeTick now wraps each tick with panic recovery and the next tick still runs. - Heartbeat upsert at end of runTick used workCtx; the shutdown watchdog's workCancel could drop the final heartbeat write. The upsert now uses a Background-derived ctx with a 5s timeout so a staleness alert won't fire spuriously after a clean restart. - Reconciler.Start was not idempotent — double-Start spawned parallel tick loops. A running flag guards both Start and Stop. - Successful dispatchOneCurtail now clears LastError on the target row rather than leaving the previous failure string in place. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…h_size Add proto buf.validate length bounds to the four text fields on StartCurtailmentRequest that previously accepted arbitrary input: - idempotency_key, reason, external_source, external_reference all gain string.max_len = 256 so a malicious or buggy caller cannot persist multi-megabyte values through these unbounded TEXT columns. The defense-in-depth length check in validateStartRequest catches non- Connect callers (internal CLIs, tests, future non-Connect surfaces). - restore_batch_size gains uint32.lte = 10000 to match the existing bound on StopCurtailmentRequest.restore_batch_size_override; the proto contract was inconsistent across the two sibling fields. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ip helpers InsertEvent and InsertTarget on CurtailmentStore became unused once InsertEventWithTargets landed as the only transactional write surface. Removing the methods drops the SQL store implementations and the panic-stub-on-each-fake-store boilerplate. countConflictSkips and countCurtailmentActiveSkips in schedule/processor were structural duplicates differing only in the constant compared. Extracted into countSkipsByFilter; the per-filter wrappers are one-liners. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: f37d0f9648
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
…ets stay drifted Two reconciler bugs introduced in 2ff8d38's failure-handling refactor: - checkDrift was bumping RetryCount when transitioning Confirmed→Drifted, and recordDispatchFailure was bumping it again when the redispatch failed. Each drift+failed-redispatch consumed two retries from the budget instead of one, halving MaxRetries for flapping miners. checkDrift now performs only the state transition; the increment lives exclusively in recordDispatchFailure so the budget tracks failed dispatch attempts rather than drift events. - recordDispatchFailure unconditionally wrote TargetStatePending on a non-terminal failure. For active-event drift redispatches the target flipped from Drifted to Pending, but observeActive's Pending case is a no-op once the event is active — the target sat forever with no retries and no terminal transition. dispatchOneCurtail and recordDispatchFailure now take an explicit nonTerminalFailureState; pending-event callers pass Pending, drifted-redispatch callers pass Drifted, and observeActive's Drifted arm picks up the target on the next tick. While in the same scope: confirmDispatched now loops and delegates to confirmOneDispatched (the per-target primitive added in the prior fix commit) so there's a single maintenance surface for the dispatched → confirmed state transition. Stop sets running=false and captures cancel funcs under the mutex before wg.Wait, so a concurrent second Stop hits the running guard and returns immediately rather than racing through the wait. Two existing tests had assertions that codified the double-bump behavior; updated to match the fixed invariant. Added regression coverage: MaxRetries=3 maps to exactly 3 dispatch attempts before RestoreFailed, and a non-terminal drift failure stays Drifted. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…odoc Internal ticket prefixes don't belong in source comments. Remove the (BE-3) reference left over from when the curtailment-skip query support was prospective. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 388ab94841
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
…ect whitespace reasons
Two Start-validation contract bugs flagged in review:
- The translator rejected max_duration_seconds=0 with allow_unbounded
false, but the request contract treats 0 as the "use org default"
sentinel (with allow_unbounded reserved for explicit opt-out of
normalization). Ordinary callers using default values were getting
InvalidArgument and could not start curtailment without sending an
explicit duration. The translator now leaves MaxDurationSeconds nil
for the sentinel, the validator allows nil with !allow_unbounded,
and Service.Start normalizes nil to OrgConfig.MaxDurationDefaultSec
before persistence. runSelector returns the OrgConfig so the
normalization avoids a second DB read.
- Whitespace-only reasons (" ") passed the `req.Reason == ""` check
but tripped the DB's `length(trim(reason)) > 0` constraint, surfacing
as a 500 instead of an InvalidArgument. The validator now uses
strings.TrimSpace so whitespace-only is rejected with the same
message as empty.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…t empty batch as failure Two reconciler dispatch-loop bugs flagged in review: - confirmOneDispatched used isCurtailedByPower, which preserves curtailed=true on missing or non-finite telemetry to avoid spurious redispatch during drift detection. That polarity is wrong for the confirmation path: targets were being promoted to confirmed without any evidence the curtailment actually took effect, and the event could flip to active based on absent telemetry. Introduce a separate isPositivelyCurtailed predicate that requires finite power below the drift threshold (or finite zero-or-negative hash when no baseline). confirmOneDispatched uses the new predicate; checkDrift keeps isCurtailedByPower since the conservative default is correct for drift. - dispatchOneCurtail marked the target dispatched even when the command service returned a nil-error result with empty BatchIdentifier (processCommand can do this when device-ID resolution returns zero rows — e.g. miner unpaired between Start and reconcile). The target stuck in dispatched with no batch to execute, no retry increment, and no eventual termination. Treat the empty-batch case as a dispatch failure so it consumes a retry and the event progresses normally. Documents the known UserID=0 / command_batch_log.created_by FK gap inline at reconcilerContext; the proper fix needs a new migration adding curtailment_event.created_by_user_id and lands as a focused follow-up commit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 0165863b58
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
…iler gap doc
- StartCurtailment response now echoes the resolved max_duration_seconds
(the value Service.Start actually persisted after normalizing the
use-org-default sentinel) rather than the request's raw zero. Plan
gains an EffectiveMaxDurationSeconds field that toStartResponse reads
via a small helper; falls back to the request value when Plan does
not carry a resolved value.
- Service.Start guards against a misconfigured org row carrying
MaxDurationDefaultSec <= 0; the resolved value is rejected with
InvalidArgument rather than persisted as-is.
- TestService_Start_RejectsEmptyReason now covers whitespace-only and
tab/newline cases, not just the empty string, so the strings.TrimSpace
check has direct coverage.
- KNOWN GAP comment at reconcilerContext spells out the runtime impact
("dispatch is broken in production until the migration ships") so a
reader doesn't mistake the deferred follow-up for polish.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 9939d07cca
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
…ndary The four uint32 fields on StartCurtailmentRequest (max_duration_seconds, restore_batch_size, restore_batch_interval_sec, min_curtailed_duration_sec) went through uint32ToInt32Saturating, which silently clamped values above math.MaxInt32 to MaxInt32. A caller sending 3_000_000_000 seconds got a different persisted duration with no validation error, breaking request/response accuracy for valid protobuf inputs. Replace the saturating helper with uint32ToInt32Strict that returns InvalidArgument naming the offending field. Connect-validated inputs are unaffected (the proto bounds cap reachable values well below MaxInt32); non-Connect callers and any future field without a proto bound now see a clear error instead of silent saturation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…otency_key A retry with a previously-used (org_id, idempotency_key) pair tripped the partial unique index uq_curtailment_event_idempotency at insert time, surfacing as Internal — defeating the purpose of exposing an idempotency key on this API. Add GetCurtailmentEventByIdempotencyKey sqlc query (org-scoped) and a matching CurtailmentStore.GetEventByIdempotencyKey method. Service.Start performs the lookup before the selector + insert pipeline; on a hit it reconstructs a minimal Plan from the persisted event + targets and returns early so the retry produces the same response shape as the original Start. NotFound on the lookup falls through to the normal path. The reconstructed Plan carries the persisted event_uuid, the persisted max_duration_seconds (so EffectiveMaxDurationSeconds is populated), and SelectedDevice entries built from curtailment_target rows. Skipped candidates and estimated kW values are not re-derived from decision_snapshot_jsonb to avoid coupling the retry path to the snapshot schema; clients can re-fetch full detail via the read APIs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 79980009c4
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
confirmOneDispatched and checkDrift previously skipped silently when the candidate row was missing — typically after a device unpaired or got deleted mid-event. The target stayed in dispatched/confirmed forever, never consumed its retry budget, and the event could not progress to a terminal state. Both paths now route through recordDispatchFailure so the retry budget moves forward and the target lands in RestoreFailed at exhaustion. The confirm path keeps the target dispatched while retrying; the drift path keeps it drifted. Add two regression tests covering each branch. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
When Service.Start short-circuits on a matching idempotency_key, the response was built from the *retry* request's fields (priority, reason, scope, mode params, batch sizes, include_maintenance, etc.) bolted onto the persisted event_uuid + targets. A caller reusing a key with drifted metadata would see the new attributes echoed alongside the original event, which is internally inconsistent. Thread the persisted Event + Target rows through Plan and have the response translator describe the persisted row directly: state, priority, reason, scope (reconstructed from scope_jsonb), mode params (from mode_params_jsonb), batch sizes, observed/baseline power, retry counts, and a real per-state target rollup. Fresh-Start path is unchanged. Add a handler-level regression test that retries with deliberately drifted metadata and asserts the response describes the persisted row. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…achable arm planFromExistingEvent populated plan.Selected on every idempotent retry, but the response now reads plan.PersistedTargets exclusively on that path; the Selected loop was a per-retry O(N) allocation with no consumer. Also drop the unreachable "active" arm of desiredStateProto and its const — v1 only writes "curtailed". Future stop/restore work can add the arm when it adds the writer. Inline the remaining literal so the cross-package mirror constant goes away. Trim two godocs to one line each per the project's terse-comment preference. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: cda1a3c8f8
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
…only startResponseFromPersisted called effectiveMaxDurationSeconds with a nil request, leaning on the proto getter's nil-safety. The intent on the persisted path is "render plan.EffectiveMaxDurationSeconds, mapping nil to 0 for allow_unbounded events" — a fresh-path helper that falls back to req.GetMaxDurationSeconds() obscures that and would break loudly if the generated getter ever changed. Add a persistedMaxDurationSeconds helper that takes only the *int32 plan field. effectiveMaxDurationSeconds delegates to it for the plan-set branch. Add a regression test for the allow_unbounded retry path: response surfaces 0 (proto default), not the retry request's drifted MaxDurationSeconds. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two concurrent Starts with the same (org_id, idempotency_key) can both miss the pre-read short-circuit; one wins the unique-index insert, the other trips uq_curtailment_event_idempotency. The loser's request was returning Internal — defeating the retry-safety guarantee the key is supposed to give operators. Add interfaces.ErrCurtailmentIdempotencyKeyConflict as the typed signal. sqlstore.InsertEventWithTargets matches pgErr Code/ConstraintName against the partial unique index name and surfaces the sentinel; constraint match keeps the sweep narrow so a future unique constraint on curtailment_event isn't silently swallowed. Service.Start, on receiving the sentinel, re-reads via GetEventByIdempotencyKey and short-circuits to planFromExistingEvent so the loser sees the winner's persisted shape — same retry contract as the pre-read hit path. Add a service-level regression test that drives the race deterministically through a fakeStore.idempotencyRaceWinner hook. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: fe8af068b5
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
Reduces PR scope creep, fixes the production-blocking FK violation, gates the operator entrypoint until BE-4 ships Stop, and trims the reconciler down to what BE-3 must own. FK fix. Migration 000043 adds curtailment_event.created_by_user_id (NOT NULL FK -> user.id). The operator's session.Info.UserID is captured at StartCurtailment, threaded through StartRequest into the persisted event row, and read back by the reconciler when synthesizing dispatch context. Without this column the reconciler would dispatch with UserID=0 and every command_batch_log insert would fail the FK, burning targets through MaxRetries to RestoreFailed. Drop scope creep. Removes the idempotency-key lookup + retry-shaped response that grew during review (~250 LOC across translate.go, service.go, selector.go, the store interface, and sqlc queries). PR description had always flagged the lookup query as a TODO; restored to original scope. Duplicate idempotency_key Start calls now surface as Internal until BE-5 lands the lookup at the persistence boundary. Feature gate. StartCurtailment handler is gated behind startEnabled (default false in cmd/fleetd/main.go) since BE-3 ships dispatch + reconciler primitives but no Stop / restorer / max_duration_seconds enforcement (those are BE-4). An event reaching active has no exit path short of AdminTerminateEvent, so the operator entrypoint stays dormant in production until BE-4 flips the flag. Reconciler complexity. isPositivelyCurtailed and isCurtailedByPower folded into one isCurtailed(..., requirePositiveEvidence bool); processEvent's explicit terminal-state arm replaced with a default; dispatchPending's empty-targets defensive transition simplified to log + return. Post-review polish. CreatedByUserID assertions added to handler and service happy-path tests; new RejectsMissingCreatedByUserID negative-case test; redundant default arms in resolvePriority and strategyReasonLabel removed; persistedMaxDurationSeconds and the two count-skips wrappers inlined; longer comments on the MaxRetries guard in observeActive's drift arm and on the Start-mid-Stop concurrency edge in Stop() that fleetd's single-lifecycle invariant makes unreachable in practice.
Pulled long rationale paragraphs out of source comments where the depth belongs in PR/commit messages. Stripped roadmap version markers (v1, BE-X) from godocs and field annotations per project convention. Kept the load- bearing whys: invariants the next reader needs to make a correct edit (e.g. detached heartbeat ctx, in-memory mirror semantics, panic-recovery asymmetry, FK-driven UserID flow, NOT-NULL-without-backfill safety claim). Touches reconciler, service, translate, handler, sqlstores, sqlc queries, migration 000043, schedule processor, session models, fleetd wiring, and the StartCurtailmentRequest proto. Generated proto/sqlc outputs follow. No behavior changes; tests + lint clean.
Background
Curtailment is the proto-fleet feature that reduces a fleet's mining power on demand. The contract foundation (#118) shipped the proto surface with handler stubs returning
Unimplemented. The admin layer (#173) addedAdminTerminateEvent, three admin-gated override fields, session-only registration of the recovery RPC, andrequireAdminFromContext. Persistence and the operator-facingPreviewCurtailmentPlan(#188) followed.This PR ships the operator-facing
StartCurtailmentwrite path, the per-device dispatch primitives (Curtail / Uncurtail), the active-event preflight filter, the schedule processor's curtailment-skip emission, and the background reconciler that drives non-terminal events forward. TheStartCurtailmentoperator entrypoint is gated off (startEnabled=falseincmd/fleetd/main.go) until the Stop + restorer +max_duration_secondsenforcement work lands; the dispatch primitives, filter, and reconciler ship live so the wiring soaks in production. With the gate flipped, an operator's Start request lands in DB, the reconciler dispatchesCurtailper target under the operator's user identity, telemetry confirms each target, and the event progressespending → activewith drift detection + bounded redispatch.Summary
Operator surface
StartCurtailmentvalidates the request, runs the selector pipeline shared withPreviewCurtailmentPlan, and persists event + per-target rows in one transaction.baseline_power_wis captured from the latest telemetry sample per target. Insufficient curtailable load returnsInvalidArgumentwith the same structured detail as Preview. Source actor (user / api_key) and the operator'suser.idare derived fromsession.Infoso audit attribution stays correct without pulling session into the service.StartCurtailmentis gated behind astartEnabled boolfield on the handler;cmd/fleetd/main.gopassesfalse. With the gate off, the RPC returnsUnimplementedregardless of caller. This protects production from an event reachingactivewithout an exit path until Stop + restorer ship in a follow-up.Operator-id plumbing and FK fix
Migration
000043addscurtailment_event.created_by_user_id(NOT NULL FK touser(id)). The operator'ssession.Info.UserIDis captured at handler entry, threaded throughStartRequest.CreatedByUserID, persisted on the event row, and read back by the reconciler when synthesizing dispatch context. Without this column the reconciler would dispatch withUserID=0and everycommand_batch_loginsert would fail the FK touser, burning targets toRestoreFailedon the first reconciler tick. NOT NULL with no backfill is safe becausePreviewCurtailmentPlanwrites no rows tocurtailment_event— the table is empty in any environment that has only run earlier migrations.Dispatch primitives
proto/minercommand/v1/command.proto: extendsCommandTypewithCOMMAND_TYPE_CURTAILandCOMMAND_TYPE_UNCURTAIL. Generated Go + TS regenerated.commandtype.Curtail/commandtype.UncurtailGo enum values withString/FromStringround-trip andMarshalText/UnmarshalTextparity.activityEventTypearms emitcurtail/uncurtailfor audit-log rows.session.ActorCurtailmentand the matchingactivitymodels.ActorCurtailmentconstant, plus anactorTypeFromSessionarm so reconciler-dispatched commands attribute to the curtailment actor in audit rows.dto.CurtailPayloadJSON DTO carrying the curtailment level on queue messages (int32mirroringsdk.CurtailLevel).COMMAND_TYPE_CURTAILrequiresCapabilityCurtailFullonly (dispatch sends FULL);COMMAND_TYPE_UNCURTAILuses the OR-set since restore is level-independent.hasAnyCapabilitygains the corresponding switch arms.executeCommandOnDevicedispatch arms invokeminerInfo.Curtail/minerInfo.Uncurtail. ExistingPluginMineralready implements both via theDeviceCurtailmentoptional SDK interface.Service.Curtail/Service.Uncurtailpublic methods on the command service mirror theReboot/StartMiningshape so the reconciler dispatches through the standard preflight + queue pipeline rather than poking the queue directly.Defensive guards on the dispatch boundary: malformed
CurtailPayloadand out-of-rangeLevelsurface asFailedPrecondition(notInternal) so they fail permanently on the first attempt rather than burningMaxFailureRetriesagainst a deterministic input bug.Preflight + schedule integration
CurtailmentActiveFilterblocks external commands targeting devices that are part of a non-terminal curtailment event. Reconciler self-traffic bypasses the filter viaActor == session.ActorCurtailmentANDCommandType ∈ {Curtail, Uncurtail}, mirroring howScheduleConflictFilterbypassesActorScheduler. TheActorfield is internal-only — handlers never set it from request data — so the bypass cannot be forged from outside. The common case (no active events) short-circuits without building a set.The schedule processor distinguishes curtailment-active skips from priority-conflict skips and emits
schedule_skipped_due_to_curtailmentinstead ofschedule_executed device_count=0, so operators can see the actual cause when an active curtailment event preempts a scheduled command.Reconciler
Single-instance background goroutine with a serial 30s tick, end-of-tick heartbeat upsert, and per-event panic isolation. Lifecycle wired in
cmd/fleetd/main.goalongside the schedule processor with the same drain-before-cancel + watchdog shape.Per tick, for each non-terminal event:
pendingevents: dispatchCurtailper pending target under the operator's synthesized session; transition the target todispatched. Confirm any already-dispatched targets via the latest telemetry sample. When every target isconfirmedor terminally failed, transition the eventpending → active. All-terminal-failed events skip pastactivetocompleted_with_failures.activeevents: drift detection on confirmed targets. A device whose telemetry no longer satisfies the curtailed predicate is markeddriftedand re-dispatched up toMaxRetries; budget exhaustion routes the target toRestoreFailed.restoringevents are owned by the future restorer; the reconciler does not write here.workCtxso shutdown-watchdog cancellation cannot drop the final liveness signal.The confirmation predicate (
isCurtailed) takes arequirePositiveEvidenceflag: confirm path requires positive evidence (missing telemetry → not curtailed); drift path preserves curtailed=true on missing telemetry so a flaky sensor cannot trigger a redispatch storm.Idempotency
The
idempotency_keyfield is plumbed through the request, the service request shape, and thecurtailment_eventcolumn. The lookup query at the persistence boundary is intentionally not implemented — the field is plumbed end-to-end so the lookup can land later without a contract change. Until then, duplicate-key Start calls surface asInternalfrom the partial-unique-index violation.Wiring
cmd/fleetd/main.goregistersCurtailmentActiveFilteroncommandSvc(afterScheduleConflictFilter) and starts the reconciler with the same graceful-shutdown pattern as the schedule processor. The handler is constructed withstartEnabled: false.What is intentionally not in this PR
StopCurtailmentand the restorer. The reconciler'srestoringarm is a no-op.max_duration_secondsenforcement (event-level termination on the cap elapsing). Pairs with the restorer.UpdateCurtailmentEvent,GetActiveCurtailment,ListCurtailmentEvents.idempotency_keylookup at the persistence boundary.Curtail/UncurtailinMinerCommandService— these are internal dispatches driven by curtailment events, not operator-facing commands. The proto enum extension is enough forCheckCommandCapabilities.InternaltoFailedPrecondition— separate cleanup PR.Known gap
A target that lands in
Dispatchedand never receives positive curtailment evidence via telemetry stays there indefinitely (no time-based redispatch in this PR). Operators rely onAdminTerminateEventas the escape hatch. The follow-up that implementsmax_duration_secondsenforcement adds the event-level time-based termination path covering this case; until thenStartCurtailmentis gated off in production.Test plan
go build ./...andgolangci-lint run ./...pass cleanly.go test ./internal/domain/curtailment/... ./internal/handlers/curtailment/... ./internal/domain/schedule/... ./internal/domain/command/... ./internal/domain/commandtype/... -count=1is green for the unit surface (DB-dependent integration tests fail on connection-refused as usual; not regressions).Coverage by area:
Service.Start— validation rejection per Start-specific field, includingcreated_by_user_id <= 0; selector forwarding (whole-org, device-list); insufficient-load returnsInvalidArgumentwith detail; empty-plan returnsInvalidArgument; persistence withbaseline_power_wandcreated_by_user_idcaptured; source-actor derivation per auth method.StartCurtailmenthandler — happy-path with stub service;startEnabled=falsereturnsUnimplementedeven with valid creds; admin-gate preserved forrestore_batch_size_override/candidate_min_power_w_override; API-key vs. session source-actor attribution;CreatedByUserIDflows fromsession.Info.UserIDinto the persisted event.Service.Curtail/Uncurtail— queue receives the right command type with the right payload.CurtailmentActiveFilter— bypass forActorCurtailment+ Curtail/Uncurtail; no-active-events fast path; partial-skip with multi-device kept/skipped split; empty-input passthrough.RestoreFailed; per-event error isolation; heartbeat advances on every tick;isCurtailedpredicate covers nil / finite / non-finite power × baseline × hash combinations across both confirm and drift modes.schedule_skipped_due_to_curtailmentand notschedule_executed.commandtype.Typeround-trip — pinsString↔FromStringfor all 13 values; rejection arm covers unknown labels.session.Actorconstants — distinct lowercase labels; no collision withActorScheduler.CURTAIL(FULL-only) andUNCURTAIL(OR-set) shapes pinned;hasAnyCapabilitycovered for each curtail capability plus an OR-semantics case.executeCommandOnDevicedispatch — Curtail dispatches with payload-derived level; surfaces unmarshal failure asFailedPrecondition; rejects out-of-range levels (level=0ANDlevel=3) asFailedPreconditionwithCurtailnever invoked; Uncurtail dispatches with the empty request.Closes #191
Refs #118
Refs #173
Refs #188