feat(curtailment): StartCurtailment + dispatch + reconciler by rongxin-liu · Pull Request #192 · block/proto-fleet

rongxin-liu · 2026-05-07T14:52:17Z

Background

Curtailment is the proto-fleet feature that reduces a fleet's mining power on demand. The contract foundation (#118) shipped the proto surface with handler stubs returning Unimplemented. The admin layer (#173) added AdminTerminateEvent, three admin-gated override fields, session-only registration of the recovery RPC, and requireAdminFromContext. Persistence and the operator-facing PreviewCurtailmentPlan (#188) followed.

This PR ships the operator-facing StartCurtailment write path, the per-device dispatch primitives (Curtail / Uncurtail), the active-event preflight filter, the schedule processor's curtailment-skip emission, and the background reconciler that drives non-terminal events forward. The StartCurtailment operator entrypoint is gated off (startEnabled=false in cmd/fleetd/main.go) until the Stop + restorer + max_duration_seconds enforcement work lands; the dispatch primitives, filter, and reconciler ship live so the wiring soaks in production. With the gate flipped, an operator's Start request lands in DB, the reconciler dispatches Curtail per target under the operator's user identity, telemetry confirms each target, and the event progresses pending → active with drift detection + bounded redispatch.

Summary

Operator surface

StartCurtailment validates the request, runs the selector pipeline shared with PreviewCurtailmentPlan, and persists event + per-target rows in one transaction. baseline_power_w is captured from the latest telemetry sample per target. Insufficient curtailable load returns InvalidArgument with the same structured detail as Preview. Source actor (user / api_key) and the operator's user.id are derived from session.Info so audit attribution stays correct without pulling session into the service.

StartCurtailment is gated behind a startEnabled bool field on the handler; cmd/fleetd/main.go passes false. With the gate off, the RPC returns Unimplemented regardless of caller. This protects production from an event reaching active without an exit path until Stop + restorer ship in a follow-up.

Operator-id plumbing and FK fix

Migration 000043 adds curtailment_event.created_by_user_id (NOT NULL FK to user(id)). The operator's session.Info.UserID is captured at handler entry, threaded through StartRequest.CreatedByUserID, persisted on the event row, and read back by the reconciler when synthesizing dispatch context. Without this column the reconciler would dispatch with UserID=0 and every command_batch_log insert would fail the FK to user, burning targets to RestoreFailed on the first reconciler tick. NOT NULL with no backfill is safe because PreviewCurtailmentPlan writes no rows to curtailment_event — the table is empty in any environment that has only run earlier migrations.

Dispatch primitives

proto/minercommand/v1/command.proto: extends CommandType with COMMAND_TYPE_CURTAIL and COMMAND_TYPE_UNCURTAIL. Generated Go + TS regenerated.
commandtype.Curtail / commandtype.Uncurtail Go enum values with String / FromString round-trip and MarshalText / UnmarshalText parity. activityEventType arms emit curtail / uncurtail for audit-log rows.
session.ActorCurtailment and the matching activitymodels.ActorCurtailment constant, plus an actorTypeFromSession arm so reconciler-dispatched commands attribute to the curtailment actor in audit rows.
dto.CurtailPayload JSON DTO carrying the curtailment level on queue messages (int32 mirroring sdk.CurtailLevel).
Capability mapping: COMMAND_TYPE_CURTAIL requires CapabilityCurtailFull only (dispatch sends FULL); COMMAND_TYPE_UNCURTAIL uses the OR-set since restore is level-independent. hasAnyCapability gains the corresponding switch arms.
executeCommandOnDevice dispatch arms invoke minerInfo.Curtail / minerInfo.Uncurtail. Existing PluginMiner already implements both via the DeviceCurtailment optional SDK interface.
Service.Curtail / Service.Uncurtail public methods on the command service mirror the Reboot / StartMining shape so the reconciler dispatches through the standard preflight + queue pipeline rather than poking the queue directly.

Defensive guards on the dispatch boundary: malformed CurtailPayload and out-of-range Level surface as FailedPrecondition (not Internal) so they fail permanently on the first attempt rather than burning MaxFailureRetries against a deterministic input bug.

Preflight + schedule integration

CurtailmentActiveFilter blocks external commands targeting devices that are part of a non-terminal curtailment event. Reconciler self-traffic bypasses the filter via Actor == session.ActorCurtailment AND CommandType ∈ {Curtail, Uncurtail}, mirroring how ScheduleConflictFilter bypasses ActorScheduler. The Actor field is internal-only — handlers never set it from request data — so the bypass cannot be forged from outside. The common case (no active events) short-circuits without building a set.

The schedule processor distinguishes curtailment-active skips from priority-conflict skips and emits schedule_skipped_due_to_curtailment instead of schedule_executed device_count=0, so operators can see the actual cause when an active curtailment event preempts a scheduled command.

Reconciler

Single-instance background goroutine with a serial 30s tick, end-of-tick heartbeat upsert, and per-event panic isolation. Lifecycle wired in cmd/fleetd/main.go alongside the schedule processor with the same drain-before-cancel + watchdog shape.

Per tick, for each non-terminal event:

pending events: dispatch Curtail per pending target under the operator's synthesized session; transition the target to dispatched. Confirm any already-dispatched targets via the latest telemetry sample. When every target is confirmed or terminally failed, transition the event pending → active. All-terminal-failed events skip past active to completed_with_failures.
active events: drift detection on confirmed targets. A device whose telemetry no longer satisfies the curtailed predicate is marked drifted and re-dispatched up to MaxRetries; budget exhaustion routes the target to RestoreFailed.
restoring events are owned by the future restorer; the reconciler does not write here.
Heartbeat upserts at end of tick with current tick UUID, duration, and active-event count. Detached from workCtx so shutdown-watchdog cancellation cannot drop the final liveness signal.

The confirmation predicate (isCurtailed) takes a requirePositiveEvidence flag: confirm path requires positive evidence (missing telemetry → not curtailed); drift path preserves curtailed=true on missing telemetry so a flaky sensor cannot trigger a redispatch storm.

Idempotency

The idempotency_key field is plumbed through the request, the service request shape, and the curtailment_event column. The lookup query at the persistence boundary is intentionally not implemented — the field is plumbed end-to-end so the lookup can land later without a contract change. Until then, duplicate-key Start calls surface as Internal from the partial-unique-index violation.

Wiring

cmd/fleetd/main.go registers CurtailmentActiveFilter on commandSvc (after ScheduleConflictFilter) and starts the reconciler with the same graceful-shutdown pattern as the schedule processor. The handler is constructed with startEnabled: false.

What is intentionally not in this PR

StopCurtailment and the restorer. The reconciler's restoring arm is a no-op.
Reconciler-side max_duration_seconds enforcement (event-level termination on the cap elapsing). Pairs with the restorer.
UpdateCurtailmentEvent, GetActiveCurtailment, ListCurtailmentEvents.
idempotency_key lookup at the persistence boundary.
Connect-RPC plumbing for Curtail / Uncurtail in MinerCommandService — these are internal dispatches driven by curtailment events, not operator-facing commands. The proto enum extension is enough for CheckCommandCapabilities.
Reclassifying the other 5 dispatch arms' unmarshal errors from Internal to FailedPrecondition — separate cleanup PR.

Known gap

A target that lands in Dispatched and never receives positive curtailment evidence via telemetry stays there indefinitely (no time-based redispatch in this PR). Operators rely on AdminTerminateEvent as the escape hatch. The follow-up that implements max_duration_seconds enforcement adds the event-level time-based termination path covering this case; until then StartCurtailment is gated off in production.

Test plan

go build ./... and golangci-lint run ./... pass cleanly. go test ./internal/domain/curtailment/... ./internal/handlers/curtailment/... ./internal/domain/schedule/... ./internal/domain/command/... ./internal/domain/commandtype/... -count=1 is green for the unit surface (DB-dependent integration tests fail on connection-refused as usual; not regressions).

Coverage by area:

Service.Start — validation rejection per Start-specific field, including created_by_user_id <= 0; selector forwarding (whole-org, device-list); insufficient-load returns InvalidArgument with detail; empty-plan returns InvalidArgument; persistence with baseline_power_w and created_by_user_id captured; source-actor derivation per auth method.
StartCurtailment handler — happy-path with stub service; startEnabled=false returns Unimplemented even with valid creds; admin-gate preserved for restore_batch_size_override / candidate_min_power_w_override; API-key vs. session source-actor attribution; CreatedByUserID flows from session.Info.UserID into the persisted event.
Service.Curtail / Uncurtail — queue receives the right command type with the right payload.
CurtailmentActiveFilter — bypass for ActorCurtailment + Curtail/Uncurtail; no-active-events fast path; partial-skip with multi-device kept/skipped split; empty-input passthrough.
Reconciler — pending → dispatched transition; telemetry-confirmed → event active; drift detection happy path; retry exhaustion routes to RestoreFailed; per-event error isolation; heartbeat advances on every tick; isCurtailed predicate covers nil / finite / non-finite power × baseline × hash combinations across both confirm and drift modes.
Schedule processor — schedule fires with all devices in an active curtailment event; processor emits schedule_skipped_due_to_curtailment and not schedule_executed.
commandtype.Type round-trip — pins String ↔ FromString for all 13 values; rejection arm covers unknown labels.
session.Actor constants — distinct lowercase labels; no collision with ActorScheduler.
Capability mapping — asymmetric CURTAIL (FULL-only) and UNCURTAIL (OR-set) shapes pinned; hasAnyCapability covered for each curtail capability plus an OR-semantics case.
executeCommandOnDevice dispatch — Curtail dispatches with payload-derived level; surfaces unmarshal failure as FailedPrecondition; rejects out-of-range levels (level=0 AND level=3) as FailedPrecondition with Curtail never invoked; Uncurtail dispatches with the empty request.

Closes #191
Refs #118
Refs #173
Refs #188

Adds the foundation primitives a curtailment Start RPC and reconciler will dispatch through: - proto CommandType: COMMAND_TYPE_CURTAIL, COMMAND_TYPE_UNCURTAIL - commandtype.Curtail, commandtype.Uncurtail with String/FromString round-trip - session.ActorCurtailment for self-originated traffic so future command-preflight filters can bypass curtailment-active gating - dto.CurtailPayload carrying the curtailment level - capability mapping for the new command types (OR-relationship across CapabilityCurtailFull and CapabilityCurtailEfficiency, matching the DeviceCurtailment optional interface) - executeCommandOnDevice dispatch arms invoking minerInfo.Curtail and minerInfo.Uncurtail with the SDK request shapes Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-05-07T14:54:43Z

🔐 Codex Security Review

Note: This is an automated security-focused code review generated by Codex.
It should be used as a supplementary check alongside human review.
False positives are possible - use your judgment.

Scope summary

Reviewed pull request diff only (025d0f501101b5e44f2a824afb9a18f786cd8615...f97ee9e15e3cfc4771e282a8beb6db5be1dbcadd, exact PR three-dot diff)

Model: gpt-5.5

💡 Click "edited" above to see previous reviews for this PR.

Review Summary

Overall Risk: HIGH

Findings

[HIGH] Unconfirmed curtailment dispatches can lock devices indefinitely

Category: Reliability | Concurrency
Location: server/internal/domain/curtailment/reconciler/reconciler.go:485
Description: After a Curtail command is successfully enqueued, the target moves to dispatched. If telemetry never proves curtailment, confirmOneDispatched returns without incrementing retry count, checking command failure status, or timing out the dispatch.
Impact: A failed plugin command, lost queue item, stale telemetry stream, or unsupported miner can leave the target dispatched forever. The event stays pending, and ListActiveCurtailedDevices treats pending/dispatched targets as locked, so normal commands against those miners are blocked indefinitely.
Recommendation: Add a confirmation timeout based on last_dispatched_at, then either inspect command result rows or retry/terminally fail the target within the retry budget. Consider capability-checking CurtailFull before persisting targets.

[MEDIUM] Reconciler has no cross-process ownership or state-guarded updates

Category: Concurrency | Reliability
Location: server/sqlc/queries/curtailment.sql:183
Description: ListNonTerminalCurtailmentEvents selects all pending/active/restoring events without a lease, advisory lock, FOR UPDATE SKIP LOCKED, or per-update state predicate. UpdateCurtailmentEventState and target updates also update by ID only.
Impact: In a multi-replica deployment or rolling restart overlap, multiple fleetd instances can dispatch the same curtailment target and race target/event transitions. A stale worker can also overwrite a state changed by a future stop/admin path.
Recommendation: Add a process-wide DB advisory lock or per-event transactional claiming with FOR UPDATE SKIP LOCKED, and make state transitions conditional on the expected current state.

[MEDIUM] Non-admin callers can set effectively unbounded curtailment durations

Category: Auth | Reliability
Location: server/internal/domain/curtailment/service.go:284
Description: The handler admin-gates allow_unbounded, but max_duration_seconds accepts any positive value up to math.MaxInt32 through the translator/service path.
Impact: Once StartCurtailment is enabled, a non-admin caller can request a duration of decades, effectively bypassing the admin-only unbounded acknowledgement and keeping miners curtailed far beyond the org default.
Recommendation: Enforce a sane upper bound in proto and service validation, or require admin privileges for values above the org default/global maximum.

[MEDIUM] Shutdown deadline does not actually bound reconciler stop

Category: Reliability
Location: server/internal/domain/curtailment/reconciler/reconciler.go:156
Description: Stop schedules workCancel after ShutdownDeadline, but then calls r.wg.Wait() without a timeout.
Impact: If an in-flight DB/queue/command path does not promptly respect context cancellation, fleetd shutdown can hang indefinitely despite the configured deadline.
Recommendation: Wait on a done channel with select against a timer, return/log timeout after cancellation, and let process shutdown continue.

Notes

StartCurtailment is wired with startEnabled=false in fleetd, which reduces immediate exposure through the public RPC. The reconciler still starts in production and will process any non-terminal rows that exist. I found no cryptostealing/pool-hijack changes, raw SQL interpolation, command-injection paths, or protobuf wire-format breakage in the scoped diff.

_{Generated by Codex Security Review |

Triggered by: @rongxin-liu |

Review workflow run}

Copilot

Pull request overview

Adds foundational curtailment command plumbing across the proto surface, server command domain, and execution path so upcoming StartCurtailment + reconciler work can dispatch Curtail/Uncurtail through the existing queue + capability-check framework.

Changes:

Extend minercommand.v1.CommandType with COMMAND_TYPE_CURTAIL and COMMAND_TYPE_UNCURTAIL (and regenerate Go/TS outputs).
Add Go domain enums + DTO payload (commandtype.Curtail/Uncurtail, dto.CurtailPayload) and hook command execution dispatch to minerInfo.Curtail/Uncurtail.
Add session attribution (session.ActorCurtailment) and update capability mapping/tests for the new command types.

Reviewed changes

Copilot reviewed 11 out of 13 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
server/internal/domain/session/models.go	Adds `ActorCurtailment` session actor for self-originated curtailment traffic attribution.
server/internal/domain/session/models_test.go	Pins actor string labels and ensures new actor doesn’t collide with existing values.
server/internal/domain/miner/dto/command_dto.go	Introduces `CurtailPayload` for queue message payloads (curtailment level).
server/internal/domain/commandtype/enum.go	Adds `Curtail`/`Uncurtail` command types with `String`/`FromString` support.
server/internal/domain/commandtype/enum_test.go	Adds round-trip and label-stability tests for command type string conversions.
server/internal/domain/command/service.go	Extends activity event mapping for `curtail` / `uncurtail`.
server/internal/domain/command/execution_service.go	Adds execution dispatch arms that call SDK curtailment methods.
server/internal/domain/command/execution_service_test.go	Adds unit tests verifying curtail/un-curtail dispatch and payload unmarshalling behavior.
server/internal/domain/command/capability_mapping.go	Maps new proto command types to curtailment capability set (OR semantics).
server/internal/domain/command/capability_checker_test.go	Ensures new command types are included in capability mapping + `RequiresCapabilityCheck`.
server/generated/grpc/minercommand/v1/command.pb.go	Regenerated Go protobuf output reflecting new command types.
proto/minercommand/v1/command.proto	Adds `COMMAND_TYPE_CURTAIL` and `COMMAND_TYPE_UNCURTAIL` enum values.
client/src/protoFleet/api/generated/minercommand/v1/command_pb.ts	Regenerated TS protobuf output reflecting new command types.

…ties The capability_mapping entries for COMMAND_TYPE_CURTAIL and COMMAND_TYPE_UNCURTAIL were added without a matching arm in hasAnyCapability, so the switch fell through and CheckCommandCapabilities silently reported none_supported=true for every device — even when the device's CommandCapabilities advertised CurtailFullSupported or CurtailEfficiencySupported. Add the two missing case arms wired to the existing proto fields, plus unit coverage for both constants (true/false per capability) and an OR-semantics test confirming the map entry now actually gates. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- Reject out-of-range CurtailLevel at the dispatch boundary as FailedPrecondition (which is in the permanent-failure arm of markQueueMessageStatus) so a malformed payload doesn't burn through MaxFailureRetries before being rejected. - Add activitymodels.ActorCurtailment and route session.ActorCurtailment through actorTypeFromSession so reconciler-dispatched commands attribute to the curtailment actor instead of falling through to ActorUser. - Trim the CurtailPayload.Level doc comment of its hardcoded numeric values (1=Efficiency, 2=Full) which had no compile-time anchor. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

Copilot reviewed 15 out of 17 changed files in this pull request and generated 1 comment.

Loop the bounds-guard sub-test over level=0 (below Efficiency) and level=3 (above Full) so a mutation on either comparison operator is caught. Each iteration creates a fresh gomock controller and asserts the miner's Curtail is never called. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…tion A malformed curtail payload is a deterministic, non-recoverable input bug — retrying it MaxFailureRetries times on the per-device FIFO queue just blocks the queue head. Reclassify the unmarshal-error arm to FailedPrecondition (which short-circuits to UpdateMessagePermanentlyFailed on the first attempt) so it matches the invalid-level branch already in this dispatch. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

CheckCommandCapabilities for COMMAND_TYPE_CURTAIL was OR'ing FULL and Efficiency, so a miner that advertised only Efficiency was reported as supporting CURTAIL even though curtailment dispatches FULL. Operators or automations relying on the capability surface for fleet selection would see false positives and the batch would fail at execution. Tighten the mapping to CapabilityCurtailFull only; UNCURTAIL keeps the OR-set since restore is level-independent. Tests split accordingly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ubtest Simplification pass: tighten three new comments where the code already communicated what the comment said, and split the bounds-guard test loop into named t.Run subtests so failures attribute per level value. - capability_mapping.go: drop "currently" qualifier and 3 redundant lines - commandtype/enum.go: trim Curtail godoc to one-liner matching peers - command_dto.go: drop the type-restating second sentence - execution_service_test.go: wrap level=0 / level=3 in named subtests Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Implements the StartCurtailment write path. The handler validates the request, runs the existing selector pipeline shared with Preview, and persists event + per-target rows in a single transaction with baseline_power_w captured from the latest telemetry sample per device. On insufficient curtailable load the handler returns InvalidArgument with the same structured detail Preview emits today; on empty plan after a successful selector run it returns InvalidArgument rather than persisting an empty event. Service-level changes: - Refactor Preview into a shared runSelector pipeline so Start reuses org-config + scope + candidate + classify + buildPlan without duplication; Plan gains an EventUUID populated only on Start success. - Add Service.Start with a StartRequest superset of PreviewRequest (restore batch, durations, idempotency/external attribution, reason). validateStartRequest layers the new bounds checks on top of the existing Preview validator. - Idempotency lookup is left as a TODO at the persistence boundary; the field is plumbed end-to-end so the lookup query can land later without a contract change. Persistence: - CurtailmentStore.InsertEventWithTargets is the transactional helper that wraps both inserts so an Insert that succeeds with zero targets cannot leak into the lifecycle tables. - The sqlstore implementation runs both queries inside db.WithTransaction. Handler: - StartCurtailment derives source_actor_type from session.Info (user / api_key) so the audit trail attributes correctly without pulling session into the service. The admin override gate already fires when restore_batch_size_override or candidate_min_power_w_override are set. - Response echoes the persisted event with target rollup; pending state is the persisted shape since dispatch is the reconciler's job. Out of scope for this commit (separate follow-ups still complete BE-3): - Initial Curtail batch dispatch and the reconciler that picks up pending events. - CurtailmentActiveFilter registration on commandSvc. - Schedule-processor curtailment-skip emission. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 129c208fc0

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

…event filter Curtail / Uncurtail public methods on the command service mirror the existing Reboot / StartMining shape so the curtailment reconciler can dispatch through the standard preflight + queue pipeline rather than poking the queue directly. CurtailmentActiveFilter gates external commands against devices that are part of an active curtailment event. Curtailment-origin traffic (Actor=ActorCurtailment) bypasses the filter so the reconciler can issue Curtail/Uncurtail without self-blocking, mirroring how ScheduleConflictFilter bypasses ActorScheduler for scheduler-origin traffic. The filter short-circuits when no active events exist so the common-case command preflight stays cheap. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…leetd wiring Reconciler picks up pending curtailment events and drives them through their lifecycle: dispatch initial Curtail per pending target with Actor=ActorCurtailment so the active-event filter self-bypasses, watch telemetry to confirm targets, transition the event pending->active when all targets confirm, then run drift detection on active events and re-dispatch up to MaxRetries before declaring drift-exhausted. Each tick is single-instance and serial. End-of-tick heartbeat upsert backs the operational liveness alert. Per-event panics are caught so one bad event doesn't kill the tick. Schedule processor now distinguishes curtailment-active filter skips and emits schedule_skipped_due_to_curtailment instead of the generic schedule_executed device_count=0 path, so operators can see when an active curtailment event preempted a scheduled command. fleetd registers CurtailmentActiveFilter on commandSvc alongside the existing ScheduleConflictFilter, and starts the reconciler alongside the schedule processor with the same graceful-shutdown shape. New sqlc queries: ListNonTerminalCurtailmentEvents, UpdateCurtailmentEventState, UpdateCurtailmentTargetState, UpsertCurtailmentReconcilerHeartbeat. Store interface gains ListNonTerminalEvents / UpdateEventState / UpdateTargetState / UpsertHeartbeat to wrap them. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 48bbed492e

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Several reconciler issues that block real operation under failure modes: - A filter-skipped command (cmd.Curtail returns nil error with non-empty result.Skipped) was silently treated as a successful dispatch. The device never received the curtail command but the target moved to Dispatched, so confirmDispatched would never see telemetry move and the target stayed dispatched indefinitely. Now examines Skipped after the nil-error check; matching device entries fall through to the recordDispatchFailure path with the skip reason as LastError. - A single permanently-failing dispatch held the entire event in pending forever — maybeMarkActive required all targets in Confirmed with no terminal-failure budget. dispatchOneCurtail errors now bump RetryCount, persist LastError, and at MaxRetries transition the target to RestoreFailed; maybeMarkActive admits Confirmed-or-terminal as the promotion condition. When every target is RestoreFailed the event transitions to completed_with_failures rather than sitting non-terminal. - After Confirmed→Drifted→Dispatched (drift redispatch), observeActive's Dispatched arm was an empty fall-through. Targets stayed dispatched even when telemetry showed curtailment had resumed. The arm now calls the same telemetry-confirmation path used during initial dispatch so drift recovery closes back to Confirmed in the same flow. - ListTargetsByEvent fetched three times per pending event per tick (dispatchPending + confirmDispatched + maybeMarkActive). dispatchPending now fetches once and threads the slice through subsequent phases; per-target updates mutate in-place. The structural change also resolves the silent-error-swallowing in confirmDispatched and maybeMarkActive since neither phase fetches anymore. - RetryCount was never reset when a re-dispatched target re-confirmed, so the budget was consumed across unrelated drift episodes; an off-by-one between checkDrift and observeActive's guards meant checkDrift's effective budget was MaxRetries+1. Both now reset on (re)confirm and use >=MaxRetries as the consistent boundary. - runTick lacked a top-level defer recover(); a panic in ListNonTerminalEvents tore down the goroutine. safeTick now wraps each tick with panic recovery and the next tick still runs. - Heartbeat upsert at end of runTick used workCtx; the shutdown watchdog's workCancel could drop the final heartbeat write. The upsert now uses a Background-derived ctx with a 5s timeout so a staleness alert won't fire spuriously after a clean restart. - Reconciler.Start was not idempotent — double-Start spawned parallel tick loops. A running flag guards both Start and Stop. - Successful dispatchOneCurtail now clears LastError on the target row rather than leaving the previous failure string in place. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…h_size Add proto buf.validate length bounds to the four text fields on StartCurtailmentRequest that previously accepted arbitrary input: - idempotency_key, reason, external_source, external_reference all gain string.max_len = 256 so a malicious or buggy caller cannot persist multi-megabyte values through these unbounded TEXT columns. The defense-in-depth length check in validateStartRequest catches non- Connect callers (internal CLIs, tests, future non-Connect surfaces). - restore_batch_size gains uint32.lte = 10000 to match the existing bound on StopCurtailmentRequest.restore_batch_size_override; the proto contract was inconsistent across the two sibling fields. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ip helpers InsertEvent and InsertTarget on CurtailmentStore became unused once InsertEventWithTargets landed as the only transactional write surface. Removing the methods drops the SQL store implementations and the panic-stub-on-each-fake-store boilerplate. countConflictSkips and countCurtailmentActiveSkips in schedule/processor were structural duplicates differing only in the constant compared. Extracted into countSkipsByFilter; the per-filter wrappers are one-liners. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f37d0f9648

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

…ets stay drifted Two reconciler bugs introduced in 2ff8d38's failure-handling refactor: - checkDrift was bumping RetryCount when transitioning Confirmed→Drifted, and recordDispatchFailure was bumping it again when the redispatch failed. Each drift+failed-redispatch consumed two retries from the budget instead of one, halving MaxRetries for flapping miners. checkDrift now performs only the state transition; the increment lives exclusively in recordDispatchFailure so the budget tracks failed dispatch attempts rather than drift events. - recordDispatchFailure unconditionally wrote TargetStatePending on a non-terminal failure. For active-event drift redispatches the target flipped from Drifted to Pending, but observeActive's Pending case is a no-op once the event is active — the target sat forever with no retries and no terminal transition. dispatchOneCurtail and recordDispatchFailure now take an explicit nonTerminalFailureState; pending-event callers pass Pending, drifted-redispatch callers pass Drifted, and observeActive's Drifted arm picks up the target on the next tick. While in the same scope: confirmDispatched now loops and delegates to confirmOneDispatched (the per-target primitive added in the prior fix commit) so there's a single maintenance surface for the dispatched → confirmed state transition. Stop sets running=false and captures cancel funcs under the mutex before wg.Wait, so a concurrent second Stop hits the running guard and returns immediately rather than racing through the wait. Two existing tests had assertions that codified the double-bump behavior; updated to match the fixed invariant. Added regression coverage: MaxRetries=3 maps to exactly 3 dispatch attempts before RestoreFailed, and a non-terminal drift failure stays Drifted. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…odoc Internal ticket prefixes don't belong in source comments. Remove the (BE-3) reference left over from when the curtailment-skip query support was prospective. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 388ab94841

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

…ect whitespace reasons Two Start-validation contract bugs flagged in review: - The translator rejected max_duration_seconds=0 with allow_unbounded false, but the request contract treats 0 as the "use org default" sentinel (with allow_unbounded reserved for explicit opt-out of normalization). Ordinary callers using default values were getting InvalidArgument and could not start curtailment without sending an explicit duration. The translator now leaves MaxDurationSeconds nil for the sentinel, the validator allows nil with !allow_unbounded, and Service.Start normalizes nil to OrgConfig.MaxDurationDefaultSec before persistence. runSelector returns the OrgConfig so the normalization avoids a second DB read. - Whitespace-only reasons (" ") passed the `req.Reason == ""` check but tripped the DB's `length(trim(reason)) > 0` constraint, surfacing as a 500 instead of an InvalidArgument. The validator now uses strings.TrimSpace so whitespace-only is rejected with the same message as empty. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…t empty batch as failure Two reconciler dispatch-loop bugs flagged in review: - confirmOneDispatched used isCurtailedByPower, which preserves curtailed=true on missing or non-finite telemetry to avoid spurious redispatch during drift detection. That polarity is wrong for the confirmation path: targets were being promoted to confirmed without any evidence the curtailment actually took effect, and the event could flip to active based on absent telemetry. Introduce a separate isPositivelyCurtailed predicate that requires finite power below the drift threshold (or finite zero-or-negative hash when no baseline). confirmOneDispatched uses the new predicate; checkDrift keeps isCurtailedByPower since the conservative default is correct for drift. - dispatchOneCurtail marked the target dispatched even when the command service returned a nil-error result with empty BatchIdentifier (processCommand can do this when device-ID resolution returns zero rows — e.g. miner unpaired between Start and reconcile). The target stuck in dispatched with no batch to execute, no retry increment, and no eventual termination. Treat the empty-batch case as a dispatch failure so it consumes a retry and the event progresses normally. Documents the known UserID=0 / command_batch_log.created_by FK gap inline at reconcilerContext; the proper fix needs a new migration adding curtailment_event.created_by_user_id and lands as a focused follow-up commit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 0165863b58

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

…iler gap doc - StartCurtailment response now echoes the resolved max_duration_seconds (the value Service.Start actually persisted after normalizing the use-org-default sentinel) rather than the request's raw zero. Plan gains an EffectiveMaxDurationSeconds field that toStartResponse reads via a small helper; falls back to the request value when Plan does not carry a resolved value. - Service.Start guards against a misconfigured org row carrying MaxDurationDefaultSec <= 0; the resolved value is rejected with InvalidArgument rather than persisted as-is. - TestService_Start_RejectsEmptyReason now covers whitespace-only and tab/newline cases, not just the empty string, so the strings.TrimSpace check has direct coverage. - KNOWN GAP comment at reconcilerContext spells out the runtime impact ("dispatch is broken in production until the migration ships") so a reader doesn't mistake the deferred follow-up for polish. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 9939d07cca

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

…ndary The four uint32 fields on StartCurtailmentRequest (max_duration_seconds, restore_batch_size, restore_batch_interval_sec, min_curtailed_duration_sec) went through uint32ToInt32Saturating, which silently clamped values above math.MaxInt32 to MaxInt32. A caller sending 3_000_000_000 seconds got a different persisted duration with no validation error, breaking request/response accuracy for valid protobuf inputs. Replace the saturating helper with uint32ToInt32Strict that returns InvalidArgument naming the offending field. Connect-validated inputs are unaffected (the proto bounds cap reachable values well below MaxInt32); non-Connect callers and any future field without a proto bound now see a clear error instead of silent saturation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…otency_key A retry with a previously-used (org_id, idempotency_key) pair tripped the partial unique index uq_curtailment_event_idempotency at insert time, surfacing as Internal — defeating the purpose of exposing an idempotency key on this API. Add GetCurtailmentEventByIdempotencyKey sqlc query (org-scoped) and a matching CurtailmentStore.GetEventByIdempotencyKey method. Service.Start performs the lookup before the selector + insert pipeline; on a hit it reconstructs a minimal Plan from the persisted event + targets and returns early so the retry produces the same response shape as the original Start. NotFound on the lookup falls through to the normal path. The reconstructed Plan carries the persisted event_uuid, the persisted max_duration_seconds (so EffectiveMaxDurationSeconds is populated), and SelectedDevice entries built from curtailment_target rows. Skipped candidates and estimated kW values are not re-derived from decision_snapshot_jsonb to avoid coupling the retry path to the snapshot schema; clients can re-fetch full detail via the read APIs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 79980009c4

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

confirmOneDispatched and checkDrift previously skipped silently when the candidate row was missing — typically after a device unpaired or got deleted mid-event. The target stayed in dispatched/confirmed forever, never consumed its retry budget, and the event could not progress to a terminal state. Both paths now route through recordDispatchFailure so the retry budget moves forward and the target lands in RestoreFailed at exhaustion. The confirm path keeps the target dispatched while retrying; the drift path keeps it drifted. Add two regression tests covering each branch. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

When Service.Start short-circuits on a matching idempotency_key, the response was built from the *retry* request's fields (priority, reason, scope, mode params, batch sizes, include_maintenance, etc.) bolted onto the persisted event_uuid + targets. A caller reusing a key with drifted metadata would see the new attributes echoed alongside the original event, which is internally inconsistent. Thread the persisted Event + Target rows through Plan and have the response translator describe the persisted row directly: state, priority, reason, scope (reconstructed from scope_jsonb), mode params (from mode_params_jsonb), batch sizes, observed/baseline power, retry counts, and a real per-state target rollup. Fresh-Start path is unchanged. Add a handler-level regression test that retries with deliberately drifted metadata and asserts the response describes the persisted row. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…achable arm planFromExistingEvent populated plan.Selected on every idempotent retry, but the response now reads plan.PersistedTargets exclusively on that path; the Selected loop was a per-retry O(N) allocation with no consumer. Also drop the unreachable "active" arm of desiredStateProto and its const — v1 only writes "curtailed". Future stop/restore work can add the arm when it adds the writer. Inline the remaining literal so the cross-package mirror constant goes away. Trim two godocs to one line each per the project's terse-comment preference. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: cda1a3c8f8

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

…only startResponseFromPersisted called effectiveMaxDurationSeconds with a nil request, leaning on the proto getter's nil-safety. The intent on the persisted path is "render plan.EffectiveMaxDurationSeconds, mapping nil to 0 for allow_unbounded events" — a fresh-path helper that falls back to req.GetMaxDurationSeconds() obscures that and would break loudly if the generated getter ever changed. Add a persistedMaxDurationSeconds helper that takes only the *int32 plan field. effectiveMaxDurationSeconds delegates to it for the plan-set branch. Add a regression test for the allow_unbounded retry path: response surfaces 0 (proto default), not the retry request's drifted MaxDurationSeconds. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Two concurrent Starts with the same (org_id, idempotency_key) can both miss the pre-read short-circuit; one wins the unique-index insert, the other trips uq_curtailment_event_idempotency. The loser's request was returning Internal — defeating the retry-safety guarantee the key is supposed to give operators. Add interfaces.ErrCurtailmentIdempotencyKeyConflict as the typed signal. sqlstore.InsertEventWithTargets matches pgErr Code/ConstraintName against the partial unique index name and surfaces the sentinel; constraint match keeps the sweep narrow so a future unique constraint on curtailment_event isn't silently swallowed. Service.Start, on receiving the sentinel, re-reads via GetEventByIdempotencyKey and short-circuits to planFromExistingEvent so the loser sees the winner's persisted shape — same retry contract as the pre-read hit path. Add a service-level regression test that drives the race deterministically through a fakeStore.idempotencyRaceWinner hook. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: fe8af068b5

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Reduces PR scope creep, fixes the production-blocking FK violation, gates the operator entrypoint until BE-4 ships Stop, and trims the reconciler down to what BE-3 must own. FK fix. Migration 000043 adds curtailment_event.created_by_user_id (NOT NULL FK -> user.id). The operator's session.Info.UserID is captured at StartCurtailment, threaded through StartRequest into the persisted event row, and read back by the reconciler when synthesizing dispatch context. Without this column the reconciler would dispatch with UserID=0 and every command_batch_log insert would fail the FK, burning targets through MaxRetries to RestoreFailed. Drop scope creep. Removes the idempotency-key lookup + retry-shaped response that grew during review (~250 LOC across translate.go, service.go, selector.go, the store interface, and sqlc queries). PR description had always flagged the lookup query as a TODO; restored to original scope. Duplicate idempotency_key Start calls now surface as Internal until BE-5 lands the lookup at the persistence boundary. Feature gate. StartCurtailment handler is gated behind startEnabled (default false in cmd/fleetd/main.go) since BE-3 ships dispatch + reconciler primitives but no Stop / restorer / max_duration_seconds enforcement (those are BE-4). An event reaching active has no exit path short of AdminTerminateEvent, so the operator entrypoint stays dormant in production until BE-4 flips the flag. Reconciler complexity. isPositivelyCurtailed and isCurtailedByPower folded into one isCurtailed(..., requirePositiveEvidence bool); processEvent's explicit terminal-state arm replaced with a default; dispatchPending's empty-targets defensive transition simplified to log + return. Post-review polish. CreatedByUserID assertions added to handler and service happy-path tests; new RejectsMissingCreatedByUserID negative-case test; redundant default arms in resolvePriority and strategyReasonLabel removed; persistedMaxDurationSeconds and the two count-skips wrappers inlined; longer comments on the MaxRetries guard in observeActive's drift arm and on the Start-mid-Stop concurrency edge in Stop() that fleetd's single-lifecycle invariant makes unreachable in practice.

Pulled long rationale paragraphs out of source comments where the depth belongs in PR/commit messages. Stripped roadmap version markers (v1, BE-X) from godocs and field annotations per project convention. Kept the load- bearing whys: invariants the next reader needs to make a correct edit (e.g. detached heartbeat ctx, in-memory mirror semantics, panic-recovery asymmetry, FK-driven UserID flow, NOT-NULL-without-backfill safety claim). Touches reconciler, service, translate, handler, sqlstores, sqlc queries, migration 000043, schedule processor, session models, fleetd wiring, and the StartCurtailmentRequest proto. Generated proto/sqlc outputs follow. No behavior changes; tests + lint clean.

rongxin-liu requested a review from a team as a code owner May 7, 2026 14:52

Copilot AI review requested due to automatic review settings May 7, 2026 14:52

github-actions Bot assigned rongxin-liu May 7, 2026

github-actions Bot added javascript Pull requests that update javascript code client server shared labels May 7, 2026

Copilot started reviewing on behalf of rongxin-liu May 7, 2026 14:53 View session

Copilot AI reviewed May 7, 2026

View reviewed changes

Comment thread server/internal/domain/command/capability_mapping.go Outdated

rongxin-liu and others added 2 commits May 7, 2026 17:36

rongxin-liu requested a review from Copilot May 7, 2026 15:37

Copilot started reviewing on behalf of rongxin-liu May 7, 2026 15:38 View session

Copilot AI reviewed May 7, 2026

View reviewed changes

Comment thread server/internal/domain/command/execution_service.go Outdated

rongxin-liu and others added 5 commits May 7, 2026 17:46

chatgpt-codex-connector Bot reviewed May 7, 2026

View reviewed changes

Comment thread server/internal/handlers/curtailment/translate.go Outdated

Comment thread server/internal/domain/curtailment/service.go Outdated

rongxin-liu and others added 2 commits May 7, 2026 22:04

rongxin-liu changed the title ~~feat(curtailment): add Curtail/Uncurtail command type and dispatch path~~ feat(curtailment): StartCurtailment + dispatch + reconciler May 7, 2026

chatgpt-codex-connector Bot reviewed May 7, 2026

View reviewed changes

Comment thread server/internal/domain/curtailment/reconciler/reconciler.go Outdated

Comment thread server/internal/domain/curtailment/reconciler/reconciler.go Outdated

rongxin-liu and others added 3 commits May 7, 2026 23:04

chatgpt-codex-connector Bot reviewed May 7, 2026

View reviewed changes

Comment thread server/internal/domain/curtailment/reconciler/reconciler.go Outdated

rongxin-liu and others added 2 commits May 7, 2026 23:36

chatgpt-codex-connector Bot reviewed May 7, 2026

View reviewed changes

Comment thread server/internal/domain/curtailment/reconciler/reconciler.go Outdated

rongxin-liu and others added 2 commits May 8, 2026 02:21

chatgpt-codex-connector Bot reviewed May 8, 2026

View reviewed changes

Comment thread server/internal/domain/stores/sqlstores/curtailment.go

Comment thread server/internal/handlers/curtailment/translate.go Outdated

chatgpt-codex-connector Bot reviewed May 8, 2026

View reviewed changes

Comment thread server/internal/handlers/curtailment/translate.go Outdated

rongxin-liu and others added 2 commits May 8, 2026 09:36

chatgpt-codex-connector Bot reviewed May 8, 2026

View reviewed changes

Comment thread server/internal/handlers/curtailment/translate.go

Comment thread server/internal/domain/curtailment/reconciler/reconciler.go

rongxin-liu and others added 3 commits May 8, 2026 09:57

chatgpt-codex-connector Bot reviewed May 8, 2026

View reviewed changes

Comment thread server/internal/handlers/curtailment/translate.go Outdated

Comment thread server/internal/domain/curtailment/service.go

rongxin-liu and others added 2 commits May 8, 2026 12:42

chatgpt-codex-connector Bot reviewed May 8, 2026

View reviewed changes

Comment thread server/internal/domain/curtailment/reconciler/reconciler.go Outdated

Comment thread server/internal/domain/stores/sqlstores/curtailment.go Outdated

rongxin-liu added 2 commits May 8, 2026 14:22

Conversation

rongxin-liu commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Background

Summary

Operator surface

Operator-id plumbing and FK fix

Dispatch primitives

Preflight + schedule integration

Reconciler

Idempotency

Wiring

What is intentionally not in this PR

Known gap

Test plan

Uh oh!

github-actions Bot commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔐 Codex Security Review

Review Summary

Findings

[HIGH] Unconfirmed curtailment dispatches can lock devices indefinitely

[MEDIUM] Reconciler has no cross-process ownership or state-guarded updates

[MEDIUM] Non-admin callers can set effectively unbounded curtailment durations

[MEDIUM] Shutdown deadline does not actually bound reconciler stop

Notes

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

rongxin-liu commented May 7, 2026 •

edited

Loading

github-actions Bot commented May 7, 2026 •

edited

Loading