RFC 0001: Agent + cloud-server split by ankitgoswami · Pull Request #145 · block/proto-fleet

ankitgoswami · 2026-04-30T22:32:23Z

Summary

Proposes the inaugural RFC process for proto-fleet (docs/rfcs/) with a README and template.
RFC 0001 proposes splitting fleetd into a thin on-prem fleet-agent and a cloud-deployable fleetd --mode=server, with a phased rollout that keeps today's docker-compose deploy (--mode=combined) working throughout.
Includes wire protocol (AgentGatewayService), per-org credential model, schema impact, drawbacks, alternatives considered, and 6 phases with mermaid diagrams.

This is a design RFC, not implementation. Land it as draft so review can happen on the markdown; subsequent PRs implement phases 1-6.

Test plan

Render the markdown on GitHub and confirm mermaid diagrams display
Verify all relative file links (../../server/...) resolve to existing source files at the cited line numbers
Verify internal anchor links (#credentials, #phase-4-...) navigate correctly
Read end-to-end and check for unresolved questions or contradictions before promoting from draft -> accepted

🤖 Generated with Claude Code

github-actions · 2026-04-30T22:37:03Z

🔐 Codex Security Review

Note: This is an automated security-focused code review generated by Codex.
It should be used as a supplementary check alongside human review.
False positives are possible - use your judgment.

Scope summary

Reviewed pull request diff only (a66ca90edd84fe2285bedb37f569cb80a61db367...4fa61623f43639993816d53e326de749a2917f5c, exact PR three-dot diff)

Model: gpt-5.4

💡 Click "edited" above to see previous reviews for this PR.

Review Summary

Overall Risk: HIGH

Findings

[HIGH] Enrollment flow does not authenticate the miner-signing key

Category: Auth
Location: docs/rfcs/0001-agent-server-split.md:95
Description: The RFC says both public keys are registered during enrollment, but the operator verification step only checks the identity-key fingerprint. The separate miner-signing key used for Proto miner JWTs is not described as being displayed to the operator, signed by the identity key, or otherwise cryptographically bound to the verified identity key. That leaves a substituted-key gap despite the text claiming the comparison blocks it.
Impact: A compromised cloud/control plane could accept the real identity key to satisfy the UI fingerprint check while pairing or migrating miners onto an attacker-controlled miner-signing key, regaining the ability to mint JWTs for those miners and issue unauthorized device operations.
Recommendation: Authenticate both keys as one bundle. Either require operator verification of both fingerprints, or have the identity key sign a manifest containing the miner-signing public key and require all pairing/rotation flows to use only that bound key.

[HIGH] Per-org credential key gives any compromised agent org-wide secret access

Category: Cryptostealing/Pool Hijack
Location: docs/rfcs/0001-agent-server-split.md:105
Description: The design puts one AES-256-GCM key on every agent in an org and explicitly allows any agent to decrypt any pool or miner credential. That collapses isolation across sites and turns the least-trusted agent into a decryption oracle for the entire org’s stored secrets.
Impact: Compromise of a single agent host exposes every stored stratum credential and miner admin credential for the org. That materially increases the blast radius for pool hijack, miner takeover, and follow-on abuse whenever the attacker later gains network reachability, control-plane leverage, or ownership-transfer influence.
Recommendation: Do not distribute org-wide decryption capability for miner and pool secrets. Wrap secrets to the owning agent(s) or device(s), or at minimum split miner credentials from pool credentials and scope decryption material to the smallest authorized set.

[MEDIUM] Unplanned agent loss has no non-destructive recovery path for Proto miners

Category: Reliability
Location: docs/rfcs/0001-agent-server-split.md:141
Description: For an unexpected agent loss, the RFC’s only recovery path is manual factory-reset-and-re-pair of each affected miner. There is no backup signer, escrowed recovery flow, or pre-authorized transfer path once the original host is gone.
Impact: A disk failure, accidental key deletion, or host loss can strand an entire site’s Proto miners behind a manual per-device recovery process, creating a severe availability and operational recovery problem.
Recommendation: Define a disaster-recovery mechanism before shipping this model: tested key backup/restore, controlled escrow, or a secondary trusted signer/transfer path that can recover devices without factory resets.

Notes

The diff is documentation-only under docs/rfcs/; there are no runtime code, migration, protobuf, or deployment artifact changes in scope for this review.
I did not score the spool-overflow/data-loss tradeoff or the local health-page exposure as formal findings because the RFC already treats those areas as open design detail, but they should be specified tightly before implementation.

_{Generated by Codex Security Review |

Triggered by: @ankitgoswami |

Review workflow run}

…plit) Adds a docs/rfcs/ directory with a README and template for the RFC process, plus the inaugural RFC 0001 proposing splitting fleetd into a thin on-prem agent and a cloud-deployable server, with a phased rollout that preserves today's combined-mode docker-compose deployment throughout. Includes the per-org symmetric encryption design for pool/miner credentials and the per-agent ed25519 design for Proto miner JWT signing, with the trade-offs explicitly documented. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

Introduces an RFC process under docs/rfcs/ and adds the first design RFC (RFC 0001) proposing a future split between an on-prem agent and a cloud-deployable server mode for proto-fleet.

Changes:

Add docs/rfcs/README.md describing when to write RFCs, lifecycle states, numbering, and format.
Add docs/rfcs/_template.md to standardize RFC structure and metadata.
Add docs/rfcs/0001-agent-server-split.md as a draft architectural proposal for an agent + server split with auth/credential model and phased rollout.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File	Description
docs/rfcs/README.md	Defines the RFC process, lifecycle, numbering, and usage guidance.
docs/rfcs/0001-agent-server-split.md	Draft RFC describing the agent/server split, security model, and rollout phases.
docs/rfcs/_template.md	Provides a standard RFC template for consistent structure and metadata.

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> Signed-off-by: Ankit Goswami <ankit.goswami@gmail.com>

mcharles-square

Generally looks good

Lands Phase 1 of RFC 0001 (the agent + cloud-server split): the wire protocol scaffold, agent identity schema, and a stub handler that returns CodeUnimplemented for every RPC. No behavior change in any deployment shape (combined / server / agent). Wire protocol (proto/agentgateway/v1/agentgateway.proto): Register, BeginAuthHandshake, CompleteAuthHandshake, UploadTelemetry, UploadEvents, UploadHeartbeat, ControlStream. Post-handshake RPCs derive agent identity from a session_token in Authorization metadata rather than any body field. RegisterRequest carries an operator-issued, org-scoped enrollment_token. buf.validate rules pin ed25519 key/signature lengths, bound api_key/session_token/name, cap opaque payloads at 1 MiB, require timestamps, and require a populated ControlStream oneof variant. Handler (server/internal/handlers/agentgateway): Embeds UnimplementedAgentGatewayServiceHandler. Registered on the shared mux and added to grpcreflect. All seven RPCs are in UnauthenticatedProcedures because the user-session AuthInterceptor cannot validate the agent's session_token; the handler is responsible for credential validation when implemented. Logging redaction (server/internal/handlers/interceptors): Handshake request/response procedures added to redaction lists; ControlStream, UploadTelemetry, UploadEvents added to SensitiveBodyProcedures. The streaming logger now suppresses per-message bodies for sensitive procedures. Schema (server/migrations/000039_create_agent_tables): agent_device is the single source of truth for device-to-agent ownership; combined mode is the absence of a row. agent_device carries (agent_id, device_id, org_id) with composite FKs to both agent(id, org_id) and device(id, org_id), so cross-tenant pairings are rejected by the DB. agent identity_pubkey and (org_id, name) uniqueness are partial indexes scoped to deleted_at IS NULL, so a soft-deleted agent does not block re-enrollment. created_at and updated_at are NOT NULL. --mode flag (server/cmd/fleetd/config.go): Accepted via kong's enum validator (server/agent/combined, default combined). Not yet load-bearing. Closes #157 Refs #145 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Server foundation for issue #158. Lands the operator and agent paths of the enrollment flow end-to-end against the database; the fleet-agent CLI binary and the operator UI are deferred to follow-up PRs. Schema (migration 000040): - api_key gains agent support: user_id nullable, new agent_id, new subject_kind ('user'|'agent'), CHECK that exactly one of user_id / agent_id is set. Existing rows back-fill to subject_kind='user'. - New pending_enrollment table holds operator-issued bootstrap codes. State machine: PENDING -> AWAITING_CONFIRMATION -> CONFIRMED, with EXPIRED/CANCELLED terminal failure states. Plaintext is shown to the operator once and never persisted; only the SHA-256 hash is stored. - New agent_auth_challenge table holds short-TTL handshake nonces; atomic DELETE ... RETURNING gives replay safety without a consumed_at. - New agent_session table holds short-lived bearer tokens issued by CompleteAuthHandshake; mirrors the user-side session table. Domain: - agentenrollment.Service implements code lifecycle and the Register + Confirm transitions. Confirm flips pending_enrollment to CONFIRMED, marks agent.enrollment_status='CONFIRMED', and issues the agent's api_key via apikey.Service.CreateAgent. - agentauth.Service implements the BeginHandshake / CompleteHandshake / ResolveSession state machine. BeginHandshake verifies the api_key, cross-checks the supplied identity_pubkey against the enrolled key, and mints a one-shot challenge. CompleteHandshake atomically consumes the challenge and verifies the ed25519 signature against the agent's identity key, minting a session_token on success. - apikey.Service / store gain a CreateAgent path; existing user-key flows stay unchanged. Validate keeps a single signature; callers branch on SubjectKind. Auth context + interceptor: - agentauth.Subject is the typed value placed on ctx by AgentAuthInterceptor (mirrors session.Info for users, via connectrpc.com/authn). - AgentAuthInterceptor only fires on AgentAuthenticatedProcedures (Upload* and ControlStream); the user-session AuthInterceptor short-circuits those so the two interceptors don't fight. - UnauthenticatedProcedures now contains only the bootstrap RPCs: Register / BeginAuthHandshake / CompleteAuthHandshake. Handlers: - AgentGatewayService.Register / BeginAuthHandshake / CompleteAuthHandshake stop returning Unimplemented and delegate to the new domain services. - New AgentAdminService (proto/agentadmin/v1) gives operators CreateEnrollmentCode, ListAgents, ConfirmAgent. Authorized via the existing user-session AuthInterceptor; org_id resolved from session info. Tests: - Integration tests against a real timescaledb cover the happy path (create code -> register -> confirm -> handshake -> resolve session) plus the security cases called out in the AC: replay of a consumed code, expired code, replayed challenge, and identity_pubkey mismatch. Out of scope (follow-up PRs): - cmd/fleet-agent/ Go binary with `enroll` subcommand. - Operator UI: Agents settings page + EnrollAgentModal. Refs #158 Refs #145 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Add the agent-side CLI that drives the bootstrap flow against the already-merged enrollment + handshake server (PR #181). Supports three subcommands: enroll, status, refresh. enroll generates ed25519 identity + miner-signing keypairs, calls Register against AgentGatewayService, prints the local fingerprint for operator visual comparison, accepts the api_key the operator pastes back after confirming in the UI, runs the BeginAuth/CompleteAuth handshake, and persists everything to a 0600 YAML state file under $XDG_STATE_HOME/fleet-agent (or ~/.local/state/fleet-agent by default). refresh re-runs the handshake against the stored api_key. status reads back the local state. Tests are wire-level: they spin up an httptest server with the real connect handler against a fake AgentGateway that verifies the signature with ed25519.Verify, mirroring the server. Refs: #158, #145 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions Bot assigned ankitgoswami Apr 30, 2026

github-actions Bot added the documentation Improvements or additions to documentation label Apr 30, 2026

ankitgoswami force-pushed the ankitg/rfc-0001-agent-server-split branch 3 times, most recently from 69e41b0 to 3b65048 Compare May 4, 2026 16:24

ankitgoswami force-pushed the ankitg/rfc-0001-agent-server-split branch from 3b65048 to 9849f6e Compare May 4, 2026 16:28

ankitgoswami marked this pull request as ready for review May 4, 2026 16:34

ankitgoswami requested a review from a team as a code owner May 4, 2026 16:34

Copilot AI review requested due to automatic review settings May 4, 2026 16:34

Copilot started reviewing on behalf of ankitgoswami May 4, 2026 16:35 View session

Copilot AI reviewed May 4, 2026

View reviewed changes

Comment thread docs/rfcs/0001-agent-server-split.md Outdated

Comment thread docs/rfcs/0001-agent-server-split.md

Potential fix for pull request finding

4fa6162

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> Signed-off-by: Ankit Goswami <ankit.goswami@gmail.com>

flesher reviewed May 4, 2026

View reviewed changes

Comment thread docs/rfcs/0001-agent-server-split.md

mcharles-square reviewed May 4, 2026

View reviewed changes

Comment thread docs/rfcs/0001-agent-server-split.md

Comment thread docs/rfcs/0001-agent-server-split.md

Comment thread docs/rfcs/0001-agent-server-split.md

ankitgoswami mentioned this pull request May 5, 2026

feat(rfc-0001): agent enrollment + auth model [phase 2 server] #181

Merged

6 tasks

ankitgoswami mentioned this pull request May 6, 2026

feat(agent): agentbootstrap library for enroll + handshake [phase 2 agent] #187

Merged

4 tasks

flesher self-requested a review May 6, 2026 23:29

flesher approved these changes May 6, 2026

View reviewed changes

ankitgoswami merged commit b1b6479 into main May 7, 2026
26 checks passed

ankitgoswami deleted the ankitg/rfc-0001-agent-server-split branch May 7, 2026 16:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC 0001: Agent + cloud-server split#145

RFC 0001: Agent + cloud-server split#145
ankitgoswami merged 2 commits intomainfrom
ankitg/rfc-0001-agent-server-split

ankitgoswami commented Apr 30, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Apr 30, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mcharles-square left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

ankitgoswami commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

github-actions Bot commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔐 Codex Security Review

Review Summary

Findings

[HIGH] Enrollment flow does not authenticate the miner-signing key

[HIGH] Per-org credential key gives any compromised agent org-wide secret access

[MEDIUM] Unplanned agent loss has no non-destructive recovery path for Proto miners

Notes

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mcharles-square left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ankitgoswami commented Apr 30, 2026 •

edited

Loading

github-actions Bot commented Apr 30, 2026 •

edited

Loading