Skip to content

RFC 0001: Agent + cloud-server split#145

Merged
ankitgoswami merged 2 commits intomainfrom
ankitg/rfc-0001-agent-server-split
May 7, 2026
Merged

RFC 0001: Agent + cloud-server split#145
ankitgoswami merged 2 commits intomainfrom
ankitg/rfc-0001-agent-server-split

Conversation

@ankitgoswami
Copy link
Copy Markdown
Contributor

@ankitgoswami ankitgoswami commented Apr 30, 2026

Summary

  • Proposes the inaugural RFC process for proto-fleet (docs/rfcs/) with a README and template.
  • RFC 0001 proposes splitting fleetd into a thin on-prem fleet-agent and a cloud-deployable fleetd --mode=server, with a phased rollout that keeps today's docker-compose deploy (--mode=combined) working throughout.
  • Includes wire protocol (AgentGatewayService), per-org credential model, schema impact, drawbacks, alternatives considered, and 6 phases with mermaid diagrams.

This is a design RFC, not implementation. Land it as draft so review can happen on the markdown; subsequent PRs implement phases 1-6.

Test plan

  • Render the markdown on GitHub and confirm mermaid diagrams display
  • Verify all relative file links (../../server/...) resolve to existing source files at the cited line numbers
  • Verify internal anchor links (#credentials, #phase-4-...) navigate correctly
  • Read end-to-end and check for unresolved questions or contradictions before promoting from draft -> accepted

🤖 Generated with Claude Code

@github-actions github-actions Bot added the documentation Improvements or additions to documentation label Apr 30, 2026
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 30, 2026

🔐 Codex Security Review

Note: This is an automated security-focused code review generated by Codex.
It should be used as a supplementary check alongside human review.
False positives are possible - use your judgment.

Scope summary

  • Reviewed pull request diff only (a66ca90edd84fe2285bedb37f569cb80a61db367...4fa61623f43639993816d53e326de749a2917f5c, exact PR three-dot diff)
  • Model: gpt-5.4

💡 Click "edited" above to see previous reviews for this PR.


Review Summary

Overall Risk: HIGH

Findings

[HIGH] Enrollment flow does not authenticate the miner-signing key

  • Category: Auth
  • Location: docs/rfcs/0001-agent-server-split.md:95
  • Description: The RFC says both public keys are registered during enrollment, but the operator verification step only checks the identity-key fingerprint. The separate miner-signing key used for Proto miner JWTs is not described as being displayed to the operator, signed by the identity key, or otherwise cryptographically bound to the verified identity key. That leaves a substituted-key gap despite the text claiming the comparison blocks it.
  • Impact: A compromised cloud/control plane could accept the real identity key to satisfy the UI fingerprint check while pairing or migrating miners onto an attacker-controlled miner-signing key, regaining the ability to mint JWTs for those miners and issue unauthorized device operations.
  • Recommendation: Authenticate both keys as one bundle. Either require operator verification of both fingerprints, or have the identity key sign a manifest containing the miner-signing public key and require all pairing/rotation flows to use only that bound key.

[HIGH] Per-org credential key gives any compromised agent org-wide secret access

  • Category: Cryptostealing/Pool Hijack
  • Location: docs/rfcs/0001-agent-server-split.md:105
  • Description: The design puts one AES-256-GCM key on every agent in an org and explicitly allows any agent to decrypt any pool or miner credential. That collapses isolation across sites and turns the least-trusted agent into a decryption oracle for the entire org’s stored secrets.
  • Impact: Compromise of a single agent host exposes every stored stratum credential and miner admin credential for the org. That materially increases the blast radius for pool hijack, miner takeover, and follow-on abuse whenever the attacker later gains network reachability, control-plane leverage, or ownership-transfer influence.
  • Recommendation: Do not distribute org-wide decryption capability for miner and pool secrets. Wrap secrets to the owning agent(s) or device(s), or at minimum split miner credentials from pool credentials and scope decryption material to the smallest authorized set.

[MEDIUM] Unplanned agent loss has no non-destructive recovery path for Proto miners

  • Category: Reliability
  • Location: docs/rfcs/0001-agent-server-split.md:141
  • Description: For an unexpected agent loss, the RFC’s only recovery path is manual factory-reset-and-re-pair of each affected miner. There is no backup signer, escrowed recovery flow, or pre-authorized transfer path once the original host is gone.
  • Impact: A disk failure, accidental key deletion, or host loss can strand an entire site’s Proto miners behind a manual per-device recovery process, creating a severe availability and operational recovery problem.
  • Recommendation: Define a disaster-recovery mechanism before shipping this model: tested key backup/restore, controlled escrow, or a secondary trusted signer/transfer path that can recover devices without factory resets.

Notes

  • The diff is documentation-only under docs/rfcs/; there are no runtime code, migration, protobuf, or deployment artifact changes in scope for this review.
  • I did not score the spool-overflow/data-loss tradeoff or the local health-page exposure as formal findings because the RFC already treats those areas as open design detail, but they should be specified tightly before implementation.

Generated by Codex Security Review |
Triggered by: @ankitgoswami |
Review workflow run

@ankitgoswami ankitgoswami force-pushed the ankitg/rfc-0001-agent-server-split branch 3 times, most recently from 69e41b0 to 3b65048 Compare May 4, 2026 16:24
…plit)

Adds a docs/rfcs/ directory with a README and template for the RFC
process, plus the inaugural RFC 0001 proposing splitting fleetd into
a thin on-prem agent and a cloud-deployable server, with a phased
rollout that preserves today's combined-mode docker-compose deployment
throughout. Includes the per-org symmetric encryption design for
pool/miner credentials and the per-agent ed25519 design for Proto
miner JWT signing, with the trade-offs explicitly documented.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@ankitgoswami ankitgoswami force-pushed the ankitg/rfc-0001-agent-server-split branch from 3b65048 to 9849f6e Compare May 4, 2026 16:28
@ankitgoswami ankitgoswami marked this pull request as ready for review May 4, 2026 16:34
@ankitgoswami ankitgoswami requested a review from a team as a code owner May 4, 2026 16:34
Copilot AI review requested due to automatic review settings May 4, 2026 16:34
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Introduces an RFC process under docs/rfcs/ and adds the first design RFC (RFC 0001) proposing a future split between an on-prem agent and a cloud-deployable server mode for proto-fleet.

Changes:

  • Add docs/rfcs/README.md describing when to write RFCs, lifecycle states, numbering, and format.
  • Add docs/rfcs/_template.md to standardize RFC structure and metadata.
  • Add docs/rfcs/0001-agent-server-split.md as a draft architectural proposal for an agent + server split with auth/credential model and phased rollout.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File Description
docs/rfcs/README.md Defines the RFC process, lifecycle, numbering, and usage guidance.
docs/rfcs/0001-agent-server-split.md Draft RFC describing the agent/server split, security model, and rollout phases.
docs/rfcs/_template.md Provides a standard RFC template for consistent structure and metadata.

Comment thread docs/rfcs/0001-agent-server-split.md Outdated
Comment thread docs/rfcs/0001-agent-server-split.md
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Signed-off-by: Ankit Goswami <ankit.goswami@gmail.com>
Comment thread docs/rfcs/0001-agent-server-split.md
Copy link
Copy Markdown
Collaborator

@mcharles-square mcharles-square left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally looks good

Comment thread docs/rfcs/0001-agent-server-split.md
Comment thread docs/rfcs/0001-agent-server-split.md
Comment thread docs/rfcs/0001-agent-server-split.md
ankitgoswami added a commit that referenced this pull request May 5, 2026
Lands Phase 1 of RFC 0001 (the agent + cloud-server split): the wire
protocol scaffold, agent identity schema, and a stub handler that
returns CodeUnimplemented for every RPC. No behavior change in any
deployment shape (combined / server / agent).

Wire protocol (proto/agentgateway/v1/agentgateway.proto):
  Register, BeginAuthHandshake, CompleteAuthHandshake, UploadTelemetry,
  UploadEvents, UploadHeartbeat, ControlStream. Post-handshake RPCs
  derive agent identity from a session_token in Authorization metadata
  rather than any body field. RegisterRequest carries an
  operator-issued, org-scoped enrollment_token. buf.validate rules pin
  ed25519 key/signature lengths, bound api_key/session_token/name,
  cap opaque payloads at 1 MiB, require timestamps, and require a
  populated ControlStream oneof variant.

Handler (server/internal/handlers/agentgateway):
  Embeds UnimplementedAgentGatewayServiceHandler. Registered on the
  shared mux and added to grpcreflect. All seven RPCs are in
  UnauthenticatedProcedures because the user-session AuthInterceptor
  cannot validate the agent's session_token; the handler is responsible
  for credential validation when implemented.

Logging redaction (server/internal/handlers/interceptors):
  Handshake request/response procedures added to redaction lists;
  ControlStream, UploadTelemetry, UploadEvents added to
  SensitiveBodyProcedures. The streaming logger now suppresses
  per-message bodies for sensitive procedures.

Schema (server/migrations/000039_create_agent_tables):
  agent_device is the single source of truth for device-to-agent
  ownership; combined mode is the absence of a row. agent_device
  carries (agent_id, device_id, org_id) with composite FKs to both
  agent(id, org_id) and device(id, org_id), so cross-tenant pairings
  are rejected by the DB. agent identity_pubkey and (org_id, name)
  uniqueness are partial indexes scoped to deleted_at IS NULL, so a
  soft-deleted agent does not block re-enrollment. created_at and
  updated_at are NOT NULL.

--mode flag (server/cmd/fleetd/config.go):
  Accepted via kong's enum validator (server/agent/combined, default
  combined). Not yet load-bearing.

Closes #157
Refs #145

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ankitgoswami added a commit that referenced this pull request May 6, 2026
Server foundation for issue #158. Lands the operator and agent paths of
the enrollment flow end-to-end against the database; the fleet-agent
CLI binary and the operator UI are deferred to follow-up PRs.

Schema (migration 000040):
- api_key gains agent support: user_id nullable, new agent_id, new
  subject_kind ('user'|'agent'), CHECK that exactly one of user_id /
  agent_id is set. Existing rows back-fill to subject_kind='user'.
- New pending_enrollment table holds operator-issued bootstrap codes.
  State machine: PENDING -> AWAITING_CONFIRMATION -> CONFIRMED, with
  EXPIRED/CANCELLED terminal failure states. Plaintext is shown to the
  operator once and never persisted; only the SHA-256 hash is stored.
- New agent_auth_challenge table holds short-TTL handshake nonces;
  atomic DELETE ... RETURNING gives replay safety without a consumed_at.
- New agent_session table holds short-lived bearer tokens issued by
  CompleteAuthHandshake; mirrors the user-side session table.

Domain:
- agentenrollment.Service implements code lifecycle and the Register +
  Confirm transitions. Confirm flips pending_enrollment to CONFIRMED,
  marks agent.enrollment_status='CONFIRMED', and issues the agent's
  api_key via apikey.Service.CreateAgent.
- agentauth.Service implements the BeginHandshake / CompleteHandshake /
  ResolveSession state machine. BeginHandshake verifies the api_key,
  cross-checks the supplied identity_pubkey against the enrolled key,
  and mints a one-shot challenge. CompleteHandshake atomically consumes
  the challenge and verifies the ed25519 signature against the agent's
  identity key, minting a session_token on success.
- apikey.Service / store gain a CreateAgent path; existing user-key
  flows stay unchanged. Validate keeps a single signature; callers
  branch on SubjectKind.

Auth context + interceptor:
- agentauth.Subject is the typed value placed on ctx by
  AgentAuthInterceptor (mirrors session.Info for users, via
  connectrpc.com/authn).
- AgentAuthInterceptor only fires on AgentAuthenticatedProcedures
  (Upload* and ControlStream); the user-session AuthInterceptor
  short-circuits those so the two interceptors don't fight.
- UnauthenticatedProcedures now contains only the bootstrap RPCs:
  Register / BeginAuthHandshake / CompleteAuthHandshake.

Handlers:
- AgentGatewayService.Register / BeginAuthHandshake /
  CompleteAuthHandshake stop returning Unimplemented and delegate to
  the new domain services.
- New AgentAdminService (proto/agentadmin/v1) gives operators
  CreateEnrollmentCode, ListAgents, ConfirmAgent. Authorized via the
  existing user-session AuthInterceptor; org_id resolved from session
  info.

Tests:
- Integration tests against a real timescaledb cover the happy path
  (create code -> register -> confirm -> handshake -> resolve
  session) plus the security cases called out in the AC: replay of a
  consumed code, expired code, replayed challenge, and identity_pubkey
  mismatch.

Out of scope (follow-up PRs):
- cmd/fleet-agent/ Go binary with `enroll` subcommand.
- Operator UI: Agents settings page + EnrollAgentModal.

Refs #158
Refs #145

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ankitgoswami added a commit that referenced this pull request May 6, 2026
Server foundation for issue #158. Lands the operator and agent paths of
the enrollment flow end-to-end against the database; the fleet-agent
CLI binary and the operator UI are deferred to follow-up PRs.

Schema (migration 000040):
- api_key gains agent support: user_id nullable, new agent_id, new
  subject_kind ('user'|'agent'), CHECK that exactly one of user_id /
  agent_id is set. Existing rows back-fill to subject_kind='user'.
- New pending_enrollment table holds operator-issued bootstrap codes.
  State machine: PENDING -> AWAITING_CONFIRMATION -> CONFIRMED, with
  EXPIRED/CANCELLED terminal failure states. Plaintext is shown to the
  operator once and never persisted; only the SHA-256 hash is stored.
- New agent_auth_challenge table holds short-TTL handshake nonces;
  atomic DELETE ... RETURNING gives replay safety without a consumed_at.
- New agent_session table holds short-lived bearer tokens issued by
  CompleteAuthHandshake; mirrors the user-side session table.

Domain:
- agentenrollment.Service implements code lifecycle and the Register +
  Confirm transitions. Confirm flips pending_enrollment to CONFIRMED,
  marks agent.enrollment_status='CONFIRMED', and issues the agent's
  api_key via apikey.Service.CreateAgent.
- agentauth.Service implements the BeginHandshake / CompleteHandshake /
  ResolveSession state machine. BeginHandshake verifies the api_key,
  cross-checks the supplied identity_pubkey against the enrolled key,
  and mints a one-shot challenge. CompleteHandshake atomically consumes
  the challenge and verifies the ed25519 signature against the agent's
  identity key, minting a session_token on success.
- apikey.Service / store gain a CreateAgent path; existing user-key
  flows stay unchanged. Validate keeps a single signature; callers
  branch on SubjectKind.

Auth context + interceptor:
- agentauth.Subject is the typed value placed on ctx by
  AgentAuthInterceptor (mirrors session.Info for users, via
  connectrpc.com/authn).
- AgentAuthInterceptor only fires on AgentAuthenticatedProcedures
  (Upload* and ControlStream); the user-session AuthInterceptor
  short-circuits those so the two interceptors don't fight.
- UnauthenticatedProcedures now contains only the bootstrap RPCs:
  Register / BeginAuthHandshake / CompleteAuthHandshake.

Handlers:
- AgentGatewayService.Register / BeginAuthHandshake /
  CompleteAuthHandshake stop returning Unimplemented and delegate to
  the new domain services.
- New AgentAdminService (proto/agentadmin/v1) gives operators
  CreateEnrollmentCode, ListAgents, ConfirmAgent. Authorized via the
  existing user-session AuthInterceptor; org_id resolved from session
  info.

Tests:
- Integration tests against a real timescaledb cover the happy path
  (create code -> register -> confirm -> handshake -> resolve
  session) plus the security cases called out in the AC: replay of a
  consumed code, expired code, replayed challenge, and identity_pubkey
  mismatch.

Out of scope (follow-up PRs):
- cmd/fleet-agent/ Go binary with `enroll` subcommand.
- Operator UI: Agents settings page + EnrollAgentModal.

Refs #158
Refs #145

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ankitgoswami added a commit that referenced this pull request May 6, 2026
Server foundation for issue #158. Lands the operator and agent paths of
the enrollment flow end-to-end against the database; the fleet-agent
CLI binary and the operator UI are deferred to follow-up PRs.

Schema (migration 000040):
- api_key gains agent support: user_id nullable, new agent_id, new
  subject_kind ('user'|'agent'), CHECK that exactly one of user_id /
  agent_id is set. Existing rows back-fill to subject_kind='user'.
- New pending_enrollment table holds operator-issued bootstrap codes.
  State machine: PENDING -> AWAITING_CONFIRMATION -> CONFIRMED, with
  EXPIRED/CANCELLED terminal failure states. Plaintext is shown to the
  operator once and never persisted; only the SHA-256 hash is stored.
- New agent_auth_challenge table holds short-TTL handshake nonces;
  atomic DELETE ... RETURNING gives replay safety without a consumed_at.
- New agent_session table holds short-lived bearer tokens issued by
  CompleteAuthHandshake; mirrors the user-side session table.

Domain:
- agentenrollment.Service implements code lifecycle and the Register +
  Confirm transitions. Confirm flips pending_enrollment to CONFIRMED,
  marks agent.enrollment_status='CONFIRMED', and issues the agent's
  api_key via apikey.Service.CreateAgent.
- agentauth.Service implements the BeginHandshake / CompleteHandshake /
  ResolveSession state machine. BeginHandshake verifies the api_key,
  cross-checks the supplied identity_pubkey against the enrolled key,
  and mints a one-shot challenge. CompleteHandshake atomically consumes
  the challenge and verifies the ed25519 signature against the agent's
  identity key, minting a session_token on success.
- apikey.Service / store gain a CreateAgent path; existing user-key
  flows stay unchanged. Validate keeps a single signature; callers
  branch on SubjectKind.

Auth context + interceptor:
- agentauth.Subject is the typed value placed on ctx by
  AgentAuthInterceptor (mirrors session.Info for users, via
  connectrpc.com/authn).
- AgentAuthInterceptor only fires on AgentAuthenticatedProcedures
  (Upload* and ControlStream); the user-session AuthInterceptor
  short-circuits those so the two interceptors don't fight.
- UnauthenticatedProcedures now contains only the bootstrap RPCs:
  Register / BeginAuthHandshake / CompleteAuthHandshake.

Handlers:
- AgentGatewayService.Register / BeginAuthHandshake /
  CompleteAuthHandshake stop returning Unimplemented and delegate to
  the new domain services.
- New AgentAdminService (proto/agentadmin/v1) gives operators
  CreateEnrollmentCode, ListAgents, ConfirmAgent. Authorized via the
  existing user-session AuthInterceptor; org_id resolved from session
  info.

Tests:
- Integration tests against a real timescaledb cover the happy path
  (create code -> register -> confirm -> handshake -> resolve
  session) plus the security cases called out in the AC: replay of a
  consumed code, expired code, replayed challenge, and identity_pubkey
  mismatch.

Out of scope (follow-up PRs):
- cmd/fleet-agent/ Go binary with `enroll` subcommand.
- Operator UI: Agents settings page + EnrollAgentModal.

Refs #158
Refs #145

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ankitgoswami added a commit that referenced this pull request May 6, 2026
Server foundation for issue #158. Lands the operator and agent paths of
the enrollment flow end-to-end against the database; the fleet-agent
CLI binary and the operator UI are deferred to follow-up PRs.

Schema (migration 000040):
- api_key gains agent support: user_id nullable, new agent_id, new
  subject_kind ('user'|'agent'), CHECK that exactly one of user_id /
  agent_id is set. Existing rows back-fill to subject_kind='user'.
- New pending_enrollment table holds operator-issued bootstrap codes.
  State machine: PENDING -> AWAITING_CONFIRMATION -> CONFIRMED, with
  EXPIRED/CANCELLED terminal failure states. Plaintext is shown to the
  operator once and never persisted; only the SHA-256 hash is stored.
- New agent_auth_challenge table holds short-TTL handshake nonces;
  atomic DELETE ... RETURNING gives replay safety without a consumed_at.
- New agent_session table holds short-lived bearer tokens issued by
  CompleteAuthHandshake; mirrors the user-side session table.

Domain:
- agentenrollment.Service implements code lifecycle and the Register +
  Confirm transitions. Confirm flips pending_enrollment to CONFIRMED,
  marks agent.enrollment_status='CONFIRMED', and issues the agent's
  api_key via apikey.Service.CreateAgent.
- agentauth.Service implements the BeginHandshake / CompleteHandshake /
  ResolveSession state machine. BeginHandshake verifies the api_key,
  cross-checks the supplied identity_pubkey against the enrolled key,
  and mints a one-shot challenge. CompleteHandshake atomically consumes
  the challenge and verifies the ed25519 signature against the agent's
  identity key, minting a session_token on success.
- apikey.Service / store gain a CreateAgent path; existing user-key
  flows stay unchanged. Validate keeps a single signature; callers
  branch on SubjectKind.

Auth context + interceptor:
- agentauth.Subject is the typed value placed on ctx by
  AgentAuthInterceptor (mirrors session.Info for users, via
  connectrpc.com/authn).
- AgentAuthInterceptor only fires on AgentAuthenticatedProcedures
  (Upload* and ControlStream); the user-session AuthInterceptor
  short-circuits those so the two interceptors don't fight.
- UnauthenticatedProcedures now contains only the bootstrap RPCs:
  Register / BeginAuthHandshake / CompleteAuthHandshake.

Handlers:
- AgentGatewayService.Register / BeginAuthHandshake /
  CompleteAuthHandshake stop returning Unimplemented and delegate to
  the new domain services.
- New AgentAdminService (proto/agentadmin/v1) gives operators
  CreateEnrollmentCode, ListAgents, ConfirmAgent. Authorized via the
  existing user-session AuthInterceptor; org_id resolved from session
  info.

Tests:
- Integration tests against a real timescaledb cover the happy path
  (create code -> register -> confirm -> handshake -> resolve
  session) plus the security cases called out in the AC: replay of a
  consumed code, expired code, replayed challenge, and identity_pubkey
  mismatch.

Out of scope (follow-up PRs):
- cmd/fleet-agent/ Go binary with `enroll` subcommand.
- Operator UI: Agents settings page + EnrollAgentModal.

Refs #158
Refs #145

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ankitgoswami added a commit that referenced this pull request May 6, 2026
Server foundation for issue #158. Lands the operator and agent paths of
the enrollment flow end-to-end against the database; the fleet-agent
CLI binary and the operator UI are deferred to follow-up PRs.

Schema (migration 000040):
- api_key gains agent support: user_id nullable, new agent_id, new
  subject_kind ('user'|'agent'), CHECK that exactly one of user_id /
  agent_id is set. Existing rows back-fill to subject_kind='user'.
- New pending_enrollment table holds operator-issued bootstrap codes.
  State machine: PENDING -> AWAITING_CONFIRMATION -> CONFIRMED, with
  EXPIRED/CANCELLED terminal failure states. Plaintext is shown to the
  operator once and never persisted; only the SHA-256 hash is stored.
- New agent_auth_challenge table holds short-TTL handshake nonces;
  atomic DELETE ... RETURNING gives replay safety without a consumed_at.
- New agent_session table holds short-lived bearer tokens issued by
  CompleteAuthHandshake; mirrors the user-side session table.

Domain:
- agentenrollment.Service implements code lifecycle and the Register +
  Confirm transitions. Confirm flips pending_enrollment to CONFIRMED,
  marks agent.enrollment_status='CONFIRMED', and issues the agent's
  api_key via apikey.Service.CreateAgent.
- agentauth.Service implements the BeginHandshake / CompleteHandshake /
  ResolveSession state machine. BeginHandshake verifies the api_key,
  cross-checks the supplied identity_pubkey against the enrolled key,
  and mints a one-shot challenge. CompleteHandshake atomically consumes
  the challenge and verifies the ed25519 signature against the agent's
  identity key, minting a session_token on success.
- apikey.Service / store gain a CreateAgent path; existing user-key
  flows stay unchanged. Validate keeps a single signature; callers
  branch on SubjectKind.

Auth context + interceptor:
- agentauth.Subject is the typed value placed on ctx by
  AgentAuthInterceptor (mirrors session.Info for users, via
  connectrpc.com/authn).
- AgentAuthInterceptor only fires on AgentAuthenticatedProcedures
  (Upload* and ControlStream); the user-session AuthInterceptor
  short-circuits those so the two interceptors don't fight.
- UnauthenticatedProcedures now contains only the bootstrap RPCs:
  Register / BeginAuthHandshake / CompleteAuthHandshake.

Handlers:
- AgentGatewayService.Register / BeginAuthHandshake /
  CompleteAuthHandshake stop returning Unimplemented and delegate to
  the new domain services.
- New AgentAdminService (proto/agentadmin/v1) gives operators
  CreateEnrollmentCode, ListAgents, ConfirmAgent. Authorized via the
  existing user-session AuthInterceptor; org_id resolved from session
  info.

Tests:
- Integration tests against a real timescaledb cover the happy path
  (create code -> register -> confirm -> handshake -> resolve
  session) plus the security cases called out in the AC: replay of a
  consumed code, expired code, replayed challenge, and identity_pubkey
  mismatch.

Out of scope (follow-up PRs):
- cmd/fleet-agent/ Go binary with `enroll` subcommand.
- Operator UI: Agents settings page + EnrollAgentModal.

Refs #158
Refs #145

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ankitgoswami added a commit that referenced this pull request May 6, 2026
Server foundation for issue #158. Lands the operator and agent paths of
the enrollment flow end-to-end against the database; the fleet-agent
CLI binary and the operator UI are deferred to follow-up PRs.

Schema (migration 000040):
- api_key gains agent support: user_id nullable, new agent_id, new
  subject_kind ('user'|'agent'), CHECK that exactly one of user_id /
  agent_id is set. Existing rows back-fill to subject_kind='user'.
- New pending_enrollment table holds operator-issued bootstrap codes.
  State machine: PENDING -> AWAITING_CONFIRMATION -> CONFIRMED, with
  EXPIRED/CANCELLED terminal failure states. Plaintext is shown to the
  operator once and never persisted; only the SHA-256 hash is stored.
- New agent_auth_challenge table holds short-TTL handshake nonces;
  atomic DELETE ... RETURNING gives replay safety without a consumed_at.
- New agent_session table holds short-lived bearer tokens issued by
  CompleteAuthHandshake; mirrors the user-side session table.

Domain:
- agentenrollment.Service implements code lifecycle and the Register +
  Confirm transitions. Confirm flips pending_enrollment to CONFIRMED,
  marks agent.enrollment_status='CONFIRMED', and issues the agent's
  api_key via apikey.Service.CreateAgent.
- agentauth.Service implements the BeginHandshake / CompleteHandshake /
  ResolveSession state machine. BeginHandshake verifies the api_key,
  cross-checks the supplied identity_pubkey against the enrolled key,
  and mints a one-shot challenge. CompleteHandshake atomically consumes
  the challenge and verifies the ed25519 signature against the agent's
  identity key, minting a session_token on success.
- apikey.Service / store gain a CreateAgent path; existing user-key
  flows stay unchanged. Validate keeps a single signature; callers
  branch on SubjectKind.

Auth context + interceptor:
- agentauth.Subject is the typed value placed on ctx by
  AgentAuthInterceptor (mirrors session.Info for users, via
  connectrpc.com/authn).
- AgentAuthInterceptor only fires on AgentAuthenticatedProcedures
  (Upload* and ControlStream); the user-session AuthInterceptor
  short-circuits those so the two interceptors don't fight.
- UnauthenticatedProcedures now contains only the bootstrap RPCs:
  Register / BeginAuthHandshake / CompleteAuthHandshake.

Handlers:
- AgentGatewayService.Register / BeginAuthHandshake /
  CompleteAuthHandshake stop returning Unimplemented and delegate to
  the new domain services.
- New AgentAdminService (proto/agentadmin/v1) gives operators
  CreateEnrollmentCode, ListAgents, ConfirmAgent. Authorized via the
  existing user-session AuthInterceptor; org_id resolved from session
  info.

Tests:
- Integration tests against a real timescaledb cover the happy path
  (create code -> register -> confirm -> handshake -> resolve
  session) plus the security cases called out in the AC: replay of a
  consumed code, expired code, replayed challenge, and identity_pubkey
  mismatch.

Out of scope (follow-up PRs):
- cmd/fleet-agent/ Go binary with `enroll` subcommand.
- Operator UI: Agents settings page + EnrollAgentModal.

Refs #158
Refs #145

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ankitgoswami added a commit that referenced this pull request May 6, 2026
Server foundation for issue #158. Lands the operator and agent paths of
the enrollment flow end-to-end against the database; the fleet-agent
CLI binary and the operator UI are deferred to follow-up PRs.

Schema (migration 000040):
- api_key gains agent support: user_id nullable, new agent_id, new
  subject_kind ('user'|'agent'), CHECK that exactly one of user_id /
  agent_id is set. Existing rows back-fill to subject_kind='user'.
- New pending_enrollment table holds operator-issued bootstrap codes.
  State machine: PENDING -> AWAITING_CONFIRMATION -> CONFIRMED, with
  EXPIRED/CANCELLED terminal failure states. Plaintext is shown to the
  operator once and never persisted; only the SHA-256 hash is stored.
- New agent_auth_challenge table holds short-TTL handshake nonces;
  atomic DELETE ... RETURNING gives replay safety without a consumed_at.
- New agent_session table holds short-lived bearer tokens issued by
  CompleteAuthHandshake; mirrors the user-side session table.

Domain:
- agentenrollment.Service implements code lifecycle and the Register +
  Confirm transitions. Confirm flips pending_enrollment to CONFIRMED,
  marks agent.enrollment_status='CONFIRMED', and issues the agent's
  api_key via apikey.Service.CreateAgent.
- agentauth.Service implements the BeginHandshake / CompleteHandshake /
  ResolveSession state machine. BeginHandshake verifies the api_key,
  cross-checks the supplied identity_pubkey against the enrolled key,
  and mints a one-shot challenge. CompleteHandshake atomically consumes
  the challenge and verifies the ed25519 signature against the agent's
  identity key, minting a session_token on success.
- apikey.Service / store gain a CreateAgent path; existing user-key
  flows stay unchanged. Validate keeps a single signature; callers
  branch on SubjectKind.

Auth context + interceptor:
- agentauth.Subject is the typed value placed on ctx by
  AgentAuthInterceptor (mirrors session.Info for users, via
  connectrpc.com/authn).
- AgentAuthInterceptor only fires on AgentAuthenticatedProcedures
  (Upload* and ControlStream); the user-session AuthInterceptor
  short-circuits those so the two interceptors don't fight.
- UnauthenticatedProcedures now contains only the bootstrap RPCs:
  Register / BeginAuthHandshake / CompleteAuthHandshake.

Handlers:
- AgentGatewayService.Register / BeginAuthHandshake /
  CompleteAuthHandshake stop returning Unimplemented and delegate to
  the new domain services.
- New AgentAdminService (proto/agentadmin/v1) gives operators
  CreateEnrollmentCode, ListAgents, ConfirmAgent. Authorized via the
  existing user-session AuthInterceptor; org_id resolved from session
  info.

Tests:
- Integration tests against a real timescaledb cover the happy path
  (create code -> register -> confirm -> handshake -> resolve
  session) plus the security cases called out in the AC: replay of a
  consumed code, expired code, replayed challenge, and identity_pubkey
  mismatch.

Out of scope (follow-up PRs):
- cmd/fleet-agent/ Go binary with `enroll` subcommand.
- Operator UI: Agents settings page + EnrollAgentModal.

Refs #158
Refs #145

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ankitgoswami added a commit that referenced this pull request May 6, 2026
Server foundation for issue #158. Lands the operator and agent paths of
the enrollment flow end-to-end against the database; the fleet-agent
CLI binary and the operator UI are deferred to follow-up PRs.

Schema (migration 000040):
- api_key gains agent support: user_id nullable, new agent_id, new
  subject_kind ('user'|'agent'), CHECK that exactly one of user_id /
  agent_id is set. Existing rows back-fill to subject_kind='user'.
- New pending_enrollment table holds operator-issued bootstrap codes.
  State machine: PENDING -> AWAITING_CONFIRMATION -> CONFIRMED, with
  EXPIRED/CANCELLED terminal failure states. Plaintext is shown to the
  operator once and never persisted; only the SHA-256 hash is stored.
- New agent_auth_challenge table holds short-TTL handshake nonces;
  atomic DELETE ... RETURNING gives replay safety without a consumed_at.
- New agent_session table holds short-lived bearer tokens issued by
  CompleteAuthHandshake; mirrors the user-side session table.

Domain:
- agentenrollment.Service implements code lifecycle and the Register +
  Confirm transitions. Confirm flips pending_enrollment to CONFIRMED,
  marks agent.enrollment_status='CONFIRMED', and issues the agent's
  api_key via apikey.Service.CreateAgent.
- agentauth.Service implements the BeginHandshake / CompleteHandshake /
  ResolveSession state machine. BeginHandshake verifies the api_key,
  cross-checks the supplied identity_pubkey against the enrolled key,
  and mints a one-shot challenge. CompleteHandshake atomically consumes
  the challenge and verifies the ed25519 signature against the agent's
  identity key, minting a session_token on success.
- apikey.Service / store gain a CreateAgent path; existing user-key
  flows stay unchanged. Validate keeps a single signature; callers
  branch on SubjectKind.

Auth context + interceptor:
- agentauth.Subject is the typed value placed on ctx by
  AgentAuthInterceptor (mirrors session.Info for users, via
  connectrpc.com/authn).
- AgentAuthInterceptor only fires on AgentAuthenticatedProcedures
  (Upload* and ControlStream); the user-session AuthInterceptor
  short-circuits those so the two interceptors don't fight.
- UnauthenticatedProcedures now contains only the bootstrap RPCs:
  Register / BeginAuthHandshake / CompleteAuthHandshake.

Handlers:
- AgentGatewayService.Register / BeginAuthHandshake /
  CompleteAuthHandshake stop returning Unimplemented and delegate to
  the new domain services.
- New AgentAdminService (proto/agentadmin/v1) gives operators
  CreateEnrollmentCode, ListAgents, ConfirmAgent. Authorized via the
  existing user-session AuthInterceptor; org_id resolved from session
  info.

Tests:
- Integration tests against a real timescaledb cover the happy path
  (create code -> register -> confirm -> handshake -> resolve
  session) plus the security cases called out in the AC: replay of a
  consumed code, expired code, replayed challenge, and identity_pubkey
  mismatch.

Out of scope (follow-up PRs):
- cmd/fleet-agent/ Go binary with `enroll` subcommand.
- Operator UI: Agents settings page + EnrollAgentModal.

Refs #158
Refs #145

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ankitgoswami added a commit that referenced this pull request May 6, 2026
Add the agent-side CLI that drives the bootstrap flow against the
already-merged enrollment + handshake server (PR #181). Supports three
subcommands: enroll, status, refresh.

enroll generates ed25519 identity + miner-signing keypairs, calls
Register against AgentGatewayService, prints the local fingerprint for
operator visual comparison, accepts the api_key the operator pastes back
after confirming in the UI, runs the BeginAuth/CompleteAuth handshake,
and persists everything to a 0600 YAML state file under
$XDG_STATE_HOME/fleet-agent (or ~/.local/state/fleet-agent by default).

refresh re-runs the handshake against the stored api_key. status reads
back the local state. Tests are wire-level: they spin up an httptest
server with the real connect handler against a fake AgentGateway that
verifies the signature with ed25519.Verify, mirroring the server.

Refs: #158, #145

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@flesher flesher self-requested a review May 6, 2026 23:29
@ankitgoswami ankitgoswami merged commit b1b6479 into main May 7, 2026
26 checks passed
@ankitgoswami ankitgoswami deleted the ankitg/rfc-0001-agent-server-split branch May 7, 2026 16:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants