Skip to content

feat(contrail): community DID provisioning on stock PDS#31

Open
tompscanlan wants to merge 22 commits intoflo-bit:mainfrom
tompscanlan:feature/pds-provisioning
Open

feat(contrail): community DID provisioning on stock PDS#31
tompscanlan wants to merge 22 commits intoflo-bit:mainfrom
tompscanlan:feature/pds-provisioning

Conversation

@tompscanlan
Copy link
Copy Markdown
Contributor

@tompscanlan tompscanlan commented May 1, 2026

Summary

Adds <ns>.community.provision — a third community-creation mode alongside mint and adopt. It produces a fresh did:plc on a stock @atproto/pds, with the caller keeping rotation custody and Contrail publishing through a revocable app password. No PDS fork, no PDS admin creds in Contrail.

How the three modes line up

mint adopt provision (new)
DID origin Contrail creates fresh did:plc pre-existing Contrail creates fresh did:plc
PDS account none pre-existing created on a stock PDS
Rotation key holder Contrail (+ caller's recovery key returned once) caller caller
Signing-key holder Contrail n/a (PDS holds it) n/a (new PDS holds it)
How Contrail writes signs records directly app password app password
Publishing supported? no (identity-only) yes yes

mint and provision both produce a fresh DID, but they split on who holds the rotation key and how publishing happens: mint keeps Contrail in control of the signing key (no PDS, no publishing yet), while provision puts the caller on rotationKeys[0] and routes writes through an app password on a real PDS. adopt is the only mode that touches an existing identity.

The caller-rotates invariant is what makes provision safe to expose without giving Contrail PDS admin creds: if Contrail is unavailable, the caller can move the DID elsewhere on their own.

Plan is to rebase onto PR #30 once it lands and relocate into packages/contrail-community/ — the new code is self-contained in community/ so the move is a path-update, not a refactor.

What's in

  • 5-RPC orchestrator: genesis → createAccountgetRecommendedDidCredentials → PLC update op (caller's rotation key at [0], Contrail subordinate at [1]) → activateAccount. Status persisted per step.
  • Schema: provision_attempts, community_sessions, provision_attempts_archive (rows that fail to make progress are archived).
  • pds.ts / plc.ts / service-auth.ts: ES256 service-auth signer, PLC update + tombstone helpers (low-S, DAG-CBOR), CID computed locally to match the live PLC parser.
  • Per-community session cache, refresh-on-near-expiry + createSession fallback. Pre-warmed by the JWTs createAccount returns. 401s on publish drop the cached session so the next call mints fresh.
  • Service-auth JWT aud resolved per-request via describeServer(body.pdsEndpoint), so a single Contrail can mint communities on multiple PDSes without a config-pinned target DID.
  • Idempotent retry on attemptId for the post-activation app-password mint: a 5xx returns the attemptId, and re-calling provision with the same id picks up only at createAppPassword — no PLC ops or createAccount re-issued.
  • allowedPdsEndpoints with URL normalization (scheme/host case, trailing slash, default port, IDN) so allowlist checks aren't bypassed by trivial encoding differences.
  • contrail reap CLI — operator-run, audited. Walks stuck provision_attempts directly (no separate "orphaned" status to manage), submits PLC tombstones, and archives. --dry-run defaults true; --no-dry-run is required to submit (irrevocable).
  • Surface-area cleanup as part of pre-review pass: dropped listProvisionAttemptsByStatus, PdsCreateSessionResult, the privileged flag on createAppPassword, resumeFromAccountCreated (stuck rows are archived, not resumed), runUpdateAndActivate (inlined), jwkPubToDidKey, PlcClient.getLastOpCid, and the caller_rotation_did_key column. Net −468 lines.

Tests added

  • Unit coverage for the orchestrator, service-auth signer (P1363 round-trip + low-S), PLC update + tombstone helpers, CID computation, session cache, allowlist enforcement (incl. URL-normalization edges), createAppPassword failure-then-retry, --dry-run default, dynamic aud resolution via describeServer, 401-clears-session-cache on publish.
  • Live-devnet e2e: full provision + 2 publishes against PDS:4000 + PLC:2582; cache-reuse assertion (1 createSession across the burst); rotation-JWK-not-persisted invariant (caller's private rotation JWK never appears in any decrypted column); reap tombstone round-trip against live PLC; full community-provision lifecycle walkthrough.

Asks

  1. Should community.provision auto-enroll on the configured record host as a 6th step (when running in-process), or stay strictly identity-only?
  2. Worth writing tools.atmo.space.declaration on the new community PDS during provision, so the owner can later rebind authority/record-host without rotation-key access? Or follow-up?
  3. Lexicon namespace {ns}.community.provision — anything to change before this freezes?

Known limitations (still open)

  • AES-GCM credential columns lack AAD; ciphertext from one column would decrypt in another.
  • No rate limit / quota on community.provision.

Test plan

  • pnpm test in packages/contrail
  • e2e against live devnet (apps/contrail-e2e)
  • Rotation-JWK-not-persisted invariant verified e2e
  • Cache reuse verified e2e
  • Rebase + relocate after PR another big refactor #30 merges

tompscanlan added 22 commits May 1, 2026 09:02
Lays the storage foundation for community provisioning:
- provision_attempts table tracking the 5-step provision flow per attempt
- community_sessions cache table for PDS session reuse across publishes
- CRUD on CommunityAdapter for both tables
- shared types: ProvisionStatus, ProvisionAttemptRow, CustodyMode
Network primitives for the provision orchestrator:
- ES256 service-auth JWT signer (com.atproto.server.createAccount lxm)
- pdsCreateAccount / pdsGetRecommendedDidCredentials / pdsActivateAccount
- PLC update-op helpers: buildUpdateOp, signUpdateOp, cidForOp (low-S sig
  normalization, DAG-CBOR encoding compatible with the live PLC parser)
… e2e

The full managed-mode provision flow:
- ProvisionOrchestrator: 5-RPC sequence (genesis op → createAccount →
  recommendedCreds → PLC update op → activateAccount), persisting status
  after each step.
- ProvisionSweeper: orphan detection + PLC-log-based resume so a stuck
  attempt can be picked up via resumeFromAccountCreated.
- /xrpc/{ns}.community.provision endpoint wiring the orchestrator into
  Hono.
- Publishing path for provision-mode communities, with a per-community
  community_sessions cache so bursts of writes share one PDS session.
- Live-devnet e2e test exercising the full flow (PDS+PLC+postgres).
Adds a custody mode where the caller supplies their own rotation
public did:key and Contrail holds a SUBORDINATE rotation key as
rotationKeys[1]. The caller retains highest-priority rotation
authority on PLC and can recover the DID independently.

- New custody_mode column on provision_attempts (managed | self_sovereign)
- caller_rotation_did_key persisted so PLC update ops (initial + resume)
  preserve the caller at rotationKeys[0]
- Mints a revocable app password post-activation so Contrail keeps a
  publishing credential without holding the user's root password
- Live-devnet e2e contrast: caller's private d/x/y JWK never appears in
  any encrypted column; PLC log lastOp.rotationKeys[0] === caller's
  did:key
The sweeper auto-set status='orphaned' on cron — too aggressive once
self-sovereign mode landed (caller may not have committed the genesis
op yet). Replaced by an operator-run \`contrail reap\` subcommand:
explicit, audited, idempotent, and gated by --dry-run by default.

Closes H1 (om-q0vq).
- pnpm-lock.yaml: workspace re-resolution from running builds during
  PDS provisioning work; pure additive (apps/atproto-starter deps).
- .gitignore: *.tsbuildinfo — typescript build cache, never want it
  tracked.
…ation in PLC

- M4: provision orchestrator pre-warms community_sessions with the JWTs
  createAccount returns, so the first publish doesn't waste a createSession
  round-trip. ensureSession's existing 30s skew + refresh + fallback paths
  cover token aging.
- H2 verification: e2e self-sovereign test now fetches the PLC log post-
  activation and asserts rotationKeys[0] === caller's did:key, locking in
  the H2 invariant against live devnet.
- E2e provision routing test: flipped publishesCreateSessionCount from
  === 1 to === 0 — both putRecords now hit the warm cache.

Closes om-q0vq follow-up M4. H2 (om-rpb1) now has unit + e2e coverage.
Caller-controlled pdsEndpoint flowed straight into orchestrator.provision
with no validation, letting any auth'd caller mint PLC entries pointing
at attacker-controlled PDSes signed by Contrail's rotation key.

Add CommunityConfig.allowedPdsEndpoints. When set to a non-empty array,
the route rejects pdsEndpoint values not in the list before any PLC op
is signed. Undefined or empty → no restriction (back-compat). Operators
running on a public/multi-tenant Contrail SHOULD set this.

Closes om-iyym (M3 from adversarial review).
…s PLC

`getLastOpCid` expected `{cid: string}` from PLC's /log/last, but PLC
returns the bare signed op object (no envelope, no `cid` field). The
function threw "PLC log/last response missing cid" against any real PLC
deployment — both the reap CLI's `prev` lookup and the provision
orchestrator's resume-from-crash path were broken end-to-end. Existing
unit tests passed because the mocks reproduced the wrong shape.

Fix: fetch the op, compute its CID locally with `cidForOp`. PLC stores
its log entries' CIDs from the same canonical DAG-CBOR encoding, so the
locally-computed CID matches the entry visible at /log/audit.

Also extends `cidForOp` to accept `SignedTombstoneOp` directly (the
tombstone shape is a strict subset of update; the encoder already handled
it via an `as never` cast in reap.ts:146 — closes M6's type concern).

Tests:
  * `tests/plc-log-last.test.ts` rewritten to mock the real PLC shape; CID
    assertions cross-check against `cidForOp`.
  * `tests/cli-reap.test.ts` mock updated to return the real op shape;
    expected `prev` CID computed via `cidForOp(FAKE_LAST_OP)`.
  * `apps/contrail-e2e/tests/reap-tombstone.test.ts` provisions a real DID
    on devnet PLC, builds + signs + submits a tombstone, asserts PLC's
    /log/last reports the same CID `cidForOp` computed locally.

Public-API delta: `cidForOp` now exported from package root (needed by
the e2e test; was already exported from `./core/community`).

Closes om-5690 (M6 from adversarial review). Caught a separate latent
bug in `getLastOpCid` along the way — same fix.
C1 — dynamic aud resolution via describeServer
  Route now calls com.atproto.server.describeServer(body.pdsEndpoint) to
  discover the target PDS DID and uses that as `aud` in the service-auth
  JWT, instead of falling back to cfg.serviceDid (Contrail's own DID).
  Without this fix, any Contrail with allowedPdsEndpoints != [single-pds]
  would 401 at createAccount with BadJwtAudience. New helper
  pdsDescribeServer() in pds.ts.

C2 — communities.custody_mode migration
  Adds ALTER TABLE communities ADD COLUMN custody_mode TEXT to MIGRATIONS.
  CREATE TABLE IF NOT EXISTS short-circuits on existing pre-PR
  communities tables, so the new column was never being added on upgrade
  and the next provision INSERT would fail. Also runs migrations against
  the community DB target when it differs from the main DB.

C3 — idempotent retry on attemptId (self-sovereign createAppPassword)
  When createAppPassword fails post-activation in self-sovereign mode, the
  DID is already minted on PLC and the PDS account is active. Previously
  the route 502'd and discarded the user's root credentials, leaving an
  unrecoverable state. Now:
    - 502 body always includes attemptId so callers can retry
    - provision() detects an existing row at status='activated' +
      custody_mode='self_sovereign' + no encrypted_password and dispatches
      to retryAppPasswordOnly()
    - retry path uses pds.createSession (new optional PdsClient method)
      with the user's password to obtain a fresh accessJwt, then re-runs
      createAppPassword
  No new column, no setup_complete flag, no follow-up endpoint — same
  call shape, idempotent on attemptId.

C4 — `contrail reap` --dry-run defaults true
  Spec required dry-run by default; previous code defaulted to false
  (any operator running `reap --all-orphaned --yes` without --dry-run
  would submit irrevocable PLC tombstones for every flagged row). Now:
    - runReap defaults dryRun: true if unspecified (safety default)
    - CLI exposes explicit --no-dry-run to actually act

Tests
  +5 new unit tests (one per Critical, including a retry-after-failure
  scenario that asserts no PLC ops or createAccount calls re-issue).
  Existing router/allowlist mocks updated to handle describeServer.
  367 unit pass (was 357 + 5 new + 5 fixture updates).
  39 e2e pass against live devnet (PDS:4000 + PLC:2582).
  7 pre-existing FTS failures unchanged.
…n key

Removes the managed-mode branch from community.provision. Every provisioned
DID now requires a caller-supplied rotation public did:key; Contrail holds a
subordinate at rotationKeys[1] and mints a revocable app password for
publishing. The user's account password is returned once and never persisted.

Cuts the CustodyMode discriminator, the custody_mode columns
(communities + provision_attempts + orphaned_archive), and the corresponding
ALTER TABLE migration. Reduces orchestrator branching and removes the
key-custody question that a single AES-GCM master key without per-column AAD
couldn't answer cleanly. If a managed flow is needed later it should ship
with proper key infra and its own threat model.
Fallout from 19d3ff5 (drop managed custody mode, require caller
rotation key): the e2e tests still constructed ProvisionInput without
rotationKey and asserted on the removed custodyMode column.

- provision.test.ts: delete the managed-mode test (whole flow no
  longer expressible); drop the custodyMode === "self_sovereign"
  assertion in the sovereign test; add callerRotation to the XRPC
  route test.
- reap-tombstone.test.ts: import generateKeyPair, pass rotationKey
  to the orchestrator.

Verified against devnet:
  provision.test.ts (sovereign + XRPC route) and reap-tombstone.test.ts
  all green; full e2e suite has one pre-existing Jetstream-timing flake
  in ingest-roundtrip > rsvpsGoingCount, unrelated to this change.
19d3ff5 removed the custodyMode discriminator. The doc-comment on
retryAppPasswordOnly still described the trigger as
"status='activated' + custodyMode='self_sovereign' + no
encrypted_password". The custody field is gone; the predicate is
just "activated + no encrypted_password" now.

Comment-only — no behavior change.
Chains provision (sovereign, caller-supplied rotation key) → grant
publisher → publish event as the community → cross-account RSVP from
a separate user PDS → indexed assertion against the in-process
ingester. Asserts the caller's rotation key persists in PLC, the
publisher (not the provisioner) writes the event, and the indexed
event hydrates with rsvpsGoingCount=1 attributed to the community DID.

Doubles as living documentation of the surface community.provision
unlocks end-to-end.
- Add Provisioned to the Three modes list (caller-supplied rotation key,
  fresh PDS account, contrail uses an app password).
- Add community.provision and the contrail reap CLI to the XRPCs section.
- Add a neutral Choosing a mode section: two yes/no questions
  (existing DID? where to publish?) determine which mode applies.
- Rewrite the sovereignty bullet in What's not here to describe each
  mode's revoke/recovery mechanics in parallel, without coloring.
…ated assertion

Previously the resume path null-checked the row and its rotation key but
not the row's status — a caller handing it an attemptId for an already-
`activated` row (or one still at `keys_generated`/`genesis_submitted`/
`did_doc_updated`) would re-run steps 4-5 and corrupt state. Refuse
unless the row is exactly at `account_created`.
reap previously required `status='orphaned'`, but production code never
sets that status — making the CLI unreachable without manual SQL. Fix by
relaxing the precondition: reap now operates on any provision_attempts
row that did not reach `activated`. Active communities are still
protected (refusal is per-row).

The intermediate `mark-orphaned` step proposed in the PR body is dropped
as redundant — `--dry-run` (default) plus the per-row confirmation
prompt already provide the review affordance the extra state was meant
to give.

Surface changes:
- CommunityAdapter.listOrphanedAttempts() -> listStuckAttempts()
- --all-orphaned flag -> --all-stuck
- Help text and prompts reworded to match
- The 'orphaned' value remains in the schema CHECK constraint
  (vestigial; harmless — no writer)
allowedPdsEndpoints used exact string equality against `body.pdsEndpoint`,
so trailing slash, default port, scheme case, host case, and IDN encoding
all bypassed an otherwise-correct allowlist. Add `normalizePdsEndpoint`
(uses URL.origin to canonicalize), apply it both to the body and to each
allowlist entry on the check, and reject unparseable URLs with 400.

The normalized form is also written back to body.pdsEndpoint so all
downstream call sites (describeServer, createAccount, stored
pds_endpoint column) see one canonical value per PDS.
When createRecord/deleteRecord against a community PDS returned 401, the
existing code surfaced UpstreamFailure 502 but left the bad session in
the cache. Every subsequent publish then hit the same 401 permanently,
because ensureSession's "is the cache fresh?" check only inspects the
locally-stored accessExp, not whether the PDS actually still honors the
token.

Drop the cached session row when the PDS returns 401. The next request
goes cold through ensureSession, which mints a new session from the
stored app password -- or fails with a clear error if the password itself
was revoked, which is the right outcome.

No retry-in-place: the simpler fix is two extra lines per route, and the
caller's natural retry covers the case.
L5 dropped the mark-orphaned step entirely, leaving 'orphaned' in the
schema CHECK constraint and the PROVISION_STATUSES TS union with no
writer and no reader. Pre-release, no migration -- so just remove the
value from both, retire the corresponding test fixture, and delete the
tautological types-test (which only asserted the literal === itself).

The provision-attempts test that exercised lastError persistence on a
status transition now uses 'did_doc_updated' instead, which is a real
status the orchestrator emits.
Covers the 20 commits since main that ship the provision community-
creation mode: the community.provision XRPC route, the contrail reap
CLI subcommand, the new community config block (masterKey,
allowedPdsEndpoints, plcDirectory), the provision_attempts and
community_credentials tables, and the sovereign-dominant custody
model with contrail's subordinate rotation key under
rotationKeys[1].
Removes code paths and test fixtures that no caller exercises, ahead of
PR flo-bit#31 review. No behavior change for the in-use provision flow.

Drops:
  - listProvisionAttemptsByStatus adapter method (no caller)
  - PdsCreateSessionResult type + re-exports
  - pdsCreateAppPassword 'privileged' option (always false in practice)
  - resumeFromAccountCreated orchestrator method + status-guard suite
  - runUpdateAndActivate private method (inlined into provision)
  - jwkPubToDidKey helper (no longer reachable)
  - PlcClient.getLastOpCid interface field (orchestrator never uses it;
    the reap CLI imports the function directly)
  - caller_rotation_did_key column + ProvisionAttemptRow field
  - tautological column-list assertions in schema.test.ts

Renames for clarity:
  - archiveOrphanedAttempt -> archiveStuckAttempt
  - provision_attempts_orphaned_archive -> provision_attempts_archive
@tompscanlan tompscanlan marked this pull request as ready for review May 5, 2026 12:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant