feat(contrail): community DID provisioning on stock PDS#31
Open
tompscanlan wants to merge 22 commits intoflo-bit:mainfrom
Open
feat(contrail): community DID provisioning on stock PDS#31tompscanlan wants to merge 22 commits intoflo-bit:mainfrom
tompscanlan wants to merge 22 commits intoflo-bit:mainfrom
Conversation
Lays the storage foundation for community provisioning: - provision_attempts table tracking the 5-step provision flow per attempt - community_sessions cache table for PDS session reuse across publishes - CRUD on CommunityAdapter for both tables - shared types: ProvisionStatus, ProvisionAttemptRow, CustodyMode
Network primitives for the provision orchestrator: - ES256 service-auth JWT signer (com.atproto.server.createAccount lxm) - pdsCreateAccount / pdsGetRecommendedDidCredentials / pdsActivateAccount - PLC update-op helpers: buildUpdateOp, signUpdateOp, cidForOp (low-S sig normalization, DAG-CBOR encoding compatible with the live PLC parser)
… e2e
The full managed-mode provision flow:
- ProvisionOrchestrator: 5-RPC sequence (genesis op → createAccount →
recommendedCreds → PLC update op → activateAccount), persisting status
after each step.
- ProvisionSweeper: orphan detection + PLC-log-based resume so a stuck
attempt can be picked up via resumeFromAccountCreated.
- /xrpc/{ns}.community.provision endpoint wiring the orchestrator into
Hono.
- Publishing path for provision-mode communities, with a per-community
community_sessions cache so bursts of writes share one PDS session.
- Live-devnet e2e test exercising the full flow (PDS+PLC+postgres).
Adds a custody mode where the caller supplies their own rotation public did:key and Contrail holds a SUBORDINATE rotation key as rotationKeys[1]. The caller retains highest-priority rotation authority on PLC and can recover the DID independently. - New custody_mode column on provision_attempts (managed | self_sovereign) - caller_rotation_did_key persisted so PLC update ops (initial + resume) preserve the caller at rotationKeys[0] - Mints a revocable app password post-activation so Contrail keeps a publishing credential without holding the user's root password - Live-devnet e2e contrast: caller's private d/x/y JWK never appears in any encrypted column; PLC log lastOp.rotationKeys[0] === caller's did:key
The sweeper auto-set status='orphaned' on cron — too aggressive once self-sovereign mode landed (caller may not have committed the genesis op yet). Replaced by an operator-run \`contrail reap\` subcommand: explicit, audited, idempotent, and gated by --dry-run by default. Closes H1 (om-q0vq).
- pnpm-lock.yaml: workspace re-resolution from running builds during PDS provisioning work; pure additive (apps/atproto-starter deps). - .gitignore: *.tsbuildinfo — typescript build cache, never want it tracked.
…ation in PLC - M4: provision orchestrator pre-warms community_sessions with the JWTs createAccount returns, so the first publish doesn't waste a createSession round-trip. ensureSession's existing 30s skew + refresh + fallback paths cover token aging. - H2 verification: e2e self-sovereign test now fetches the PLC log post- activation and asserts rotationKeys[0] === caller's did:key, locking in the H2 invariant against live devnet. - E2e provision routing test: flipped publishesCreateSessionCount from === 1 to === 0 — both putRecords now hit the warm cache. Closes om-q0vq follow-up M4. H2 (om-rpb1) now has unit + e2e coverage.
Caller-controlled pdsEndpoint flowed straight into orchestrator.provision with no validation, letting any auth'd caller mint PLC entries pointing at attacker-controlled PDSes signed by Contrail's rotation key. Add CommunityConfig.allowedPdsEndpoints. When set to a non-empty array, the route rejects pdsEndpoint values not in the list before any PLC op is signed. Undefined or empty → no restriction (back-compat). Operators running on a public/multi-tenant Contrail SHOULD set this. Closes om-iyym (M3 from adversarial review).
…s PLC
`getLastOpCid` expected `{cid: string}` from PLC's /log/last, but PLC
returns the bare signed op object (no envelope, no `cid` field). The
function threw "PLC log/last response missing cid" against any real PLC
deployment — both the reap CLI's `prev` lookup and the provision
orchestrator's resume-from-crash path were broken end-to-end. Existing
unit tests passed because the mocks reproduced the wrong shape.
Fix: fetch the op, compute its CID locally with `cidForOp`. PLC stores
its log entries' CIDs from the same canonical DAG-CBOR encoding, so the
locally-computed CID matches the entry visible at /log/audit.
Also extends `cidForOp` to accept `SignedTombstoneOp` directly (the
tombstone shape is a strict subset of update; the encoder already handled
it via an `as never` cast in reap.ts:146 — closes M6's type concern).
Tests:
* `tests/plc-log-last.test.ts` rewritten to mock the real PLC shape; CID
assertions cross-check against `cidForOp`.
* `tests/cli-reap.test.ts` mock updated to return the real op shape;
expected `prev` CID computed via `cidForOp(FAKE_LAST_OP)`.
* `apps/contrail-e2e/tests/reap-tombstone.test.ts` provisions a real DID
on devnet PLC, builds + signs + submits a tombstone, asserts PLC's
/log/last reports the same CID `cidForOp` computed locally.
Public-API delta: `cidForOp` now exported from package root (needed by
the e2e test; was already exported from `./core/community`).
Closes om-5690 (M6 from adversarial review). Caught a separate latent
bug in `getLastOpCid` along the way — same fix.
C1 — dynamic aud resolution via describeServer
Route now calls com.atproto.server.describeServer(body.pdsEndpoint) to
discover the target PDS DID and uses that as `aud` in the service-auth
JWT, instead of falling back to cfg.serviceDid (Contrail's own DID).
Without this fix, any Contrail with allowedPdsEndpoints != [single-pds]
would 401 at createAccount with BadJwtAudience. New helper
pdsDescribeServer() in pds.ts.
C2 — communities.custody_mode migration
Adds ALTER TABLE communities ADD COLUMN custody_mode TEXT to MIGRATIONS.
CREATE TABLE IF NOT EXISTS short-circuits on existing pre-PR
communities tables, so the new column was never being added on upgrade
and the next provision INSERT would fail. Also runs migrations against
the community DB target when it differs from the main DB.
C3 — idempotent retry on attemptId (self-sovereign createAppPassword)
When createAppPassword fails post-activation in self-sovereign mode, the
DID is already minted on PLC and the PDS account is active. Previously
the route 502'd and discarded the user's root credentials, leaving an
unrecoverable state. Now:
- 502 body always includes attemptId so callers can retry
- provision() detects an existing row at status='activated' +
custody_mode='self_sovereign' + no encrypted_password and dispatches
to retryAppPasswordOnly()
- retry path uses pds.createSession (new optional PdsClient method)
with the user's password to obtain a fresh accessJwt, then re-runs
createAppPassword
No new column, no setup_complete flag, no follow-up endpoint — same
call shape, idempotent on attemptId.
C4 — `contrail reap` --dry-run defaults true
Spec required dry-run by default; previous code defaulted to false
(any operator running `reap --all-orphaned --yes` without --dry-run
would submit irrevocable PLC tombstones for every flagged row). Now:
- runReap defaults dryRun: true if unspecified (safety default)
- CLI exposes explicit --no-dry-run to actually act
Tests
+5 new unit tests (one per Critical, including a retry-after-failure
scenario that asserts no PLC ops or createAccount calls re-issue).
Existing router/allowlist mocks updated to handle describeServer.
367 unit pass (was 357 + 5 new + 5 fixture updates).
39 e2e pass against live devnet (PDS:4000 + PLC:2582).
7 pre-existing FTS failures unchanged.
…n key Removes the managed-mode branch from community.provision. Every provisioned DID now requires a caller-supplied rotation public did:key; Contrail holds a subordinate at rotationKeys[1] and mints a revocable app password for publishing. The user's account password is returned once and never persisted. Cuts the CustodyMode discriminator, the custody_mode columns (communities + provision_attempts + orphaned_archive), and the corresponding ALTER TABLE migration. Reduces orchestrator branching and removes the key-custody question that a single AES-GCM master key without per-column AAD couldn't answer cleanly. If a managed flow is needed later it should ship with proper key infra and its own threat model.
Fallout from 19d3ff5 (drop managed custody mode, require caller rotation key): the e2e tests still constructed ProvisionInput without rotationKey and asserted on the removed custodyMode column. - provision.test.ts: delete the managed-mode test (whole flow no longer expressible); drop the custodyMode === "self_sovereign" assertion in the sovereign test; add callerRotation to the XRPC route test. - reap-tombstone.test.ts: import generateKeyPair, pass rotationKey to the orchestrator. Verified against devnet: provision.test.ts (sovereign + XRPC route) and reap-tombstone.test.ts all green; full e2e suite has one pre-existing Jetstream-timing flake in ingest-roundtrip > rsvpsGoingCount, unrelated to this change.
19d3ff5 removed the custodyMode discriminator. The doc-comment on retryAppPasswordOnly still described the trigger as "status='activated' + custodyMode='self_sovereign' + no encrypted_password". The custody field is gone; the predicate is just "activated + no encrypted_password" now. Comment-only — no behavior change.
Chains provision (sovereign, caller-supplied rotation key) → grant publisher → publish event as the community → cross-account RSVP from a separate user PDS → indexed assertion against the in-process ingester. Asserts the caller's rotation key persists in PLC, the publisher (not the provisioner) writes the event, and the indexed event hydrates with rsvpsGoingCount=1 attributed to the community DID. Doubles as living documentation of the surface community.provision unlocks end-to-end.
- Add Provisioned to the Three modes list (caller-supplied rotation key, fresh PDS account, contrail uses an app password). - Add community.provision and the contrail reap CLI to the XRPCs section. - Add a neutral Choosing a mode section: two yes/no questions (existing DID? where to publish?) determine which mode applies. - Rewrite the sovereignty bullet in What's not here to describe each mode's revoke/recovery mechanics in parallel, without coloring.
…ated assertion Previously the resume path null-checked the row and its rotation key but not the row's status — a caller handing it an attemptId for an already- `activated` row (or one still at `keys_generated`/`genesis_submitted`/ `did_doc_updated`) would re-run steps 4-5 and corrupt state. Refuse unless the row is exactly at `account_created`.
reap previously required `status='orphaned'`, but production code never sets that status — making the CLI unreachable without manual SQL. Fix by relaxing the precondition: reap now operates on any provision_attempts row that did not reach `activated`. Active communities are still protected (refusal is per-row). The intermediate `mark-orphaned` step proposed in the PR body is dropped as redundant — `--dry-run` (default) plus the per-row confirmation prompt already provide the review affordance the extra state was meant to give. Surface changes: - CommunityAdapter.listOrphanedAttempts() -> listStuckAttempts() - --all-orphaned flag -> --all-stuck - Help text and prompts reworded to match - The 'orphaned' value remains in the schema CHECK constraint (vestigial; harmless — no writer)
allowedPdsEndpoints used exact string equality against `body.pdsEndpoint`, so trailing slash, default port, scheme case, host case, and IDN encoding all bypassed an otherwise-correct allowlist. Add `normalizePdsEndpoint` (uses URL.origin to canonicalize), apply it both to the body and to each allowlist entry on the check, and reject unparseable URLs with 400. The normalized form is also written back to body.pdsEndpoint so all downstream call sites (describeServer, createAccount, stored pds_endpoint column) see one canonical value per PDS.
When createRecord/deleteRecord against a community PDS returned 401, the existing code surfaced UpstreamFailure 502 but left the bad session in the cache. Every subsequent publish then hit the same 401 permanently, because ensureSession's "is the cache fresh?" check only inspects the locally-stored accessExp, not whether the PDS actually still honors the token. Drop the cached session row when the PDS returns 401. The next request goes cold through ensureSession, which mints a new session from the stored app password -- or fails with a clear error if the password itself was revoked, which is the right outcome. No retry-in-place: the simpler fix is two extra lines per route, and the caller's natural retry covers the case.
L5 dropped the mark-orphaned step entirely, leaving 'orphaned' in the schema CHECK constraint and the PROVISION_STATUSES TS union with no writer and no reader. Pre-release, no migration -- so just remove the value from both, retire the corresponding test fixture, and delete the tautological types-test (which only asserted the literal === itself). The provision-attempts test that exercised lastError persistence on a status transition now uses 'did_doc_updated' instead, which is a real status the orchestrator emits.
Covers the 20 commits since main that ship the provision community- creation mode: the community.provision XRPC route, the contrail reap CLI subcommand, the new community config block (masterKey, allowedPdsEndpoints, plcDirectory), the provision_attempts and community_credentials tables, and the sovereign-dominant custody model with contrail's subordinate rotation key under rotationKeys[1].
Removes code paths and test fixtures that no caller exercises, ahead of PR flo-bit#31 review. No behavior change for the in-use provision flow. Drops: - listProvisionAttemptsByStatus adapter method (no caller) - PdsCreateSessionResult type + re-exports - pdsCreateAppPassword 'privileged' option (always false in practice) - resumeFromAccountCreated orchestrator method + status-guard suite - runUpdateAndActivate private method (inlined into provision) - jwkPubToDidKey helper (no longer reachable) - PlcClient.getLastOpCid interface field (orchestrator never uses it; the reap CLI imports the function directly) - caller_rotation_did_key column + ProvisionAttemptRow field - tautological column-list assertions in schema.test.ts Renames for clarity: - archiveOrphanedAttempt -> archiveStuckAttempt - provision_attempts_orphaned_archive -> provision_attempts_archive
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds
<ns>.community.provision— a third community-creation mode alongsidemintandadopt. It produces a freshdid:plcon a stock@atproto/pds, with the caller keeping rotation custody and Contrail publishing through a revocable app password. No PDS fork, no PDS admin creds in Contrail.How the three modes line up
did:plcdid:plcmintandprovisionboth produce a fresh DID, but they split on who holds the rotation key and how publishing happens:mintkeeps Contrail in control of the signing key (no PDS, no publishing yet), whileprovisionputs the caller onrotationKeys[0]and routes writes through an app password on a real PDS.adoptis the only mode that touches an existing identity.The caller-rotates invariant is what makes
provisionsafe to expose without giving Contrail PDS admin creds: if Contrail is unavailable, the caller can move the DID elsewhere on their own.Plan is to rebase onto PR #30 once it lands and relocate into
packages/contrail-community/— the new code is self-contained incommunity/so the move is a path-update, not a refactor.What's in
createAccount→getRecommendedDidCredentials→ PLC update op (caller's rotation key at[0], Contrail subordinate at[1]) →activateAccount. Status persisted per step.provision_attempts,community_sessions,provision_attempts_archive(rows that fail to make progress are archived).pds.ts/plc.ts/service-auth.ts: ES256 service-auth signer, PLC update + tombstone helpers (low-S, DAG-CBOR), CID computed locally to match the live PLC parser.createSessionfallback. Pre-warmed by the JWTscreateAccountreturns. 401s on publish drop the cached session so the next call mints fresh.audresolved per-request viadescribeServer(body.pdsEndpoint), so a single Contrail can mint communities on multiple PDSes without a config-pinned target DID.attemptIdfor the post-activation app-password mint: a 5xx returns theattemptId, and re-calling provision with the same id picks up only atcreateAppPassword— no PLC ops orcreateAccountre-issued.allowedPdsEndpointswith URL normalization (scheme/host case, trailing slash, default port, IDN) so allowlist checks aren't bypassed by trivial encoding differences.contrail reapCLI — operator-run, audited. Walks stuckprovision_attemptsdirectly (no separate "orphaned" status to manage), submits PLC tombstones, and archives.--dry-rundefaults true;--no-dry-runis required to submit (irrevocable).listProvisionAttemptsByStatus,PdsCreateSessionResult, theprivilegedflag oncreateAppPassword,resumeFromAccountCreated(stuck rows are archived, not resumed),runUpdateAndActivate(inlined),jwkPubToDidKey,PlcClient.getLastOpCid, and thecaller_rotation_did_keycolumn. Net −468 lines.Tests added
createAppPasswordfailure-then-retry,--dry-rundefault, dynamicaudresolution viadescribeServer, 401-clears-session-cache on publish.createSessionacross the burst); rotation-JWK-not-persisted invariant (caller's private rotation JWK never appears in any decrypted column); reap tombstone round-trip against live PLC; full community-provision lifecycle walkthrough.Asks
community.provisionauto-enroll on the configured record host as a 6th step (when running in-process), or stay strictly identity-only?tools.atmo.space.declarationon the new community PDS during provision, so the owner can later rebind authority/record-host without rotation-key access? Or follow-up?{ns}.community.provision— anything to change before this freezes?Known limitations (still open)
community.provision.Test plan
pnpm testinpackages/contrailapps/contrail-e2e)