Skip to content

Group E2EE before discovery: safe one-shot lifecycle orchestration#2

Closed
chgaowei wants to merge 15 commits into
mainfrom
feature/changshan/group-e2ee
Closed

Group E2EE before discovery: safe one-shot lifecycle orchestration#2
chgaowei wants to merge 15 commits into
mainfrom
feature/changshan/group-e2ee

Conversation

@chgaowei
Copy link
Copy Markdown
Contributor

@chgaowei chgaowei commented May 3, 2026

Summary

PR-B3 for hidden/test-only Group E2EE recovery before public discovery.

This CLI slice wires owner/admin same-device recovery orchestration around anp-mls and the hidden message-service P6 control plane. It keeps the user-facing feature gated and diagnostic/repair oriented; public discovery remains off.

Scope

  • Add recovery KeyPackage publishing and owner/admin group e2ee recover-member orchestration.
  • Run prepare -> hidden group.e2ee.recover_member -> finalize/abort without mutating P4 membership.
  • Preserve recovery KeyPackage metadata (purpose, group_did, device_id) through CLI/service boundaries.
  • Keep missed add/remove/leave recovery fail-closed and device-scoped MLS state visible through status/repair paths.
  • Ensure helper execution uses stdin/stdout JSON and does not place sensitive plaintext in argv.
  • Resolve review-blocking base drift from main; PR merge state is now clean after commit 0ca3b48.

Guardrails

  • Hidden/test-only only: no public group-e2ee discovery claim.
  • No P4 membership mutation in recovery; recover-member is crypto-state repair only.
  • No multi-device support.
  • No k1 DID compatibility.
  • No cloud snapshot or External Commit.
  • No production claim for complete MLS group management.

Validation evidence

  • Local after base-sync merge:
    • go test ./internal/cli ./internal/message ./internal/runtime/listener ./internal/cmdmeta ./internal/doctor ./internal/anpsdk -> pass.
    • go test ./... && go vet ./... -> pass.
  • GitHub Actions on PR Group E2EE before discovery: safe one-shot lifecycle orchestration #2: unit-test checks green after the base-sync push.
  • Cross-repo focused system evidence:
    • Hidden/default discovery + flag-off smoke: 3 passed.
    • CLI recovery + negative focused E2E: 5 passed in 24.96s.
  • Final review/security evidence: no CRITICAL/HIGH blocker; discovery remains blocked until a separate public-discovery security gate.

Review status

Ready for reviewer attention as part of the four-PR PR-B3 set. This PR intentionally remains hidden/test-only and should not be used to announce production Group E2EE capability.

The secure direct CLI now consumes the published ANP Go SDK v0.8.7,
persists sent secure messages as E2EE, redacts secure status output,
auto-acks decrypted inbound init messages from polling/listener paths,
and flushes queued secure outbox items once peers confirm. This removes
the temporary workspace SDK replacement while keeping local repair and
retry commands available for restart recovery.

Constraint: P5 requires direct.send operation_id/message_id equality, target key-service binding, and no leakage of ratchet material
Constraint: Mainline builds must consume github.com/agent-network-protocol/anp/golang v0.8.7 without a committed workspace replace
Rejected: Keep replace => ../anp/anp/golang | breaks CI and release portability
Confidence: high
Scope-risk: moderate
Directive: Do not print raw p5-e2ee-sessions or reintroduce workspace-local ANP SDK replace in committed go.mod
Tested: go test ./...
Tested: go vet ./...
Tested: awiki-system-test tests_v2 with message-service v2: 85 passed, 8 skipped
Not-tested: Cross-device production relay outside the local system-test stack
The CLI needs a stable integration seam for P6 before real MLS is connected, so this introduces exec-based provider plumbing and diagnostic commands while documenting that the architecture draft remains contract-test only for this slice.

Constraint: awiki-cli must stay pure Go and must not depend on a resident MLS process.
Constraint: plaintext must be passed through stdin rather than argv.
Rejected: Wire group create/add/send into live OpenMLS now | would imply production E2EE before service and SDK contracts are proven.
Confidence: high
Scope-risk: moderate
Directive: Do not advertise real group E2EE from CLI commands until anp-mls uses real OpenMLS state and system tests cover the loop.
Tested: go test ./internal/message ./internal/cli; git diff --check
Not-tested: Real anp-mls installation discovery, transparent group send/decrypt happy path, packaged release binary inclusion
The CLI now treats group E2EE as a real anp-mls exec-backed flow instead of a dry diagnostic skeleton. The implementation keeps Go pure-Go/no-CGO, sends plaintext only through stdin to anp-mls, publishes opaque P6 payloads to message-service, and stores only derived group summaries in the business SQLite database.

Constraint: awiki-cli must not link Rust/OpenMLS or store MLS private material in its business DB

Constraint: System tests require AWIKI_ANP_MLS_BINARY/runtime path discovery before PATH fallback

Rejected: Keep group E2EE as contract-test-only CLI commands | real MLS lane requires create/add/send/decrypt orchestration from normal CLI surfaces

Confidence: medium

Scope-risk: moderate

Directive: Do not put application plaintext in anp-mls argv or message-service group.e2ee.send bodies

Tested: go test ./internal/message ./internal/doctor -count=1

Tested: go test ./internal/cli -run TestGroupDryRunPlansRenderStableContracts -count=1

Tested: CGO_ENABLED=0 go test ./... -run '^$' -count=1

Tested: go vet ./internal/message ./internal/doctor ./internal/cli

Not-tested: Full internal/cli suite; existing TestRuntimeValidationErrorsUseStableCodes timed out during remote upgrade/DID replace path
Task-8 focused E2E exposed that anp-mls may include device-scoped provider metadata in the generated KeyPackage result while message-service accepts a tighter public KeyPackage schema. The CLI now whitelists the service payload fields before signing and publishing, so device_id and any private provider-only fields stay local to anp-mls/CLI orchestration.

Constraint: message-service rejects unsupported body.group_key_package fields with RPC 1003

Constraint: MLS private/provider metadata must not leak into service storage

Rejected: Require message-service to accept all provider fields | unnecessarily broadens server storage contract and does not fix private-field leakage

Confidence: high

Scope-risk: narrow

Directive: Keep group.e2ee.publish_key_package body limited to service-supported public KeyPackage fields

Tested: go test ./internal/message -run 'TestBuildGroupE2EEPublishKeyPackageRPCParamsStripsProviderOnlyFields|TestBuildGroupE2EEAddRPCParamsIncludesConsumedKeyPackageID|TestBuildGroupE2EESendRPCParamsSendsOnlyOpaqueCipherObject|TestMLSExecProvider' -count=1

Tested: go test ./internal/message -count=1

Tested: go vet ./internal/message
Doctor now verifies the anp-mls compatibility contract and reports MLS state health before reviewers try the group E2EE path. The release helper documents and stages the Rust binary without changing the pure-Go CLI boundary.

Constraint: awiki-cli must stay pure Go/no CGO and invoke anp-mls through stdin/stdout

Rejected: vendor Rust into the Go binary | would violate the pure-Go packaging boundary

Confidence: high

Scope-risk: narrow

Tested: go test ./internal/message ./internal/doctor ./internal/cli

Tested: go vet ./internal/message ./internal/doctor

Tested: scripts/release/build-anp-mls.sh --dry-run
The real MLS Alice/Bob loop publishes Bob's KeyPackage under a named device, while one-shot message reads previously tried only the default device state. The CLI now keeps anp-mls state agent/device-scoped, processes local welcome notices for stored identities, scans local device state during decrypt, and strips provider-local plaintext/OpenMLS fields before sending opaque P6 objects to the service.

Constraint: Go CLI must remain pure Go / no CGO and invoke anp-mls as a one-shot helper.

Constraint: Group E2EE remains hidden/test-only; this does not enable public discovery.

Rejected: Share one MLS SQLite DB across local identities | OpenMLS private KeyPackage state is not namespaced and Alice add consumed Bob's local material.

Confidence: high

Scope-risk: moderate

Directive: Do not send provider-local OpenMLS fields or application plaintext in group.e2ee.send payloads.

Tested: go test ./internal/message ./internal/doctor ./internal/cli

Tested: go vet ./internal/message ./internal/doctor

Tested: focused awiki-system-test group E2EE local/negative target passed (2 passed).

Tested: root make local-test passed (84 passed, 11 skipped).
The CLI now drives the hidden Group E2EE loop through the P6 target/security matrix, ratchet-tree welcome replay, explicit send AAD metadata, and durable notice repair without introducing a background process or CGO. The change also strengthens doctor/install diagnostics around the one-shot anp-mls binary and scoped state.

Constraint: Group E2EE must remain hidden/test-only and must not imply public product support.
Constraint: anp-mls receives plaintext only over stdin/stdout JSON, never argv.
Rejected: Cache MLS epoch as P4 group_state_version | service state and MLS epochs are separate and must be bound independently.
Rejected: Implement broad group lifecycle commands | v1 scope is publish/create/add/welcome/send/decrypt plus repair diagnostics.
Confidence: high
Scope-risk: moderate
Directive: Keep help/discovery copy conservative until the service explicitly enables public discovery after security review.
Tested: go test ./internal/message ./internal/cli ./internal/cmdmeta ./internal/doctor -count=1; go vet ./internal/message ./internal/doctor; focused awiki-system-test CLI real-MLS loop 2 passed
Not-tested: Public beta packaging across all OS release artifacts
@chgaowei
Copy link
Copy Markdown
Contributor Author

chgaowei commented May 3, 2026

Security review completed for this PR set (2026-05-03): BLOCK_DISCOVERY.

This PR should stay draft / hidden-test-only. Do not enable public discovery for anp.group.e2ee.v1 / group-e2ee in this PR set.

Key blockers before any separate discovery-enable PR:

  • DID WBA binding proof validation is currently shape-level; cryptographic proof verification and golden vectors are required.
  • Notice pull/fanout semantics need public-client hardening (missed live notification recovery, idempotent mark-delivered, observability).
  • Broad tests_v2 / root make local-test remains deferred and must be green or explicitly accepted before undrafting/merge.

Focused evidence remains:

  • real MLS CLI smoke: 2 passed in 19.52s
  • flag-off guard: 1 passed in 2.43s

chgaowei added 2 commits May 3, 2026 20:39
The CLI now signs anp-mls KeyPackage did_wba_binding objects with the active identity key before publication, keeping private MLS material in anp-mls while giving message-service a cryptographic ownership proof to verify.

Constraint: awiki-cli must remain pure Go/no CGO and Group E2EE discovery stays hidden.
Rejected: Trust provider-supplied binding proof | the provider cannot know the CLI identity key and would leave publish verification shape-only.
Confidence: high
Scope-risk: moderate
Directive: Do not pass plaintext or private key material through argv; keep signing in-process and anp-mls JSON over stdin/stdout.
Tested: cd awiki-cli && go test ./internal/anpsdk ./internal/message
Tested: focused Group E2EE system test after deslop: 2 passed in 23.19s
PR-A needs awiki-cli to route E2EE membership exits through MLS pending commits without making same-epoch local-terminal self-leave look safe. The CLI now prepares remove commits, finalizes only after hidden service acceptance, repairs commit-delivery notices for one-shot clients, and blocks non-advancing self-leave before submitting any P6 leave mutation.

Constraint: Public discovery remains hidden/test-only for group E2EE.
Constraint: OpenMLS 0.8 self-leave can be local-terminal without advancing the group epoch.
Rejected: Fall back to public group.remove/group.leave for E2EE groups | would separate membership changes from cryptographic epoch changes.
Rejected: Submit same-epoch self-leave to group.e2ee.leave | service acceptance would rely on delivery suppression rather than MLS exclusion.
Confidence: high
Scope-risk: moderate
Directive: Do not enable E2EE self-leave until anp-mls exposes an epoch-advancing commit or a reviewed leave-request/remaining-member flow exists.
Tested: go test ./internal/message -run 'TestLeaveGroupE2EERejectsLocalTerminalSelfLeaveBeforeServiceSubmit|TestUnsupportedGroupE2EESelfLeaveReasonDetectsNonAdvancingEpoch' -v
Tested: go test ./internal/message ./internal/cli ./internal/cmdmeta
Tested: go test ./...
Tested: go vet ./...
Tested: git diff --check
@chgaowei chgaowei changed the title Group E2EE Step B: keep CLI P6 orchestration hidden Group E2EE before discovery: safe one-shot lifecycle orchestration May 3, 2026
chgaowei added 2 commits May 3, 2026 23:30
PR-B1 requires one-shot CLI clients to avoid OpenMLS local-terminal self-leave artifacts. E2EE group leave now creates a hidden leave_request control-plane record, while owner/admin processing uses the existing epoch-advancing remove-member orchestration and carries the leave request id into the hidden remove payload.

Constraint: Public group E2EE discovery remains hidden/test-only during PR-B1

Constraint: Go CLI must remain pure Go and delegate MLS commits to anp-mls

Rejected: Submit group.e2ee.leave with a same-epoch local-terminal artifact | fails cryptographic exclusion semantics

Confidence: high

Scope-risk: moderate

Directive: Do not route E2EE group leave back to provider.LeaveGroup unless anp-mls can prove epoch-advancing remaining-member exclusion

Tested: go test -run 'TestBuildGroupE2EE|TestHTTPTransportGroupMethods|TestLeaveGroupE2EE|TestGroupDryRunPlans' ./internal/message ./internal/cli

Tested: go test ./...

Tested: go vet ./...
After an owner/admin remove is accepted, a one-shot CLI process may still hold a pending local MLS commit and initially encrypt at the previous epoch. Detect the service epoch-mismatch response, finalize any local pending commit reported by anp-mls status, repair notices, and retry encryption so remaining members can continue sending ciphertext without falling back to plaintext group messages.

Constraint: Group E2EE remains hidden/test-only; no multi-device, cloud snapshot, rejoin, or public discovery expansion.

Rejected: Fall back to group.base send on epoch mismatch | would store application plaintext in the service DB for E2EE groups.

Confidence: high

Scope-risk: moderate

Directive: Do not bypass group.e2ee.send for known E2EE groups; stale epochs must repair/fail closed, not downgrade.

Tested: go test -run 'Test.*GroupE2EE|Test.*Leave|TestGroup' ./internal/message ./internal/cli ./internal/cmdmeta

Tested: go vet ./...

Tested: live awiki-system-test lifecycle target passed: 3 passed in 26.50s

Tested: git diff --check
@chgaowei
Copy link
Copy Markdown
Contributor Author

chgaowei commented May 4, 2026

PR-B1 update: safe Group E2EE leave-request before discovery

Scope added in this push:

  • Hidden/test-only PR-B1 safe leave-request flow only.
  • No public discovery enablement; anp.group.e2ee.v1 / group-e2ee remain hidden.
  • No multi-device, k1 compatibility, cloud snapshot, member update, rejoin, or External Commit.

Latest commits across the PR-B1 stack:

  • anp/anp: e9e0841, 3e2642a, 5bb6adb
  • awiki-cli: b71f55e, ba7f77f
  • message-service: 4304c5e, 99c82dd
  • awiki-system-test: 4a856b4, e119a9a, d800be5

Validation evidence:

  • Local env: awiki-system-test/manage_local_test_env.py check --with-message-v2 --use-local-anp passed.
  • Live lifecycle: AWIKI_GROUP_E2EE_CONTRACT_TEST=1 uv run pytest tests_v2/cli/test_awiki_cli_group_e2ee_lifecycle_local.py -q3 passed in 26.50s.
  • Discovery/flag-off: uv run python manage_local_test_env.py run-tests --keep-env --with-message-v2 --use-local-anp tests_v2/message_service/test_group_e2ee_contract.py tests_v2/message_service/test_group_e2ee_flag_off.py3 passed in 1.43s.
  • anp/anp: cargo fmt --check; cargo test group_e2ee_remove_member_prepares_pending_commit_then_finalize_advances_epoch -- --nocapture; git diff --check.
  • awiki-cli: focused group E2EE Go tests; go vet ./...; git diff --check.
  • awiki-system-test: focused ruff check / ruff format --check; git diff --check.

Notes:

  • Live local Postgres message-service test databases had stale migration checksums from prior runs; they were reset (message_service_a/b) before rerun.
  • The final lifecycle run proved the remaining-member post-remove send stays on group.e2ee.send and does not fall back to plaintext group.base delivery.

One-shot CLI recovery now compares local MLS state with the hidden service
crypto head, safely finalizes accepted pending commits, replays durable
welcome/commit notices, treats duplicate already-applied commits as
delivered, and fails closed with needs_snapshot_or_readd when continuity
cannot be proven. Status/send paths scan agent/device-scoped MLS state so
members added through non-default KeyPackages can repair missed add commits
and resume encrypted sends without a resident process.

Constraint: PR-B2 is recovery hardening only; group E2EE remains hidden/test-only with no public discovery, multi-device, k1 compatibility, cloud snapshot, External Commit, or rejoin scope.
Rejected: Always encrypt/process repair on the default device | KeyPackage-published members store MLS state under their actual device id and would be stranded after welcome repair.
Rejected: Finalize any local pending commit during repair | local commits are finalized only when the service head proves the target epoch was accepted.
Confidence: high
Scope-risk: moderate
Directive: Keep recovery fail-closed on missing notice gaps until a separately reviewed snapshot/re-add protocol exists.
Tested: go test ./internal/message ./internal/cli ./internal/cmdmeta
Tested: AWIKI_GROUP_E2EE_CONTRACT_TEST=1 uv run python manage_local_test_env.py run-tests --with-message-v2 --use-local-anp tests_v2/cli/test_awiki_cli_group_e2ee_recovery_local.py
Not-tested: Full root make local-test.
@chgaowei
Copy link
Copy Markdown
Contributor Author

chgaowei commented May 4, 2026

PR-B2 update: missed add-commit recovery before discovery

Scope added in the latest push:

  • Hidden/test-only PR-B2 recovery/repair hardening only.
  • No public discovery enablement; anp.group.e2ee.v1 / group-e2ee remain hidden.
  • No multi-device, k1 compatibility, cloud snapshot, member update, rejoin, or External Commit.

What changed across the four-PR stack:

  • missed add commit recovery: one-shot clients can diagnose local-vs-service epoch lag and replay durable commit-delivery notices after being offline.
  • device-scoped MLS state: CLI status/repair/send scan agent/device-scoped anp-mls state instead of assuming only default, so KeyPackage-published devices can recover and send.
  • service durable commit notice: group.e2ee.add stores commit-delivery notices for existing active members while keeping message-service opaque-only; no MLS private state or plaintext is stored server-side.

Latest PR-B2 commits:

  • anp/anp: 8eefd04 — status reports local_epoch and inactive left/removed bindings correctly.
  • awiki-cli: debbe39 — repair/status/send recovery across service head, durable notices, and device-scoped MLS state.
  • message-service: a050f28 — durable commit-delivery fanout for existing active members on add.
  • awiki-system-test: 6968177 — Alice/Bob/Carol recovery system test.

Focused validation evidence:

  • go test ./internal/message ./internal/cli ./internal/cmdmeta → passed.
  • cargo fmt --check → passed.
  • cargo test -p im-group group_e2ee_add_commit_notice_carries_replay_artifact_for_existing_members -- --nocapture → passed.
  • cargo test group_e2ee_leave_prepares_and_finalize_marks_local_state_left -- --nocapture → passed.
  • uv run ruff check manage_local_test_env.py tests_v2/message_service/test_group_e2ee_contract.py tests_v2/cli/test_awiki_cli_group_e2ee_recovery_local.py → passed.
  • uv run pytest tests_v2/message_service/test_group_e2ee_contract.py::test_group_e2ee_real_mls_loop_is_covered_by_focused_cli_tests -q1 passed in 0.03s.
  • AWIKI_GROUP_E2EE_CONTRACT_TEST=1 uv run python manage_local_test_env.py run-tests --with-message-v2 --use-local-anp tests_v2/cli/test_awiki_cli_group_e2ee_recovery_local.py1 passed in 11.63s and local env stopped.

CI / broad smoke after push:

  • anp/anp GitHub checks: CodeQL + Rust Python Interop all passed on feature/changshan/group-e2ee.
  • awiki-cli GitHub Actions: unit-test passed on debbe39.
  • message-service / awiki-system-test: no GitHub checks are currently reported on this branch.
  • awiki-cli broad: go test ./... → passed.
  • message-service broad: cargo test -p im-group -- --nocapture36 passed; cargo check --workspace → passed.
  • message-service v2 system broad: uv run python manage_local_test_env.py run-tests --with-message-v2 --use-local-anp tests_v2/message_service14 passed in 10.59s.
  • full tests_v2 smoke: uv run python manage_local_test_env.py run-tests --suite message-v2 --with-message-v2 --use-local-anp tests_v285 passed, 14 skipped in 151.89s; local env stopped.

Discovery remains blocked by the existing security-review gate; this update is recovery plumbing, not public readiness.

chgaowei added 3 commits May 4, 2026 18:21
The CLI now exposes hidden/test-only same-device group E2EE recovery UX: recovery KeyPackage publication, recover-member orchestration through anp-mls prepare/finalize/abort, and needs_snapshot_or_readd recovery artifacts. The recovery path submits group.e2ee.recover_member and keeps P4 group.add out of the flow.

Constraint: Worker-3 scope is awiki-cli only after leader correction
Constraint: Recovery must stay hidden/test-only and must not mutate P4 membership
Rejected: Reuse group.add for recovery | violates PR-B3 P4/P6 separation
Confidence: medium
Scope-risk: moderate
Tested: gofmt on modified Go files
Tested: go test ./internal/message ./internal/cli ./internal/cmdmeta
Tested: go test ./...
Tested: go vet ./...
Not-tested: Live message-service/anp-mls PR-B3 end-to-end because sibling service lane is owned by other workers
PR-B3 recovery KeyPackages must reach message-service with purpose, group_did, and device_id intact, while normal KeyPackages must not send empty optional recovery fields. The recover-member result also stops echoing a group.add debug marker so hidden/test-only recovery cannot be mistaken for P4 membership mutation.

Constraint: Group E2EE stays hidden/test-only and recover-member must not mutate or imply P4 group.add.

Rejected: Keep group.add as a forbidden-method debug echo | focused E2E treats any public group.add marker as an overclaim and it is unnecessary for orchestration.

Confidence: high

Scope-risk: narrow

Directive: Do not drop purpose/group_did from recovery group_key_package payloads; service recovery lookup depends on them.

Tested: go test ./internal/message -run 'PublishKeyPackage|SanitizeGroupKeyPackage|RecoverMember|GroupE2EE'

Tested: go test ./internal/message ./internal/cli ./internal/cmdmeta

Tested: go test ./...

Tested: go vet ./...

Tested: awiki-system-test focused CLI PR-B3 E2E 5 passed with --with-message-v2 --use-local-anp

Not-tested: Public discovery enablement; intentionally out of scope.
The PR branch had fallen behind main far enough for GitHub to mark the pull request DIRTY. This merge keeps the hidden/test-only group E2EE recovery work reviewable while preserving current main changes around handle completion, listener probes, and English skill references.

Constraint: Final PR confirmation must not claim merge readiness while GitHub reports conflicts

Rejected: Leave awiki-cli PR dirty and only document the blocker | it would fail the requested pre-merge confirmation

Confidence: medium

Scope-risk: moderate

Directive: Keep group E2EE discovery hidden; this merge only resolves base drift and must not enable public capability advertising

Tested: go test ./internal/cli ./internal/message ./internal/runtime/listener ./internal/cmdmeta ./internal/doctor ./internal/anpsdk

Not-tested: full cross-repo system test after base-drift merge
@chgaowei chgaowei marked this pull request as ready for review May 4, 2026 11:40
@chgaowei
Copy link
Copy Markdown
Contributor Author

chgaowei commented May 4, 2026

Withdrawn for now per maintainer request. Keeping the branch for later follow-up; not merging and not deleting the branch.

@chgaowei chgaowei closed this May 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant