Skip to content

Challenge flow overhaul#125

Open
ilchu wants to merge 8 commits into
devfrom
ic/challenge-workflow-overhaul
Open

Challenge flow overhaul#125
ilchu wants to merge 8 commits into
devfrom
ic/challenge-workflow-overhaul

Conversation

@ilchu

@ilchu ilchu commented Jun 3, 2026

Copy link
Copy Markdown
Collaborator

The provider was where trust was supposed to land but the chain couldn't actually punish a provider that lied — respond_to_challenge slashed only on timeout, signed commitments could be replayed forever, and several pallet placeholders made the math wrong. This PR closes all of those.

Seven independently-reviewable commits, starting from a baseline test commit so the security-critical changes land into a covered area (the pallet had zero challenge tests before this).

Pallet — security

Slash on demonstrably-false response, not just on timeout. respond_to_challenge now adjudicates each variant into Ok(()) or Err(SlashReason). Invalid Proof (chunk-Merkle or MMR mismatch), forged Deleted admin signature, and bogus Superseded claim all slash immediately. Parameter-shape errors (stale nonce, non-admin signer) still bubble as DispatchError — those are caller mistakes, not lies. New SlashReason enum carried in the ChallengeSlashed event.

Replay protection on CommitmentPayload. New nonce: u64 field; CURRENT_VERSION bumped to 0x02 so V1-shaped signatures can't verify against the V2 layout. New T::MaxNonceAge runtime constant (24h on live runtimes), recency-checked on every signature path — checkpoint, extend_checkpoint, challenge_offchain, Deleted response arm. BucketSnapshot carries commitment_nonce so late-arriving extend_checkpoint signatures verify against the same payload the original signers signed.

Three placeholder fixes. challenge_offchain now takes leaf_count: u64 from the challenger rather than reconstructing with hardcoded 0 (which only worked if the provider also signed with 0). Replica.last_sync is now Option<ReplicaSyncRecord { mmr_root, start_seq, leaf_count, block }> populated from the bucket snapshot at confirm_replica_sync time; challenge_replica uses the stored start_seq instead of 0u64. Challenge deposit is now T::ChallengeDeposit (1 token live, was 100u32 ≈ 1e-10 of a token — effectively free spam).

Pallet — observability

ChallengerStats on-chain aggregate. New storage map ChallengerStats<AccountId, ChallengerStatRecord> with ValueQuery. Bumped at four sites: create_challenge increments total_challenges; defended path increments failed_challenges; slash_provider_for_failed_challenge (used by both timeout and invalid-response paths) increments successful_challenges and adds the 10%-of-slashed-stake reward to total_earnings.

Provider node

Wire ChallengeResponder into startup. The module was defined and exported but never instantiated in command.rs — entire path was dead code in production. New --enable-challenge-responder CLI flag alongside the other coordinators.

Real poll_challenges. Was Ok(vec![]). Now resolves the provider's 32-byte account, iterates StorageProvider::Challenges via subxt, manually decodes the SCALE-encoded Vec<Challenge>, and keeps only entries matching our account (preserving the original index for ChallengeId reconstruction). Cost is bounded by ChallengeTimeout — past-deadline entries are reaped in on_finalize.

API surface. /commit, /commitment, /checkpoint-signature, and /delete now take a caller-supplied nonce (the block the caller intends to submit at) and echo it in the response. Signed payloads use the real post-mutation leaf_count (was 0).

Client SDK

get_total_challenge_earnings and get_challenge_stats are real — both read ChallengerStats from chain. commit / get_commitment / get_checkpoint_signature take an explicit nonce: u64 so callers control what block the signature is bound to.

JS demos

examples/papi/{full-flow,bucket-with-storage,checkpoint-rewards,sc-flow,sc-coverage}.js snapshot System.Number and thread the value through uploadChunk, fetchCheckpointSignature, challengeOffchain, submitClientCheckpoint. upload.leafCount and upload.nonce from the /commit response are passed to challenge_offchain.

Tests

22 new challenge tests in the pallet (up from 0) covering baseline happy/timeout/no-snapshot cases, replay-rejection for challenge_offchain and checkpoint, immediate-slash for all three invalid response variants, ChallengerStats updates on each transition, and challenge_replica using the stored start_seq. New SCALE-decoder round-trip unit test in challenge_responder.rs. Provider-node integration tests updated for the new nonce / leaf_count surface. Client integration tests no longer hardcode "earnings = 0" — they assert the real total >= successful + failed invariant.

Migration

Breaking: CommitmentPayload v1 signatures (no nonce) no longer verify on this runtime. Acceptable for a pre-launch parachain — dev-chain state is expected to break. Extrinsic signatures have new args (nonce, leaf_count); JS clients, Rust SDK callers, and benchmarks are all updated.

ilchu added 7 commits June 3, 2026 21:17
Until now the storage-provider pallet had zero tests covering its
challenge workflow — `challenge_checkpoint`, `challenge_replica`,
`respond_to_challenge`, and the on_finalize timeout-slash hook were
all untested. That made the security-critical extrinsics the riskiest
area to change.

Add a `challenge_tests` module with 14 scenarios covering current
behavior:

- challenge_checkpoint creates challenge in storage / fails without
  snapshot / fails if provider not in primaries
- respond_to_challenge with valid Proof defends and applies the
  response-time-based cost split
- respond_to_challenge with bogus chunk or MMR proof currently errors
  with InvalidChallengeProof (commits to come will turn these into
  immediate-slash tests)
- respond_to_challenge with Superseded when canonical has advanced
- respond_to_challenge rejects nonexistent challenge / wrong provider
  / past deadline
- on_finalize at deadline slashes the provider's stake to zero and
  pays the challenger a 10% reward
- challenge_replica reads last_sync, rejects primary providers, and
  rejects replicas without a confirmed sync

Two helpers:

- `advance_to(n)` — runs both system and storage-provider hooks each
  block (the default `run_to_block` in mock.rs only runs system hooks)
- `single_chunk_proof(data)` — builds (mmr_root, MmrProof, MerkleProof)
  for a single-leaf single-chunk MMR so tests can construct genuinely
  valid proofs without standing up real chunking infrastructure
Audit (AUDIT.md, Tier 3) flagged that the same `CommitmentPayload` signed
by a provider could be replayed indefinitely — there was no nonce,
timestamp, or block number bound into the signed bytes. An attacker who
captured a single signature could submit `challenge_offchain` against
the provider forever.

Bind a `nonce: u64` field into `CommitmentPayload` (bumping
`CURRENT_VERSION` from `0x01` to `0x02` so V1-shaped signatures can't
verify against the V2 layout). The pallet validates the nonce via a new
`T::MaxNonceAge` config constant on three sites:

  - `checkpoint` extrinsic — caller passes the nonce all signing
    providers agreed on; stored in `BucketSnapshot::commitment_nonce`
    so `extend_checkpoint` can verify late-arriving signatures
  - `challenge_offchain` extrinsic — caller passes the nonce the
    provider used when signing the captured commitment
  - `respond_to_challenge::Deleted` arm — admin signs over the new
    deletion commitment with a fresh nonce

The pallet rejects nonces that are in the future or older than
`MaxNonceAge` blocks. `provider_checkpoint` is unchanged: it already
signs over `CheckpointProposal` which has its own `window` field for
replay protection.

Test infra:
  - 3 new pallet tests (`challenge_offchain_rejects_old_nonce`,
    `_rejects_future_nonce`, `checkpoint_rejects_old_nonce`) exercise
    the recency gate via the parameter check that fires before
    signature verification — so a dummy signature suffices.
  - Existing challenge_tests baseline updated for the new
    `BucketSnapshot` field.

Plumbing through the surrounding stack:
  - `CommitRequest` / `CommitmentQuery` / `CommitmentResponse` /
    `CheckpointSignatureResponse` / `DeleteRequest` / `DeleteResponse`
    in provider-node carry the nonce
  - Provider-node API handlers thread the caller-supplied nonce into
    the signed `CommitmentPayload`
  - Rust SDK `commit` / `get_commitment` / `get_checkpoint_signature`
    take an explicit `nonce: u64` argument
  - JS demo helpers (`uploadChunk`, `fetchCheckpointSignature`,
    `submitClientCheckpoint`, `challengeOffchain`) snapshot the current
    block number from chain and pass it through
  - Runtime + mock configs set `MaxNonceAge` (24 h on the live
    runtimes, 200 blocks in tests)
  - Benchmarks pass `0u64` as nonce (running at low block numbers
    keeps them inside `MaxNonceAge`)
Before: a provider who couldn't produce a real challenge proof could
submit garbage via respond_to_challenge — the extrinsic failed with
`InvalidChallengeProof` but the challenge remained pending until the
deadline. Slashing fired only on timeout, so a misbehaving provider
got to stall instead of being penalised promptly.

After: in `respond_to_challenge`, each response variant is adjudicated
into one of:

  - Ok(()) — response defends, run the existing cost-split path
  - Err(SlashReason) — response is a demonstrable lie, slash now

`Proof` arm slashes with `InvalidProof` if either the chunk-Merkle
proof or the MMR proof fails to verify. `Deleted` arm slashes with
`InvalidDeletionClaim` if the `new_start_seq` doesn't cover the
challenged seq or the admin signature is forged. `Superseded` arm
slashes with `InvalidSupersededClaim` if the bucket has no snapshot
or the challenged seq is past the canonical end.

Parameter-shape errors (stale nonce, non-admin signer) still bubble up
as DispatchError — those represent caller mistakes, not adversarial
responses. The provider isn't punished for typos.

New primitives type `SlashReason` (Timeout | InvalidProof |
InvalidDeletionClaim | InvalidSupersededClaim) is carried in the
`ChallengeSlashed` event so observers can tell which failure mode
fired. `on_finalize` passes `Timeout`; the new immediate-slash paths
pass the matching variant.

Three new tests; two existing tests renamed from `_currently_errors`
to `_slashes_immediately` and rewritten to assert provider stake is
zero post-response and `challenges_failed` reflects the loss. The
extrinsic itself returns Ok(()) — the slash IS the valid state
transition; the provider sees their tx succeed and pays the price.
Three audit-flagged correctness bugs in the challenge path, all now
addressed:

1. **leaf_count=0 placeholder in challenge_offchain**
   The pallet rebuilt the signed CommitmentPayload with a hardcoded
   leaf_count=0, which only verified if the provider also signed with
   0. Add a `leaf_count` parameter to the extrinsic. Provider-node
   /commit, /commitment, and /delete now sign with the real
   post-mutation leaf_count and surface it in the response so the
   challenger can pass it through.

2. **start_seq=0 placeholder in challenge_replica**
   `Replica.last_sync` was `Option<(H256, BlockNumber)>` — no way to
   recover the start_seq at sync time, so challenges always used 0.
   Replace with `Option<ReplicaSyncRecord { mmr_root, start_seq,
   leaf_count, block }>` populated by `confirm_replica_sync` from
   the bucket snapshot's actual sequence metadata.

3. **Hardcoded `100u32` challenge deposit**
   At 12 decimals this was ~1e-10 of a token — challenge spam was
   effectively free. Replace with `T::ChallengeDeposit::get()`. Live
   runtimes set 1 token (1e12); mocks keep 100 so existing test math
   doesn't shift.

Test updates:
- `challenge_replica_uses_last_sync` now asserts the stored start_seq
  (7) appears on the resulting Challenge (previously asserted 0).
- `challenge_offchain_rejects_old_nonce` /
  `_rejects_future_nonce` updated for the new 10-arg extrinsic
  signature.

Surface-area updates: pallet Config (`ChallengeDeposit`); mocks
(pallet, drive-registry, s3-registry); both runtimes; provider-node
CommitResponse adds `leaf_count`; storage_user.rs SDK
CommitResponse adds `leaf_count`; JS demos thread `upload.leafCount`
into `challengeOffchain`. Benchmarks updated for the extra extrinsic
arg.
The Rust SDK's `get_total_challenge_earnings` and `get_challenge_stats`
have been returning `Ok(0)` with comments noting "no on-chain
per-challenger aggregate stored." This commit adds the aggregate so the
SDK can answer truthfully in the next commit.

New primitives type `ChallengerStatRecord<Balance>` (total_challenges,
successful_challenges, failed_challenges, total_earnings) and new
pallet storage `ChallengerStats: StorageMap<AccountId, _>` with
ValueQuery (default for new challengers).

Stats are bumped at four sites:

  - `create_challenge` — `total_challenges += 1`
  - Defended path in `respond_to_challenge` — `failed_challenges += 1`
    (the challenger lost; the provider successfully proved possession)
  - `slash_provider_for_failed_challenge` (used by both timeout and
    invalid-response paths) — `successful_challenges += 1` and the
    10%-of-slashed-stake reward added to `total_earnings`

Four new pallet tests cover all four state transitions.
ChallengeResponder was defined and exported from the lib but never
instantiated in command.rs — the entire module was dead code in
production. CI integration passes happened to work because the JS test
driver orchestrates challenge responses end-to-end via HTTP proof
endpoints + client SDK extrinsics, so the provider never needed to
detect challenges on its own.

This commit closes that gap.

`poll_challenges` previously returned `Ok(vec![])` with a TODO. Now:
  - Resolves the provider's 32-byte account from the signer's public
    key (preferred) or from the SS58-encoded `provider_id` fallback
  - Iterates `StorageProvider::Challenges` storage via subxt dynamic
    queries
  - For each entry, manually decodes the SCALE-encoded `Vec<Challenge>`
    (the byte layout of `Challenge<T>` is stable for deployed runtimes)
  - Keeps only entries whose `provider` field matches our account,
    preserving the original index for `ChallengeId` reconstruction

A round-trip unit test (`decode_challenge_vec_filters_by_provider`)
hand-rolls the SCALE bytes for two `Challenge` entries, runs the
decoder, and asserts it picks the right one with the right index. If
the pallet's struct layout ever drifts from what the decoder expects,
this test fires immediately.

Wiring: new `ChallengeResponderParams` CLI flags
(`--enable-challenge-responder`, `--challenge-poll-interval`),
`start_challenge_responder` startup helper in command.rs alongside the
other coordinators, instantiation in `run()`. Like the other
coordinators, it requires `--keyfile` to sign and is opt-in via the
enable flag.

Cost is bounded by `ChallengeTimeout` — storage entries past their
deadline are reaped in `on_finalize`, so iteration size is at worst
the number of distinct deadlines with at least one open challenge.

Also drop a redundant `storage_primitives::` qualifier on the
`ChallengerStats` storage definition (the type is already imported in
the prelude).
`get_total_challenge_earnings` and `get_challenge_stats` were the last
of the audit's "Ok(0) with an apology comment" SDK stubs. The pallet
now maintains `ChallengerStats` per account (added in the previous
commit), so the SDK can answer for real.

Both methods now fetch `StorageProvider::ChallengerStats` keyed by
this client's challenger account. The on-chain
`ChallengerStatRecord` decodes via a small `read_u128` helper that
pulls each named field out of the dynamic value. ValueQuery on the
pallet side means a never-bumped account decodes to a zero record —
no None handling needed at the call site.

The integration tests `test_get_challenge_stats` and
`test_get_total_challenge_earnings` no longer assert "must be zero"
— that was a hardcoded reflection of the SDK stubs. They now assert
the more general invariant `total >= successful + failed` and that
the earnings read succeeds without erroring.
@ilchu ilchu self-assigned this Jun 3, 2026
@ilchu ilchu requested a review from bkontur June 3, 2026 12:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant