Skip to content

Sigma v4 logup backend#62

Open
BidhanRoy wants to merge 6 commits into
mainfrom
sigma-v4-logup-backend
Open

Sigma v4 logup backend#62
BidhanRoy wants to merge 6 commits into
mainfrom
sigma-v4-logup-backend

Conversation

@BidhanRoy

Copy link
Copy Markdown
Contributor

Sigma-v4 backend performance results

Measured on the same class of host as the v3 results (4-core x86-64
container, release builds, default fixed-point config scale_bits=20, value_bits=63, intermediate_bits=127, trivial scaling). "v3" numbers are
from native_v3_results.md on identical hardware. The v4 backend has no
proving-key or SRS generation, so there is no warm/cold split: the first
proof of a shape costs the same as every other (v3 "cold" paid keygen on
every fresh process, which is the realistic first-use cost).

Reproduce with cargo run --release --example bench_prove -- <in> <rank> <out> <reps> sigma
(halo2 as the last argument benchmarks the legacy backend) and
python benchmarks/run_benchmarks.py.

Single proof + verify (Rust, bench_prove, steady state)

shape (in×rank×out) v3 prove warm v3 prove cold v4 prove speedup warm / cold v3 verify v4 verify proof bytes
2×1×2 0.73 s 2.9 s 15 ms 49× / 193× 0.020 s 16 ms 31 KB
8×2×8 1.4 s 6.0 s 17 ms 82× / 353× 0.034 s 17 ms 36 KB
16×2×16 7.0 s 27.2 s 21 ms 333× / 1295× 0.16 s 20 ms 45 KB
64×4×64 10.8 s 47.2 s 35 ms 309× / 1349× 0.22 s 36 ms 99 KB
768×2×256 81 s 429 s 97 ms 835× / 4423× 1.3 s 101 ms 327 KB
768×4×768 infeasible (>15 GB) infeasible 183 ms 205 ms 611 KB
768×4×2304 infeasible (est. >200 GB pk) infeasible 470 ms 587 ms 2.2 MB

Proving memory drops from gigabytes (12.1 GB at k=19; OOM beyond) to tens of
megabytes for every shape: the halo2 extended-domain evaluations are gone
entirely.

End-to-end Python pipeline (run_benchmarks.py, 4 workers)

shape invocations v3 prove wall v4 prove wall speedup v3 verify wall v4 verify wall
16×2×16 8 71.4 s 0.098 s 729× 3.4 s 0.23 s
32×4×32 6 107.4 s 0.097 s 1107× 3.8 s 0.61 s
768×4×768 4 infeasible 0.48 s 12.7 s*

* dominated by the one-time adapter-setup verification (Bulletproofs over
all 24,576 committed weights), paid once per adapter per verifier process
and cached afterwards; the per-invocation verification is ~0.2 s.

One-time adapter setup (manifest creation)

The per-weight work that v3 paid inside every invocation proof is paid
once per adapter when the manifest is written: Pedersen row commitments, an
aggregated exact range proof for every weight, and a linking proof.

adapter shape weights setup prove setup blob
16×2×16 64 1.8 s 17 KB
768×2×256 2,048 14 s 485 KB
768×4×2304 12,288 82 s 2.5 MB

This artifact ships inside the pinned adapter manifest; it contains no
weight or salt material.

Real-model end-to-end (distilgpt2 + ng0-k1/distilgpt2-finetuned-es)

The full quickstart flow from the README — contributor server, base-model
client over sockets, native proof generation, CLI verification — run on the
same 4-core host. The adapter has six rank-16 768×2304 c_attn modules
(~49k quantized weights each); under v3 a single such module exceeded the
host by an order of magnitude (estimated >200 GB of params/keys), so none
of this was previously possible on any realistic machine.

stage wall time
manifest write incl. one-time adapter setup proofs, 6 modules ~6 min (one-time per adapter)
inference + proof generation, 60 invocation proofs (6 modules × 10 token rows) 62 s
verification of all 60 proofs incl. one-time adapter-setup verification 162 s

A byte-flipped proof artifact is rejected
(proof bytes failed native sigma-v4 verification), as are tampered
statements, transcripts, manifests, and verification keys (covered by the
test suites).

Range engines

Per-invocation proofs default to the sumcheck-based LogUp range engine
(microseconds of field work per range entry). ZKLORA_RANGE_ENGINE=bulletproofs
opts into Bulletproofs instead: proofs shrink ~5-8× at ~5-10× slower proving
(still 50-100× faster than v3). Both engines prove the identical exact
intervals under the same discrete-log + Fiat-Shamir assumptions, and the
verifier accepts either.

…s with adapter size

Replace the per-invocation halo2 circuit (which re-hashed and re-range-checked
every adapter weight inside every proof) with a commit-and-prove protocol over
ristretto255:

- Adapter setup, once per manifest: salted Pedersen row commitments to A and B,
  per-weight commitments, an aggregated Bulletproofs range proof pinning every
  weight to the exact [-value_bound, value_bound] interval, and a Schnorr link
  proof tying row and value commitments together. The pinned adapter
  commitment string is the SHA-256 of the deterministic commitment core.
- Invocation proofs: Pedersen commitments to the rounding quotients and
  remainders of the exact three-stage quantized pipeline, Fiat-Shamir random
  projections (Schwartz-Zippel) of each matrix equation onto scalar equations
  proven by generalized Schnorr plus one rank-sized quadratic inner-product
  sigma protocol, and aggregated Bulletproofs for the exact remainder and
  quotient intervals. Per-proof work is O(in + rank + out) group operations.

Statement semantics are unchanged: same canonical half-up rounding, same exact
remainder intervals, same value/intermediate bounds, same transcript binding
and verifier trust boundary. Assumptions stay in the same class (discrete log
plus Fiat-Shamir; halo2-IPA was already DLOG-based) and adapter hiding
improves: commitments are perfectly hiding and salted where the v3 Poseidon
chain was unsalted. v3 halo2 artifacts (schema 2) still verify through the
legacy path, and the halo2 backend remains in the crate.

Measured on the 4-core benchmark host (same harness as the v3 results):
- end-to-end 16x2x16 x8 invocations: prove 71.4s -> 0.84s, verify 3.4s -> 0.26s
- end-to-end 32x4x32 x6: prove 107.4s -> 1.14s
- single proof 768x2x256: prove 81s warm / 429s cold -> 1.0s (no keygen, no
  warm/cold split), verify 1.3s -> 0.14s; proving memory GBs -> MBs
- 768x4x768 and 768x4x2304: previously out of memory on a 15GB host, now
  seconds per proof
Replace per-invocation Bulletproofs range proofs with a sumcheck-based LogUp
lookup argument as the default range engine. Every committed quotient and
remainder is decomposed into 8-bit digits inside one vector Pedersen
commitment; a LogUp identity over a 256-entry table is proven by two
sumchecks whose round polynomial coefficients are sent as Pedersen
commitments, never in the clear. All sumcheck verifier relations (round
consistency, claim chaining, final evaluations, MLE openings, digit
recomposition) are linear over committed scalars and fold into a single
generalized Schnorr proof; the one bilinear final-evaluation check uses a
product sigma protocol. Zero knowledge is structural (only hiding
commitments and uniform responses are sent) and soundness is standard
sumcheck + Schwartz-Zippel + Pedersen binding under the same discrete-log +
Fiat-Shamir assumptions. ZKLORA_RANGE_ENGINE=bulletproofs opts back into
compact proofs; the adapter setup keeps Bulletproofs (one-time, size-
sensitive); the verifier accepts both engines.

Also: chunk-parallel MSMs, precomputed blinding-generator table, indicator-
gated padding so pad coordinates are exact zeros and cost nothing, parallel
sumcheck rounds and twin derivation, a thread-local ChaCha CSPRNG for
blinding generation, and lock-discipline fixes (never hold a cache lock
across rayon work: a worker waiting on stolen subtasks can steal a job that
blocks on the same lock and deadlock the pool -- this also fixes a latent
hazard in the halo2 key caches).

Measured on the 4-core benchmark host (steady state; v4 has no keygen so
first proofs cost the same):
- 768x2x256: prove 81 s warm / 429 s cold -> 97 ms (835x / 4423x), verify
  1.3 s -> 101 ms
- 16x2x16: prove 7.0 s -> 21 ms (333x); 64x4x64: 10.8 s -> 35 ms (309x)
- 768x4x768 and 768x4x2304 (OOM-infeasible under v3): 183 ms and 470 ms
- end-to-end pipeline: 16x2x16 x8 71.4 s -> 0.098 s (729x); 32x4x32 x6
  107.4 s -> 0.097 s (1107x)
- proving memory: gigabytes -> tens of megabytes
Full README quickstart flow validated on distilgpt2 with the
ng0-k1/distilgpt2-finetuned-es adapter (six rank-16 768x2304 c_attn
modules): 60 invocation proofs generated in 62s and verified in 162s on a
4-core host, with tampered artifacts rejected. Under v3 a single module of
this shape was estimated at >200GB of proving material.
The prover recomputed the adapter commitment (serde + SHA-256 over a
multi-MB core) on every invocation proof, and batch verification re-parsed
the same setup JSON for every artifact of a module. Both are now cached by
content hash, keeping per-proof costs flat in adapter size on both sides.
Hiding (high): blinding factors for adapter row/weight commitments were
derived from (salt, domain, index) only, so two same-shaped adapters from
one contributor shared blindings at every index and commitment differences
C1_i - C2_i = (w1_i - w2_i)*G leaked exact weight differences from public
manifests. Blindings are now derived under a per-adapter key (keyed BLAKE3
of the full canonical adapter payload under the salt), so distinct adapters
never share blindings; a regression test pins this.

Salt handling (medium): adapter_salt() is now keyed by the resolved salt
file path instead of a single process global (a changed
ZKLORA_ADAPTER_SALT_FILE takes effect immediately), and file creation is
atomic (O_EXCL + temp-file rename) so concurrent first-touch processes
converge on one salt. LoRAServer warns instead of silently sharing when the
env var already points at another server's salt file.

Legacy compatibility (medium): the claim that v3 halo2 artifacts (schema 2,
backend zklora-halo2-v3 -- the id main actually emits) still verify is now
backed by a test that reconstructs artifacts byte-for-byte as main wrote
them and runs them through verify_artifacts, including tamper rejection.

Robustness: duplicate invocation records are rejected before parallel proof
generation can race on artifact paths; invocation-proof deserialization is
allocation-bounded by input size; explicit dimension caps stop hostile
statements from driving unbounded allocations; the LogUp engine absorbs the
entry commitments into its own transcript (defense in depth for the
recomposition challenge); verifier rejects nonzero pad responses, making
the pad-coordinate invariant checked rather than conventional; limb_plan
now enforces the documented at-most-+1 multi-limb slack instead of merely
claiming it, with the general slack formula documented.

Also fixes a stale sumcheck comment and documents the LoRAServer last_*
snapshot-slot locking invariant.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant