Sigma v4 logup backend#62
Open
BidhanRoy wants to merge 6 commits into
Open
Conversation
…s with adapter size Replace the per-invocation halo2 circuit (which re-hashed and re-range-checked every adapter weight inside every proof) with a commit-and-prove protocol over ristretto255: - Adapter setup, once per manifest: salted Pedersen row commitments to A and B, per-weight commitments, an aggregated Bulletproofs range proof pinning every weight to the exact [-value_bound, value_bound] interval, and a Schnorr link proof tying row and value commitments together. The pinned adapter commitment string is the SHA-256 of the deterministic commitment core. - Invocation proofs: Pedersen commitments to the rounding quotients and remainders of the exact three-stage quantized pipeline, Fiat-Shamir random projections (Schwartz-Zippel) of each matrix equation onto scalar equations proven by generalized Schnorr plus one rank-sized quadratic inner-product sigma protocol, and aggregated Bulletproofs for the exact remainder and quotient intervals. Per-proof work is O(in + rank + out) group operations. Statement semantics are unchanged: same canonical half-up rounding, same exact remainder intervals, same value/intermediate bounds, same transcript binding and verifier trust boundary. Assumptions stay in the same class (discrete log plus Fiat-Shamir; halo2-IPA was already DLOG-based) and adapter hiding improves: commitments are perfectly hiding and salted where the v3 Poseidon chain was unsalted. v3 halo2 artifacts (schema 2) still verify through the legacy path, and the halo2 backend remains in the crate. Measured on the 4-core benchmark host (same harness as the v3 results): - end-to-end 16x2x16 x8 invocations: prove 71.4s -> 0.84s, verify 3.4s -> 0.26s - end-to-end 32x4x32 x6: prove 107.4s -> 1.14s - single proof 768x2x256: prove 81s warm / 429s cold -> 1.0s (no keygen, no warm/cold split), verify 1.3s -> 0.14s; proving memory GBs -> MBs - 768x4x768 and 768x4x2304: previously out of memory on a 15GB host, now seconds per proof
Replace per-invocation Bulletproofs range proofs with a sumcheck-based LogUp lookup argument as the default range engine. Every committed quotient and remainder is decomposed into 8-bit digits inside one vector Pedersen commitment; a LogUp identity over a 256-entry table is proven by two sumchecks whose round polynomial coefficients are sent as Pedersen commitments, never in the clear. All sumcheck verifier relations (round consistency, claim chaining, final evaluations, MLE openings, digit recomposition) are linear over committed scalars and fold into a single generalized Schnorr proof; the one bilinear final-evaluation check uses a product sigma protocol. Zero knowledge is structural (only hiding commitments and uniform responses are sent) and soundness is standard sumcheck + Schwartz-Zippel + Pedersen binding under the same discrete-log + Fiat-Shamir assumptions. ZKLORA_RANGE_ENGINE=bulletproofs opts back into compact proofs; the adapter setup keeps Bulletproofs (one-time, size- sensitive); the verifier accepts both engines. Also: chunk-parallel MSMs, precomputed blinding-generator table, indicator- gated padding so pad coordinates are exact zeros and cost nothing, parallel sumcheck rounds and twin derivation, a thread-local ChaCha CSPRNG for blinding generation, and lock-discipline fixes (never hold a cache lock across rayon work: a worker waiting on stolen subtasks can steal a job that blocks on the same lock and deadlock the pool -- this also fixes a latent hazard in the halo2 key caches). Measured on the 4-core benchmark host (steady state; v4 has no keygen so first proofs cost the same): - 768x2x256: prove 81 s warm / 429 s cold -> 97 ms (835x / 4423x), verify 1.3 s -> 101 ms - 16x2x16: prove 7.0 s -> 21 ms (333x); 64x4x64: 10.8 s -> 35 ms (309x) - 768x4x768 and 768x4x2304 (OOM-infeasible under v3): 183 ms and 470 ms - end-to-end pipeline: 16x2x16 x8 71.4 s -> 0.098 s (729x); 32x4x32 x6 107.4 s -> 0.097 s (1107x) - proving memory: gigabytes -> tens of megabytes
Full README quickstart flow validated on distilgpt2 with the ng0-k1/distilgpt2-finetuned-es adapter (six rank-16 768x2304 c_attn modules): 60 invocation proofs generated in 62s and verified in 162s on a 4-core host, with tampered artifacts rejected. Under v3 a single module of this shape was estimated at >200GB of proving material.
The prover recomputed the adapter commitment (serde + SHA-256 over a multi-MB core) on every invocation proof, and batch verification re-parsed the same setup JSON for every artifact of a module. Both are now cached by content hash, keeping per-proof costs flat in adapter size on both sides.
Hiding (high): blinding factors for adapter row/weight commitments were derived from (salt, domain, index) only, so two same-shaped adapters from one contributor shared blindings at every index and commitment differences C1_i - C2_i = (w1_i - w2_i)*G leaked exact weight differences from public manifests. Blindings are now derived under a per-adapter key (keyed BLAKE3 of the full canonical adapter payload under the salt), so distinct adapters never share blindings; a regression test pins this. Salt handling (medium): adapter_salt() is now keyed by the resolved salt file path instead of a single process global (a changed ZKLORA_ADAPTER_SALT_FILE takes effect immediately), and file creation is atomic (O_EXCL + temp-file rename) so concurrent first-touch processes converge on one salt. LoRAServer warns instead of silently sharing when the env var already points at another server's salt file. Legacy compatibility (medium): the claim that v3 halo2 artifacts (schema 2, backend zklora-halo2-v3 -- the id main actually emits) still verify is now backed by a test that reconstructs artifacts byte-for-byte as main wrote them and runs them through verify_artifacts, including tamper rejection. Robustness: duplicate invocation records are rejected before parallel proof generation can race on artifact paths; invocation-proof deserialization is allocation-bounded by input size; explicit dimension caps stop hostile statements from driving unbounded allocations; the LogUp engine absorbs the entry commitments into its own transcript (defense in depth for the recomposition challenge); verifier rejects nonzero pad responses, making the pad-coordinate invariant checked rather than conventional; limb_plan now enforces the documented at-most-+1 multi-limb slack instead of merely claiming it, with the general slack formula documented. Also fixes a stale sumcheck comment and documents the LoRAServer last_* snapshot-slot locking invariant.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Sigma-v4 backend performance results
Measured on the same class of host as the v3 results (4-core x86-64
container, release builds, default fixed-point config
scale_bits=20, value_bits=63, intermediate_bits=127, trivial scaling). "v3" numbers arefrom
native_v3_results.mdon identical hardware. The v4 backend has noproving-key or SRS generation, so there is no warm/cold split: the first
proof of a shape costs the same as every other (v3 "cold" paid keygen on
every fresh process, which is the realistic first-use cost).
Reproduce with
cargo run --release --example bench_prove -- <in> <rank> <out> <reps> sigma(
halo2as the last argument benchmarks the legacy backend) andpython benchmarks/run_benchmarks.py.Single proof + verify (Rust,
bench_prove, steady state)Proving memory drops from gigabytes (12.1 GB at k=19; OOM beyond) to tens of
megabytes for every shape: the halo2 extended-domain evaluations are gone
entirely.
End-to-end Python pipeline (
run_benchmarks.py, 4 workers)* dominated by the one-time adapter-setup verification (Bulletproofs over
all 24,576 committed weights), paid once per adapter per verifier process
and cached afterwards; the per-invocation verification is ~0.2 s.
One-time adapter setup (manifest creation)
The per-weight work that v3 paid inside every invocation proof is paid
once per adapter when the manifest is written: Pedersen row commitments, an
aggregated exact range proof for every weight, and a linking proof.
This artifact ships inside the pinned adapter manifest; it contains no
weight or salt material.
Real-model end-to-end (distilgpt2 + ng0-k1/distilgpt2-finetuned-es)
The full quickstart flow from the README — contributor server, base-model
client over sockets, native proof generation, CLI verification — run on the
same 4-core host. The adapter has six rank-16 768×2304 c_attn modules
(~49k quantized weights each); under v3 a single such module exceeded the
host by an order of magnitude (estimated >200 GB of params/keys), so none
of this was previously possible on any realistic machine.
A byte-flipped proof artifact is rejected
(
proof bytes failed native sigma-v4 verification), as are tamperedstatements, transcripts, manifests, and verification keys (covered by the
test suites).
Range engines
Per-invocation proofs default to the sumcheck-based LogUp range engine
(microseconds of field work per range entry).
ZKLORA_RANGE_ENGINE=bulletproofsopts into Bulletproofs instead: proofs shrink ~5-8× at ~5-10× slower proving
(still 50-100× faster than v3). Both engines prove the identical exact
intervals under the same discrete-log + Fiat-Shamir assumptions, and the
verifier accepts either.