Sigma v4 logup backend by BidhanRoy · Pull Request #62 · bageldotcom/zkLoRA

BidhanRoy · 2026-06-11T18:24:04Z

Sigma-v4 backend performance results

Measured on the same class of host as the v3 results (4-core x86-64
container, release builds, default fixed-point config scale_bits=20, value_bits=63, intermediate_bits=127, trivial scaling). "v3" numbers are
from native_v3_results.md on identical hardware. The v4 backend has no
proving-key or SRS generation, so there is no warm/cold split: the first
proof of a shape costs the same as every other (v3 "cold" paid keygen on
every fresh process, which is the realistic first-use cost).

Reproduce with cargo run --release --example bench_prove -- <in> <rank> <out> <reps> sigma
(halo2 as the last argument benchmarks the legacy backend) and
python benchmarks/run_benchmarks.py.

Single proof + verify (Rust, `bench_prove`, steady state)

shape (in×rank×out)	v3 prove warm	v3 prove cold	v4 prove	speedup warm / cold	v3 verify	v4 verify	proof bytes
2×1×2	0.73 s	2.9 s	15 ms	49× / 193×	0.020 s	16 ms	31 KB
8×2×8	1.4 s	6.0 s	17 ms	82× / 353×	0.034 s	17 ms	36 KB
16×2×16	7.0 s	27.2 s	21 ms	333× / 1295×	0.16 s	20 ms	45 KB
64×4×64	10.8 s	47.2 s	35 ms	309× / 1349×	0.22 s	36 ms	99 KB
768×2×256	81 s	429 s	97 ms	835× / 4423×	1.3 s	101 ms	327 KB
768×4×768	infeasible (>15 GB)	infeasible	183 ms	∞	—	205 ms	611 KB
768×4×2304	infeasible (est. >200 GB pk)	infeasible	470 ms	∞	—	587 ms	2.2 MB

Proving memory drops from gigabytes (12.1 GB at k=19; OOM beyond) to tens of
megabytes for every shape: the halo2 extended-domain evaluations are gone
entirely.

End-to-end Python pipeline (`run_benchmarks.py`, 4 workers)

shape	invocations	v3 prove wall	v4 prove wall	speedup	v3 verify wall	v4 verify wall
16×2×16	8	71.4 s	0.098 s	729×	3.4 s	0.23 s
32×4×32	6	107.4 s	0.097 s	1107×	3.8 s	0.61 s
768×4×768	4	infeasible	0.48 s	∞	—	12.7 s*

* dominated by the one-time adapter-setup verification (Bulletproofs over
all 24,576 committed weights), paid once per adapter per verifier process
and cached afterwards; the per-invocation verification is ~0.2 s.

One-time adapter setup (manifest creation)

The per-weight work that v3 paid inside every invocation proof is paid
once per adapter when the manifest is written: Pedersen row commitments, an
aggregated exact range proof for every weight, and a linking proof.

adapter shape	weights	setup prove	setup blob
16×2×16	64	1.8 s	17 KB
768×2×256	2,048	14 s	485 KB
768×4×2304	12,288	82 s	2.5 MB

This artifact ships inside the pinned adapter manifest; it contains no
weight or salt material.

Real-model end-to-end (distilgpt2 + ng0-k1/distilgpt2-finetuned-es)

The full quickstart flow from the README — contributor server, base-model
client over sockets, native proof generation, CLI verification — run on the
same 4-core host. The adapter has six rank-16 768×2304 c_attn modules
(~49k quantized weights each); under v3 a single such module exceeded the
host by an order of magnitude (estimated >200 GB of params/keys), so none
of this was previously possible on any realistic machine.

stage	wall time
manifest write incl. one-time adapter setup proofs, 6 modules	~6 min (one-time per adapter)
inference + proof generation, 60 invocation proofs (6 modules × 10 token rows)	62 s
verification of all 60 proofs incl. one-time adapter-setup verification	162 s

A byte-flipped proof artifact is rejected
(proof bytes failed native sigma-v4 verification), as are tampered
statements, transcripts, manifests, and verification keys (covered by the
test suites).

Range engines

Per-invocation proofs default to the sumcheck-based LogUp range engine
(microseconds of field work per range entry). ZKLORA_RANGE_ENGINE=bulletproofs
opts into Bulletproofs instead: proofs shrink ~5-8× at ~5-10× slower proving
(still 50-100× faster than v3). Both engines prove the identical exact
intervals under the same discrete-log + Fiat-Shamir assumptions, and the
verifier accepts either.

…s with adapter size Replace the per-invocation halo2 circuit (which re-hashed and re-range-checked every adapter weight inside every proof) with a commit-and-prove protocol over ristretto255: - Adapter setup, once per manifest: salted Pedersen row commitments to A and B, per-weight commitments, an aggregated Bulletproofs range proof pinning every weight to the exact [-value_bound, value_bound] interval, and a Schnorr link proof tying row and value commitments together. The pinned adapter commitment string is the SHA-256 of the deterministic commitment core. - Invocation proofs: Pedersen commitments to the rounding quotients and remainders of the exact three-stage quantized pipeline, Fiat-Shamir random projections (Schwartz-Zippel) of each matrix equation onto scalar equations proven by generalized Schnorr plus one rank-sized quadratic inner-product sigma protocol, and aggregated Bulletproofs for the exact remainder and quotient intervals. Per-proof work is O(in + rank + out) group operations. Statement semantics are unchanged: same canonical half-up rounding, same exact remainder intervals, same value/intermediate bounds, same transcript binding and verifier trust boundary. Assumptions stay in the same class (discrete log plus Fiat-Shamir; halo2-IPA was already DLOG-based) and adapter hiding improves: commitments are perfectly hiding and salted where the v3 Poseidon chain was unsalted. v3 halo2 artifacts (schema 2) still verify through the legacy path, and the halo2 backend remains in the crate. Measured on the 4-core benchmark host (same harness as the v3 results): - end-to-end 16x2x16 x8 invocations: prove 71.4s -> 0.84s, verify 3.4s -> 0.26s - end-to-end 32x4x32 x6: prove 107.4s -> 1.14s - single proof 768x2x256: prove 81s warm / 429s cold -> 1.0s (no keygen, no warm/cold split), verify 1.3s -> 0.14s; proving memory GBs -> MBs - 768x4x768 and 768x4x2304: previously out of memory on a 15GB host, now seconds per proof

Replace per-invocation Bulletproofs range proofs with a sumcheck-based LogUp lookup argument as the default range engine. Every committed quotient and remainder is decomposed into 8-bit digits inside one vector Pedersen commitment; a LogUp identity over a 256-entry table is proven by two sumchecks whose round polynomial coefficients are sent as Pedersen commitments, never in the clear. All sumcheck verifier relations (round consistency, claim chaining, final evaluations, MLE openings, digit recomposition) are linear over committed scalars and fold into a single generalized Schnorr proof; the one bilinear final-evaluation check uses a product sigma protocol. Zero knowledge is structural (only hiding commitments and uniform responses are sent) and soundness is standard sumcheck + Schwartz-Zippel + Pedersen binding under the same discrete-log + Fiat-Shamir assumptions. ZKLORA_RANGE_ENGINE=bulletproofs opts back into compact proofs; the adapter setup keeps Bulletproofs (one-time, size- sensitive); the verifier accepts both engines. Also: chunk-parallel MSMs, precomputed blinding-generator table, indicator- gated padding so pad coordinates are exact zeros and cost nothing, parallel sumcheck rounds and twin derivation, a thread-local ChaCha CSPRNG for blinding generation, and lock-discipline fixes (never hold a cache lock across rayon work: a worker waiting on stolen subtasks can steal a job that blocks on the same lock and deadlock the pool -- this also fixes a latent hazard in the halo2 key caches). Measured on the 4-core benchmark host (steady state; v4 has no keygen so first proofs cost the same): - 768x2x256: prove 81 s warm / 429 s cold -> 97 ms (835x / 4423x), verify 1.3 s -> 101 ms - 16x2x16: prove 7.0 s -> 21 ms (333x); 64x4x64: 10.8 s -> 35 ms (309x) - 768x4x768 and 768x4x2304 (OOM-infeasible under v3): 183 ms and 470 ms - end-to-end pipeline: 16x2x16 x8 71.4 s -> 0.098 s (729x); 32x4x32 x6 107.4 s -> 0.097 s (1107x) - proving memory: gigabytes -> tens of megabytes

Full README quickstart flow validated on distilgpt2 with the ng0-k1/distilgpt2-finetuned-es adapter (six rank-16 768x2304 c_attn modules): 60 invocation proofs generated in 62s and verified in 162s on a 4-core host, with tampered artifacts rejected. Under v3 a single module of this shape was estimated at >200GB of proving material.

The prover recomputed the adapter commitment (serde + SHA-256 over a multi-MB core) on every invocation proof, and batch verification re-parsed the same setup JSON for every artifact of a module. Both are now cached by content hash, keeping per-proof costs flat in adapter size on both sides.

Hiding (high): blinding factors for adapter row/weight commitments were derived from (salt, domain, index) only, so two same-shaped adapters from one contributor shared blindings at every index and commitment differences C1_i - C2_i = (w1_i - w2_i)*G leaked exact weight differences from public manifests. Blindings are now derived under a per-adapter key (keyed BLAKE3 of the full canonical adapter payload under the salt), so distinct adapters never share blindings; a regression test pins this. Salt handling (medium): adapter_salt() is now keyed by the resolved salt file path instead of a single process global (a changed ZKLORA_ADAPTER_SALT_FILE takes effect immediately), and file creation is atomic (O_EXCL + temp-file rename) so concurrent first-touch processes converge on one salt. LoRAServer warns instead of silently sharing when the env var already points at another server's salt file. Legacy compatibility (medium): the claim that v3 halo2 artifacts (schema 2, backend zklora-halo2-v3 -- the id main actually emits) still verify is now backed by a test that reconstructs artifacts byte-for-byte as main wrote them and runs them through verify_artifacts, including tamper rejection. Robustness: duplicate invocation records are rejected before parallel proof generation can race on artifact paths; invocation-proof deserialization is allocation-bounded by input size; explicit dimension caps stop hostile statements from driving unbounded allocations; the LogUp engine absorbs the entry commitments into its own transcript (defense in depth for the recomposition challenge); verifier rejects nonzero pad responses, making the pad-coordinate invariant checked rather than conventional; limb_plan now enforces the documented at-most-+1 multi-limb slack instead of merely claiming it, with the general slack formula documented. Also fixes a stale sumcheck comment and documents the LoRAServer last_* snapshot-slot locking invariant.

BidhanRoy added 6 commits June 11, 2026 04:15

Apply rustfmt to the sigma-v4 backend modules

f36b08c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sigma v4 logup backend#62

Sigma v4 logup backend#62
BidhanRoy wants to merge 6 commits into
mainfrom
sigma-v4-logup-backend

BidhanRoy commented Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

BidhanRoy commented Jun 11, 2026

Sigma-v4 backend performance results

Single proof + verify (Rust, bench_prove, steady state)

End-to-end Python pipeline (run_benchmarks.py, 4 workers)

One-time adapter setup (manifest creation)

Real-model end-to-end (distilgpt2 + ng0-k1/distilgpt2-finetuned-es)

Range engines

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Single proof + verify (Rust, `bench_prove`, steady state)

End-to-end Python pipeline (`run_benchmarks.py`, 4 workers)