You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello Salvatore, I've been exploring ds4 on a DGX Spark and have two changes that might help it run better. Both are CUDA-only and don't touch Metal or agent paths.
Please excuse my Claude writeup:
Result (ds4-bench standard sweep, promessi_sposi.txt, gen=128)
ctx
stock main
this work, plain
this work, --mtp
2048
~13.9
14.24
16.13
16384
~13.5
13.97
15.81
The two changes (stacked, rebased on current main f91c12b)
DGX Spark / GB10 backend support — HBM-resident model. Budgeted HBM cache (cap + MoE filter), a real GPU argmax kernel (replacing an indexer-topk-at-k=1 workaround), cudaHostRegisterReadOnly dropped (unsupported on GB10), and a couple of small-N decode kernel fuses. Plain decode only; no behavior change off Spark. (+234/-92)
→ cuda: DGX Spark / GB10 backend support — HBM-resident model TrevorS/ds4#13
MTP combined-forward speculative decode. Folds eval(first_token)+verify into one batched-N=2 forward; a Q8 share-weight matmul kernel that's bit-equal to the N=1 path by construction; --mtp-draft default 1→2. Strict mode (--quality / DS4_MTP_STRICT) falls back to the canonical path for byte-equality with plain decode. (+750/-124)
→ mtp: combined-forward speculative decode beats plain on GB10 (+2.4 t/s) (stacked on #13) TrevorS/ds4#14
Test status
Both build clean (make + make cpu) and pass --long-context, --tool-call-quality, --server, --metal-kernels. Two ds4_test checks fail — both fail identically on stock main (f91c12b), so they're not introduced by these changes. I root-caused both:
--logprob-vectors short_code_completion: the only divergence across the 4 steps is the markdown code-fence tag case — the full-precision official API emits ```c, the IQ2XXS 2-bit local model emits ```C; the code (return snprintf) is byte-identical. A quant near-tie.
--metal-tensor-equivalence: long-context only. The MoE routed-expert down-projection accumulates via float atomicAdd at n_tokens >= 128 (use_atomic_down); atomicAdd order is scheduling-dependent, so two runs of the same config drift at ulp scale and occasionally flip a greedy argmax. DS4_CUDA_MOE_NO_ATOMIC_DOWN=1 makes it bit-exact, confirming the cause. Flagging in case the non-determinism is of interest — it's orthogonal to these PRs.
Ok it's me again --
Would you like this upstreamed? If so, how would you prefer it? If not, that's all good too. Thanks!
Hello Salvatore, I've been exploring
ds4on a DGX Spark and have two changes that might help it run better. Both are CUDA-only and don't touch Metal or agent paths.Please excuse my Claude writeup:
Result (ds4-bench standard sweep, promessi_sposi.txt, gen=128)
The two changes (stacked, rebased on current main f91c12b)
DGX Spark / GB10 backend support — HBM-resident model. Budgeted HBM cache (cap + MoE filter), a real GPU argmax kernel (replacing an indexer-topk-at-k=1 workaround),
cudaHostRegisterReadOnlydropped (unsupported on GB10), and a couple of small-N decode kernel fuses. Plain decode only; no behavior change off Spark. (+234/-92)→ cuda: DGX Spark / GB10 backend support — HBM-resident model TrevorS/ds4#13
MTP combined-forward speculative decode. Folds eval(first_token)+verify into one batched-N=2 forward; a Q8 share-weight matmul kernel that's bit-equal to the N=1 path by construction;
--mtp-draftdefault 1→2. Strict mode (--quality/DS4_MTP_STRICT) falls back to the canonical path for byte-equality with plain decode. (+750/-124)→ mtp: combined-forward speculative decode beats plain on GB10 (+2.4 t/s) (stacked on #13) TrevorS/ds4#14
Test status
Both build clean (
make+make cpu) and pass--long-context,--tool-call-quality,--server,--metal-kernels. Twods4_testchecks fail — both fail identically on stockmain(f91c12b), so they're not introduced by these changes. I root-caused both:--logprob-vectors short_code_completion: the only divergence across the 4 steps is the markdown code-fence tag case — the full-precision official API emits```c, the IQ2XXS 2-bit local model emits```C; the code (return snprintf) is byte-identical. A quant near-tie.--metal-tensor-equivalence: long-context only. The MoE routed-expert down-projection accumulates via floatatomicAddatn_tokens >= 128(use_atomic_down); atomicAdd order is scheduling-dependent, so two runs of the same config drift at ulp scale and occasionally flip a greedy argmax.DS4_CUDA_MOE_NO_ATOMIC_DOWN=1makes it bit-exact, confirming the cause. Flagging in case the non-determinism is of interest — it's orthogonal to these PRs.Ok it's me again --
Would you like this upstreamed? If so, how would you prefer it? If not, that's all good too. Thanks!