ds4 / DGX Spark / MTP and Performance Improvements

Hello Salvatore, I've been exploring `ds4` on a DGX Spark and have two changes that might help it run better. Both are CUDA-only and don't touch Metal or agent paths.

Please excuse my Claude writeup:

### Result (ds4-bench standard sweep, promessi_sposi.txt, gen=128)

| ctx   | stock main | this work, plain | this work, --mtp |
| ----- | ---------: | ---------------: | ---------------: |
| 2048  | ~13.9      | 14.24            | **16.13**        |
| 16384 | ~13.5      | 13.97            | **15.81**        |

### The two changes (stacked, rebased on current main f91c12b)

1. **DGX Spark / GB10 backend support — HBM-resident model.** Budgeted HBM cache (cap + MoE filter), a real GPU argmax kernel (replacing an indexer-topk-at-k=1 workaround), `cudaHostRegisterReadOnly` dropped (unsupported on GB10), and a couple of small-N decode kernel fuses. Plain decode only; no behavior change off Spark. (+234/-92)
   → https://github.com/TrevorS/ds4/pull/13

2. **MTP combined-forward speculative decode.** Folds eval(first_token)+verify into one batched-N=2 forward; a Q8 share-weight matmul kernel that's bit-equal to the N=1 path by construction; `--mtp-draft` default 1→2. Strict mode (`--quality` / `DS4_MTP_STRICT`) falls back to the canonical path for byte-equality with plain decode. (+750/-124)
   → https://github.com/TrevorS/ds4/pull/14

### Test status

Both build clean (`make` + `make cpu`) and pass `--long-context`, `--tool-call-quality`, `--server`, `--metal-kernels`. Two `ds4_test` checks fail — both fail identically on stock `main` (f91c12b), so they're not introduced by these changes. I root-caused both:

- **`--logprob-vectors short_code_completion`**: the only divergence across the 4 steps is the markdown code-fence tag case — the full-precision official API emits `` ```c ``, the IQ2XXS 2-bit local model emits `` ```C ``; the code (`return snprintf`) is byte-identical. A quant near-tie.

- **`--metal-tensor-equivalence`**: long-context only. The MoE routed-expert down-projection accumulates via float `atomicAdd` at `n_tokens >= 128` (`use_atomic_down`); atomicAdd order is scheduling-dependent, so two runs of the same config drift at ulp scale and occasionally flip a greedy argmax. `DS4_CUDA_MOE_NO_ATOMIC_DOWN=1` makes it bit-exact, confirming the cause. Flagging in case the non-determinism is of interest — it's orthogonal to these PRs.

Ok it's me again --

Would you like this upstreamed? If so, how would you prefer it? If not, that's all good too. Thanks!


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ds4 / DGX Spark / MTP and Performance Improvements #244

Result (ds4-bench standard sweep, promessi_sposi.txt, gen=128)

The two changes (stacked, rebased on current main `f91c12b`)

Test status

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

ds4 / DGX Spark / MTP and Performance Improvements #244

Description

Result (ds4-bench standard sweep, promessi_sposi.txt, gen=128)

The two changes (stacked, rebased on current main f91c12b)

Test status

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

The two changes (stacked, rebased on current main `f91c12b`)