Skip to content

fix: resolve SparseDistribution crash on zero/NaN probability mass#83

Open
Jonathangadeaharder wants to merge 1 commit into
youssofal:mainfrom
Jonathangadeaharder:fix-sparse-distribution-crash
Open

fix: resolve SparseDistribution crash on zero/NaN probability mass#83
Jonathangadeaharder wants to merge 1 commit into
youssofal:mainfrom
Jonathangadeaharder:fix-sparse-distribution-crash

Conversation

@Jonathangadeaharder
Copy link
Copy Markdown

@Jonathangadeaharder Jonathangadeaharder commented May 24, 2026

Resolves #82

Changes

  1. Fallback in SparseDistribution Constructor: Made SparseDistribution.__post_init__ fall back to a valid one-hot distribution on the first available token (greedy choice) if the sum of the probabilities is not finite or <= 0, rather than raising a crash-inducing ValueError.
  2. Robust residual_distribution: Added not np.isfinite(total) checks to the fallback paths in residual_distribution (sparse, dense, and non-sparse branches) to prevent NaN values from bypassing target fallbacks.
  3. Unit Tests: Added test cases in tests/test_sampling.py verifying fallback recovery on zero/NaN probability mass, as well as a mock test for residual_distribution NaN robustness.

All tests pass successfully.

Copilot AI review requested due to automatic review settings May 24, 2026 11:41
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

@Jonathangadeaharder Jonathangadeaharder force-pushed the fix-sparse-distribution-crash branch from 4c2dc4f to e4cf880 Compare May 24, 2026 11:44
Copy link
Copy Markdown

@xlinbsd xlinbsd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reproduced on v0.3.7 — manual patch from PR #83 works**

Environment: M5 Pro 64 GB, macOS 26.5, Youssofal/Qwen3.6-27B-MTPLX-Optimized-Quality, --profile sustained, --reasoning off.

After a few minutes of agentic sessions the server crashes with ValueError: SparseDistribution probabilities must have positive mass, and the client gets Connection reset by peer (os error 54).

Applied the patch from PR #83 manually to site-packages/mtplx/sampling.py — two changes:

1. SparseDistribution.__post_init__ — fallback instead of raise:

# before
if not np.isfinite(total) or total <= 0:
    raise ValueError("SparseDistribution probabilities must have positive mass")

# after (mirrors fast_sampling.py which already has this fix)
if not np.isfinite(total) or total <= 0:
    token_ids = token_ids[:1] if token_ids.size > 0 else np.array([0], dtype=np.int64)
    probs = np.array([1.0], dtype=np.float64)
    total = 1.0

2. residual_distribution — add NaN guard to all 4 if total <= 0: checks:

if total <= 0 or not np.isfinite(total):

Sessions are now stable. Note that fast_sampling.py already has the equivalent fallback at lines 98-100, 151-153 and 208-211 — sampling.py was just missing it.

Please merge PR #83.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ValueError: SparseDistribution probabilities must have positive mass during speculative drafting

3 participants