Skip to content

fix: prevent arithmetic overflow in U64Segment encoding selection for sparse/extreme row id ranges#6516

Open
ivscheianu wants to merge 6 commits intolance-format:mainfrom
ivscheianu:segment-overflow-u128-encoding-selection
Open

fix: prevent arithmetic overflow in U64Segment encoding selection for sparse/extreme row id ranges#6516
ivscheianu wants to merge 6 commits intolance-format:mainfrom
ivscheianu:segment-overflow-u128-encoding-selection

Conversation

@ivscheianu
Copy link
Copy Markdown
Contributor

@ivscheianu ivscheianu commented Apr 14, 2026

U64Segment::from_stats_and_sequence crashes when row IDs span a large range or include values near u64::MAX. Fixes #6515

There are two independent overflow classes:

  1. Cost estimation: n_holes() and sorted_sequence_sizes() compute range spans in u64/usize that wrap for large ranges, making infeasible encodings (RangeWithHoles, RangeWithBitmap) appear cheapest. The code then attempts to materialize billions of holes or allocate multi-exabyte bitmaps.

  2. Exclusive-end: All range-backed encodings construct Range<u64> with stats.max + 1 as the exclusive end. When max == u64::MAX, this overflows even for small, memory-feasible sets (e.g., [u64::MAX - 3, u64::MAX - 1, u64::MAX]).

Both classes cause process aborts in debug and OOM in release. Across JNI this kills the JVM with no recoverable exception.

Fix

n_holes()u128 return type: The total slot count max - min + 1 can be up to 2^64, which exceeds u64::MAX. Widening to u128 gives the correct value instead of wrapping.

sorted_sequence_sizes()u128 arithmetic: All cost estimates computed in u128 with saturating arithmetic, then converted via usize::try_from(...).unwrap_or(usize::MAX). Infeasible encodings saturate and always lose the min() comparison.

from_stats_and_sequence()checked_add(1) gate: exclusive_end = stats.max.checked_add(1) computed once and used as a gate for all range-backed branches. When None (i.e., max == u64::MAX), falls through to SortedArray. The bare expression stats.max + 1 no longer appears in the function.

@github-actions github-actions Bot added the bug Something isn't working label Apr 14, 2026
@ivscheianu ivscheianu marked this pull request as ready for review April 14, 2026 15:26
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 14, 2026

Codecov Report

❌ Patch coverage is 91.07143% with 10 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
rust/lance-table/src/rowids/segment.rs 91.07% 5 Missing and 5 partials ⚠️

📢 Thoughts on this report? Let us know!

@westonpace
Copy link
Copy Markdown
Member

when row IDs span a large range or include values near u64::MAX

Are you actually encountering this in practice? u64::MAX is huge. You would need to add 1 billion rows per second for over 500 years to get here.

Or are you triggering this in some other way besides data volume? Is there some way to manually specify the row ids? Are we using u64::MAX style sentinels in some place?

@ivscheianu
Copy link
Copy Markdown
Contributor Author

when row IDs span a large range or include values near u64::MAX

Are you actually encountering this in practice? u64::MAX is huge. You would need to add 1 billion rows per second for over 500 years to get here.

Or are you triggering this in some other way besides data volume? Is there some way to manually specify the row ids? Are we using u64::MAX style sentinels in some place?

hey @westonpace, thank you for taking a look!

Good points, let me explain how I encountered this problem. Sorry for not adding more context in the first place. I used trivial operations, talking about few rows, found it during my initial debugging for #6465.

  1. Why it was triggered on low volume operations:
  • Both transaction.rs:build_manifest and dataset.rs:write_manifest_file call apply_feature_flags. Many internal operations (index creation, compaction) invoke these with ManifestWriteConfig::default(), which has use_stable_row_ids = false. On any such write to a dataset that already had FLAG_STABLE_ROW_IDS set, the flag was silently cleared, causing subsequent writes to stop assigning stable row IDs.

Fix: inherit the flag from the existing manifest via use_stable_row_ids || manifest.uses_stable_row_ids(): https://github.com/lance-format/lance/pull/6465/changes#diff-4f6a9a13b1528c3e44b78f5847e0300debf414e4cb5bc62ff0bcf8a359f3363dR3241

  1. Silent logic error in version metadata:
  • The old code tried to recover created_at version metadata for rewritten rows by treating stable row IDs as old-style bit-packed fragment addresses:
  let orig_frag_id = row_id >> 32;
  let row_offset = (row_id & 0xFFFFFFFF) as usize;

Stable row IDs are sequential integers from next_row_id = 0, not (fragment_id << 32) | row_offset. After a RewriteRows update, rows receive new sequential IDs with no relationship to their original positional offset, so this would silently index into the wrong position in the version sequence and produce incorrect created_at metadata.
Fix: build a proper row_id -> (fragment, positional_offset) map by iterating each fragment's actual RowIdSequence. This bug is unrelated to the U64Segment overflow, stable IDs starting from 0 never produce extreme values here.

  1. U64Segment arithmetic overflow (defensive, not directly triggered now)
    n_holes() and sorted_sequence_sizes() overflow for values near u64::MAX.
let total_slots = self.max - self.min + 1;      // wraps to 0 when min=0, max=u64::MAX
let range_with_holes = 24 + 4 * n_holes as usize; // overflows when n_holes ≈ 2^63

Sequential stable row IDs starting from 0 never reach these extremes in practice, so this wasn't triggered by the minimal test. The fix (promoting intermediate arithmetic to u128, guarding max + 1 with checked_add) was found by code inspection. It's included because TOMBSTONE_ROW = u64::MAX is a live sentinel in the codebase, and the overflow would silently pick the wrong encoding variant in release builds or panic in debug.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

U64Segment::sorted_sequence_sizes overflows for sparse row ID ranges

2 participants