fix: serialize version metadata through JNI and correct row-ID lookup#6465
fix: serialize version metadata through JNI and correct row-ID lookup#6465ivscheianu wants to merge 19 commits intolance-format:mainfrom
Conversation
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
| let total_slots = stats.max - stats.min + 1; | ||
|
|
||
| let range_with_holes = 24 + 4 * n_holes as usize; | ||
| let range_with_holes = 24usize.saturating_add(4usize.saturating_mul(n_holes as usize)); |
There was a problem hiding this comment.
created #6515 for a long-term fix, will try to take care of the impl in a separate PR
| } | ||
|
|
||
| impl IntoJava for &RowDatasetVersionMeta { | ||
| fn into_java<'a>(self, env: &mut JNIEnv<'a>) -> Result<JObject<'a>> { |
There was a problem hiding this comment.
Nit pick... Let's use 'local for the lifetime, similar to the other Fragment.into_java().
There was a problem hiding this comment.
'a is actually the dominant convention across all the other IntoJava impls in this file (DataFile, DeletionFile, RowIdMeta, FragmentMergeResult, etc.). Fragment is actually the outlier. My main concern was about consistency, happy to change it if you think it's the appropriate move
|
|
||
| /// Fallback version for rows whose original creation version cannot be determined. | ||
| /// Version 1 is the initial dataset version in the Lance format. | ||
| const UNKNOWN_CREATED_AT_VERSION: u64 = 1; |
There was a problem hiding this comment.
I would suggest to rename it to DEFAULT_VERSION.
| const UNKNOWN_CREATED_AT_VERSION: u64 = 1; | |
| const DEFAULT_VERSION: u64 = 1; |
There was a problem hiding this comment.
I'd prefer to keep UNKNOWN_CREATED_AT_VERSION because the name captures why this value is used (fallback when the original creation version can't be determined), not just what it is. DEFAULT_VERSION would be a little ambiguous at the call sites since a reader would have to trace back to the constant definition to understand the intent
| frag_indices.sort_by_key(|&i| existing_fragments[i].id); | ||
| for &i in &frag_indices { |
There was a problem hiding this comment.
I see that we do to iterations over the existing_fragments. Can this be improved?
There was a problem hiding this comment.
updated, thanks!
| None => None, | ||
| }; | ||
|
|
||
| if let Some(row_ids) = row_ids { |
There was a problem hiding this comment.
I think this should be moved in the match above. This will also make code more readable and move the else branch in the None case.
There was a problem hiding this comment.
Folding the if let Some(row_ids) into the Inline arm would require duplicating the fallback build_version_meta logic across three places: the Inline/Err sub-arm, the External arm, and the None arm. The current structure keeps that fallback in one else branch, which is the only reason the intermediate row_ids: Option<_> variable exists. Agreed it's not the most elegant shape, but I think it's less duplication than the alternative
|
|
||
| fragment.last_updated_at_version_meta = build_version_meta(fragment, new_version); | ||
| } else { | ||
| let version_meta = build_version_meta(fragment, new_version); |
There was a problem hiding this comment.
This can be moved in the None case.
| let inherited = current_manifest | ||
| .map(|m| m.uses_stable_row_ids()) | ||
| .unwrap_or(false); | ||
| let stable_row_ids = config.use_stable_row_ids || inherited; |
There was a problem hiding this comment.
Nitpick... the name of the variable should let us understand that it is a boolean, so use_stable_row_ids would be better.
| // build_manifest may have already set FLAG_STABLE_ROW_IDS on the manifest. | ||
| // Preserve it here so this second apply_feature_flags call does not clear it | ||
| // when config.use_stable_row_ids is false (the ManifestWriteConfig default). | ||
| let stable_row_ids = config.use_stable_row_ids || manifest.uses_stable_row_ids(); |
There was a problem hiding this comment.
Ditto... use_stable_row_ids to show that is a boolean.
Fixes #6464
Summary
Two bugs in the stable row ID + CDF version tracking pipeline caused
_row_created_at_versionto report incorrect values after updates.Bug 1: JNI serialization gap for version metadata
FragmentMetadataround-trips through JNI (Rust -> Java -> Rust) lostcreated_at_version_metaandlast_updated_at_version_meta. The JNI layer serializedRowIdMetabut hard-codedNonefor both version metadata fields on the Java->Rust path, so any version information attached by lance-core was silently dropped when the connector returned fragments.Fix: Add
VersionMeta.java(mirrorsRowIdMeta), extendFragmentMetadatawith two newVersionMetafields, and updatelance-jni/src/fragment.rsto serialize/deserializeRowDatasetVersionMetavia JSON through the new Java type. Also extracted a genericextract_nullable_field<T>()helper to reduce duplicated nullable-getter boilerplate.Bug 2: Incorrect row-ID lookup in
Operation::UpdateWhen preserving
created_atversions for updated rows, the code assumedrow_id >> 32yields the original fragment ID. This is only true for unstable row addresses. Stable row IDs are sequential integers unrelated tofragment layout. Any dataset with stable row IDs would look up the wrong fragment and fall back to a default version of 1.
Fix: Build a reverse index (stable row ID -> (fragment ref, offset)) from each existing fragment's
RowIdMetasequence, then look up each new fragment's row IDs in this index to find the correct original fragment and row offset.Additional improvements
RowIdMeta.fromRowIds(long[])— static factory that encodes stable row IDs into the inline protobuf/JSON format expected by lance-core, centralizing logic that was previously hand-rolled in the connector.Companion PR
This PR is paired with lance-format/lance-spark#407 which implements the native UPDATE path in the Spark connector: capturing
_rowidduringupdate(), attachingRowIdMetato new fragments, and consolidating deletion handling on the driver.