Skip to content

feat: Honor compression settings for metadata.json on write#1876

Open
emkornfield wants to merge 57 commits intoapache:mainfrom
emkornfield:feat/metadata-compression
Open

feat: Honor compression settings for metadata.json on write#1876
emkornfield wants to merge 57 commits intoapache:mainfrom
emkornfield:feat/metadata-compression

Conversation

@emkornfield
Copy link
Contributor

@emkornfield emkornfield commented Nov 19, 2025

Which issue does this PR close?

Split off from #1851

What changes are included in this PR?

This change honors the compression setting for metadata.json file (write.metadata.compression-codec).

Are these changes tested?

Add unit test to verify files are gzipped when the flag is enabled.

BREAKING CHANGE: Make write_to take MetadataLocation and MetadataLocation::with_next_version take properties.

@emkornfield
Copy link
Contributor Author

CC @kevinjqliu @liurenjie1024

key: &str,
default: T,
) -> Result<T, anyhow::Error>
) -> crate::error::Result<T>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
) -> crate::error::Result<T>
) -> Result<T>

Unnecessary fully qualified name.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

/// The target file size for files.
pub write_target_file_size_bytes: usize,
/// Compression codec for metadata files (JSON)
pub metadata_compression_codec: String,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be sth like Option<String>, none should be None in memory.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.


// Modify filename to add .gz before .metadata.json
let location = metadata_location.as_ref();
let new_location = if location.ends_with(".metadata.json") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should simply replace the suffix with this approach. Currently all metadata location generated logic are in MetadataLocation, we should add a new method in like following and deprecate old method as following:

impl MetadataLocation {
#[deprecate("Use new_with_table instead.")]
pub fn new_with_table_location(table_location: impl ToString) -> Self {
}

pub fn new_with_table(table: &Table) -> Self {
  ....
}

Copy link
Contributor Author

@emkornfield emkornfield Nov 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The table option seemed a little awkward with the control flow on how it is used istead I added a method that takes table properties. Also added a similar method for incrementing the number.

I modified this logic a bit and kept. My concern here is if we only deprecate the new_with_table_location then users can write out metadata.json files in a format that would be unreadable by other readers.

.map_err(from_aws_sdk_error)?;
let warehouse_location = get_resp.warehouse_location().to_string();
MetadataLocation::new_with_table_location(warehouse_location).to_string()
get_resp.warehouse_location().to_string()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a semantic change because up the update to creation.location below, need to investigate further what the correct thing to do here is.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point, from the comment above (L451-452) it seems like the "s3tables catalog" generates the metadata file location and does not allow any modification

so maybe we should just use

        let metadata_location =
            MetadataLocation::new_with_properties(metadata_location, None)
                .to_string(); 

so as to not do any path modification before writing

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the original code was wrong. As best I can test warehouseLocation is actually the table location.

I've updated it to assign to table_location and then assign it below (as far as I can tell, the .location field is meant to be table location and not the metadata.json file).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I think this fix is correct, I think it's much related to this pr? We should move it to a standalone pr.

Copy link
Contributor Author

@emkornfield emkornfield Feb 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We needed change this anyways do to the reordering of calls, it seems better to fix it. I'll open a separate PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#2115 to make the change isolated.

Copy link
Contributor

@kevinjqliu kevinjqliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally LGTM, a few nit comments.
would be good to get a second pair of 👀


/// Creates a completely new metadata location starting at version 0,
/// with compression settings from the table's properties.
/// Only used for creating a new table. For updates, see `with_next_version`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// Only used for creating a new table. For updates, see `with_next_version`.
/// Only used for creating a new table. For updates, see `with_next_version_and_properties`.

since with_next_version is also deprecated

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll wait till I understand @liurenjie1024 concerns about with_next_version_and_properties to update this.

.map_err(from_aws_sdk_error)?;
let warehouse_location = get_resp.warehouse_location().to_string();
MetadataLocation::new_with_table_location(warehouse_location).to_string()
get_resp.warehouse_location().to_string()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point, from the comment above (L451-452) it seems like the "s3tables catalog" generates the metadata file location and does not allow any modification

so maybe we should just use

        let metadata_location =
            MetadataLocation::new_with_properties(metadata_location, None)
                .to_string(); 

so as to not do any path modification before writing

Copy link
Contributor

@liurenjie1024 liurenjie1024 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @emkornfield for this pr, looks much better now, still need some refinements.

/// Creates a new metadata location for an updated metadata file.
/// Takes table properties to determine compression settings, which may have changed
/// from the previous version.
pub fn with_next_version_and_properties(&self, properties: &HashMap<String, String>) -> Self {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need this.

Copy link
Contributor Author

@emkornfield emkornfield Nov 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you clarify what you mean by we don't need it? I updated crates/iceberg/src/catalog/mod.rs to use it, if new properties change compression settings, I'm not sure how this would be reflected otherwise? I might have missed it but I don't think there is anything in the spec that prevents users from changing compression properties.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If properties has been changes, we should build a new instance of this MetadataLocation struct.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If properties has been changes, we should build a new instance of this MetadataLocation struct.

My main concern here, is I don't think this is something that clients of MetadataLocation would actually understand and check for a new version of the metadata (i.e. the path of least resistance is to call with_next_version on whatever metadata they used for creation. As a solution, I've created a static factory next_version which combines the two steps for the only call-site that exists for with_next_version. If you think it is better to not deprecated with_next_version I can also revert this change.

emkornfield and others added 4 commits November 21, 2025 08:54
Co-authored-by: Kevin Liu <kevinjqliu@users.noreply.github.com>
…nfield/iceberg-rust into feat/metadata-compression
@emkornfield
Copy link
Contributor Author

@liurenjie1024 thanks for the review, I think this is ready for another look.


/// Compression codec for metadata files (JSON).
#[derive(Debug, PartialEq)]
pub enum MetadataCompressionCodec {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already have a CompressionCodec, we should move it to spec module and reuse it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that, that compression codec doesn't actually include GZIP which means extra validation for this code path. I'm not opposed to it but given that it will take extra logic to validate valid options for both this PR and the existing code in puffin to not allow writing GZIP and we also have a different concept for Avro files in #1851 it seems better do the refactoring change in a separate PR once everything is merged (conveniently reaching the rule of 3?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have concerns with this approach. OSS software is somehow different enterprise software development. If everyone just move forward with a lot of diverges and left the cleanup things to others, the codebase will become a mess. I think a better approach would be to have another pr to move the compression codec enum into spec module. And then we can move on with this pr after it got merged.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, thank you for the guidance.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created #2081

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to use that code.

liurenjie1024 pushed a commit that referenced this pull request Jan 29, 2026
)

## Which issue does this PR close?

As discussed in the PR for [allowing compressed
metadata.json](#1876 (comment))
writes we want CompressionCodec in a central place so it can be used in
puffin and for other compression use-cases. Happy to move it to `spec/`
as suggested in the original comment but this seemed more natural.

## What changes are included in this PR?

This moves compression.rs to a top level (seems like the best location
for common code but please let me know if there is a better place.

- It adds Gzip as a compression option and replaces current direct use
the GzEncoder package.
- It adds validation to puffin that Gzip is not currently supported (per
spec)

## Are these changes tested?

Added unit tests to cover additions and existing tests cover the rest.
Copy link
Contributor

@blackmwk blackmwk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @emkornfield for this pr, we are quite close !


/// Creates a new metadata location for an updated metadata file.
/// Uses compression settings from the new metadata.
pub fn next_version(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still confused why this method is better than self.with_next_version? The self.with_next_version looks more reasonable to me since we need the original version, which is contained in self.

Copy link
Contributor Author

@emkornfield emkornfield Feb 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current next_with_version needs to be deprecated either way because it doesn't take new properties (i.e. compression can change). If we don't deprecate it and just replace it it could be viable.

If using the member function is more idiomatic, happy to switch to that, IIRC @liurenjie1024 had issue with that though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I got your point. Then a more idiomatic approach would be adding a new member function as following:

pub fn with_new_metadata(&self, metadata: &TableMetadata) -> Self {
   ...
}

And the user is expected to call as following:

let meta_loc: MetadataLocation = ...;
let new_meta_loc = meta_loc.with_next_version().with_new_metadata();

Copy link
Contributor Author

@emkornfield emkornfield Feb 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue I have with this API is it requires callers to remember with_new_metadata, but the metadata is fundamental for correctly generating the location.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@blackmwk I've updated the APIs here to avoid backwards compatibility, would appreciate your thoughts on the current approach.

@@ -468,9 +483,44 @@ impl TableMetadata {
file_io: &FileIO,
metadata_location: impl AsRef<str>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we may need to change metadata_location's type to the struct MetadataLocation, rather than impl AsRef<str>. Also we should not infer compression codec from suffix, instead we should infer it from TableMetadata itself. With this change, we could avoid a lot of validation in the method body.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we may need to change metadata_location's type to the struct MetadataLocation, rather than impl AsRef.

I was trying to avoid a breaking change here, are there general policies I should be aware of (for some reason I thought there was a preference not have breaking changes). if we don't mind breaking changes, I'm happy to make the change).

Also we should not infer compression codec from suffix, instead we should infer it from TableMetadata itself. With this change, we could avoid a lot of validation in the method body.

I don't believe we are inferring compression codec from the suffix. The complexity here is since we are only deprecating with_next_version, we end up writing the wrong name under the following scenario.

  1. Metadata file 0 - has no compression
  2. Caller uses with_next_version to generate the next location.
  3. Caller updates to use compression = 'gz'

Without this logic we would write out a gzipped file without the compression suffix, which would break other readers (e.g. Java library).

I realize I actually missed handling the same transition for CompressionCodec::None => (json_data, metadata_location.as_ref().to_string()), which would check if the suffix is compressed and remove it.

if we chose to just delete with_next_version (or update to take TableMetadata on the existing method). A lot of complexity goes away (we no longer need parse the suffix, although it might be nice for display purposes) and we don't need the special checks here.

So in short I think the two paths forward are:

  1. Breaking change for MetadataLocation to force taking properties on next version (and not deprecate the method).
  2. Fix the Match on None to handle a similar down-grade.

I think other options introduce a risk that clients can end up writing out metadata json files that other readers will not be able to read.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added comments here: https://github.com/apache/iceberg-rust/pull/1876/changes#r2756683326
I think with this approach we will be able provide an idiomatic api.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Responded there. This code is much simpler now that we are storing the codec directly, and taking TableLocation

.map_err(from_aws_sdk_error)?;
let warehouse_location = get_resp.warehouse_location().to_string();
MetadataLocation::new_with_table_location(warehouse_location).to_string()
get_resp.warehouse_location().to_string()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I think this fix is correct, I think it's much related to this pr? We should move it to a standalone pr.

@emkornfield
Copy link
Contributor Author

emkornfield commented Feb 2, 2026

@blackmwk thanks for the review I think we need to close on a direct for this: https://github.com/apache/iceberg-rust/pull/1876/files#r2752454670

  • I'll separate out the S3 fix into its own PR.
  • Use the new compression enum on MetadataLocation.

If we can make breaking changes then:

  1. Then if write-to takes metadata location then we don't need to store compression codec on MetadataLocation at all (we can make it take a compression codec when generating the final path). (I actually need to double check this, there might be other places that try to reserialize the string).
  2. We can eliminate a lot of the validation logic.

@blackmwk
Copy link
Contributor

blackmwk commented Feb 3, 2026

Hi, @emkornfield

I'll separate out the S3 fix into its own PR.
Use the new compression enum on MetadataLocation.

I think this is the right direction to go.

If we can make breaking changes then:

Then if write-to takes metadata location then we don't need to store compression codec on MetadataLocation at all (we can make it take a compression codec when generating the final path). (I actually need to double check this, there might be other places that try to reserialize the string).
We can eliminate a lot of the validation logic.

This is the place where we have to make a breaking change, so I think it's fine.


#[test]
fn test_parse_metadata_file_compression_valid() {
use super::parse_metadata_file_compression;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to move these imports out and clean up this code a little bit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] iceberg-rust does not respect compression settings for metadata & avro

5 participants