Skip to content

Append method [in memory vs disk size] #1994

@Vasampa23

Description

@Vasampa23

Apache Iceberg version

0.8.1

Please describe the bug 🐞

During my initial exploration of append() method.
The flow of execution is roughly as follows append() -> _dataframe_to_data_files() -> bin_pack_arrow_table()

  1. In the bin_pack_arrow_table, we consider its in memory size using .nbytes
  2. we use write.target-file-size-bytes this seems to consider in memory size rather than disk memory.
  3. As I loaded a 300MB parquet file as pyarrow table it occupied 21637095875 bytes in memory. after ingestion this had 50 parquet files in MinIO. [20GB//512 roughly 50 files]

It would be better if we consider disk memory of files to be written rather than in memory as considering in memory could actually result in many small files

Willingness to contribute

  • I can contribute a fix for this bug independently
  • I would be willing to contribute a fix for this bug with guidance from the Iceberg community
  • I cannot contribute a fix for this bug at this time

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions