Apache Iceberg version
0.8.1
Please describe the bug 🐞
During my initial exploration of append() method.
The flow of execution is roughly as follows append() -> _dataframe_to_data_files() -> bin_pack_arrow_table()
- In the bin_pack_arrow_table, we consider its in memory size using .nbytes
- we use write.target-file-size-bytes this seems to consider in memory size rather than disk memory.
- As I loaded a 300MB parquet file as pyarrow table it occupied 21637095875 bytes in memory. after ingestion this had 50 parquet files in MinIO. [20GB//512 roughly 50 files]
It would be better if we consider disk memory of files to be written rather than in memory as considering in memory could actually result in many small files
Willingness to contribute
Apache Iceberg version
0.8.1
Please describe the bug 🐞
During my initial exploration of append() method.
The flow of execution is roughly as follows append() -> _dataframe_to_data_files() -> bin_pack_arrow_table()
It would be better if we consider disk memory of files to be written rather than in memory as considering in memory could actually result in many small files
Willingness to contribute