Apache Iceberg version
0.11.0 (latest release)
Please describe the bug 🐞
Apologies, this is a bit of a fuzzy one right now, but I thought reporting it anyway.
Context:
We're using Iceberg with AWS Glue and AWS S3 as storage. In S3 there are roughly speaking 3 kinds of files (metadata, manifests, and data files). The first one that is read when loading a table via catalog.load_table() is the metadata file. The metadata file contains information on all current* snapshots and schema versions of the table. py-iceberg seems to load these completely into memory.
Issue:
As we worked on the Iceberg table, there were a lot of snapshots created over time and with that a lot of schema versions. This led to the latest metadata file to be grow to ~10MB gzip compressed (or ~250MB uncompressed JSON). When we load this table via catalog.load_table() it consumes ~4GB of memory (total usage of the python process in memray). This is a lot - especially since we only need the latest snapshot and the respective schema version. (Which is probably true for most users I guess.)
Semi-Workaround:
One could try to expire some snapshots, e.g. via Sparks expire_snapshots procedure [https://iceberg.apache.org/docs/1.10.0/spark-procedures/#expire_snapshots], but it will not get rid of the old / unused schemas unless you set clean_expired_metadata as well (which is only supported since 1.10.x, so relatively new).
(Preliminary) Root-Cause:
I believe the issue is that we leverage Pydantic's model_validate_json in
|
return TableMetadataWrapper.model_validate_json(data).root |
, which loads the whole JSON into memory and then we seem to keep the full
TableMetadata object around.
Suggestion:
Would it make sense to parse the JSON not fully into memory and load the needed snapshots and schemas lazy / on demand? (Would be also fine, if that is a configurable option of catalog.load_table())
Remark:
Obviously we could blame this on an un-maintained Iceberg table, but I think it would be good for the pyIceberg lib to be robust against such scenarios, hence why I opened the issue.
Willingness to contribute
Apache Iceberg version
0.11.0 (latest release)
Please describe the bug 🐞
Apologies, this is a bit of a fuzzy one right now, but I thought reporting it anyway.
Context:
We're using Iceberg with AWS Glue and AWS S3 as storage. In S3 there are roughly speaking 3 kinds of files (metadata, manifests, and data files). The first one that is read when loading a table via
catalog.load_table()is the metadata file. The metadata file contains information on all current* snapshots and schema versions of the table. py-iceberg seems to load these completely into memory.Issue:
As we worked on the Iceberg table, there were a lot of snapshots created over time and with that a lot of schema versions. This led to the latest metadata file to be grow to ~10MB gzip compressed (or ~250MB uncompressed JSON). When we load this table via
catalog.load_table()it consumes ~4GB of memory (total usage of the python process in memray). This is a lot - especially since we only need the latest snapshot and the respective schema version. (Which is probably true for most users I guess.)Semi-Workaround:
One could try to expire some snapshots, e.g. via Sparks
expire_snapshotsprocedure [https://iceberg.apache.org/docs/1.10.0/spark-procedures/#expire_snapshots], but it will not get rid of the old / unused schemas unless you setclean_expired_metadataas well (which is only supported since 1.10.x, so relatively new).(Preliminary) Root-Cause:
I believe the issue is that we leverage Pydantic's
model_validate_jsoniniceberg-python/pyiceberg/table/metadata.py
Line 663 in 44ce51a
TableMetadataobject around.Suggestion:
Would it make sense to parse the JSON not fully into memory and load the needed snapshots and schemas lazy / on demand? (Would be also fine, if that is a configurable option of
catalog.load_table())Remark:
Obviously we could blame this on an un-maintained Iceberg table, but I think it would be good for the pyIceberg lib to be robust against such scenarios, hence why I opened the issue.
Willingness to contribute