Apache Iceberg version
version = "0.9.1"
Please describe the bug 🐞
It seems like there is a memory leak in the avro/reader.py
I have a long running service that keeps crashing. I tried to replicate the issue locally and it seems it also has this issue.
The following code creates a Memory catalog and generates some random data for ingestion into iceberg.
from pyiceberg.catalog.memory import InMemoryCatalog
import tracemalloc
from datetime import datetime, timezone
import polars as pl
def generate_df():
df = pl.DataFrame(
{
"event_type": ["playback"] * 1000,
"event_origin": ["origin1"] * 1000,
"event_send_at": [datetime.now(timezone.utc)] * 1000,
"event_saved_at": [datetime.now(timezone.utc)] * 1000,
"data": [
{
"calendarKey": "calendarKey",
"id": str(i),
"referenceId": f"ref-{i}",
}
for i in range(1000)
],
}
)
return df
df = generate_df()
catalog = InMemoryCatalog("default", warehouse="/tmp/iceberg")
catalog.create_namespace("default")
df = generate_df()
catalog = InMemoryCatalog("default", warehouse="/tmp/iceberg")
catalog.create_namespace("default")
table = iceberg_table = catalog.create_table(
"default.leak", schema=df.to_arrow().schema, location="/tmp/iceberg/leak"
)
df = pl.DataFrame()
tracemalloc.start()
for i in range(1000):
df = generate_df()
df.write_iceberg(table, mode="append")
snapshot = tracemalloc.take_snapshot()
top_stats = snapshot.statistics("lineno")
for stat in top_stats[:10]:
print(stat)
Slowly but steadily the outputs for the avro reader memory size increases
/Users/dits/git/play-recommendation-input-consumer/.venv/lib/python3.11/site-packages/pyiceberg/avro/reader.py:330: size=370 KiB, count=3782, average=100 B
/Users/dits/git/play-recommendation-input-consumer/.venv/lib/python3.11/site-packages/pyiceberg/avro/reader.py:190: size=222 KiB, count=1891, average=120 B
/Users/dits/git/play-recommendation-input-consumer/.venv/lib/python3.11/site-packages/pyiceberg/avro/reader.py:133: size=184 KiB, count=5673, average=33 B
After some more writes the output looks like this
/Users/dits/git/play-recommendation-input-consumer/.venv/lib/python3.11/site-packages/pyiceberg/avro/reader.py:330: size=420 KiB, count=4290, average=100 B
/Users/dits/git/play-recommendation-input-consumer/.venv/lib/python3.11/site-packages/pyiceberg/avro/reader.py:190: size=251 KiB, count=2145, average=120 B
/Users/dits/git/play-recommendation-input-consumer/.venv/lib/python3.11/site-packages/pyiceberg/avro/reader.py:133: size=208 KiB, count=6435, average=33 B
If we take a look at the AvroFile class it uses the enter and exit dunder methods. The enter method assigns the reader to a variable on the instance but it seems like the different reader classes sticks around.
https://github.com/apache/iceberg-python/blob/main/pyiceberg/avro/file.py#L192
Willingness to contribute
Apache Iceberg version
version = "0.9.1"
Please describe the bug 🐞
It seems like there is a memory leak in the avro/reader.py
I have a long running service that keeps crashing. I tried to replicate the issue locally and it seems it also has this issue.
The following code creates a Memory catalog and generates some random data for ingestion into iceberg.
Slowly but steadily the outputs for the avro reader memory size increases
After some more writes the output looks like this
If we take a look at the AvroFile class it uses the enter and exit dunder methods. The enter method assigns the reader to a variable on the instance but it seems like the different reader classes sticks around.
https://github.com/apache/iceberg-python/blob/main/pyiceberg/avro/file.py#L192
Willingness to contribute