Arrow: Don't copy the list/map when not needed#252
Conversation
Wrapping the list seems to introduce an odd behavior where `null` values are converted to an empty list `[]`. Resolves #251
.cast(..) instead of wrapping a map/list
.cast(..) instead of wrapping a map/list| if isinstance(list_array, pa.ListArray) and value_array is not None: | ||
| arrow_field = pa.list_(self._construct_field(list_type.element_field, value_array.type)) | ||
| if isinstance(value_array, pa.StructArray): | ||
| # Arrow does not allow reordering of fields, therefore we have to copy the array :( |
There was a problem hiding this comment.
Just to confirm my understanding, another reason that we have to copy the array: Arrow also does not allow field-mismatch, which happens when we have an optional schema field and no values for that field in the file.
There was a problem hiding this comment.
And this limitation will make the issue remaining in this edge case (col_list array<struct<test:int>>):
spark.sql(
f"""
CREATE TABLE {catalog_name}.default.test_table_empty_list_and_map (
col_list array<struct<test:int>>,
col_map map<int, int>
)
USING iceberg
TBLPROPERTIES (
'format-version'='1'
);
"""
)
spark.sql(
f"""
INSERT INTO {catalog_name}.default.test_table_empty_list_and_map
VALUES (null, null)
"""
)
def test_null_list_and_map(catalog: Catalog) -> None:
table_test_empty_list_and_map = catalog.load_table("default.test_table_empty_list_and_map")
arrow_table = table_test_empty_list_and_map.scan().to_arrow()
> assert arrow_table["col_list"].to_pylist() == [None]
E assert [[]] == [None]There was a problem hiding this comment.
It seems adding the following before line 1167
if list_array.is_null():
return Nonecan let me pass the above test.
There was a problem hiding this comment.
Thanks for the test-case, let me add that one 👍 I don't think the last fix is going to work, since not all values might be null 🤔
There was a problem hiding this comment.
I've pushed a fix, but I don't think we can solve this yet without a fix in Arrow upstream 😢 apache/arrow#38809
|
@HonahX I agree, thanks for merging. This PR is already an improvement over the previous situation, so I think it is good to have it in. |
Wrapping the list seems to introduce an odd behavior where
nullvalues are converted to an empty list[].Because the arrays in Arrow have a counter for the null values. When we create a new list by copying, the counter is not taken into account, and just empty lists/maps are injected instead.
Resolves #251