-
Notifications
You must be signed in to change notification settings - Fork 293
Open
Labels
enhancementNew feature or requestNew feature or request
Description
What is the problem the feature request solves?
With the experimental native scans built on DataFusion's ParquetExec and our update to DataFusion 45, we have the opportunity to start adding support for StringView. I have started scoping out this work and would like to start aggregating findings here.
Describe the potential solution
Project-level:
- Bump arrow-java version. We're currently on 16.0.0. I believe the view types were added in 17.0.0. I tested bumping to 18.2.0 and so far it doesn't seem too painful.
Java-side:
- Add support for decoding
Utf8ViewandBinaryViewtoCometVector. I prototyped this here and here for Utf8View and BinaryView, respectively.
Native-side:
- Enable StringViewArray by default in query execution and Parquet reader. We're using a recent enough DataFusion version that this is done already.
- planner.rs and serde.rs should generate Utf8View and BinaryView types when possible.
- Shuffle:
- Add support to hash_util.
- Add support to shuffle_writer (slot_size, etc.)
I'm sure there's more than this, and will continue adding as I find stuff broken in my proof-of-concept branch.
Additional context
Related DataFusion blogs:
https://datafusion.apache.org/blog/2024/09/13/string-view-german-style-strings-part-1/
https://datafusion.apache.org/blog/2024/09/13/string-view-german-style-strings-part-2/
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request