Basic Extension Type Registry Implementation#20312
Basic Extension Type Registry Implementation#20312tobixdev wants to merge 9 commits intoapache:mainfrom
Conversation
paleolimbot
left a comment
There was a problem hiding this comment.
Thank you for your continued effort on this!
I don't have any notes on the implementation pattern here...a session-scoped registry where we can define custom behaviour for extension types is exactly what we need to get started integrating types like UUID and variant in a composable way. I love that this PR does something useful out of the gate, and that it is implemented in a way that does not impact existing code.
I can talk all day about why I think extension types are great, but I'll just leave one concrete example in support: this PR lets me remove my custom workaround for displaying geometries in our CLI, Python, and R bindings (~1500 lines of code), and would let me upstream our table printer (e.g., auto hiding columns for wide output) if there's interest. Basically, this lets me upstream parts of SedonaDB that are relevant to supporting UUID and variant instead of continuing to pursue workarounds.
| /// # Why do we need a Registration? | ||
| /// | ||
| /// A good question is why this trait is even necessary. Why not directly register the | ||
| /// [`DFExtensionType`] in a registration? |
There was a problem hiding this comment.
Thank you for this write up! I think the way you've done it here is great...in the context of the existing ExtensionType trait, the registration is basically Self (i.e., Self::try_from_field(...)) and the DFExtensionType is basically self (i.e., self.do_stuff_with_a_data_type()).
Arrow C++ does register an instance of the DFExtensionType equivalent instead of having a separate registration class, where every instance of an extension type has a .Deserialize() method that can create a new instance of itself; however, I think the way you've done it here is much cleaner.
There was a problem hiding this comment.
That's a great analogy! Maybe I can integrate that somehow in the documentation. I'll try it once we've got another review.
| #[tokio::test] | ||
| async fn test_pretty_print_logical_plan() -> Result<()> { | ||
| let result = create_test_table().await?.to_string().await?; | ||
|
|
||
| assert_snapshot!( | ||
| result, | ||
| @r" | ||
| +--------------------------------------+ | ||
| | my_uuids | | ||
| +--------------------------------------+ | ||
| | 00000000-0000-0000-0000-000000000000 | | ||
| | 00010203-0405-0607-0809-000102030506 | | ||
| +--------------------------------------+ | ||
| " | ||
| ); |
There was a problem hiding this comment.
🎉 !
This is actually one of the things I am looking forward the most with pretty printing...it means that assert_batches_eq!() will work out of the box and some of the testing workarounds we've been using can collapse to nicely readable tests.
| pub trait DFExtensionType: Debug + Send + Sync { | ||
| /// Returns an [`ArrayFormatter`] that can format values of this type. | ||
| /// | ||
| /// If `Ok(None)` is returned, the default implementation will be used. | ||
| /// If an error is returned, there was an error creating the formatter. | ||
| fn create_array_formatter<'fmt>( | ||
| &self, | ||
| _array: &'fmt dyn Array, | ||
| _options: &FormatOptions<'fmt>, | ||
| ) -> Result<Option<ArrayFormatter<'fmt>>> { | ||
| Ok(None) | ||
| } | ||
| } |
There was a problem hiding this comment.
Just highlighting for readers that this is the crux: there is now a trait that centralizes the properties of a non-built-in-Arrow DataType that DataFusion internals have access to. Pretty printing is a relatively straightforward initial target that is particularly useful for the CLI and testing; however, there are more things that can be added here (or not):
- Casting extension arrays (this would unlock things like
'00010203-0405-0607-0809-000102030506'::UUIDor for me,'POINT (0 1)'::GEOMETRY). The CSV writer could also leverage this by casting extension types to string. This would probably also be sufficient to makesome_uuid_column = '00010203-0405-0607-0809-000102030506'work as well because I think=currently works by casting both sides to a common type. - Sort order (I think this is what got Tobias started on all of this and it also benefits geometry)
There was a problem hiding this comment.
Thanks @paleolimbot for taking the time to provide additional context! This definitely helps readers to better understand the potential impact of extension type support!
I also appreciate your continued engagement with this topic! Otherwise, I might have already stopped working on it. ;)
There was a problem hiding this comment.
To also add some additional context from my side:
For us, the pretty printer would also be a great improvement. Currently, pretty-printing our RDF terms looks something like this:
+--------------------------------------+-------------------------------------------------------+
| input | STR(?table?.input) |
+--------------------------------------+-------------------------------------------------------+
| {named_node=http://example.com/test} | {string={value: http://example.com/test, language: }} |
| {decimal=1000.0000000000000000} | {string={value: 10, language: }} |
+--------------------------------------+-------------------------------------------------------+
as opposed to
+--------------------------------------+-------------------------------------------------------+
| input | STR(?table?.input) |
+--------------------------------------+-------------------------------------------------------+
| <http://example.com/test> | "http://example.com/test" |
| "1000.00"xsd:decimal | "10" |
+--------------------------------------+-------------------------------------------------------+
We can do that as we're just a research project but if we would offer a usable CLI we would require similar workarounds as SedonaDB.
Which issue does this PR close?
This is a PR based on #18552 that contains a basic implementation of an extension type registry. The driving use case is pretty-printing data frames with custom types.
Ping @paleolimbot @adriangb if you're still interested.
Most Important Changes to the Old PR
DFExtensionTypeinstead ofBoundExtensionTypeto avoid the need of explaining "bind". If you think there is merit in the other term, let me know. I think otherwise, this aligns with your proposal.Rationale for this change
What changes are included in this PR?
ExtensionTypeRegistryDFArrayFormatterFactorywhich creates custom pretty-printers when formatting data frames.SessionState/SessionContextAre these changes tested?
Happy to add more tests if this PR has a chance of being merged
Are there any user-facing changes?
Yes, the entire Extension Type API is new.