Skip to content

Basic Extension Type Registry Implementation#20312

Open
tobixdev wants to merge 9 commits intoapache:mainfrom
tobixdev:extension-type-registry-2
Open

Basic Extension Type Registry Implementation#20312
tobixdev wants to merge 9 commits intoapache:mainfrom
tobixdev:extension-type-registry-2

Conversation

@tobixdev
Copy link
Contributor

Which issue does this PR close?

This is a PR based on #18552 that contains a basic implementation of an extension type registry. The driving use case is pretty-printing data frames with custom types.

Ping @paleolimbot @adriangb if you're still interested.

Most Important Changes to the Old PR

  • We no longer use the Logical Type, as there is no real conses on how DataFusion should allow "inline" references to extension types. As a consequence, the formatting query plans use case in the old PR no longer works. Extension types can only be used where DataFusion has a reference to a registry (e.g., DataFrame pretty-printing). @paleolimbot I've called it DFExtensionType instead of BoundExtensionType to avoid the need of explaining "bind". If you think there is merit in the other term, let me know. I think otherwise, this aligns with your proposal.
  • Added a more complex example with a parameterized type to demonstrate the entire ability of the API
  • No extension types are registered by default, users must opt-in

Rationale for this change

  • Allow customized behavior based on extension type metadata.

What changes are included in this PR?

  • Add an ExtensionTypeRegistry
  • Add DFArrayFormatterFactory which creates custom pretty-printers when formatting data frames.
  • Add an extension type registry to the SessionState / SessionContext
  • A Full Example of using the API
  • An implementation for the UUID canonical extension type

Are these changes tested?

  • Yes, but only two end-to-end tests.
    • One for pretty-printing UUID values
    • One for pretty-printing in the example

Happy to add more tests if this PR has a chance of being merged

Are there any user-facing changes?

Yes, the entire Extension Type API is new.

@github-actions github-actions bot added logical-expr Logical plan and expressions core Core DataFusion crate common Related to common crate ffi Changes to the ffi crate labels Feb 12, 2026
@tobixdev tobixdev changed the title Basic Extension Type Implementation Basic Extension Type Registry Implementation Feb 12, 2026
@github-actions github-actions bot added the datasource Changes to the datasource crate label Feb 12, 2026
Copy link
Member

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your continued effort on this!

I don't have any notes on the implementation pattern here...a session-scoped registry where we can define custom behaviour for extension types is exactly what we need to get started integrating types like UUID and variant in a composable way. I love that this PR does something useful out of the gate, and that it is implemented in a way that does not impact existing code.

I can talk all day about why I think extension types are great, but I'll just leave one concrete example in support: this PR lets me remove my custom workaround for displaying geometries in our CLI, Python, and R bindings (~1500 lines of code), and would let me upstream our table printer (e.g., auto hiding columns for wide output) if there's interest. Basically, this lets me upstream parts of SedonaDB that are relevant to supporting UUID and variant instead of continuing to pursue workarounds.

Comment on lines +229 to +232
/// # Why do we need a Registration?
///
/// A good question is why this trait is even necessary. Why not directly register the
/// [`DFExtensionType`] in a registration?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this write up! I think the way you've done it here is great...in the context of the existing ExtensionType trait, the registration is basically Self (i.e., Self::try_from_field(...)) and the DFExtensionType is basically self (i.e., self.do_stuff_with_a_data_type()).

Arrow C++ does register an instance of the DFExtensionType equivalent instead of having a separate registration class, where every instance of an extension type has a .Deserialize() method that can create a new instance of itself; however, I think the way you've done it here is much cleaner.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a great analogy! Maybe I can integrate that somehow in the documentation. I'll try it once we've got another review.

Comment on lines +58 to +72
#[tokio::test]
async fn test_pretty_print_logical_plan() -> Result<()> {
let result = create_test_table().await?.to_string().await?;

assert_snapshot!(
result,
@r"
+--------------------------------------+
| my_uuids |
+--------------------------------------+
| 00000000-0000-0000-0000-000000000000 |
| 00010203-0405-0607-0809-000102030506 |
+--------------------------------------+
"
);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉 !

This is actually one of the things I am looking forward the most with pretty printing...it means that assert_batches_eq!() will work out of the box and some of the testing workarounds we've been using can collapse to nicely readable tests.

Comment on lines +59 to +71
pub trait DFExtensionType: Debug + Send + Sync {
/// Returns an [`ArrayFormatter`] that can format values of this type.
///
/// If `Ok(None)` is returned, the default implementation will be used.
/// If an error is returned, there was an error creating the formatter.
fn create_array_formatter<'fmt>(
&self,
_array: &'fmt dyn Array,
_options: &FormatOptions<'fmt>,
) -> Result<Option<ArrayFormatter<'fmt>>> {
Ok(None)
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just highlighting for readers that this is the crux: there is now a trait that centralizes the properties of a non-built-in-Arrow DataType that DataFusion internals have access to. Pretty printing is a relatively straightforward initial target that is particularly useful for the CLI and testing; however, there are more things that can be added here (or not):

  • Casting extension arrays (this would unlock things like '00010203-0405-0607-0809-000102030506'::UUID or for me, 'POINT (0 1)'::GEOMETRY). The CSV writer could also leverage this by casting extension types to string. This would probably also be sufficient to make some_uuid_column = '00010203-0405-0607-0809-000102030506' work as well because I think = currently works by casting both sides to a common type.
  • Sort order (I think this is what got Tobias started on all of this and it also benefits geometry)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @paleolimbot for taking the time to provide additional context! This definitely helps readers to better understand the potential impact of extension type support!

I also appreciate your continued engagement with this topic! Otherwise, I might have already stopped working on it. ;)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To also add some additional context from my side:

For us, the pretty printer would also be a great improvement. Currently, pretty-printing our RDF terms looks something like this:

+--------------------------------------+-------------------------------------------------------+
| input                                | STR(?table?.input)                                    |
+--------------------------------------+-------------------------------------------------------+
| {named_node=http://example.com/test} | {string={value: http://example.com/test, language: }} |
| {decimal=1000.0000000000000000}      | {string={value: 10, language: }}                      |
+--------------------------------------+-------------------------------------------------------+

as opposed to

+--------------------------------------+-------------------------------------------------------+
| input                                | STR(?table?.input)                                    |
+--------------------------------------+-------------------------------------------------------+
| <http://example.com/test>            | "http://example.com/test"                             |
| "1000.00"xsd:decimal                 | "10"                                                  |
+--------------------------------------+-------------------------------------------------------+

We can do that as we're just a research project but if we would offer a usable CLI we would require similar workarounds as SedonaDB.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

common Related to common crate core Core DataFusion crate datasource Changes to the datasource crate ffi Changes to the ffi crate logical-expr Logical plan and expressions

Projects

None yet

Development

Successfully merging this pull request may close these issues.

API to register behavior for Extension Types

2 participants