Basic Extension Type Registry Implementation by tobixdev · Pull Request #20312 · apache/datafusion

tobixdev · 2026-02-12T11:53:33Z

Which issue does this PR close?

This is a PR based on #18552 that contains a basic implementation of an extension type registry. The driving use case is pretty-printing data frames with custom types.

Closes API to register behavior for Extension Types #18223.

Ping @paleolimbot @adriangb if you're still interested.

Most Important Changes to the Old PR

We no longer use the Logical Type, as there is no real conses on how DataFusion should allow "inline" references to extension types. As a consequence, the formatting query plans use case in the old PR no longer works. Extension types can only be used where DataFusion has a reference to a registry (e.g., DataFrame pretty-printing). @paleolimbot I've called it DFExtensionType instead of BoundExtensionType to avoid the need of explaining "bind". If you think there is merit in the other term, let me know. I think otherwise, this aligns with your proposal.
Added a more complex example with a parameterized type to demonstrate the entire ability of the API
No extension types are registered by default, users must opt-in

Rationale for this change

Allow customized behavior based on extension type metadata.

What changes are included in this PR?

Add an ExtensionTypeRegistry
Add DFArrayFormatterFactory which creates custom pretty-printers when formatting data frames.
Add an extension type registry to the SessionState / SessionContext
A Full Example of using the API
An implementation for the UUID canonical extension type

Are these changes tested?

Yes, but only two end-to-end tests.
- One for pretty-printing UUID values
- One for pretty-printing in the example

Happy to add more tests if this PR has a chance of being merged

Are there any user-facing changes?

Yes, the entire Extension Type API is new.

paleolimbot

Thank you for your continued effort on this!

I don't have any notes on the implementation pattern here...a session-scoped registry where we can define custom behaviour for extension types is exactly what we need to get started integrating types like UUID and variant in a composable way. I love that this PR does something useful out of the gate, and that it is implemented in a way that does not impact existing code.

I can talk all day about why I think extension types are great, but I'll just leave one concrete example in support: this PR lets me remove my custom workaround for displaying geometries in our CLI, Python, and R bindings (~1500 lines of code), and would let me upstream our table printer (e.g., auto hiding columns for wide output) if there's interest. Basically, this lets me upstream parts of SedonaDB that are relevant to supporting UUID and variant instead of continuing to pursue workarounds.

paleolimbot · 2026-02-12T15:50:17Z

datafusion/expr/src/registry.rs

+/// # Why do we need a Registration?
+///
+/// A good question is why this trait is even necessary. Why not directly register the
+/// [`DFExtensionType`] in a registration?


Thank you for this write up! I think the way you've done it here is great...in the context of the existing ExtensionType trait, the registration is basically Self (i.e., Self::try_from_field(...)) and the DFExtensionType is basically self (i.e., self.do_stuff_with_a_data_type()).

Arrow C++ does register an instance of the DFExtensionType equivalent instead of having a separate registration class, where every instance of an extension type has a .Deserialize() method that can create a new instance of itself; however, I think the way you've done it here is much cleaner.

That's a great analogy! Maybe I can integrate that somehow in the documentation. I'll try it once we've got another review.

paleolimbot · 2026-02-12T15:52:51Z

datafusion/core/tests/extension_types/pretty_printing.rs

+#[tokio::test]
+async fn test_pretty_print_logical_plan() -> Result<()> {
+    let result = create_test_table().await?.to_string().await?;
+
+    assert_snapshot!(
+        result,
+        @r"
+    +--------------------------------------+
+    | my_uuids                             |
+    +--------------------------------------+
+    | 00000000-0000-0000-0000-000000000000 |
+    | 00010203-0405-0607-0809-000102030506 |
+    +--------------------------------------+
+    "
+    );


🎉 !

This is actually one of the things I am looking forward the most with pretty printing...it means that assert_batches_eq!() will work out of the box and some of the testing workarounds we've been using can collapse to nicely readable tests.

paleolimbot · 2026-02-12T16:27:44Z

datafusion/common/src/types/extension.rs

+pub trait DFExtensionType: Debug + Send + Sync {
+    /// Returns an [`ArrayFormatter`] that can format values of this type.
+    ///
+    /// If `Ok(None)` is returned, the default implementation will be used.
+    /// If an error is returned, there was an error creating the formatter.
+    fn create_array_formatter<'fmt>(
+        &self,
+        _array: &'fmt dyn Array,
+        _options: &FormatOptions<'fmt>,
+    ) -> Result<Option<ArrayFormatter<'fmt>>> {
+        Ok(None)
+    }
+}


Just highlighting for readers that this is the crux: there is now a trait that centralizes the properties of a non-built-in-Arrow DataType that DataFusion internals have access to. Pretty printing is a relatively straightforward initial target that is particularly useful for the CLI and testing; however, there are more things that can be added here (or not):

Casting extension arrays (this would unlock things like '00010203-0405-0607-0809-000102030506'::UUID or for me, 'POINT (0 1)'::GEOMETRY). The CSV writer could also leverage this by casting extension types to string. This would probably also be sufficient to make some_uuid_column = '00010203-0405-0607-0809-000102030506' work as well because I think = currently works by casting both sides to a common type.

Sort order (I think this is what got Tobias started on all of this and it also benefits geometry)

Thanks @paleolimbot for taking the time to provide additional context! This definitely helps readers to better understand the potential impact of extension type support!

I also appreciate your continued engagement with this topic! Otherwise, I might have already stopped working on it. ;)

To also add some additional context from my side:

For us, the pretty printer would also be a great improvement. Currently, pretty-printing our RDF terms looks something like this:

+--------------------------------------+-------------------------------------------------------+ | input | STR(?table?.input) | +--------------------------------------+-------------------------------------------------------+ | {named_node=http://example.com/test} | {string={value: http://example.com/test, language: }} | | {decimal=1000.0000000000000000} | {string={value: 10, language: }} | +--------------------------------------+-------------------------------------------------------+

as opposed to

+--------------------------------------+-------------------------------------------------------+ | input | STR(?table?.input) | +--------------------------------------+-------------------------------------------------------+ | <http://example.com/test> | "http://example.com/test" | | "1000.00"xsd:decimal | "10" | +--------------------------------------+-------------------------------------------------------+

We can do that as we're just a research project but if we would offer a usable CLI we would require similar workarounds as SedonaDB.

tobixdev added 3 commits February 12, 2026 10:45

Add draft for basic extension type support

05b37c0

Add an example for custom extension types

2a48e73

Further improvements of the extension type API proposal

0eabd10

github-actions bot added logical-expr Logical plan and expressions core Core DataFusion crate common Related to common crate ffi Changes to the ffi crate labels Feb 12, 2026

Formatting

f36534a

tobixdev changed the title ~~Basic Extension Type Implementation~~ Basic Extension Type Registry Implementation Feb 12, 2026

tobixdev added 3 commits February 12, 2026 12:57

Docs

f5d5b12

License headers and formatting

5616018

Add extension type registry implementation for mock sessions

4932656

github-actions bot added the datasource Changes to the datasource crate label Feb 12, 2026

tobixdev added 2 commits February 12, 2026 14:33

Fix error in listing_table_factory.rs, Formatting

6e1522e

Lints and formatting

f9b0b36

paleolimbot reviewed Feb 12, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Basic Extension Type Registry Implementation#20312

Basic Extension Type Registry Implementation#20312
tobixdev wants to merge 9 commits intoapache:mainfrom
tobixdev:extension-type-registry-2

tobixdev commented Feb 12, 2026

Uh oh!

paleolimbot left a comment

Uh oh!

paleolimbot Feb 12, 2026

Uh oh!

tobixdev Feb 13, 2026

Uh oh!

paleolimbot Feb 12, 2026

Uh oh!

paleolimbot Feb 12, 2026

Uh oh!

tobixdev Feb 13, 2026

Uh oh!

tobixdev Feb 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

tobixdev commented Feb 12, 2026

Which issue does this PR close?

Most Important Changes to the Old PR

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

paleolimbot left a comment

Choose a reason for hiding this comment

Uh oh!

paleolimbot Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

tobixdev Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

paleolimbot Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

paleolimbot Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

tobixdev Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

tobixdev Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants