Skip to content

Conversation

@serramatutu
Copy link

@serramatutu serramatutu commented Oct 30, 2025

Which issue does this PR close?

Implement arrow.timestamp_with_offset canonical extension type.

Rationale for this change

Be compliant with Arrow spec: apache/arrow#48002

What changes are included in this PR?

This commit adds a new TimestampWithOffset extension type. This type represents a timestamp column that stores potentially different timezone offsets per value. The timestamp is stored in UTC alongside the original timezone offset in minutes.

Are these changes tested?

Yes.

Are there any user-facing changes?

Yes, this is a new canonical extension type.

Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems pretty straightforward and reasonable

@felipecrv felipecrv marked this pull request as ready for review November 5, 2025 03:52
@alamb alamb changed the title [DRAFT] Add TimestampWithOffset extension type Add TimestampWithOffset extension type Nov 5, 2025
@alamb
Copy link
Contributor

alamb commented Nov 5, 2025

There are several TODOs in this PRs description. Is that intended?

@serramatutu
Copy link
Author

serramatutu commented Nov 6, 2025

@alamb We're currently still discussing the FORMAT in the arrow repo and in the mailing list before talking too much about implementation. Not sure why @felipecrv marked this as ready for review?

For arrow-rs, I still need to

  • Add run-end and dict encoding to the offset column
  • Figure how to do JSON de/encoding using RFC3339
  • Test more

Realistically, even after the FORMAT PR goes into the spec I think this will take me a few more weeks to fully flesh out.

@alamb
Copy link
Contributor

alamb commented Nov 7, 2025

Thanks for the clarification @serramatutu -- I'll mark it as a draft again then

@alamb alamb marked this pull request as draft November 7, 2025 20:01
felipecrv added a commit to apache/arrow that referenced this pull request Dec 5, 2025
…48002)

### Rationale for this change

Closes #44248

Arrow has no built-in canonical way of representing the `TIMESTAMP WITH
TIME ZONE` SQL type, which is present across multiple different database
systems. Not having a native way to represent this forces users to
either convert to UTC and drop the time zone, which may have correctness
implications, or use bespoke workarounds. A new
`arrow.timestamp_with_offset` extension type would introduce a standard
canonical way of representing that information.

Rust implementation: apache/arrow-rs#8743
Go implementation: apache/arrow-go#558

[DISCUSS] [thread in the mailing
list](https://lists.apache.org/thread/yhbr3rj9l59yoxv92o2s6dqlop16sfnk).

### What changes are included in this PR?

Proposal and documentation for `arrow.timestamp_with_offset` canonical
extension type.

### Are these changes tested?

N/A

### Are there any user-facing changes?

Yes, this is an extension to the arrow format.

* GitHub Issue: #44248

---------

Co-authored-by: David Li <li.davidm96@gmail.com>
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Co-authored-by: Felipe Oliveira Carvalho <felipekde@gmail.com>
This commit adds a new `TimestampWithOffset` extension type.

This type represents a timestamp column that stores potentially
different timezone offsets per value. The timestamp is stored in
UTC alongside the original timezone offset in minutes.
@serramatutu serramatutu force-pushed the serramatutu/TimestampWithOffset/rs branch from d187bb9 to f6cda58 Compare December 16, 2025 16:07
@serramatutu serramatutu marked this pull request as ready for review December 16, 2025 16:07
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me -- thank you @serramatutu and @westonpace

/// The extension type for `TimestampWithOffset`.
///
/// <https://arrow.apache.org/docs/format/CanonicalExtensions.html#timestamp-with-offset>
TimestampWithOffset(TimestampWithOffset),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is a pub enum, unfortunately this will need to wait until the next major arrow release (in Jan):

@alamb alamb changed the title Add TimestampWithOffset extension type Add TimestampWithOffset canonical extension type Dec 18, 2025
@serramatutu
Copy link
Author

Looks like we got a CI flake and need to rerun:

The hosted runner lost communication with the server. Anything in your workflow that terminates the runner process, starves it for CPU/Memory, or blocks its network access can cause this error.

Comment on lines +28 to +51
/// This type represents a timestamp column that stores potentially different timezone offsets per value.
/// The timestamp is stored in UTC alongside the original timezone offset in minutes. This extension type
/// is intended to be compatible with ANSI SQL's `TIMESTAMP WITH TIME ZONE`, which is supported by multiple
/// database engines.
///
/// The storage type of the extension is a `Struct` with 2 fields, in order:
/// - `timestamp`: a non-nullable `Timestamp(time_unit, "UTC")`, where `time_unit` is any Arrow `TimeUnit` (s, ms, us or ns).
/// - `offset_minutes`: a non-nullable signed 16-bit integer (`Int16`) representing the offset in minutes
/// from the UTC timezone. Negative offsets represent time zones west of UTC, while positive offsets represent
/// east. Offsets normally range from -779 (-12:59) to +780 (+13:00).
///
/// This type has no type parameters.
///
/// Metadata is either empty or an empty string.
///
/// It is also *permissible* for the `offset_minutes` field to be dictionary-encoded with a preferred (*but not required*)
/// index type of `int8`, or run-end-encoded with a preferred (*but not required*) runs type of `int8`.
///
/// It's worth noting that the data source needs to resolve timezone strings such as `UTC` or
/// `Americas/Los_Angeles` into an offset in minutes in order to construct a `TimestampWithOffset`.
/// This makes `TimestampWithOffset` type "lossy" in the sense that any original "unresolved"
/// timezone string gets lost in this conversion. It's a tradeoff for optimizing the row
/// representation and simplifying the client code, which does not need to know how to convert
/// from timezone string to its corresponding offset in minutes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you gq the paragraphs on vim? Or the equivalent on your editor? Keeping them around 70 characters with uniform length.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

arrow Changes to the arrow crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants