[SPARK-57257][SQL] Support nanosecond-precision timestamps in Hive results#56320
[SPARK-57257][SQL] Support nanosecond-precision timestamps in Hive results#56320MaxGekk wants to merge 2 commits into
Conversation
…sults ### What changes were proposed in this pull request? This PR modifies `HiveResult` to support the nanosecond-precision timestamp types `TIMESTAMP_LTZ(p)` (`TimestampLTZNanosType`) and `TIMESTAMP_NTZ(p)` (`TimestampNTZNanosType`), `p` in [7, 9]. Two cases are added to `HiveResult.toHiveStringDefault`, mirroring the existing microsecond timestamp cases: - `(i: Instant, _: TimestampLTZNanosType)` -> rendered in the session time zone. - `(l: LocalDateTime, _: TimestampNTZNanosType)` -> rendered zone-independently. The external collected values are `Instant` (LTZ) and `LocalDateTime` (NTZ); they are converted to the physical `TimestampNanosVal` at the column precision and formatted with the nanosecond-aware `TimestampFormatter` (`formatNanos` / `formatWithoutTimeZoneNanos`, SPARK-57162), flooring sub-`p` digits and trimming trailing zeros. This is the same rendering used by casting these types to string (SPARK-57256), so Hive output stays consistent. ### Why are the changes needed? Before the change, formatting a nanosecond timestamp column through `HiveResult` (e.g. end-to-end SQL / golden-file tests, spark-sql CLI, Thrift server output) hits the catch-all match and fails with a `MatchError`, analogous to the `TimeType` issue fixed in SPARK-51517: ``` scala.MatchError (2020-01-01T00:00:00.123456789Z, TimestampLTZNanosType(9)) (of class scala.Tuple2) ``` ### Does this PR introduce _any_ user-facing change? Yes. It fixes the error above. After the change, nanosecond timestamp values are rendered as proper strings in Hive results (only reachable when `spark.sql.timestampNanosTypes.enabled=true`). ### How was this patch tested? - New cases in `HiveResultSuite` covering `TIMESTAMP_LTZ(p)` / `TIMESTAMP_NTZ(p)` for `p` in [7, 9]: precision-driven fraction width, trailing-zero trimming, nanosWithinMicro 0 and 999, LTZ session-zone rendering vs. zone-independent NTZ, and nested (array/map/struct) values. - New golden-file end-to-end tests `timestamp-ltz-nanos.sql` and `timestamp-ntz-nanos.sql` (as SPARK-51517 added `time.sql`), disabled in `ThriftServerQueryTestSuite`. ### Was this patch authored or co-authored using generative AI tooling? Generated-by: Cursor 1.7.0
|
@peter-toth @dongjoon-hyun @HyukjinKwon Could you review this PR, please. This PR allows to write end-to-end tests in *.sql for timestamps with nanosecond precision. |
There was a problem hiding this comment.
Do we have any test cases with a pre-1970 value like 1960-01-01 00:00:00.000000001 which exercises the negative-epoch path?
There was a problem hiding this comment.
Good catch -- all the previous cases used a post-epoch base. Added a pre-1970 base (1960-01-01) so we now cover the negative-epochMicros + positive-nanosWithinMicro path, both in the HiveResultSuite matrix and in the timestamp-{ltz,ntz}-nanos.sql golden files. Done in d578fbb.
There was a problem hiding this comment.
It would be nice to add at least some test cases with null values (both top-level and nested/complex).
There was a problem hiding this comment.
Added. Note: NULLs for these columns are handled by the generic (null, _) branch in HiveResult.toHiveString, which runs before the new TimestampLTZNanosType / TimestampNTZNanosType cases, so the path is type-agnostic -- but it's worth locking in. Added top-level (NULL) and nested array/map/struct NULL cases to both golden files (and the suite). Done in d578fbb.
peter-toth
left a comment
There was a problem hiding this comment.
LGTM, but the test gaps that @uros-b's mentioned are valid. Please add them before merging.
…mestamps in Hive results Address review feedback on apache#56320: - Add a pre-1970 base (1960-01-01) to exercise the negative-epoch path (negative epochMicros + positive nanosWithinMicro), in both HiveResultSuite and the timestamp-{ltz,ntz}-nanos.sql golden files. - Add top-level and nested (array/map/struct) NULL cases. NULLs are handled by the generic `(null, _)` branch in `HiveResult.toHiveString`, but this locks in the behavior. ### Was this patch authored or co-authored using generative AI tooling? Generated-by: Cursor 1.7.0
|
Thanks for the review @peter-toth and @uros-b! Added the pre-1970 (negative-epoch) and NULL (top-level + nested) coverage in d578fbb, in both |
|
The failure Run / Build modules: pyspark-connect-old-client is not related to this PR, I believe. Merging to master/4.x. Thank you, @peter-toth @uros-b @dongjoon-hyun for review. |
…sults ### What changes were proposed in this pull request? This PR modifies `HiveResult` to support the nanosecond-precision timestamp types `TIMESTAMP_LTZ(p)` (`TimestampLTZNanosType`) and `TIMESTAMP_NTZ(p)` (`TimestampNTZNanosType`), `p` in [7, 9]. Two cases are added to `HiveResult.toHiveStringDefault`, mirroring the existing microsecond timestamp cases: - `(i: Instant, _: TimestampLTZNanosType)` -> rendered in the session time zone. - `(l: LocalDateTime, _: TimestampNTZNanosType)` -> rendered zone-independently. The external collected values are `Instant` (LTZ) and `LocalDateTime` (NTZ); they are converted to the physical `TimestampNanosVal` at the column precision and formatted with the nanosecond-aware `TimestampFormatter` (`formatNanos` / `formatWithoutTimeZoneNanos`, SPARK-57162), flooring sub-`p` digits and trimming trailing zeros. This is the same rendering used by casting these types to string (SPARK-57256), so Hive output stays consistent. ### Why are the changes needed? Before the change, formatting a nanosecond timestamp column through `HiveResult` (e.g. end-to-end SQL / golden-file tests, spark-sql CLI, Thrift server output) hits the catch-all match and fails with a `MatchError`, analogous to the `TimeType` issue fixed in SPARK-51517: ``` scala.MatchError (2020-01-01T00:00:00.123456789Z, TimestampLTZNanosType(9)) (of class scala.Tuple2) ``` ### Does this PR introduce _any_ user-facing change? Yes. It fixes the error above. After the change, nanosecond timestamp values are rendered as proper strings in Hive results (only reachable when `spark.sql.timestampNanosTypes.enabled=true`). ### How was this patch tested? - New cases in `HiveResultSuite` covering `TIMESTAMP_LTZ(p)` / `TIMESTAMP_NTZ(p)` for `p` in [7, 9]: precision-driven fraction width, trailing-zero trimming, nanosWithinMicro 0 and 999, LTZ session-zone rendering vs. zone-independent NTZ, and nested (array/map/struct) values. - New golden-file end-to-end tests `timestamp-ltz-nanos.sql` and `timestamp-ntz-nanos.sql` (as SPARK-51517 added `time.sql`), disabled in `ThriftServerQueryTestSuite`. ### Was this patch authored or co-authored using generative AI tooling? Generated-by: Cursor 1.7.0 Closes #56320 from MaxGekk/nanos-hiveresult. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com> (cherry picked from commit 637803e) Signed-off-by: Max Gekk <max.gekk@gmail.com>
What changes were proposed in this pull request?
This PR modifies
HiveResultto support the nanosecond-precision timestamp typesTIMESTAMP_LTZ(p)(TimestampLTZNanosType) andTIMESTAMP_NTZ(p)(TimestampNTZNanosType),pin [7, 9]. Two cases are added toHiveResult.toHiveStringDefault, mirroring the existing microsecond timestamp cases:(i: Instant, _: TimestampLTZNanosType)-> rendered in the session time zone.(l: LocalDateTime, _: TimestampNTZNanosType)-> rendered zone-independently.The external collected values are
Instant(LTZ) andLocalDateTime(NTZ); they are converted to the physicalTimestampNanosValat the column precision and formatted with the nanosecond-awareTimestampFormatter(formatNanos/formatWithoutTimeZoneNanos, SPARK-57162), flooring sub-pdigits and trimming trailing zeros. This is the same rendering used by casting these types to string (SPARK-57256), so Hive output stays consistent.Why are the changes needed?
Before the change, formatting a nanosecond timestamp column through
HiveResult(e.g. end-to-end SQL / golden-file tests, spark-sql CLI, Thrift server output) hits the catch-all match and fails with aMatchError, analogous to theTimeTypeissue fixed in SPARK-51517:Does this PR introduce any user-facing change?
Yes. It fixes the error above. After the change, nanosecond timestamp values are rendered as proper strings in Hive results (only reachable when
spark.sql.timestampNanosTypes.enabled=true).How was this patch tested?
HiveResultSuitecoveringTIMESTAMP_LTZ(p)/TIMESTAMP_NTZ(p)forpin [7, 9]: precision-driven fraction width, trailing-zero trimming, nanosWithinMicro 0 and 999, LTZ session-zone rendering vs. zone-independent NTZ, and nested (array/map/struct) values.timestamp-ltz-nanos.sqlandtimestamp-ntz-nanos.sql(as SPARK-51517 addedtime.sql), disabled inThriftServerQueryTestSuite.Was this patch authored or co-authored using generative AI tooling?
Generated-by: Cursor 1.7.0