[SPARK-57256][SQL] Cast nanosecond-precision timestamps to string#56317
[SPARK-57256][SQL] Cast nanosecond-precision timestamps to string#56317MaxGekk wants to merge 6 commits into
Conversation
|
@stevomitric @uros-b Could you review this PR, please. |
uros-b
left a comment
There was a problem hiding this comment.
Shall we add some end-to-end SQL test coverage?
uros-b
left a comment
There was a problem hiding this comment.
Also, it would be nice to add at least one complex type cast test (e.g. using array).
Added in a50de21. The new |
### What changes were proposed in this pull request? Implement casting of the nanosecond-precision timestamp types `TIMESTAMP_NTZ(p)` (`TimestampNTZNanosType`) and `TIMESTAMP_LTZ(p)` (`TimestampLTZNanosType`), `p` in [7, 9], to `STRING`. Casting is implemented in `ToStringBase` (mixed into `Cast`), so this also fixes `ToPrettyString` (and therefore `Dataset.show()`) for these types via the shared base. The change wires the SPARK-57162 formatter methods into the existing cast-to-string paths (interpreted and codegen): - `TimestampLTZNanosType(p)` -> `TimestampFormatter.formatNanos(v, p)` (session time zone). - `TimestampNTZNanosType(p)` -> `TimestampFormatter.formatWithoutTimeZoneNanos(v, p)` (zone-independent, UTC wall-clock grid). The fractional-second precision `p` is taken from the source type; sub-`p` digits are floored and trailing zeros are trimmed, consistent with the microsecond cast path (both use `FractionTimestampFormatter`). `Cast.needsTimeZone` is extended so that `TimestampLTZNanosType -> StringType` resolves the session time zone (mirroring `TimestampType -> StringType`); the NTZ variant does not need a time zone. ### Why are the changes needed? Today `Cast` permits these casts at analysis time (the generic `(_, StringType)` rule), but at runtime the nanosecond types have no dedicated case in `ToStringBase` and fall through to the default `String.valueOf(...)` branch, producing the internal form `TimestampNanosVal(epochMicros, nanosWithinMicro)` instead of a proper SQL timestamp string. A correct textual representation is a prerequisite for nanosecond support in expressions, SHOW/pretty output, and downstream text-based sinks. ### Does this PR introduce _any_ user-facing change? User-facing only when `spark.sql.timestampNanosTypes.enabled=true`; these types are not available otherwise. Casting to string never fails, so ANSI and non-ANSI modes behave identically. With `spark.sql.timestampNanosTypes.enabled=true`: ``` SELECT CAST(ts AS STRING); -- TIMESTAMP_NTZ(9) value 2020-01-01 00:00:00.123456789 -- before: TimestampNanosVal(1577836800000000, 789) -- after: 2020-01-01 00:00:00.123456789 ``` ### How was this patch tested? New cases in `CastSuiteBase` (run under both ANSI on/off; `checkEvaluation` exercises the interpreted and codegen paths): precision 7/8/9, trailing-zero trimming, `nanosWithinMicro` 0 and 999, LTZ time-zone shift under a non-UTC session zone vs. NTZ remaining unshifted, pre-epoch and year-9999 boundaries, and null input. ### Was this patch authored or co-authored using generative AI tooling? Generated-by: Cursor
… string Co-authored-by: Max Gekk
…tamp types Co-authored-by: Max Gekk
…-string sweep Co-authored-by: Max Gekk
… complex-type cast coverage Rename the two nanosecond pretty-string tests in ToPrettyStringSuite with the `SPARK-57256:` prefix for traceability, matching the CastSuiteBase tests. Add a cast-to-string test for complex types nesting nanosecond timestamps: array<timestamp_ntz_nanos> / array<timestamp_ltz_nanos> (with a null element to cover the recursive nullString path) and a struct nesting both variants.
…d add SQL tests After rebasing onto the merged SPARK-57257, master routes interpreted CAST(nanos AS STRING) through the Types Framework's zone-less format(), which deliberately raised UNSUPPORTED_FEATURE.TIMESTAMP_NANOS_TO_STRING as a placeholder. That shadowed the zone-aware formatting added here. Bypass TypeApiOps for the nanosecond timestamp types in ToStringBase.castToString so they use the zone-aware castToStringDefault cases (LTZ renders in the session time zone, mirroring the microsecond timestamp types). Remove the now-dead codegen error case and drop the obsolete TimestampNanosRowSuite test that asserted the error; positive cast coverage lives in CastSuiteBase/ToPrettyStringSuite. The framework format()/toSQLValue() still raise for the zone-less EXPLAIN / SQL-literal paths. Add end-to-end golden-file checks to cast.sql now that SPARK-57257 wired HiveResult: precision flooring, trailing-zero trimming, nanosWithinMicro boundaries, pre-1970 negative-epoch, complex types with nested NULL, top-level NULL, and string-context use. All result columns are STRING. ### Was this patch authored or co-authored using generative AI tooling? Generated-by: Cursor 1.7.0
a50de21 to
ec68a98
Compare
|
@uros-b Now that #56320 (SPARK-57257, |
What changes were proposed in this pull request?
Implement casting of the nanosecond-precision timestamp types
TIMESTAMP_NTZ(p)(TimestampNTZNanosType) andTIMESTAMP_LTZ(p)(TimestampLTZNanosType),pin [7, 9], toSTRING.Casting is implemented in
ToStringBase(mixed intoCast), so this change also fixesToPrettyString(and thereforeDataset.show()) for these types via the shared base.The change wires the SPARK-57162 formatter methods into the existing cast-to-string paths (interpreted and codegen):
TimestampLTZNanosType(p)->TimestampFormatter.formatNanos(v, p)(renders in the session time zone).TimestampNTZNanosType(p)->TimestampFormatter.formatWithoutTimeZoneNanos(v, p)(zone-independent, UTC wall-clock grid).The fractional-second precision
pis taken from the source type; sub-pdigits are floored and trailing zeros are trimmed, consistent with the microsecond cast path (both useFractionTimestampFormatter).Cast.needsTimeZoneis extended so thatTimestampLTZNanosType -> StringTyperesolves the session time zone (mirroringTimestampType -> StringType); the NTZ variant does not need a time zone.Why are the changes needed?
Today
Castpermits these casts at analysis time (the generic(_, StringType)rule), but at runtime the nanosecond types have no dedicated case inToStringBaseand fall through to the defaultString.valueOf(...)branch, producing the internal formTimestampNanosVal(epochMicros, nanosWithinMicro)instead of a proper SQL timestamp string. Producing a correct textual representation is a prerequisite for nanosecond support in expressions, SHOW/pretty output, and downstream text-based sinks.Does this PR introduce any user-facing change?
User-facing only when
spark.sql.timestampNanosTypes.enabled=true; these types are not available otherwise. Casting to string never fails, so ANSI and non-ANSI modes behave identically.With
spark.sql.timestampNanosTypes.enabled=true:How was this patch tested?
New cases in
CastSuiteBase(run under both ANSI on/off;checkEvaluationexercises the interpreted and codegen paths): precision 7/8/9, trailing-zero trimming,nanosWithinMicro0 and 999, LTZ time-zone shift under a non-UTC session zone vs. NTZ remaining unshifted, pre-epoch and year-9999 boundaries, and null input.Was this patch authored or co-authored using generative AI tooling?
Generated-by: Cursor