Describe the bug
SparkWidthBucket::return_type returns Int32, but Spark's WidthBucket.dataType is LongType:
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/mathExpressions.scala#L1825
// datafusion/spark/src/function/math/width_bucket.rs
fn return_type(&self, _arg_types: &[DataType]) -> Result<DataType> {
Ok(Int32)
}
The n_bucket input was aligned to i64 to match Spark in #20330, but the return type was left as Int32. The kernel still builds Int32Array.
This produces wrong results in any consumer that plans against Spark's declared output type (Int64) but receives an Int32Array at runtime: with two rows per batch, the consumer reads 16 bytes of Int64 from an 8-byte Int32 buffer, packing two int32 values into a single int64 and reading uninitialized bytes for the rest.
Concretely, for width_bucket(value, 0.0, 10.0, 5) over Range(0, 10) split into 5 partitions of 2 rows each:
| value |
expected (Int64) |
observed |
| 0 |
1 |
4294967297 (= 0x1_00000001) |
| 1 |
1 |
0 |
| 2 |
2 |
8589934594 (= 0x2_00000002) |
| 3 |
2 |
0 |
| ... |
... |
... |
To Reproduce
Run any consumer that respects Spark's declared LongType for WidthBucket against SparkWidthBucket. Reproduces in DataFusion Comet on the width_bucket - with range data test in CometMathExpressionSuite (apache/datafusion-comet#4347).
Expected behavior
SparkWidthBucket::return_type returns Int64 and the kernel builds Int64Array, matching Spark.
Additional context
Related: #20330 (input parameter alignment).
Describe the bug
SparkWidthBucket::return_typereturnsInt32, but Spark'sWidthBucket.dataTypeisLongType:https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/mathExpressions.scala#L1825
The
n_bucketinput was aligned toi64to match Spark in #20330, but the return type was left asInt32. The kernel still buildsInt32Array.This produces wrong results in any consumer that plans against Spark's declared output type (
Int64) but receives anInt32Arrayat runtime: with two rows per batch, the consumer reads 16 bytes ofInt64from an 8-byteInt32buffer, packing two int32 values into a single int64 and reading uninitialized bytes for the rest.Concretely, for
width_bucket(value, 0.0, 10.0, 5)overRange(0, 10)split into 5 partitions of 2 rows each:To Reproduce
Run any consumer that respects Spark's declared
LongTypeforWidthBucketagainstSparkWidthBucket. Reproduces in DataFusion Comet on thewidth_bucket - with range datatest inCometMathExpressionSuite(apache/datafusion-comet#4347).Expected behavior
SparkWidthBucket::return_typereturnsInt64and the kernel buildsInt64Array, matching Spark.Additional context
Related: #20330 (input parameter alignment).