Skip to content

Compressing VarBinArray to ConstantArray loses info about offsets_ptype #1021

@XinyuZeng

Description

@XinyuZeng

Hi Vortex, I am not sure this is the desired behavior. For example, if we compress a LargeBinary or LargeUtf8 Arrow Array into Vortex's ConstantArray and then canonicalize it back, we will get Binary or Utf8 Arrow Array. This is because VarBinArray::from_iter always uses the u32 offsets builder:

https://github.com/spiraldb/vortex/blob/e75606de2624a9c5b73ee0176fb56582fad9aebe/vortex-array/src/array/varbin/mod.rs#L164

This can be reproduced by running the round_trip_arrow_compressed test. It is ignored but Arrow now supports comparing Structs:
https://github.com/spiraldb/vortex/blob/e75606de2624a9c5b73ee0176fb56582fad9aebe/bench-vortex/src/lib.rs#L264-L268

The taxi dataset has a field store_and_fwd_flag which is mostly N. It is reasonable for a ConstantArray to just use u32 offset but if we have a ChunkedArray where the first chunk is Constant and the second chunk is not, we may have inconsistent Arrow schema between output RecordBatches? (while this may be the problem of Arrow missing a logical type)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions