GH-41017: [C++] Preserve ordered flag in DictionaryBuilder#49797
GH-41017: [C++] Preserve ordered flag in DictionaryBuilder#49797tinezivic wants to merge 1 commit intoapache:mainfrom
Conversation
|
|
7960ab8 to
d3323b0
Compare
|
|
paleolimbot
left a comment
There was a problem hiding this comment.
Just a note that this will probably also solve #49674 ...there is a unit test in that PR that would be good to include here if this approach seems reasonable.
I am wondering if there is some way to set the ordering of the values in a dictionary builder when using ordered = True, or otherwise the result will be an array of values probably ordered by where each value was first seen. I am not sure there is a realistic use case for building an ordered dictionary where the values aren't known in advance (but perhaps that is possible with the DictionaryBuilder).
DictionaryBuilderBase::type() did not pass the ordered parameter to ::arrow::dictionary(), causing it to always default to false. This meant that building a DictionaryArray via MakeBuilder or MakeDictionaryBuilder with an ordered DictionaryType would produce an array with ordered=false. Fix: Add ordered_ member and set_ordered() to DictionaryBuilderBase (both the primary template and the NullType specialization). The DictionaryBuilderCase in builder.cc now propagates the ordered flag from the input DictionaryType to the builder after construction. Generated-by: GitHub Copilot
d3323b0 to
03c4560
Compare
|
Thank you for the review and for pointing to #49798! Regarding the test from #49798: I have added Regarding ordering of values: You are right to flag this. The |
Rationale for this change
DictionaryBuilderBase::type()does not pass theorderedparameter to::arrow::dictionary(), causing it to always default tofalse. This means that anyDictionaryArraybuilt throughMakeBuilder()orMakeDictionaryBuilder()with an orderedDictionaryTypewill produce an array wheretype().ordered() == false.This affects PyArrow users:
pa.array(data, type=pa.dictionary(pa.int8(), pa.string(), ordered=True))returns an array withordered=False.Reported in: #41017
Also caused: pandas-dev/pandas#58152
What changes are included in this PR?
bool ordered_ = falsemember andset_ordered()method toDictionaryBuilderBase(both the primary template and theNullTypespecialization)ordered_to::arrow::dictionary()intype()andFinishInternal()bool orderedfield toDictionaryBuilderCaseinbuilder.ccdict_type.ordered()throughMakeBuilderImpl::Visit()andMakeDictionaryBuilder()Are these changes tested?
Yes. Added 3 C++ tests in
array_dict_test.cc:MakeBuilderPreservesOrdered— verifiesMakeBuilderwith ordered dict type produces ordered arrayMakeBuilderUnorderedByDefault— verifies unordered stays unorderedMakeDictionaryBuilderPreservesOrdered— verifiesMakeDictionaryBuilderpreserves orderedAre there any user-facing changes?
No API changes.
DictionaryBuildernow correctly preserves theorderedflag from the inputDictionaryType, fixing the silent data loss.AI Disclosure
This fix was developed with the assistance of GitHub Copilot (Claude Sonnet 4.6). The AI assisted with code generation, C++ template structure, and diff review. The contributor identified the root cause, understood the DictionaryBuilder internals, and verified the fix logic.