AVRO-4134: [C++] Allocate std::vector block by block when decoding #3546

philippeVerney · 2025-11-13T08:54:16Z

What is the purpose of the change

This pull request improves std::vector decoding by allocating memory block by block instead of a default memory allocation strategy.

Verifying this change

This change is already covered by existing tests, such as testArray() in test/SpecificTests.cc

Documentation

Does this pull request introduce a new feature? no

stephanlachnit · 2025-11-17T10:28:30Z

Have you actually tested if this is actually faster? Calling reserve in a loop looks to me it could actually make it slower since it since it reserves exactly, leading to a potential reallocation every loop iteration, while push_back will exponentially over-reserve, leading to potentially less reallocation. See https://www.youtube.com/watch?v=algDLvbl1YY for an in-depth video about this.

philippeVerney · 2025-11-17T10:57:36Z

The short and rigorous answer is : no.

However, it would be hard to make a clear answer about if it is faster or slower because, as often, it depends...
For example, if one decides to encode a very large array in blocks of one item, this change would be dramatic.
Inversely, if one decides to encode a very large array in a single block of all items, this change would be perfect.
As thumb rule, more you use blocks in your array, and worst will be this change.

Now, if we try to guess what AVRO users will do (knowing that they can do exactly what they want in the end), I think the current AVRO C++ code encodes an array as a single block which is the "perfect" case for this change. See

avro/lang/c++/include/avro/Specific.hh

Lines 238 to 249 in 9110c69

    
           static void encode(Encoder &e, const std::vector<T> &b) { 
        
               e.arrayStart(); 
        
               if (!b.empty()) { 
        
                   e.setItemCount(b.size()); 
        
                   for (typename std::vector<T>::const_iterator it = b.begin(); 
        
                        it != b.end(); ++it) { 
        
                       e.startItem(); 
        
                       avro::encode(e, *it); 
        
                   } 
        
               } 
        
               e.arrayEnd(); 
        
           }

where we only see a single call to e.setItemCount(b.size());
Consequently, when using the current AVRO code to encode arrays, you end up with a single block array which will make the "problematic" decoding loop irrelevant.

I think that this change is clearly faster for the default usage of the AVRO C++ library. However, it has indeed drawbacks and can be worst for some exotic and/or proprietary and/or future array encodings.

stephanlachnit · 2025-11-17T13:03:33Z

Right, given the encoder this then makes sense.

Alternative you could iterate twice:

s.clear();
std::size_t new_size = 0;
for (std::size_t n = d.arrayStart(); n != 0; n = d.arrayNext()) {
    new_size += n;
}
s.reserve(new_size);
for (std::size_t n = d.arrayStart(); n != 0; n = d.arrayNext()) {
    for (std::size_t i = 0; i < n; ++i) {
        T t;
        avro::decode(d, t);
        s.emplace_back(std::move(t));
    }
}

But since I don't know how costly the arrayStart() method is, this might as well be even more expensive.

philippeVerney · 2025-11-17T16:35:40Z

Indeed, this is a more general solution but it would need either to read twice the entire array or to "seekg" (i.e. to set the input position indicator of the input stream) several times which may in both cases impact performance. In both cases, we would need to seekg to the beginning of the array before the second iteration.
We would also need to change the underlying InputStream to an SeekableInputStream if I understand correctly.

The change would look more heavy, further from the corresponding JIRA issue and, I don't know on top of my head if it would generally be faster or not.
For a single block array which is the current encoding implementation, I guess it would be a little bit slower cause of the seekg cost.

RyanSkraba · 2025-12-02T19:55:32Z

Thanks for this contribution :D I'm not a C++ dev!

Can you tell me if this change should be cherry-picked to 1.12.x and/or 1.11.x for a new release, or is it for 1.13.0?

@stephanlachnit I see a 👍 -- are you happy with the PR?

marlowa · 2025-12-02T20:39:30Z

First, many thanks for putting in the effort on the PR. But I have to ask, was this change analysed for performance? I hope so since that was the whole point of the issue I raised.

philippeVerney · 2025-12-03T10:33:25Z

First, many thanks for putting in the effort on the PR. But I have to ask, was this change analysed for performance? I hope so since that was the whole point of the issue I raised.

Mentally "analysed" : yes on my side as discussed in my above post.
Analysed implementing a benchmark : no on my side. To me it does not deserve the time I would have to spend on it. But happy if someone else makes it.

philippeVerney · 2025-12-03T10:38:04Z

Can you tell me if this change should be cherry-picked to 1.12.x and/or 1.11.x for a new release, or is it for 1.13.0?

IMHO, it would be ideal to be cherry picked on other dev branches since it generally provide only advantages. But this is not required.
I don't know if it exists official rules to decide if it deserves a cherry pick or not. If so I am happy to help answering questions about these rules which you can point me to.

KalleOlaviNiemitalo · 2025-12-03T14:26:33Z

IMO it should be 1.13.0 only. Preallocating like this makes it more vulnerable to malicious data that claims to have a huge number of array elements and then causes the library to allocate a lot of memory, despite the attacker not spending any resources to actually send that much data. #3403 disclaims responsibility on using the library on untrusted data but users of older branches, which were published before that text was added, may still be doing that. Moreover, https://issues.apache.org/jira/plugins/servlet/mobile#issue/AVRO-4134 is categorised as a trivial improvement, rather than a bug.

marlowa · 2025-12-04T07:53:48Z

As the person who originally raised this issue, I want to chip in to agree with KalleOlavNiemitalo. It makes sense to do the change in or beyond the version where that documentation was changed, rather than earlier. I can wait for 1.13.0, that's fine.

Allocate vector block by block when decoding

cc4bbe9

github-actions bot added the C++ Pull Requests for C++ binding label Nov 13, 2025

martin-g approved these changes Nov 13, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

AVRO-4134: [C++] Allocate std::vector block by block when decoding #3546

AVRO-4134: [C++] Allocate std::vector block by block when decoding #3546

Uh oh!

philippeVerney commented Nov 13, 2025

Uh oh!

stephanlachnit commented Nov 17, 2025 •

edited

Loading

Uh oh!

philippeVerney commented Nov 17, 2025 •

edited

Loading

Uh oh!

stephanlachnit commented Nov 17, 2025

Uh oh!

philippeVerney commented Nov 17, 2025

Uh oh!

RyanSkraba commented Dec 2, 2025

Uh oh!

marlowa commented Dec 2, 2025

Uh oh!

philippeVerney commented Dec 3, 2025 •

edited

Loading

Uh oh!

philippeVerney commented Dec 3, 2025 •

edited

Loading

Uh oh!

KalleOlaviNiemitalo commented Dec 3, 2025

Uh oh!

marlowa commented Dec 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

AVRO-4134: [C++] Allocate std::vector block by block when decoding #3546

Are you sure you want to change the base?

AVRO-4134: [C++] Allocate std::vector block by block when decoding #3546

Uh oh!

Conversation

philippeVerney commented Nov 13, 2025

What is the purpose of the change

Verifying this change

Documentation

Uh oh!

stephanlachnit commented Nov 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

philippeVerney commented Nov 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stephanlachnit commented Nov 17, 2025

Uh oh!

philippeVerney commented Nov 17, 2025

Uh oh!

RyanSkraba commented Dec 2, 2025

Uh oh!

marlowa commented Dec 2, 2025

Uh oh!

philippeVerney commented Dec 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

philippeVerney commented Dec 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

KalleOlaviNiemitalo commented Dec 3, 2025

Uh oh!

marlowa commented Dec 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

stephanlachnit commented Nov 17, 2025 •

edited

Loading

philippeVerney commented Nov 17, 2025 •

edited

Loading

philippeVerney commented Dec 3, 2025 •

edited

Loading

philippeVerney commented Dec 3, 2025 •

edited

Loading