Skip to content

Conversation

@philippeVerney
Copy link

What is the purpose of the change

This pull request improves std::vector decoding by allocating memory block by block instead of a default memory allocation strategy.

Verifying this change

This change is already covered by existing tests, such as testArray() in test/SpecificTests.cc

Documentation

  • Does this pull request introduce a new feature? no

@github-actions github-actions bot added the C++ Pull Requests for C++ binding label Nov 13, 2025
@stephanlachnit
Copy link
Contributor

stephanlachnit commented Nov 17, 2025

Have you actually tested if this is actually faster? Calling reserve in a loop looks to me it could actually make it slower since it since it reserves exactly, leading to a potential reallocation every loop iteration, while push_back will exponentially over-reserve, leading to potentially less reallocation. See https://www.youtube.com/watch?v=algDLvbl1YY for an in-depth video about this.

@philippeVerney
Copy link
Author

philippeVerney commented Nov 17, 2025

The short and rigorous answer is : no.

However, it would be hard to make a clear answer about if it is faster or slower because, as often, it depends...
For example, if one decides to encode a very large array in blocks of one item, this change would be dramatic.
Inversely, if one decides to encode a very large array in a single block of all items, this change would be perfect.
As thumb rule, more you use blocks in your array, and worst will be this change.

Now, if we try to guess what AVRO users will do (knowing that they can do exactly what they want in the end), I think the current AVRO C++ code encodes an array as a single block which is the "perfect" case for this change. See

static void encode(Encoder &e, const std::vector<T> &b) {
e.arrayStart();
if (!b.empty()) {
e.setItemCount(b.size());
for (typename std::vector<T>::const_iterator it = b.begin();
it != b.end(); ++it) {
e.startItem();
avro::encode(e, *it);
}
}
e.arrayEnd();
}
where we only see a single call to e.setItemCount(b.size());
Consequently, when using the current AVRO code to encode arrays, you end up with a single block array which will make the "problematic" decoding loop irrelevant.

I think that this change is clearly faster for the default usage of the AVRO C++ library. However, it has indeed drawbacks and can be worst for some exotic and/or proprietary and/or future array encodings.

@stephanlachnit
Copy link
Contributor

Right, given the encoder this then makes sense.

Alternative you could iterate twice:

s.clear();
std::size_t new_size = 0;
for (std::size_t n = d.arrayStart(); n != 0; n = d.arrayNext()) {
    new_size += n;
}
s.reserve(new_size);
for (std::size_t n = d.arrayStart(); n != 0; n = d.arrayNext()) {
    for (std::size_t i = 0; i < n; ++i) {
        T t;
        avro::decode(d, t);
        s.emplace_back(std::move(t));
    }
}

But since I don't know how costly the arrayStart() method is, this might as well be even more expensive.

@philippeVerney
Copy link
Author

Indeed, this is a more general solution but it would need either to read twice the entire array or to "seekg" (i.e. to set the input position indicator of the input stream) several times which may in both cases impact performance. In both cases, we would need to seekg to the beginning of the array before the second iteration.
We would also need to change the underlying InputStream to an SeekableInputStream if I understand correctly.

The change would look more heavy, further from the corresponding JIRA issue and, I don't know on top of my head if it would generally be faster or not.
For a single block array which is the current encoding implementation, I guess it would be a little bit slower cause of the seekg cost.

@RyanSkraba
Copy link
Contributor

Thanks for this contribution :D I'm not a C++ dev!

Can you tell me if this change should be cherry-picked to 1.12.x and/or 1.11.x for a new release, or is it for 1.13.0?

@stephanlachnit I see a 👍 -- are you happy with the PR?

@marlowa
Copy link
Contributor

marlowa commented Dec 2, 2025

First, many thanks for putting in the effort on the PR. But I have to ask, was this change analysed for performance? I hope so since that was the whole point of the issue I raised.

@philippeVerney
Copy link
Author

philippeVerney commented Dec 3, 2025

First, many thanks for putting in the effort on the PR. But I have to ask, was this change analysed for performance? I hope so since that was the whole point of the issue I raised.

Mentally "analysed" : yes on my side as discussed in my above post.
Analysed implementing a benchmark : no on my side. To me it does not deserve the time I would have to spend on it. But happy if someone else makes it.

@philippeVerney
Copy link
Author

philippeVerney commented Dec 3, 2025

Can you tell me if this change should be cherry-picked to 1.12.x and/or 1.11.x for a new release, or is it for 1.13.0?

IMHO, it would be ideal to be cherry picked on other dev branches since it generally provide only advantages. But this is not required.
I don't know if it exists official rules to decide if it deserves a cherry pick or not. If so I am happy to help answering questions about these rules which you can point me to.

@KalleOlaviNiemitalo
Copy link
Contributor

IMO it should be 1.13.0 only. Preallocating like this makes it more vulnerable to malicious data that claims to have a huge number of array elements and then causes the library to allocate a lot of memory, despite the attacker not spending any resources to actually send that much data. #3403 disclaims responsibility on using the library on untrusted data but users of older branches, which were published before that text was added, may still be doing that. Moreover, https://issues.apache.org/jira/plugins/servlet/mobile#issue/AVRO-4134 is categorised as a trivial improvement, rather than a bug.

@marlowa
Copy link
Contributor

marlowa commented Dec 4, 2025

As the person who originally raised this issue, I want to chip in to agree with KalleOlavNiemitalo. It makes sense to do the change in or beyond the version where that documentation was changed, rather than earlier. I can wait for 1.13.0, that's fine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

C++ Pull Requests for C++ binding

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants