-
Notifications
You must be signed in to change notification settings - Fork 1.7k
AVRO-4134: [C++] Allocate std::vector block by block when decoding #3546
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
AVRO-4134: [C++] Allocate std::vector block by block when decoding #3546
Conversation
|
Have you actually tested if this is actually faster? Calling |
|
The short and rigorous answer is : no. However, it would be hard to make a clear answer about if it is faster or slower because, as often, it depends... Now, if we try to guess what AVRO users will do (knowing that they can do exactly what they want in the end), I think the current AVRO C++ code encodes an array as a single block which is the "perfect" case for this change. See avro/lang/c++/include/avro/Specific.hh Lines 238 to 249 in 9110c69
e.setItemCount(b.size());Consequently, when using the current AVRO code to encode arrays, you end up with a single block array which will make the "problematic" decoding loop irrelevant. I think that this change is clearly faster for the default usage of the AVRO C++ library. However, it has indeed drawbacks and can be worst for some exotic and/or proprietary and/or future array encodings. |
|
Right, given the encoder this then makes sense. Alternative you could iterate twice: s.clear();
std::size_t new_size = 0;
for (std::size_t n = d.arrayStart(); n != 0; n = d.arrayNext()) {
new_size += n;
}
s.reserve(new_size);
for (std::size_t n = d.arrayStart(); n != 0; n = d.arrayNext()) {
for (std::size_t i = 0; i < n; ++i) {
T t;
avro::decode(d, t);
s.emplace_back(std::move(t));
}
}But since I don't know how costly the |
|
Indeed, this is a more general solution but it would need either to read twice the entire array or to "seekg" (i.e. to set the input position indicator of the input stream) several times which may in both cases impact performance. In both cases, we would need to seekg to the beginning of the array before the second iteration. The change would look more heavy, further from the corresponding JIRA issue and, I don't know on top of my head if it would generally be faster or not. |
|
Thanks for this contribution :D I'm not a C++ dev! Can you tell me if this change should be cherry-picked to 1.12.x and/or 1.11.x for a new release, or is it for 1.13.0? @stephanlachnit I see a 👍 -- are you happy with the PR? |
|
First, many thanks for putting in the effort on the PR. But I have to ask, was this change analysed for performance? I hope so since that was the whole point of the issue I raised. |
Mentally "analysed" : yes on my side as discussed in my above post. |
IMHO, it would be ideal to be cherry picked on other dev branches since it generally provide only advantages. But this is not required. |
|
IMO it should be 1.13.0 only. Preallocating like this makes it more vulnerable to malicious data that claims to have a huge number of array elements and then causes the library to allocate a lot of memory, despite the attacker not spending any resources to actually send that much data. #3403 disclaims responsibility on using the library on untrusted data but users of older branches, which were published before that text was added, may still be doing that. Moreover, https://issues.apache.org/jira/plugins/servlet/mobile#issue/AVRO-4134 is categorised as a trivial improvement, rather than a bug. |
|
As the person who originally raised this issue, I want to chip in to agree with KalleOlavNiemitalo. It makes sense to do the change in or beyond the version where that documentation was changed, rather than earlier. I can wait for 1.13.0, that's fine. |
What is the purpose of the change
This pull request improves std::vector decoding by allocating memory block by block instead of a default memory allocation strategy.
Verifying this change
This change is already covered by existing tests, such as testArray() in test/SpecificTests.cc
Documentation