iceberg: harden case_fold and add non-null-term coverage#30579
Open
nvartolomei wants to merge 1 commit into
Open
iceberg: harden case_fold and add non-null-term coverage#30579nvartolomei wants to merge 1 commit into
nvartolomei wants to merge 1 commit into
Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
This PR hardens Iceberg’s Unicode case folding used by iceberg::names_equal when field_name_comparison::lower_case is selected, aiming to make comparisons safe for arbitrary std::string_view inputs and to surface UTF-8/alloc failures rather than silently producing misleading results.
Changes:
- Removes
UTF8PROC_NULLTERMusage soutf8proc_maprespects the providedstring_viewlength. - Refactors the folding output ownership to avoid extra
std::stringcopies by using an RAII wrapper around themalloc-allocated buffer. - Adds unit test coverage for non-null-terminated
std::string_viewinputs.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| src/v/iceberg/unicode.h | Documents the updated names_equal exception behavior for case-folding. |
| src/v/iceberg/unicode.cc | Removes null-terminated folding behavior, introduces RAII for utf8proc output, and changes error handling to throw. |
| src/v/iceberg/tests/unicode_test.cc | Adds regression coverage ensuring folding respects string_view length (non-null-terminated view). |
Comments suppressed due to low confidence (2)
src/v/iceberg/unicode.cc:74
names_equal'sswitch (norm)has nodefaultand the function can fall off the end without returning a value if an unexpectedfield_name_comparisonvalue is passed (e.g. from deserialization or an uninitialized value). Add adefault/catch-all path (e.g.__builtin_unreachable()or throw) and ensure the function always returns abool.
switch (norm) {
case field_name_comparison::verbatim:
return a == b;
case field_name_comparison::lower_case:
return case_fold(a).view() == case_fold(b).view();
src/v/iceberg/unicode.cc:56
utf8proc_maptakes autf8proc_ssize_tlength, buts.size()issize_tand is cast directly. For very large inputs this can overflow/truncate and pass a negative/incorrect length toutf8proc_map. Add an explicit bounds check againststd::numeric_limits<utf8proc_ssize_t>::max()and fail fast (throw) if the input is too large.
const utf8proc_ssize_t len = utf8proc_map(
reinterpret_cast<const uint8_t*>(s.data()),
static_cast<utf8proc_ssize_t>(s.size()),
&result,
wdberkeley
previously approved these changes
May 21, 2026
cb655cd to
a1bbae3
Compare
Contributor
Author
|
Force-pushed
Did not address Copilot's other suggestions:
|
Collaborator
CI test resultstest results on build#84846
test results on build#84888
|
Several related improvements to the Unicode case-folding used by field_name_comparison::lower_case in names_equal: - Drop UTF8PROC_NULLTERM from utf8proc_map. With that flag the size argument is ignored and utf8proc reads until it finds a NUL, which is unsafe for arbitrary string_view inputs. Add a regression test using a non-null-terminated view. - Stop copying the folded result into a std::string. utf8proc_map allocates the buffer with malloc; wrap it in a unique_ptr with a free_deleter and expose it via a small RAII folded_string class. names_equal now compares views into those owners directly. - Surface errors instead of silently mis-comparing: throw std::runtime_error carrying utf8proc_errmsg on invalid UTF-8 and std::bad_alloc on allocation failure. Document the contract in the public header.
a1bbae3 to
2c11dc4
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Several related improvements to the Unicode case-folding used by
field_name_comparison::lower_caseiniceberg::names_equal:UTF8PROC_NULLTERMfromutf8proc_map. With that flag the size argument is ignored and utf8proc reads until it finds a NUL, which is unsafe for arbitrarystring_viewinputs. Added a regression test using a non-null-terminated view.std::string.utf8proc_mapallocates the buffer withmalloc; it's now wrapped in aunique_ptrwith afree_deleterand exposed via a small RAIIfolded_stringclass.names_equalcompares views into those owners directly.std::runtime_errorcarryingutf8proc_errmsgon invalid UTF-8 andstd::bad_allocon allocation failure. The contract is documented in the public header.Backports Required
Release Notes