Skip to content

iceberg: harden case_fold and add non-null-term coverage#30579

Open
nvartolomei wants to merge 1 commit into
redpanda-data:devfrom
nvartolomei:nv/iceberg-harden-case-fold
Open

iceberg: harden case_fold and add non-null-term coverage#30579
nvartolomei wants to merge 1 commit into
redpanda-data:devfrom
nvartolomei:nv/iceberg-harden-case-fold

Conversation

@nvartolomei
Copy link
Copy Markdown
Contributor

@nvartolomei nvartolomei commented May 21, 2026

Summary

Several related improvements to the Unicode case-folding used by field_name_comparison::lower_case in iceberg::names_equal:

  • Drop UTF8PROC_NULLTERM from utf8proc_map. With that flag the size argument is ignored and utf8proc reads until it finds a NUL, which is unsafe for arbitrary string_view inputs. Added a regression test using a non-null-terminated view.
  • Stop copying the folded result into a std::string. utf8proc_map allocates the buffer with malloc; it's now wrapped in a unique_ptr with a free_deleter and exposed via a small RAII folded_string class. names_equal compares views into those owners directly.
  • Surface errors instead of silently mis-comparing. Throws std::runtime_error carrying utf8proc_errmsg on invalid UTF-8 and std::bad_alloc on allocation failure. The contract is documented in the public header.

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v26.1.x
  • v25.3.x
  • v25.2.x

Release Notes

  • none

Copilot AI review requested due to automatic review settings May 21, 2026 20:25
@nvartolomei nvartolomei requested a review from wdberkeley May 21, 2026 20:25
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR hardens Iceberg’s Unicode case folding used by iceberg::names_equal when field_name_comparison::lower_case is selected, aiming to make comparisons safe for arbitrary std::string_view inputs and to surface UTF-8/alloc failures rather than silently producing misleading results.

Changes:

  • Removes UTF8PROC_NULLTERM usage so utf8proc_map respects the provided string_view length.
  • Refactors the folding output ownership to avoid extra std::string copies by using an RAII wrapper around the malloc-allocated buffer.
  • Adds unit test coverage for non-null-terminated std::string_view inputs.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File Description
src/v/iceberg/unicode.h Documents the updated names_equal exception behavior for case-folding.
src/v/iceberg/unicode.cc Removes null-terminated folding behavior, introduces RAII for utf8proc output, and changes error handling to throw.
src/v/iceberg/tests/unicode_test.cc Adds regression coverage ensuring folding respects string_view length (non-null-terminated view).
Comments suppressed due to low confidence (2)

src/v/iceberg/unicode.cc:74

  • names_equal's switch (norm) has no default and the function can fall off the end without returning a value if an unexpected field_name_comparison value is passed (e.g. from deserialization or an uninitialized value). Add a default/catch-all path (e.g. __builtin_unreachable() or throw) and ensure the function always returns a bool.
    switch (norm) {
    case field_name_comparison::verbatim:
        return a == b;
    case field_name_comparison::lower_case:
        return case_fold(a).view() == case_fold(b).view();

src/v/iceberg/unicode.cc:56

  • utf8proc_map takes a utf8proc_ssize_t length, but s.size() is size_t and is cast directly. For very large inputs this can overflow/truncate and pass a negative/incorrect length to utf8proc_map. Add an explicit bounds check against std::numeric_limits<utf8proc_ssize_t>::max() and fail fast (throw) if the input is too large.
    const utf8proc_ssize_t len = utf8proc_map(
      reinterpret_cast<const uint8_t*>(s.data()),
      static_cast<utf8proc_ssize_t>(s.size()),
      &result,

Comment thread src/v/iceberg/unicode.cc
Comment thread src/v/iceberg/unicode.h Outdated
Comment thread src/v/iceberg/tests/unicode_test.cc
wdberkeley
wdberkeley previously approved these changes May 21, 2026
@nvartolomei
Copy link
Copy Markdown
Contributor Author

nvartolomei commented May 21, 2026

Force-pushed cb655cd9fa → a1bbae301a addressing review feedback:

  • unicode.h — scoped the exception contract in the docstring to lower_case mode; verbatim is a pure byte compare and does not throw.
  • unicode_test.cc — added NamesEqualInvalidUtf8.Throws covering EXPECT_THROW(..., std::runtime_error) on invalid UTF-8 in lower_case, and EXPECT_NO_THROW in verbatim.

Did not address Copilot's other suggestions:

  • switch(norm) exhaustiveness — -Werror=switch already enforces this at compile time.
  • size_t → utf8proc_ssize_t overflow — field names are small in practice; not worth a runtime check.
  • string_view(nullptr, 0) — well-defined in C++23; also utf8proc_map never returns a null buffer on success.

@nvartolomei nvartolomei enabled auto-merge May 21, 2026 22:01
@vbotbuildovich
Copy link
Copy Markdown
Collaborator

vbotbuildovich commented May 21, 2026

CI test results

test results on build#84846
test_status test_class test_method test_arguments test_kind job_url passed reason test_history
FLAKY(PASS) WriteCachingFailureInjectionE2ETest test_crash_all {"use_transactions": false} integration https://buildkite.com/redpanda/redpanda/builds/84846#019e4c9c-0963-495d-baf3-66d8770d76a0 10/11 Test PASSES after retries.No significant increase in flaky rate(baseline=0.0909, p0=1.0000, reject_threshold=0.0100. adj_baseline=0.2487, p1=0.0573, trust_threshold=0.5000) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=WriteCachingFailureInjectionE2ETest&test_method=test_crash_all
test results on build#84888
test_status test_class test_method test_arguments test_kind job_url passed reason test_history
FLAKY(PASS) NodeWiseRecoveryTest test_recovery_local_data_missing {"wait_for_final_manifest_uploads": false} integration https://buildkite.com/redpanda/redpanda/builds/84888#019e510a-2651-4a64-98b7-03992cee7a85 10/11 Test PASSES after retries.No significant increase in flaky rate(baseline=0.0017, p0=1.0000, reject_threshold=0.0100. adj_baseline=0.1000, p1=0.3487, trust_threshold=0.5000) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=NodeWiseRecoveryTest&test_method=test_recovery_local_data_missing
FLAKY(PASS) WriteCachingFailureInjectionE2ETest test_crash_all {"use_transactions": false} integration https://buildkite.com/redpanda/redpanda/builds/84888#019e510b-b3de-44cc-a648-2118a909f5fc 18/21 Test PASSES after retries.No significant increase in flaky rate(baseline=0.0849, p0=0.5157, reject_threshold=0.0100. adj_baseline=0.2336, p1=0.1209, trust_threshold=0.5000) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=WriteCachingFailureInjectionE2ETest&test_method=test_crash_all

Several related improvements to the Unicode case-folding used by
field_name_comparison::lower_case in names_equal:

- Drop UTF8PROC_NULLTERM from utf8proc_map. With that flag the size
  argument is ignored and utf8proc reads until it finds a NUL, which
  is unsafe for arbitrary string_view inputs. Add a regression test
  using a non-null-terminated view.
- Stop copying the folded result into a std::string. utf8proc_map
  allocates the buffer with malloc; wrap it in a unique_ptr with a
  free_deleter and expose it via a small RAII folded_string class.
  names_equal now compares views into those owners directly.
- Surface errors instead of silently mis-comparing: throw
  std::runtime_error carrying utf8proc_errmsg on invalid UTF-8 and
  std::bad_alloc on allocation failure. Document the contract in the
  public header.
@nvartolomei nvartolomei force-pushed the nv/iceberg-harden-case-fold branch from a1bbae3 to 2c11dc4 Compare May 22, 2026 18:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants