[VL] Fix native crash in toVeloxExpr for nested field reference into array/map#12290
[VL] Fix native crash in toVeloxExpr for nested field reference into array/map#12290felipepessoto wants to merge 1 commit into
Conversation
…array/map SubstraitVeloxExprConverter::toVeloxExpr(FieldReference) resolves a nested struct_field chain by calling asRowType() on each child to descend a level. When the reference traverses a non-struct child -- e.g. a field nested under an array, as exercised by Delta's "nested data support - analysis error - updating array type" UpdateSuite test -- asRowType() returns null and the next loop iteration dereferenced that null RowType, crashing the whole forked JVM with a SIGSEGV in libvelox.so (gluten::SubstraitVeloxExprConverter::toVeloxExpr). A SIGSEGV is not catchable, so plan validation could not fall back and the test process died outright. Replace the unchecked navigation with VELOX_USER_CHECK guards (field-index bounds + non-null struct child) that throw a VeloxUserError. The plan validator already wraps expression conversion in try/catch(VeloxException), so the query now falls back to vanilla execution cleanly instead of crashing. Add SubstraitVeloxExprConverterTest with a regression repro: a field reference descending into an array column, and an out-of-range field index, both of which must now throw instead of crashing. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Adds regression coverage and hardens Substrait field-reference conversion to fail gracefully (VeloxUserError) instead of crashing (SIGSEGV) on invalid nested paths and out-of-range indices.
Changes:
- Add bounds checking for direct struct_field indices during field-access conversion.
- Reject nested field descent into non-row types with a user error (instead of nullptr deref).
- Add unit tests covering both failure modes and wire them into the Velox test target.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| cpp/velox/tests/SubstraitVeloxExprConverterTest.cc | New regression tests for non-struct nested field references and out-of-range field indices. |
| cpp/velox/tests/CMakeLists.txt | Registers the new test file in the Velox test suite. |
| cpp/velox/substrait/SubstraitToVeloxExpr.cc | Adds range checks and non-row descent checks to prevent SIGSEGV and throw clean user errors. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| VELOX_USER_CHECK( | ||
| idx >= 0 && static_cast<uint32_t>(idx) < inputColumnType->size(), | ||
| "Field reference index {} is out of range for the {}-field row type.", | ||
| idx, | ||
| inputColumnType->size()); |
| ::substrait::Expression::FieldReference fieldReference; | ||
| fieldReference.mutable_direct_reference()->mutable_struct_field()->set_field(5); | ||
|
|
||
| VELOX_ASSERT_THROW(SubstraitVeloxExprConverter::toVeloxExpr(fieldReference, inputType), "out of range"); |
|
@zhztheplayer @philo-he could you review this PR? I found the issue when running the Delta CI #12278. I'm not familiar with this code, but with the fix the SIGSEGV are gone. I also created a PR without the fix and I could repro it: #12291 / https://github.com/apache/gluten/actions/runs/27490867910/job/81255583450?pr=12291. Thanks |
What changes are proposed in this pull request?
TL;DR fix this: https://github.com/apache/gluten/actions/runs/27487184961/job/81246596702?pr=12278

PR/CI with test only, before the fix: #12291 / https://github.com/apache/gluten/actions/runs/27490867910/job/81255583450?pr=12291
SubstraitVeloxExprConverter::toVeloxExpr(const ::substrait::Expression::FieldReference&, const RowTypePtr&)resolves a nestedstruct_fieldreference by walking the chain and descending one level at a time withinputColumnType = asRowType(inputColumnType->childAt(idx)).When the field path traverses a non-struct child — for example a field nested under an array —
asRowType()returnsnullptr, and the next loop iteration dereferenced that nullRowType, crashing the whole (forked) JVM with aSIGSEGVinlibvelox.so:Because a
SIGSEGVis not a catchable C++ exception,SubstraitToVeloxPlanValidator(which wraps expression conversion intry/catch (VeloxException)) could not catch it and fall back — the process died outright.This PR replaces the unchecked navigation with
VELOX_USER_CHECKguards that throw aVeloxUserError:RowType::childAt/nameOfuse uncheckedoperator[]), andWith these checks, an unsupported nested reference makes plan validation fail cleanly and the query falls back to vanilla Spark instead of crashing the JVM.
How was this patch tested?
Added
cpp/velox/tests/SubstraitVeloxExprConverterTest.cc, a unit test that builds the exact crashingFieldReference— a reference descending into an array column — and asserts the converter now throws aVeloxUserError(viaVELOX_ASSERT_THROW) instead of crashing the process; it also covers an out-of-range field index. Before this change the same input aborts the JVM with theSIGSEGVshown above.This is the same code path hit by Delta Lake's
UpdateSuitetest"nested data support - analysis error - updating array type"(anUPDATEwhose field path descends through an array), which is exercised by the Gluten Delta Spark UT CI: before this change the forked test JVM aborts with the SIGSEGV; after it, validation falls back to vanilla Spark and the query is handled correctly.clang-format-15reports no changes.Was this patch authored or co-authored using generative AI tooling?
Generated-by: GitHub Copilot CLI (claude-opus-4.8)