[fix](retention) Limit param count to 32 to avoid BE heap overflow#64521
[fix](retention) Limit param count to 32 to avoid BE heap overflow#64521924060929 wants to merge 1 commit into
Conversation
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
|
run buildall |
|
/review |
There was a problem hiding this comment.
Review conclusion: the retention limit fix is directionally correct: FE rejects more than 32 params, BE has a constructor backstop, and the added BE unit test covers the 32/33 boundary. I found one regression-test standards issue that should be fixed before merge.
Critical checkpoint conclusions:
- Goal/test proof: The PR addresses the >32 retention heap overflow with FE and BE validation, plus unit/regression coverage. The regression test needs one cleanup-policy fix.
- Scope/focus: The implementation is small and focused on retention argument count validation.
- Concurrency/lifecycle/config/compatibility/persistence/data writes/FE-BE variables: No new concurrency, lifecycle, config, storage-format, persistence, data-write, or FE-BE variable-passing concerns found.
- Parallel paths: Nereids validation and BE aggregate construction are both covered for the changed SQL/runtime paths.
- Conditional checks: The max-argument condition is tied to RetentionState::MAX_EVENTS and is justified by the fixed array/bitmap representation.
- Tests/results: BE boundary unit test and regression test were added; the regression test currently drops the table after execution, which violates the Doris regression-test rule to preserve state after the suite.
- Observability/performance: No additional observability is needed for this validation-only change; no performance issue found.
- Other issues: No duplicate existing comments were present, and no extra user focus points were provided.
Local review checks: read the code-review skill and relevant FE/Nereids/BE guides; inspected the PR diff and related factory/analysis paths; ran git diff --check successfully. I did not run the full BE or regression test suites.
| exception "at most 32" | ||
| } | ||
|
|
||
| sql "DROP TABLE IF EXISTS retention_param_limit_test" |
There was a problem hiding this comment.
Doris regression tests should drop tables before use but leave them in place after the suite so a failed run preserves the final state for debugging. This suite already does DROP TABLE IF EXISTS retention_param_limit_test at the beginning, so please remove this final drop.
There was a problem hiding this comment.
Done — removed the trailing DROP TABLE IF EXISTS retention_param_limit_test; the suite now only drops at the start and leaves the table in place after the run.
543f0fd to
8655a29
Compare
|
run buildall |
|
/review |
retention() keeps its aggregate state in a fixed-size RetentionState::events[32] array and serializes it into a single int64 bitmap, so it supports at most 32 events. Neither FE nor BE enforced this limit: - AggregateFunctionRetention::add() iterates over all argument columns and calls RetentionState::set(i) (events[i] = 1) with no bound check, so more than 32 params write past events[] (heap out-of-bounds write). - insert_result_into() reads events[] for the actual argument count, reading past the array as well. With retention() called with 102 boolean args this corrupted the heap and later crashed BE in the streaming aggregation output path (ColumnString length corruption "string column length is too large: total_length=4295003729" -> memcpy SIGSEGV in StreamingAggLocalState::_get_results_with_serialized_key). Enforce the 32-event limit on both sides: - FE Retention.checkLegalityBeforeTypeCoercion rejects > 32 params with a clear AnalysisException at planning time. - BE AggregateFunctionRetention ctor throws INVALID_ARGUMENT when given more than RetentionState::MAX_EVENTS argument types, as a backstop for any path reaching BE.
8655a29 to
55590bb
Compare
|
run buildall |
|
/review |
There was a problem hiding this comment.
Review opinion: no blocking issues found in the current PR file list.
I reviewed the 4 files reported by GitHub for PR 64521 and the existing inline review threads. The previous comments about removing the internal issue ID and the final regression-test DROP are addressed in the current head.
Critical checkpoint conclusions:
- Goal and tests: the change enforces the retention() 32-condition limit in Nereids FE and adds a BE constructor guard as a fallback. The added BE unit test covers 32 vs 33 args, and the regression test covers the FE rejection path.
- Scope: the current PR is focused on retention() validation and its tests.
- Concurrency/lifecycle/config: no new concurrency, lifecycle, static-init, or config risk found.
- Compatibility/parallel paths: no storage format, persistence, or FE-BE protocol change. The BE guard protects paths that bypass Nereids validation.
- Error handling: FE raises AnalysisException; BE throws doris::Exception during aggregate creation, consistent with existing vectorized function construction patterns.
- Data correctness/transactions: no transaction, visible-version, delete-bitmap, or write-path behavior is touched.
- Test hygiene: the new regression suite drops before use, leaves the table for debugging, uses ORDER BY on the successful grouped query, and uses test/exception for the negative case. git diff --check passes for the current PR files.
- Performance/observability: the added check is constructor-only and does not need additional logs or metrics.
User focus: no additional user-provided review focus was supplied.
I did not run the full BE/FE build or regression suite in this review.
There was a problem hiding this comment.
No blocking code issues found in this pass.
Checkpoint conclusions:
- Goal and proof: The PR enforces the
retention()32-condition limit in both FE analysis and BE aggregate construction. The added regression test covers the SQL 32/33 boundary, and the BE unit test covers factory construction atMAX_EVENTSandMAX_EVENTS + 1. - Scope and clarity: The functional change is small and focused on the retention aggregate plus targeted tests.
- Concurrency and lifecycle: No new shared mutable state, threads, locks, static initialization dependency, or non-obvious lifecycle path is introduced.
- Config and compatibility: No config/storage/protocol format is added. The BE constructor guard is a useful mixed-version backstop for plans that still reach BE.
- Parallel paths: I checked the FE builtin aggregate registration and the BE aggregate factory registration; there is a single
retentionbinding path, and aggregate combinator paths still go through the nested legality/factory construction path. - Error handling: FE raises
AnalysisException; BE raisesdoris::Exception, which is converted at the existing expression context boundary. This avoids silent continuation and prevents the previous heap overwrite path. - Tests and expected results: The new regression test follows the drop-before-use rule and uses the expected-error pattern. No
.outfile is required because the successful query is only used as an execution boundary check and the negative case checks the exception. - Observability: No additional logs or metrics seem necessary for this arity validation.
- Transactions, persistence, writes, MoW, and visible-version correctness: Not applicable to this PR.
- Performance: The added checks are constant-time construction/analysis checks and do not affect the aggregate hot path for valid queries.
Existing review context: I treated the prior inline threads about removing the CIR tag and the final table drop as already known; both are fixed in the current diff.
User focus: No additional user-provided focus points were present.
CI note: Compile, CheckStyle, clang formatter, license, and secret checks are passing. The GitHub-hosted macOS BE UT failure is environmental (JAVA version is 25, it must be JDK-17) before tests run. The TeamCity FE UT log is not accessible from this runner (HTTP 401), and several BE/regression checks were still pending when reviewed.
|
run feut |
TPC-H: Total hot run time: 28958 ms |
TPC-DS: Total hot run time: 168709 ms |
|
run beut |
FE UT Coverage ReportIncrement line coverage |
|
run beut |
FE Regression Coverage ReportIncrement line coverage |
1 similar comment
FE Regression Coverage ReportIncrement line coverage |
What problem does this PR solve?
Problem Summary:
retention()keeps its aggregate state in a fixed-sizeRetentionState::events[32]array and serializes it into a singleint64bitmap, so it can represent at most 32 events. However neither FE nor BE limited the number of arguments:AggregateFunctionRetention::add()iterates over all argument columns and callsRetentionState::set(i)(events[i] = 1) without any bound check, so with more than 32 params it writes past the end ofevents[](heap out-of-bounds write).RetentionState::insert_result_into()readsevents[]for the actual argument count, so it reads past the array as well.With a query calling
retention()with 102 boolean arguments, this corrupted the heap and later crashed BE in the streaming aggregation output path:This PR enforces the 32-event limit on both sides:
Retention.checkLegalityBeforeTypeCoercion): reject> 32params with a clearAnalysisExceptionat planning time.AggregateFunctionRetentionconstructor): throwINVALID_ARGUMENTwhen more thanRetentionState::MAX_EVENTSargument types are given, as a backstop for any path that reaches BE (e.g. an older FE during a rolling upgrade, or a direct BE call).Release note
Fix BE crash when
retention()is called with more than 32 conditions; it now reports a clear error instead of corrupting memory.Check List (For Author)
Test
Behavior changed:
retention()with more than 32 conditions now fails with a clear error at planning time instead of returning wrong results or crashing BE.Does this need documentation?