Skip to content

[fix](retention) Limit param count to 32 to avoid BE heap overflow#64521

Open
924060929 wants to merge 1 commit into
apache:masterfrom
924060929:fix_retention_param_limit
Open

[fix](retention) Limit param count to 32 to avoid BE heap overflow#64521
924060929 wants to merge 1 commit into
apache:masterfrom
924060929:fix_retention_param_limit

Conversation

@924060929

Copy link
Copy Markdown
Contributor

What problem does this PR solve?

Problem Summary:

retention() keeps its aggregate state in a fixed-size RetentionState::events[32] array and serializes it into a single int64 bitmap, so it can represent at most 32 events. However neither FE nor BE limited the number of arguments:

  • AggregateFunctionRetention::add() iterates over all argument columns and calls RetentionState::set(i) (events[i] = 1) without any bound check, so with more than 32 params it writes past the end of events[] (heap out-of-bounds write).
  • RetentionState::insert_result_into() reads events[] for the actual argument count, so it reads past the array as well.

With a query calling retention() with 102 boolean arguments, this corrupted the heap and later crashed BE in the streaming aggregation output path:

[E-3113]string column length is too large: total_length=4295003729, element_number=4064
...
doris::vectorized::ColumnStr<unsigned int>::insert_many_strings -> memcpy   (SIGSEGV)
doris::pipeline::StreamingAggLocalState::_get_results_with_serialized_key

This PR enforces the 32-event limit on both sides:

  • FE (Retention.checkLegalityBeforeTypeCoercion): reject > 32 params with a clear AnalysisException at planning time.
  • BE (AggregateFunctionRetention constructor): throw INVALID_ARGUMENT when more than RetentionState::MAX_EVENTS argument types are given, as a backstop for any path that reaches BE (e.g. an older FE during a rolling upgrade, or a direct BE call).

Release note

Fix BE crash when retention() is called with more than 32 conditions; it now reports a clear error instead of corrupting memory.

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes. retention() with more than 32 conditions now fails with a clear error at planning time instead of returning wrong results or crashing BE.
  • Does this need documentation?

    • No.
    • Yes.

@hello-stephen

Copy link
Copy Markdown
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@924060929

Copy link
Copy Markdown
Contributor Author

run buildall

@924060929

Copy link
Copy Markdown
Contributor Author

/review

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review conclusion: the retention limit fix is directionally correct: FE rejects more than 32 params, BE has a constructor backstop, and the added BE unit test covers the 32/33 boundary. I found one regression-test standards issue that should be fixed before merge.

Critical checkpoint conclusions:

  • Goal/test proof: The PR addresses the >32 retention heap overflow with FE and BE validation, plus unit/regression coverage. The regression test needs one cleanup-policy fix.
  • Scope/focus: The implementation is small and focused on retention argument count validation.
  • Concurrency/lifecycle/config/compatibility/persistence/data writes/FE-BE variables: No new concurrency, lifecycle, config, storage-format, persistence, data-write, or FE-BE variable-passing concerns found.
  • Parallel paths: Nereids validation and BE aggregate construction are both covered for the changed SQL/runtime paths.
  • Conditional checks: The max-argument condition is tied to RetentionState::MAX_EVENTS and is justified by the fixed array/bitmap representation.
  • Tests/results: BE boundary unit test and regression test were added; the regression test currently drops the table after execution, which violates the Doris regression-test rule to preserve state after the suite.
  • Observability/performance: No additional observability is needed for this validation-only change; no performance issue found.
  • Other issues: No duplicate existing comments were present, and no extra user focus points were provided.

Local review checks: read the code-review skill and relevant FE/Nereids/BE guides; inspected the PR diff and related factory/analysis paths; ran git diff --check successfully. I did not run the full BE or regression test suites.

exception "at most 32"
}

sql "DROP TABLE IF EXISTS retention_param_limit_test"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doris regression tests should drop tables before use but leave them in place after the suite so a failed run preserves the final state for debugging. This suite already does DROP TABLE IF EXISTS retention_param_limit_test at the beginning, so please remove this final drop.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — removed the trailing DROP TABLE IF EXISTS retention_param_limit_test; the suite now only drops at the start and leaves the table in place after the run.

@924060929

Copy link
Copy Markdown
Contributor Author

run buildall

@924060929

Copy link
Copy Markdown
Contributor Author

/review

Comment thread be/test/exprs/aggregate/vec_retention_test.cpp Outdated
retention() keeps its aggregate state in a fixed-size RetentionState::events[32]
array and serializes it into a single int64 bitmap, so it supports at most 32
events. Neither FE nor BE enforced this limit:

- AggregateFunctionRetention::add() iterates over all argument columns and calls
  RetentionState::set(i) (events[i] = 1) with no bound check, so more than 32
  params write past events[] (heap out-of-bounds write).
- insert_result_into() reads events[] for the actual argument count, reading past
  the array as well.

With retention() called with 102 boolean args this corrupted the heap and later
crashed BE in the streaming aggregation output path (ColumnString length
corruption "string column length is too large: total_length=4295003729" ->
memcpy SIGSEGV in StreamingAggLocalState::_get_results_with_serialized_key).

Enforce the 32-event limit on both sides:
- FE Retention.checkLegalityBeforeTypeCoercion rejects > 32 params with a clear
  AnalysisException at planning time.
- BE AggregateFunctionRetention ctor throws INVALID_ARGUMENT when given more than
  RetentionState::MAX_EVENTS argument types, as a backstop for any path reaching BE.
@924060929 924060929 force-pushed the fix_retention_param_limit branch from 8655a29 to 55590bb Compare June 15, 2026 08:28
@924060929

Copy link
Copy Markdown
Contributor Author

run buildall

@924060929

Copy link
Copy Markdown
Contributor Author

/review

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review opinion: no blocking issues found in the current PR file list.

I reviewed the 4 files reported by GitHub for PR 64521 and the existing inline review threads. The previous comments about removing the internal issue ID and the final regression-test DROP are addressed in the current head.

Critical checkpoint conclusions:

  • Goal and tests: the change enforces the retention() 32-condition limit in Nereids FE and adds a BE constructor guard as a fallback. The added BE unit test covers 32 vs 33 args, and the regression test covers the FE rejection path.
  • Scope: the current PR is focused on retention() validation and its tests.
  • Concurrency/lifecycle/config: no new concurrency, lifecycle, static-init, or config risk found.
  • Compatibility/parallel paths: no storage format, persistence, or FE-BE protocol change. The BE guard protects paths that bypass Nereids validation.
  • Error handling: FE raises AnalysisException; BE throws doris::Exception during aggregate creation, consistent with existing vectorized function construction patterns.
  • Data correctness/transactions: no transaction, visible-version, delete-bitmap, or write-path behavior is touched.
  • Test hygiene: the new regression suite drops before use, leaves the table for debugging, uses ORDER BY on the successful grouped query, and uses test/exception for the negative case. git diff --check passes for the current PR files.
  • Performance/observability: the added check is constructor-only and does not need additional logs or metrics.

User focus: no additional user-provided review focus was supplied.

I did not run the full BE/FE build or regression suite in this review.

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No blocking code issues found in this pass.

Checkpoint conclusions:

  • Goal and proof: The PR enforces the retention() 32-condition limit in both FE analysis and BE aggregate construction. The added regression test covers the SQL 32/33 boundary, and the BE unit test covers factory construction at MAX_EVENTS and MAX_EVENTS + 1.
  • Scope and clarity: The functional change is small and focused on the retention aggregate plus targeted tests.
  • Concurrency and lifecycle: No new shared mutable state, threads, locks, static initialization dependency, or non-obvious lifecycle path is introduced.
  • Config and compatibility: No config/storage/protocol format is added. The BE constructor guard is a useful mixed-version backstop for plans that still reach BE.
  • Parallel paths: I checked the FE builtin aggregate registration and the BE aggregate factory registration; there is a single retention binding path, and aggregate combinator paths still go through the nested legality/factory construction path.
  • Error handling: FE raises AnalysisException; BE raises doris::Exception, which is converted at the existing expression context boundary. This avoids silent continuation and prevents the previous heap overwrite path.
  • Tests and expected results: The new regression test follows the drop-before-use rule and uses the expected-error pattern. No .out file is required because the successful query is only used as an execution boundary check and the negative case checks the exception.
  • Observability: No additional logs or metrics seem necessary for this arity validation.
  • Transactions, persistence, writes, MoW, and visible-version correctness: Not applicable to this PR.
  • Performance: The added checks are constant-time construction/analysis checks and do not affect the aggregate hot path for valid queries.

Existing review context: I treated the prior inline threads about removing the CIR tag and the final table drop as already known; both are fixed in the current diff.

User focus: No additional user-provided focus points were present.

CI note: Compile, CheckStyle, clang formatter, license, and secret checks are passing. The GitHub-hosted macOS BE UT failure is environmental (JAVA version is 25, it must be JDK-17) before tests run. The TeamCity FE UT log is not accessible from this runner (HTTP 401), and several BE/regression checks were still pending when reviewed.

@924060929

Copy link
Copy Markdown
Contributor Author

run feut

@hello-stephen

Copy link
Copy Markdown
Contributor
TPC-H: Total hot run time: 28958 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 55590bb1aedfd19ceaf02a4913bc4994298b1df1, data reload: false

------ Round 1 ----------------------------------
orders	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	17674	4101	4124	4101
q2	q3	10778	1390	819	819
q4	4687	472	344	344
q5	7555	888	577	577
q6	183	174	140	140
q7	765	827	623	623
q8	9760	1614	1559	1559
q9	7104	4509	4508	4508
q10	6781	1816	1483	1483
q11	431	266	242	242
q12	634	420	282	282
q13	18160	3348	2799	2799
q14	260	261	252	252
q15	q16	819	779	712	712
q17	1015	1000	1036	1000
q18	6721	5815	5520	5520
q19	1619	1376	1050	1050
q20	498	418	266	266
q21	5777	2578	2373	2373
q22	424	357	308	308
Total cold run time: 101645 ms
Total hot run time: 28958 ms

----- Round 2, with runtime_filter_mode=off -----
orders	Doris	NULL	NULL	150000000	42	6422171781	NULL	22778155	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	4509	4397	4396	4396
q2	q3	4541	4912	4336	4336
q4	2086	2218	1400	1400
q5	4442	4316	4322	4316
q6	232	174	129	129
q7	1738	2070	1759	1759
q8	2486	2152	2092	2092
q9	7884	7841	7889	7841
q10	4814	4787	4316	4316
q11	583	424	379	379
q12	723	760	611	611
q13	3377	3601	2972	2972
q14	312	299	284	284
q15	q16	731	747	637	637
q17	1359	1336	1335	1335
q18	8031	7406	6801	6801
q19	1100	1096	1098	1096
q20	2246	2217	1942	1942
q21	5323	4621	4472	4472
q22	528	455	403	403
Total cold run time: 57045 ms
Total hot run time: 51517 ms

@hello-stephen

Copy link
Copy Markdown
Contributor
TPC-DS: Total hot run time: 168709 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 55590bb1aedfd19ceaf02a4913bc4994298b1df1, data reload: false

query5	4329	634	490	490
query6	443	193	170	170
query7	4836	581	303	303
query8	357	215	203	203
query9	8740	4070	4047	4047
query10	477	334	242	242
query11	5899	2338	2116	2116
query12	151	103	96	96
query13	1243	628	463	463
query14	6382	5396	5066	5066
query14_1	4402	4419	4405	4405
query15	211	198	176	176
query16	993	462	419	419
query17	958	709	586	586
query18	2436	490	354	354
query19	203	187	145	145
query20	121	108	117	108
query21	219	154	125	125
query22	13538	13521	13432	13432
query23	17368	16522	16173	16173
query23_1	16328	16372	16257	16257
query24	7603	1747	1304	1304
query24_1	1320	1313	1336	1313
query25	575	473	406	406
query26	1295	318	168	168
query27	2692	566	348	348
query28	4475	2011	2039	2011
query29	1081	639	486	486
query30	317	241	201	201
query31	1110	1094	953	953
query32	112	62	59	59
query33	537	323	258	258
query34	1189	1103	683	683
query35	751	798	702	702
query36	1404	1402	1211	1211
query37	152	101	88	88
query38	3185	3133	3098	3098
query39	933	919	913	913
query39_1	882	905	866	866
query40	213	121	100	100
query41	69	61	60	60
query42	97	94	93	93
query43	332	320	284	284
query44	
query45	191	181	172	172
query46	1088	1192	772	772
query47	2368	2398	2218	2218
query48	399	417	301	301
query49	615	452	347	347
query50	1046	365	287	287
query51	4406	4340	4273	4273
query52	87	90	82	82
query53	251	284	185	185
query54	279	236	195	195
query55	80	76	68	68
query56	229	229	206	206
query57	1418	1405	1306	1306
query58	233	213	207	207
query59	1619	1638	1401	1401
query60	273	245	222	222
query61	152	153	149	149
query62	691	647	584	584
query63	232	187	190	187
query64	2545	758	619	619
query65	
query66	1786	491	330	330
query67	29641	29629	29490	29490
query68	
query69	419	306	264	264
query70	968	971	944	944
query71	289	231	217	217
query72	2937	2623	2292	2292
query73	889	786	428	428
query74	5115	4933	4769	4769
query75	2648	2572	2222	2222
query76	2365	1155	799	799
query77	359	364	294	294
query78	12398	12522	11885	11885
query79	1448	1088	733	733
query80	715	458	396	396
query81	476	290	242	242
query82	563	176	128	128
query83	358	275	255	255
query84	
query85	901	500	424	424
query86	410	307	283	283
query87	3395	3320	3147	3147
query88	3691	2754	2775	2754
query89	431	383	334	334
query90	1873	182	185	182
query91	173	165	132	132
query92	64	61	54	54
query93	1518	1419	889	889
query94	615	348	316	316
query95	664	372	345	345
query96	1055	788	333	333
query97	2703	2694	2605	2605
query98	208	203	202	202
query99	1150	1179	1018	1018
Total cold run time: 250693 ms
Total hot run time: 168709 ms

@924060929

Copy link
Copy Markdown
Contributor Author

run beut

@hello-stephen

Copy link
Copy Markdown
Contributor

FE UT Coverage Report

Increment line coverage 0.00% (0/3) 🎉
Increment coverage report
Complete coverage report

@924060929

Copy link
Copy Markdown
Contributor Author

run beut

@hello-stephen

Copy link
Copy Markdown
Contributor

FE Regression Coverage Report

Increment line coverage 0.35% (3/865) 🎉
Increment coverage report
Complete coverage report

1 similar comment
@hello-stephen

Copy link
Copy Markdown
Contributor

FE Regression Coverage Report

Increment line coverage 0.35% (3/865) 🎉
Increment coverage report
Complete coverage report

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants