26.3 Antalya port - fixes for s3Cluster distributed calls#1646
Conversation
…ote_calls 26.1 Antalya port - fixes for s3Cluster distributed calls
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 83ae6b17bf
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if (info.has_join) | ||
| { | ||
| auto join_node = query_node.getJoinTree(); | ||
| query_tree_distributed = join_node->as<JoinNode>()->getLeftTableExpression()->clone(); |
There was a problem hiding this comment.
Rewrite query from the active table, not the join's left side
updateQueryWithJoinToSendIfNeeded always takes join_node->getLeftTableExpression() when building query_to_send, which is incorrect when this IStorageCluster::read call is for a table function on the right side of a JOIN (e.g. LEFT JOIN ... s3Cluster(...)). In that case the rewritten remote query is built from the other table, so the shard executes the wrong source (often a local table name that does not exist remotely), causing query failure or incorrect data flow for object_storage_cluster_join_mode='local'.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Methoc called when xxCluster is on the left side of join.
Verification report: Altinity/ClickHouse PR #1646ConclusionPR is merged. CI red on head, but every failure is either a pre-existing flake or a regression-suite scenario broken at baseline on
CI on head
|
| Check | Test FAIL | Class |
|---|---|---|
Stateless tests (amd_debug, sequential) |
00157_cache_dictionary |
Pre-existing flake — 106 fails / 25 PRs / 90d |
Stateless tests (arm_binary, sequential) |
00157_cache_dictionary |
same |
Regression workflow (10 failed checks)
| Check | Top failing test on PR-1646 builds (30d) | Baseline fail rate (30d, all PRs on 26.3) | Class |
|---|---|---|---|
Swarms (Release + Aarch64) |
/swarms/feature/node failure/network failure (×16) |
44% (303 OK / 235 Fail) | Pre-existing broken |
| same | /swarms/feature/cluster discovery/multiple paths (×16) |
30% (431 / 185) | Pre-existing broken |
| same | /swarms/feature/swarm join sanity/join with clause (×16) |
34% (408 / 207) | Pre-existing broken |
S3Export (partition) (Release + Aarch64) |
/s3/minio/export tests/export partition/sanity/no partition by (×16) |
50% (282 / 286) | Pre-existing broken |
Iceberg (1) (Release + Aarch64) |
/iceberg/iceberg engine/rest catalog/sort key timezone/sort key with month transform and utc timezone (×16) |
41% (267 / 188) | Missing-dependency symptom — --iceberg_partition_timezone not exposed by tested binary |
| same | /iceberg/iceberg engine/rest catalog/iceberg iterator race condition (×16) |
28% (528 / 210) | Pre-existing flaky |
Iceberg (2) (Release + Aarch64) |
/iceberg/iceberg engine/glue catalog/iceberg iterator race condition (×16) |
28% (524 / 205) | Pre-existing flaky |
Parquet (Release + Aarch64) |
/parquet/postgresql/compression type/=* and /parquet/mysql/compression type/=* (×16 each) |
~36% | Pre-existing flaky |
Regression DB on /PRs/1646/ builds (30d): 1,234 Fail / 41,062 OK ≈ 3.0%. Every top failure matches the all-PR baseline fail rate on antalya-26.3.
Related to PR diff?
PR fixes s3Cluster distributed-call paths (35 files in storage/cluster/table-function layers).
| Failing test | Diff overlap | Related? |
|---|---|---|
00157_cache_dictionary |
none (dictionary cache) | No |
swarms / node failure, cluster discovery, swarm join sanity |
thematic overlap (swarm = cluster of object-storage nodes), but failures occur at the same 30–44% baseline on PRs that do not touch s3Cluster | No |
s3_export_partition / no partition by |
none (export-partition path) | No |
iceberg / sort key timezone month transform utc |
unrelated (timezone partitioning feature), failure is UNRECOGNIZED_ARGUMENTS on binary — missing wiring at the test/binary layer |
No (missing-dep symptom) |
iceberg / iterator race condition (rest + glue) |
iterator race condition is a long-standing chronic flake | No |
parquet / postgresql + mysql function round-trip |
none (parquet ↔ external DB) | No |
No failure exceeds the all-PR baseline or intersects this PR's touched code path in a way unique to #1646.
Recommendations
- No action on this PR. Merged and effectively clean.
- Re-verify after the dependency frontports land on
antalya-26.3(Antalya 26.1; Remote initiator improvements 2 #1608/Antalya 25.8: Remote initiator improvements 2 #1638, Fix attempt to create table from table function #1701, Cherry-pick: Fix: system.databases always shows data lake catalog databases #1690, plus whatever wiresiceberg_partition_timezoneas a CLI option in the tested binary). The icebergUNRECOGNIZED_ARGUMENTSfailure is a sharp tell that the regression-test environment expects features the branch's binary does not (yet) expose. - Pair with the same chronic-baseline cleanup recommended in
VERIFICATION_PR_1640.md: swarmsnetwork failure/cluster discovery, export-partitionno partition by, iceberg iterator race conditions, parquet postgresql/mysql round-trip — these affect every PR onantalya-26.3and should be tracked as single issues rather than re-attributed per PR.
Audit Report — PR #1646PR: 26.3 Antalya port - fixes for s3Cluster distributed calls ScopeFrontport of antalya-26.1 PR #1395 (itself a bundle of #583, #584, #703, #720, #972, #1082, #1160, #1168). Key feature areas:
~2 000 diff lines, 35 files. Call GraphConfirmed DefectsMEDIUM —
|
| Anchor | ObjectFilterStep::updatePipeline (line 404): return std::move(pipelines.front()) — no filtering applied |
| Impact | If optimizePrimaryKeyConditionAndLimit does not find a SourceStepWithFilter below the ObjectFilterStep (e.g., the plan is restructured by other optimizations between step insertion and the optimizer pass), the hive-partition filter is silently dropped. The result: all files are read regardless of hive-partition column predicates — MORE rows than expected pass to the coordinator. For COUNT() queries this produces wrong numbers. |
| Trigger | use_hive_partitioning=1 + s3Cluster(...) + WHERE key <= N on a hive-partitioned path, in any scenario where the optimizer's optimizePrimaryKeyConditionAndLimit walk doesn't reach the ReadFromCluster source before another optimization interposes a non-transparent step. The current code guards the insertion with typeid_cast<ReadFromCluster *>(query_plan.getRootNode()->step.get()), which mitigates the most obvious case, but the optimizer runs later and may see a different stack. |
| Why defect | ObjectFilterStep carries the filter DAG but deliberately does not apply it. Correctness depends entirely on the optimizer succeeding. There is no fallback: if push-down fails, no other step applies the filter. A proper defensive implementation would either be a real FilterStep (so rows are always filtered even if push-down fails), or assert that the source step is a SourceStepWithFilter and throw if not. |
| Fix direction | Either implement updatePipeline as a real filter (like FilterStep) so it always applies the predicate, or verify in step-insertion code that the optimizer will always run this step successfully and throw LOGICAL_ERROR if push-down does not occur. |
| Regression test direction | Add a query plan optimization test that inserts an incompatible step between ObjectFilterStep and ReadFromCluster and asserts the filter is not silently dropped. |
MEDIUM — SearcherVisitor always takes the left-most table expression; cluster function on the right side of JOIN produces a wrong rewritten query
| Anchor | IStorageCluster::updateQueryWithJoinToSendIfNeeded / SearcherVisitor (DFS, getSubqueryDepth() <= 2 && !passed_node): finds the first TABLE/TABLE_FUNCTION node encountered; then query_tree_distributed = join_node->getLeftTableExpression()->clone() |
| Impact | When s3Cluster(...) is the RIGHT operand of a JOIN (e.g. FROM local_table JOIN s3Cluster(...)), SearcherVisitor finds local_table first. table_function_node = local_table. query_tree_distributed = left expression = local_table. The rewritten remote query is SELECT … FROM local_table, which does not exist on remote nodes. Result: wrong data or exception on remote workers. |
| Trigger | FROM local_table JOIN s3Cluster(…) SETTINGS object_storage_cluster_join_mode='local'. The Codex P1 comment flagged this; the author's response ("method called when xxCluster is on the left side") is an assertion about caller behavior, not an enforcement in code. The test test_joins result2 places s3Cluster on the right and asserts equality with result1 — if this passes in CI it suggests the planner does reorder the join or read() is not called for the right-side table in this configuration; if so the invariant should be documented or enforced. |
| Why defect | The code assumes the cluster function is always the left-most TABLE/TABLE_FUNCTION in DFS order. This invariant is not checked nor enforced. A JOIN with cluster function on the right silently selects the wrong table expression for the remote query. |
| Fix direction | Replace the generic DFS with a targeted search for the cluster storage node, or assert that table_function_node is of the expected cluster type before proceeding and throw LOGICAL_ERROR otherwise. |
| Regression test direction | Add an integration test: FROM local_table INNER JOIN s3Cluster(…) ON … SETTINGS object_storage_cluster_join_mode='local' and verify results match a reference computed without the cluster setting. |
LOW — FutureSetFromSubquery::detachSetAndKey is not protected by the new mutex
| Anchor | PreparedSets.cpp / FutureSetFromSubquery::detachSetAndKey (line 194): moves set_and_key out without acquiring mutex |
| Impact | If another thread concurrently calls get(), getTypes(), or build() (all of which now hold mutex and dereference set_and_key), after detachSetAndKey has moved set_and_key to nullptr, the dereference set_and_key->set is a null pointer dereference. |
| Trigger | Concurrent call to detachSetAndKey and any mutex-protected method. In practice, detachSetAndKey is likely called in a single-threaded setup phase; the risk depends on the usage context. |
| Why defect | All other public methods that access set_and_key were wrapped with the mutex, but detachSetAndKey was not, creating an inconsistent protection boundary. |
| Fix direction | Add std::lock_guard lock(mutex) to detachSetAndKey, or document that detachSetAndKey must not be called concurrently with other methods. |
| Regression test direction | Thread-sanitizer run of a query that builds a subquery set concurrently with the set being detached. |
LOW — PlannerJoinTree: removal of to_stage == FetchColumns guard may add unnecessary output columns in aggregation stages
| Anchor | PlannerJoinTree.cpp line 1364: original guard select_query_options.to_stage == QueryProcessingStage::FetchColumns removed |
| Impact | For remote tables or table functions with to_stage == WithMergeableState, output columns are now always added to updated_actions_dag_outputs, even when they are not needed at that aggregation stage. This could interfere with column pruning or produce extra columns in distributed aggregation results. Impact is likely performance (wider blocks), not correctness. |
| Trigger | Any distributed query with aggregation using a remote table function and to_stage = WithMergeableState. |
| Why defect | The stage check existed to avoid adding columns that are irrelevant at the merge stage. Its removal is a correctness tradeoff that may cause plan regressions for non-cluster remote queries. |
| Fix direction | Keep the fix for table functions but scope it: only remove the FetchColumns guard for cluster-function storage, not for all remote tables. |
| Regression test direction | EXPLAIN the distributed query with aggregation and verify the output column list does not grow unnecessarily after the change. |
Previously Flagged Defect (Codex P1) — Evaluated
| Codex finding | Assessment |
|---|---|
P1 — updateQueryWithJoinToSendIfNeeded always takes join_node->getLeftTableExpression() (wrong when cluster is on the right of JOIN). |
Partially addressed by author assertion ("method called when xxCluster is on the left side"). The invariant is not enforced in code and the passing test result2 (cluster on right) requires further investigation. Retained as a MEDIUM finding above. |
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Frontports for Antalya 26.3
CI/CD Options
Exclude tests:
Regression jobs to run: