26.3 Antalya port - fixes for s3Cluster distributed calls by ianton-ru · Pull Request #1646 · Altinity/ClickHouse

ianton-ru · 2026-04-14T15:22:57Z

Changelog category (leave one):

New Feature

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Frontports for Antalya 26.3

26.1 Antalya port - fixes for s3Cluster distributed calls #1395, It's a frontport for 26.1, contains

CI/CD Options

Exclude tests:

Regression jobs to run:

…ote_calls 26.1 Antalya port - fixes for s3Cluster distributed calls

ianton-ru · 2026-04-14T15:23:25Z

@codex review

github-actions · 2026-04-14T15:24:06Z

Workflow [PR], commit [e4287e0]

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 83ae6b17bf

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-14T15:35:22Z

+            if (info.has_join)
+            {
+                auto join_node = query_node.getJoinTree();
+                query_tree_distributed = join_node->as<JoinNode>()->getLeftTableExpression()->clone();


Rewrite query from the active table, not the join's left side

updateQueryWithJoinToSendIfNeeded always takes join_node->getLeftTableExpression() when building query_to_send, which is incorrect when this IStorageCluster::read call is for a table function on the right side of a JOIN (e.g. LEFT JOIN ... s3Cluster(...)). In that case the rewritten remote query is built from the other table, so the shard executes the wrong source (often a local table name that does not exist remotely), causing query failure or incorrect data flow for object_storage_cluster_join_mode='local'.

Useful? React with 👍 / 👎.

Methoc called when xxCluster is on the left side of join.

This reverts commit 2fbf4c2.

This reverts commit c8473ee.

alsugiliazova · 2026-05-11T12:50:01Z

Verification report: Altinity/ClickHouse PR #1646

Conclusion

PR is merged. CI red on head, but every failure is either a pre-existing flake or a regression-suite scenario broken at baseline on antalya-26.3. No PR-caused regression found.

Caveat — partial frontport (same as #1640). PR #1646 lives on antalya-26.3, which is still missing several companion frontports from antalya-26.1 (remote-initiator improvements 2 from #1608/#1638, CREATE TABLE from table function fix from #1701, system.databases data-lake fix from #1690, and miscellaneous export-partition / iceberg / parquet plumbing). The chronic regression failures observed here are missing-dependency symptoms on the branch, not breakage introduced by this PR. One concrete example: the iceberg sort key with month transform and utc timezone test errors with Code: 552. UNRECOGNIZED_ARGUMENTS: Unrecognized option '--iceberg_partition_timezone' — the test asserts on a CLI/setting that the regression test expects, fired against a build that does not surface it on the command line. The setting itself is declared in src/Core/Settings.cpp on the merged frontport branch, so the wiring discrepancy is somewhere between regression-test setup and the binary in this CI run — not from PR #1646's diff.

CI on head `e4287e00` — failures

PR test workflow (2 failed checks, 43 success, 1 cancelled)

Check	Test FAIL	Class
`Stateless tests (amd_debug, sequential)`	`00157_cache_dictionary`	Pre-existing flake — 106 fails / 25 PRs / 90d
`Stateless tests (arm_binary, sequential)`	`00157_cache_dictionary`	same

Regression workflow (10 failed checks)

Check	Top failing test on PR-1646 builds (30d)	Baseline fail rate (30d, all PRs on 26.3)	Class
`Swarms` (Release + Aarch64)	`/swarms/feature/node failure/network failure` (×16)	44% (303 OK / 235 Fail)	Pre-existing broken
same	`/swarms/feature/cluster discovery/multiple paths` (×16)	30% (431 / 185)	Pre-existing broken
same	`/swarms/feature/swarm join sanity/join with clause` (×16)	34% (408 / 207)	Pre-existing broken
`S3Export (partition)` (Release + Aarch64)	`/s3/minio/export tests/export partition/sanity/no partition by` (×16)	50% (282 / 286)	Pre-existing broken
`Iceberg (1)` (Release + Aarch64)	`/iceberg/iceberg engine/rest catalog/sort key timezone/sort key with month transform and utc timezone` (×16)	41% (267 / 188)	Missing-dependency symptom — `--iceberg_partition_timezone` not exposed by tested binary
same	`/iceberg/iceberg engine/rest catalog/iceberg iterator race condition` (×16)	28% (528 / 210)	Pre-existing flaky
`Iceberg (2)` (Release + Aarch64)	`/iceberg/iceberg engine/glue catalog/iceberg iterator race condition` (×16)	28% (524 / 205)	Pre-existing flaky
`Parquet` (Release + Aarch64)	`/parquet/postgresql/compression type/=` and `/parquet/mysql/compression type/=` (×16 each)	~36%	Pre-existing flaky

Regression DB on /PRs/1646/ builds (30d): 1,234 Fail / 41,062 OK ≈ 3.0%. Every top failure matches the all-PR baseline fail rate on antalya-26.3.

Related to PR diff?

PR fixes s3Cluster distributed-call paths (35 files in storage/cluster/table-function layers).

Failing test	Diff overlap	Related?
`00157_cache_dictionary`	none (dictionary cache)	No
`swarms / node failure, cluster discovery, swarm join sanity`	thematic overlap (swarm = cluster of object-storage nodes), but failures occur at the same 30–44% baseline on PRs that do not touch s3Cluster	No
`s3_export_partition / no partition by`	none (export-partition path)	No
`iceberg / sort key timezone month transform utc`	unrelated (timezone partitioning feature), failure is `UNRECOGNIZED_ARGUMENTS` on binary — missing wiring at the test/binary layer	No (missing-dep symptom)
`iceberg / iterator race condition` (rest + glue)	iterator race condition is a long-standing chronic flake	No
`parquet / postgresql + mysql function round-trip`	none (parquet ↔ external DB)	No

No failure exceeds the all-PR baseline or intersects this PR's touched code path in a way unique to #1646.

Recommendations

No action on this PR. Merged and effectively clean.
Re-verify after the dependency frontports land on antalya-26.3 (Antalya 26.1; Remote initiator improvements 2 #1608/Antalya 25.8: Remote initiator improvements 2 #1638, Fix attempt to create table from table function #1701, Cherry-pick: Fix: system.databases always shows data lake catalog databases #1690, plus whatever wires iceberg_partition_timezone as a CLI option in the tested binary). The iceberg UNRECOGNIZED_ARGUMENTS failure is a sharp tell that the regression-test environment expects features the branch's binary does not (yet) expose.
Pair with the same chronic-baseline cleanup recommended in VERIFICATION_PR_1640.md: swarms network failure / cluster discovery, export-partition no partition by, iceberg iterator race conditions, parquet postgresql/mysql round-trip — these affect every PR on antalya-26.3 and should be tracked as single issues rather than re-attributed per PR.

alsugiliazova · 2026-05-11T13:18:40Z

Audit Report — PR #1646

PR: 26.3 Antalya port - fixes for s3Cluster distributed calls
Merged: May 6, 2026 into antalya-26.3
Auditor: AI (Sonnet 4.6 via Cursor audit-review skill)
Date: 2026-05-11

Scope

Frontport of antalya-26.1 PR #1395 (itself a bundle of #583, #584, #703, #720, #972, #1082, #1160, #1168). Key feature areas:

Area	Files
JOIN rewriting for cluster queries (`object_storage_cluster_join_mode='local'`)	`IStorageCluster.cpp/h`
`FutureSetFromSubquery` concurrency fix	`PreparedSets.cpp/h`
`ObjectFilterStep` — hive-partition filter push-down for cluster reads	`ObjectFilterStep.cpp/h`, `InterpreterSelectQuery.cpp`, `Planner.cpp`, `optimizePrimaryKeyConditionAndLimit.cpp`
`remote()` + cluster function `INITIAL_QUERY` identity fix	`RemoteQueryExecutor.cpp/h`, `ReadFromRemote.cpp/h`, `Context.cpp`
`PlannerJoinTree` remote output columns	`PlannerJoinTree.cpp`
AST extraction helpers	`extractTableFunctionFromSelectQuery.cpp/h`
`ObjectStorageClusterJoinMode` setting	`SettingsEnums.h/cpp`, `Settings.cpp/h`, `SettingsChangesHistory.cpp`

~2 000 diff lines, 35 files.

Call Graph

User query (s3Cluster / icebergCluster JOIN local_table)
  └─ IStorageCluster::getQueryProcessingStage()
       └─ getQueryTreeInfo()        ← checks for JOIN, CROSS JOIN, WHERE with local columns
  └─ IStorageCluster::read()
       └─ updateQueryWithJoinToSendIfNeeded()
            └─ SearcherVisitor       ← finds first TABLE/TABLE_FUNCTION in DFS order
            └─ CollectUsedColumnsForSourceVisitor
            └─ queryNodeToDistributedSelectQuery()
       └─ updateQueryToSendIfNeeded()
       └─ [cluster path] ReadFromCluster step added to plan
            └─ ObjectFilterStep added above ReadFromCluster (hive partitioning)
                 └─ optimizePrimaryKeyConditionAndLimit pushes filter back to ReadFromCluster
            └─ ReadFromCluster::initializePipeline()
                 └─ RemoteQueryExecutor (with is_remote_function / shard_count)
                      └─ sendQueryUnlocked() → INITIAL_QUERY rewrite when is_remote_function && shard_count==1

FutureSetFromSubquery (IN subquery materialisation, concurrent access)
  └─ get() / build() / setQueryPlan() / setExternalTable() / getTypes() → all mutex-protected
  └─ get_unsafe() / build_unsafe() → used inside already-locked callers

Confirmed Defects

MEDIUM — `ObjectFilterStep` is a pass-through: filter is silently dropped if optimizer push-down fails


Anchor	`ObjectFilterStep::updatePipeline` (line 404): `return std::move(pipelines.front())` — no filtering applied
Impact	If `optimizePrimaryKeyConditionAndLimit` does not find a `SourceStepWithFilter` below the `ObjectFilterStep` (e.g., the plan is restructured by other optimizations between step insertion and the optimizer pass), the hive-partition filter is silently dropped. The result: all files are read regardless of hive-partition column predicates — MORE rows than expected pass to the coordinator. For `COUNT()` queries this produces wrong numbers.
Trigger	`use_hive_partitioning=1` + `s3Cluster(...)` + `WHERE key <= N` on a hive-partitioned path, in any scenario where the optimizer's `optimizePrimaryKeyConditionAndLimit` walk doesn't reach the `ReadFromCluster` source before another optimization interposes a non-transparent step. The current code guards the insertion with `typeid_cast<ReadFromCluster *>(query_plan.getRootNode()->step.get())`, which mitigates the most obvious case, but the optimizer runs later and may see a different stack.
Why defect	`ObjectFilterStep` carries the filter DAG but deliberately does not apply it. Correctness depends entirely on the optimizer succeeding. There is no fallback: if push-down fails, no other step applies the filter. A proper defensive implementation would either be a real `FilterStep` (so rows are always filtered even if push-down fails), or assert that the source step is a `SourceStepWithFilter` and throw if not.
Fix direction	Either implement `updatePipeline` as a real filter (like `FilterStep`) so it always applies the predicate, or verify in step-insertion code that the optimizer will always run this step successfully and throw `LOGICAL_ERROR` if push-down does not occur.
Regression test direction	Add a query plan optimization test that inserts an incompatible step between `ObjectFilterStep` and `ReadFromCluster` and asserts the filter is not silently dropped.

MEDIUM — `SearcherVisitor` always takes the left-most table expression; cluster function on the right side of JOIN produces a wrong rewritten query


Anchor	`IStorageCluster::updateQueryWithJoinToSendIfNeeded` / `SearcherVisitor` (DFS, `getSubqueryDepth() <= 2 && !passed_node`): finds the first `TABLE`/`TABLE_FUNCTION` node encountered; then `query_tree_distributed = join_node->getLeftTableExpression()->clone()`
Impact	When `s3Cluster(...)` is the RIGHT operand of a JOIN (e.g. `FROM local_table JOIN s3Cluster(...)`), `SearcherVisitor` finds `local_table` first. `table_function_node` = `local_table`. `query_tree_distributed` = left expression = `local_table`. The rewritten remote query is `SELECT … FROM local_table`, which does not exist on remote nodes. Result: wrong data or exception on remote workers.
Trigger	`FROM local_table JOIN s3Cluster(…) SETTINGS object_storage_cluster_join_mode='local'`. The Codex P1 comment flagged this; the author's response ("method called when xxCluster is on the left side") is an assertion about caller behavior, not an enforcement in code. The test `test_joins result2` places s3Cluster on the right and asserts equality with result1 — if this passes in CI it suggests the planner does reorder the join or `read()` is not called for the right-side table in this configuration; if so the invariant should be documented or enforced.
Why defect	The code assumes the cluster function is always the left-most TABLE/TABLE_FUNCTION in DFS order. This invariant is not checked nor enforced. A JOIN with cluster function on the right silently selects the wrong table expression for the remote query.
Fix direction	Replace the generic DFS with a targeted search for the cluster storage node, or assert that `table_function_node` is of the expected cluster type before proceeding and throw `LOGICAL_ERROR` otherwise.
Regression test direction	Add an integration test: `FROM local_table INNER JOIN s3Cluster(…) ON … SETTINGS object_storage_cluster_join_mode='local'` and verify results match a reference computed without the cluster setting.

LOW — `FutureSetFromSubquery::detachSetAndKey` is not protected by the new mutex


Anchor	`PreparedSets.cpp` / `FutureSetFromSubquery::detachSetAndKey` (line 194): moves `set_and_key` out without acquiring `mutex`
Impact	If another thread concurrently calls `get()`, `getTypes()`, or `build()` (all of which now hold `mutex` and dereference `set_and_key`), after `detachSetAndKey` has moved `set_and_key` to `nullptr`, the dereference `set_and_key->set` is a null pointer dereference.
Trigger	Concurrent call to `detachSetAndKey` and any mutex-protected method. In practice, `detachSetAndKey` is likely called in a single-threaded setup phase; the risk depends on the usage context.
Why defect	All other public methods that access `set_and_key` were wrapped with the mutex, but `detachSetAndKey` was not, creating an inconsistent protection boundary.
Fix direction	Add `std::lock_guard lock(mutex)` to `detachSetAndKey`, or document that `detachSetAndKey` must not be called concurrently with other methods.
Regression test direction	Thread-sanitizer run of a query that builds a subquery set concurrently with the set being detached.

LOW — `PlannerJoinTree`: removal of `to_stage == FetchColumns` guard may add unnecessary output columns in aggregation stages


Anchor	`PlannerJoinTree.cpp` line 1364: original guard `select_query_options.to_stage == QueryProcessingStage::FetchColumns` removed
Impact	For remote tables or table functions with `to_stage == WithMergeableState`, output columns are now always added to `updated_actions_dag_outputs`, even when they are not needed at that aggregation stage. This could interfere with column pruning or produce extra columns in distributed aggregation results. Impact is likely performance (wider blocks), not correctness.
Trigger	Any distributed query with aggregation using a remote table function and `to_stage = WithMergeableState`.
Why defect	The stage check existed to avoid adding columns that are irrelevant at the merge stage. Its removal is a correctness tradeoff that may cause plan regressions for non-cluster remote queries.
Fix direction	Keep the fix for table functions but scope it: only remove the `FetchColumns` guard for cluster-function storage, not for all remote tables.
Regression test direction	`EXPLAIN` the distributed query with aggregation and verify the output column list does not grow unnecessarily after the change.

Previously Flagged Defect (Codex P1) — Evaluated

Codex finding	Assessment
P1 — `updateQueryWithJoinToSendIfNeeded` always takes `join_node->getLeftTableExpression()` (wrong when cluster is on the right of JOIN).	Partially addressed by author assertion ("method called when xxCluster is on the left side"). The invariant is not enforced in code and the passing test `result2` (cluster on right) requires further investigation. Retained as a MEDIUM finding above.

ianton-ru and others added 4 commits April 13, 2026 17:29

Fix version parsing for Antalya branch

c8473ee

Check version format

2fbf4c2

Merge pull request #1395 from Altinity/frontport/antalya-26.1/fix_rem…

a5c4b9f

…ote_calls 26.1 Antalya port - fixes for s3Cluster distributed calls

Fix test

83ae6b1

chatgpt-codex-connector Bot reviewed Apr 14, 2026

View reviewed changes

ianton-ru added 2 commits April 14, 2026 17:42

Fix test

c4c6e59

Fix settings history

2883aa2

ianton-ru added antalya antalya-26.3 labels Apr 14, 2026

Fix Antalya test error

24b139f

This was referenced Apr 15, 2026

26.3 Antalya port - Alternative syntax for cluster functions #1640

Merged

[Before forward-ports] Antalya 26.3: patch for SettingsChangesHistory #1653

Merged

ianton-ru added the port-antalya PRs to be ported to all new Antalya releases label Apr 19, 2026

ianton-ru and others added 7 commits April 20, 2026 10:27

Revert "Check version format"

c019b97

This reverts commit 2fbf4c2.

Revert "Fix version parsing for Antalya branch"

d36d778

This reverts commit c8473ee.

Merge antalya-26.3

20c0dc1

Merge branch 'antalya-26.3' into frontport/antalya-26.3/fix_remote_calls

fb3bee8

Merge branch 'antalya-26.3' into frontport/antalya-26.3/fix_remote_calls

683249d

Fix check_convert_system_db_to_atomic test

8f384df

Merge branch 'antalya-26.3' into frontport/antalya-26.3/fix_remote_calls

e4287e0

zvonand merged commit d9d3710 into antalya-26.3 May 6, 2026
284 of 311 checks passed

alsugiliazova added the verified Approved for release label May 11, 2026

alsugiliazova mentioned this pull request May 11, 2026

Antalya 26.3: Enabled experimental datalake catalogues by-default #1743

Merged

27 tasks

svb-alt added the forwardport This is a frontport of code that existed in previous Antalya versions label May 12, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

26.3 Antalya port - fixes for s3Cluster distributed calls#1646

26.3 Antalya port - fixes for s3Cluster distributed calls#1646
zvonand merged 14 commits into
antalya-26.3from
frontport/antalya-26.3/fix_remote_calls

ianton-ru commented Apr 14, 2026 •

edited by zvonand

Loading

Uh oh!

ianton-ru commented Apr 14, 2026

Uh oh!

github-actions Bot commented Apr 14, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Apr 14, 2026

Uh oh!

ianton-ru Apr 14, 2026

Uh oh!

Uh oh!

alsugiliazova commented May 11, 2026

Uh oh!

alsugiliazova commented May 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

ianton-ru commented Apr 14, 2026 • edited by zvonand Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changelog category (leave one):

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

CI/CD Options

Exclude tests:

Regression jobs to run:

Uh oh!

ianton-ru commented Apr 14, 2026

Uh oh!

github-actions Bot commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

ianton-ru Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

alsugiliazova commented May 11, 2026

Verification report: Altinity/ClickHouse PR #1646

Conclusion

CI on head e4287e00 — failures

PR test workflow (2 failed checks, 43 success, 1 cancelled)

Regression workflow (10 failed checks)

Related to PR diff?

Recommendations

Uh oh!

alsugiliazova commented May 11, 2026

Audit Report — PR #1646

Scope

Call Graph

Confirmed Defects

MEDIUM — ObjectFilterStep is a pass-through: filter is silently dropped if optimizer push-down fails

MEDIUM — SearcherVisitor always takes the left-most table expression; cluster function on the right side of JOIN produces a wrong rewritten query

LOW — FutureSetFromSubquery::detachSetAndKey is not protected by the new mutex

LOW — PlannerJoinTree: removal of to_stage == FetchColumns guard may add unnecessary output columns in aggregation stages

Previously Flagged Defect (Codex P1) — Evaluated

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ianton-ru commented Apr 14, 2026 •

edited by zvonand

Loading

github-actions Bot commented Apr 14, 2026 •

edited

Loading

CI on head `e4287e00` — failures

MEDIUM — `ObjectFilterStep` is a pass-through: filter is silently dropped if optimizer push-down fails

MEDIUM — `SearcherVisitor` always takes the left-most table expression; cluster function on the right side of JOIN produces a wrong rewritten query

LOW — `FutureSetFromSubquery::detachSetAndKey` is not protected by the new mutex

LOW — `PlannerJoinTree`: removal of `to_stage == FetchColumns` guard may add unnecessary output columns in aggregation stages