Skip to content

26.3 Antalya port - fixes for s3Cluster distributed calls#1646

Merged
zvonand merged 14 commits into
antalya-26.3from
frontport/antalya-26.3/fix_remote_calls
May 6, 2026
Merged

26.3 Antalya port - fixes for s3Cluster distributed calls#1646
zvonand merged 14 commits into
antalya-26.3from
frontport/antalya-26.3/fix_remote_calls

Conversation

@ianton-ru
Copy link
Copy Markdown

@ianton-ru ianton-ru commented Apr 14, 2026

Changelog category (leave one):

  • New Feature

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Frontports for Antalya 26.3

CI/CD Options

Exclude tests:

  • Fast test
  • Integration Tests
  • Stateless tests
  • Stateful tests
  • Performance tests
  • All with ASAN
  • All with TSAN
  • All with MSAN
  • All with UBSAN
  • All with Coverage
  • All with Aarch64
  • All Regression
  • Disable CI Cache

Regression jobs to run:

  • Fast suites (mostly <1h)
  • Aggregate Functions (2h)
  • Alter (1.5h)
  • Benchmark (30m)
  • ClickHouse Keeper (1h)
  • Iceberg (2h)
  • LDAP (1h)
  • Parquet (1.5h)
  • RBAC (1.5h)
  • SSL Server (1h)
  • S3 (2h)
  • S3 Export (2h)
  • Swarms (30m)
  • Tiered Storage (2h)

@ianton-ru
Copy link
Copy Markdown
Author

@codex review

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 14, 2026

Workflow [PR], commit [e4287e0]

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 83ae6b17bf

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

if (info.has_join)
{
auto join_node = query_node.getJoinTree();
query_tree_distributed = join_node->as<JoinNode>()->getLeftTableExpression()->clone();
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Rewrite query from the active table, not the join's left side

updateQueryWithJoinToSendIfNeeded always takes join_node->getLeftTableExpression() when building query_to_send, which is incorrect when this IStorageCluster::read call is for a table function on the right side of a JOIN (e.g. LEFT JOIN ... s3Cluster(...)). In that case the rewritten remote query is built from the other table, so the shard executes the wrong source (often a local table name that does not exist remotely), causing query failure or incorrect data flow for object_storage_cluster_join_mode='local'.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Methoc called when xxCluster is on the left side of join.

@zvonand zvonand merged commit d9d3710 into antalya-26.3 May 6, 2026
284 of 311 checks passed
@alsugiliazova
Copy link
Copy Markdown
Member

Verification report: Altinity/ClickHouse PR #1646

Conclusion

PR is merged. CI red on head, but every failure is either a pre-existing flake or a regression-suite scenario broken at baseline on antalya-26.3. No PR-caused regression found.

Caveat — partial frontport (same as #1640). PR #1646 lives on antalya-26.3, which is still missing several companion frontports from antalya-26.1 (remote-initiator improvements 2 from #1608/#1638, CREATE TABLE from table function fix from #1701, system.databases data-lake fix from #1690, and miscellaneous export-partition / iceberg / parquet plumbing). The chronic regression failures observed here are missing-dependency symptoms on the branch, not breakage introduced by this PR. One concrete example: the iceberg sort key with month transform and utc timezone test errors with Code: 552. UNRECOGNIZED_ARGUMENTS: Unrecognized option '--iceberg_partition_timezone' — the test asserts on a CLI/setting that the regression test expects, fired against a build that does not surface it on the command line. The setting itself is declared in src/Core/Settings.cpp on the merged frontport branch, so the wiring discrepancy is somewhere between regression-test setup and the binary in this CI run — not from PR #1646's diff.


CI on head e4287e00 — failures

PR test workflow (2 failed checks, 43 success, 1 cancelled)

Check Test FAIL Class
Stateless tests (amd_debug, sequential) 00157_cache_dictionary Pre-existing flake — 106 fails / 25 PRs / 90d
Stateless tests (arm_binary, sequential) 00157_cache_dictionary same

Regression workflow (10 failed checks)

Check Top failing test on PR-1646 builds (30d) Baseline fail rate (30d, all PRs on 26.3) Class
Swarms (Release + Aarch64) /swarms/feature/node failure/network failure (×16) 44% (303 OK / 235 Fail) Pre-existing broken
same /swarms/feature/cluster discovery/multiple paths (×16) 30% (431 / 185) Pre-existing broken
same /swarms/feature/swarm join sanity/join with clause (×16) 34% (408 / 207) Pre-existing broken
S3Export (partition) (Release + Aarch64) /s3/minio/export tests/export partition/sanity/no partition by (×16) 50% (282 / 286) Pre-existing broken
Iceberg (1) (Release + Aarch64) /iceberg/iceberg engine/rest catalog/sort key timezone/sort key with month transform and utc timezone (×16) 41% (267 / 188) Missing-dependency symptom--iceberg_partition_timezone not exposed by tested binary
same /iceberg/iceberg engine/rest catalog/iceberg iterator race condition (×16) 28% (528 / 210) Pre-existing flaky
Iceberg (2) (Release + Aarch64) /iceberg/iceberg engine/glue catalog/iceberg iterator race condition (×16) 28% (524 / 205) Pre-existing flaky
Parquet (Release + Aarch64) /parquet/postgresql/compression type/=* and /parquet/mysql/compression type/=* (×16 each) ~36% Pre-existing flaky

Regression DB on /PRs/1646/ builds (30d): 1,234 Fail / 41,062 OK ≈ 3.0%. Every top failure matches the all-PR baseline fail rate on antalya-26.3.


Related to PR diff?

PR fixes s3Cluster distributed-call paths (35 files in storage/cluster/table-function layers).

Failing test Diff overlap Related?
00157_cache_dictionary none (dictionary cache) No
swarms / node failure, cluster discovery, swarm join sanity thematic overlap (swarm = cluster of object-storage nodes), but failures occur at the same 30–44% baseline on PRs that do not touch s3Cluster No
s3_export_partition / no partition by none (export-partition path) No
iceberg / sort key timezone month transform utc unrelated (timezone partitioning feature), failure is UNRECOGNIZED_ARGUMENTS on binary — missing wiring at the test/binary layer No (missing-dep symptom)
iceberg / iterator race condition (rest + glue) iterator race condition is a long-standing chronic flake No
parquet / postgresql + mysql function round-trip none (parquet ↔ external DB) No

No failure exceeds the all-PR baseline or intersects this PR's touched code path in a way unique to #1646.


Recommendations

  1. No action on this PR. Merged and effectively clean.
  2. Re-verify after the dependency frontports land on antalya-26.3 (Antalya 26.1; Remote initiator improvements 2 #1608/Antalya 25.8: Remote initiator improvements 2 #1638, Fix attempt to create table from table function #1701, Cherry-pick: Fix: system.databases always shows data lake catalog databases #1690, plus whatever wires iceberg_partition_timezone as a CLI option in the tested binary). The iceberg UNRECOGNIZED_ARGUMENTS failure is a sharp tell that the regression-test environment expects features the branch's binary does not (yet) expose.
  3. Pair with the same chronic-baseline cleanup recommended in VERIFICATION_PR_1640.md: swarms network failure / cluster discovery, export-partition no partition by, iceberg iterator race conditions, parquet postgresql/mysql round-trip — these affect every PR on antalya-26.3 and should be tracked as single issues rather than re-attributed per PR.

@alsugiliazova alsugiliazova added the verified Approved for release label May 11, 2026
@alsugiliazova
Copy link
Copy Markdown
Member

Audit Report — PR #1646

PR: 26.3 Antalya port - fixes for s3Cluster distributed calls
Merged: May 6, 2026 into antalya-26.3
Auditor: AI (Sonnet 4.6 via Cursor audit-review skill)
Date: 2026-05-11


Scope

Frontport of antalya-26.1 PR #1395 (itself a bundle of #583, #584, #703, #720, #972, #1082, #1160, #1168). Key feature areas:

Area Files
JOIN rewriting for cluster queries (object_storage_cluster_join_mode='local') IStorageCluster.cpp/h
FutureSetFromSubquery concurrency fix PreparedSets.cpp/h
ObjectFilterStep — hive-partition filter push-down for cluster reads ObjectFilterStep.cpp/h, InterpreterSelectQuery.cpp, Planner.cpp, optimizePrimaryKeyConditionAndLimit.cpp
remote() + cluster function INITIAL_QUERY identity fix RemoteQueryExecutor.cpp/h, ReadFromRemote.cpp/h, Context.cpp
PlannerJoinTree remote output columns PlannerJoinTree.cpp
AST extraction helpers extractTableFunctionFromSelectQuery.cpp/h
ObjectStorageClusterJoinMode setting SettingsEnums.h/cpp, Settings.cpp/h, SettingsChangesHistory.cpp

~2 000 diff lines, 35 files.


Call Graph

User query (s3Cluster / icebergCluster JOIN local_table)
  └─ IStorageCluster::getQueryProcessingStage()
       └─ getQueryTreeInfo()        ← checks for JOIN, CROSS JOIN, WHERE with local columns
  └─ IStorageCluster::read()
       └─ updateQueryWithJoinToSendIfNeeded()
            └─ SearcherVisitor       ← finds first TABLE/TABLE_FUNCTION in DFS order
            └─ CollectUsedColumnsForSourceVisitor
            └─ queryNodeToDistributedSelectQuery()
       └─ updateQueryToSendIfNeeded()
       └─ [cluster path] ReadFromCluster step added to plan
            └─ ObjectFilterStep added above ReadFromCluster (hive partitioning)
                 └─ optimizePrimaryKeyConditionAndLimit pushes filter back to ReadFromCluster
            └─ ReadFromCluster::initializePipeline()
                 └─ RemoteQueryExecutor (with is_remote_function / shard_count)
                      └─ sendQueryUnlocked() → INITIAL_QUERY rewrite when is_remote_function && shard_count==1

FutureSetFromSubquery (IN subquery materialisation, concurrent access)
  └─ get() / build() / setQueryPlan() / setExternalTable() / getTypes() → all mutex-protected
  └─ get_unsafe() / build_unsafe() → used inside already-locked callers

Confirmed Defects

MEDIUM — ObjectFilterStep is a pass-through: filter is silently dropped if optimizer push-down fails

Anchor ObjectFilterStep::updatePipeline (line 404): return std::move(pipelines.front()) — no filtering applied
Impact If optimizePrimaryKeyConditionAndLimit does not find a SourceStepWithFilter below the ObjectFilterStep (e.g., the plan is restructured by other optimizations between step insertion and the optimizer pass), the hive-partition filter is silently dropped. The result: all files are read regardless of hive-partition column predicates — MORE rows than expected pass to the coordinator. For COUNT() queries this produces wrong numbers.
Trigger use_hive_partitioning=1 + s3Cluster(...) + WHERE key <= N on a hive-partitioned path, in any scenario where the optimizer's optimizePrimaryKeyConditionAndLimit walk doesn't reach the ReadFromCluster source before another optimization interposes a non-transparent step. The current code guards the insertion with typeid_cast<ReadFromCluster *>(query_plan.getRootNode()->step.get()), which mitigates the most obvious case, but the optimizer runs later and may see a different stack.
Why defect ObjectFilterStep carries the filter DAG but deliberately does not apply it. Correctness depends entirely on the optimizer succeeding. There is no fallback: if push-down fails, no other step applies the filter. A proper defensive implementation would either be a real FilterStep (so rows are always filtered even if push-down fails), or assert that the source step is a SourceStepWithFilter and throw if not.
Fix direction Either implement updatePipeline as a real filter (like FilterStep) so it always applies the predicate, or verify in step-insertion code that the optimizer will always run this step successfully and throw LOGICAL_ERROR if push-down does not occur.
Regression test direction Add a query plan optimization test that inserts an incompatible step between ObjectFilterStep and ReadFromCluster and asserts the filter is not silently dropped.

MEDIUM — SearcherVisitor always takes the left-most table expression; cluster function on the right side of JOIN produces a wrong rewritten query

Anchor IStorageCluster::updateQueryWithJoinToSendIfNeeded / SearcherVisitor (DFS, getSubqueryDepth() <= 2 && !passed_node): finds the first TABLE/TABLE_FUNCTION node encountered; then query_tree_distributed = join_node->getLeftTableExpression()->clone()
Impact When s3Cluster(...) is the RIGHT operand of a JOIN (e.g. FROM local_table JOIN s3Cluster(...)), SearcherVisitor finds local_table first. table_function_node = local_table. query_tree_distributed = left expression = local_table. The rewritten remote query is SELECT … FROM local_table, which does not exist on remote nodes. Result: wrong data or exception on remote workers.
Trigger FROM local_table JOIN s3Cluster(…) SETTINGS object_storage_cluster_join_mode='local'. The Codex P1 comment flagged this; the author's response ("method called when xxCluster is on the left side") is an assertion about caller behavior, not an enforcement in code. The test test_joins result2 places s3Cluster on the right and asserts equality with result1 — if this passes in CI it suggests the planner does reorder the join or read() is not called for the right-side table in this configuration; if so the invariant should be documented or enforced.
Why defect The code assumes the cluster function is always the left-most TABLE/TABLE_FUNCTION in DFS order. This invariant is not checked nor enforced. A JOIN with cluster function on the right silently selects the wrong table expression for the remote query.
Fix direction Replace the generic DFS with a targeted search for the cluster storage node, or assert that table_function_node is of the expected cluster type before proceeding and throw LOGICAL_ERROR otherwise.
Regression test direction Add an integration test: FROM local_table INNER JOIN s3Cluster(…) ON … SETTINGS object_storage_cluster_join_mode='local' and verify results match a reference computed without the cluster setting.

LOW — FutureSetFromSubquery::detachSetAndKey is not protected by the new mutex

Anchor PreparedSets.cpp / FutureSetFromSubquery::detachSetAndKey (line 194): moves set_and_key out without acquiring mutex
Impact If another thread concurrently calls get(), getTypes(), or build() (all of which now hold mutex and dereference set_and_key), after detachSetAndKey has moved set_and_key to nullptr, the dereference set_and_key->set is a null pointer dereference.
Trigger Concurrent call to detachSetAndKey and any mutex-protected method. In practice, detachSetAndKey is likely called in a single-threaded setup phase; the risk depends on the usage context.
Why defect All other public methods that access set_and_key were wrapped with the mutex, but detachSetAndKey was not, creating an inconsistent protection boundary.
Fix direction Add std::lock_guard lock(mutex) to detachSetAndKey, or document that detachSetAndKey must not be called concurrently with other methods.
Regression test direction Thread-sanitizer run of a query that builds a subquery set concurrently with the set being detached.

LOW — PlannerJoinTree: removal of to_stage == FetchColumns guard may add unnecessary output columns in aggregation stages

Anchor PlannerJoinTree.cpp line 1364: original guard select_query_options.to_stage == QueryProcessingStage::FetchColumns removed
Impact For remote tables or table functions with to_stage == WithMergeableState, output columns are now always added to updated_actions_dag_outputs, even when they are not needed at that aggregation stage. This could interfere with column pruning or produce extra columns in distributed aggregation results. Impact is likely performance (wider blocks), not correctness.
Trigger Any distributed query with aggregation using a remote table function and to_stage = WithMergeableState.
Why defect The stage check existed to avoid adding columns that are irrelevant at the merge stage. Its removal is a correctness tradeoff that may cause plan regressions for non-cluster remote queries.
Fix direction Keep the fix for table functions but scope it: only remove the FetchColumns guard for cluster-function storage, not for all remote tables.
Regression test direction EXPLAIN the distributed query with aggregation and verify the output column list does not grow unnecessarily after the change.

Previously Flagged Defect (Codex P1) — Evaluated

Codex finding Assessment
P1 — updateQueryWithJoinToSendIfNeeded always takes join_node->getLeftTableExpression() (wrong when cluster is on the right of JOIN). Partially addressed by author assertion ("method called when xxCluster is on the left side"). The invariant is not enforced in code and the passing test result2 (cluster on right) requires further investigation. Retained as a MEDIUM finding above.

@svb-alt svb-alt added the forwardport This is a frontport of code that existed in previous Antalya versions label May 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

antalya antalya-26.3 forwardport This is a frontport of code that existed in previous Antalya versions port-antalya PRs to be ported to all new Antalya releases verified Approved for release

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants