Antalya 26.3 port - improvements for cluster requests#1687
Conversation
…ous_hashing 26.1 Antalya port - improvements for cluster requests
Removes the `hyperrectangle` field from `DB::Iceberg::ColumnInfo` that was re-added during the frontport. The field was removed upstream in PR ClickHouse#98231, which relocated raw min/max bounds to `ParsedManifestFileEntry::value_bounds`. The `DataFileMetaInfo` Iceberg constructor now deserializes those bounds via the shared `deserializeFieldFromBinaryRepr` helper (moved from `ManifestFileIterator.cpp` to `IcebergFieldParseHelpers`). Addresses @ianton-ru's comment at #1687 (comment).
…bled The Iceberg read optimization (`allow_experimental_iceberg_read_optimization`) identifies constant columns from Iceberg metadata and removes them from the read request. When all requested columns become constant, it sets `need_only_count = true`, which tells the Parquet reader to skip all initialization — including `preparePrewhere` — and just return the raw row count from file metadata. This completely bypasses `row_level_filter` (row policies) and `prewhere_info`, returning unfiltered row counts. The InterpreterSelectQuery relies on the storage to apply these filters when `supportsPrewhere` is true and does not add a fallback FilterStep to the query plan, so the filter is silently lost. The fix prevents `need_only_count` from being set when an active `row_level_filter` or `prewhere_info` exists in the format filter info. Fixes #1595 (cherry picked from commit f204850)
…t NULLs The Altinity-specific constant column optimization (`allow_experimental_iceberg_read_optimization`) scans `requested_columns` for nullable columns absent from the Iceberg file metadata and replaces them with constant NULLs. However, `requested_columns` can also contain columns produced by `prewhere_info` or `row_level_filter` expressions (e.g. `equals(boolean_col, false)`). These computed columns are not in the file metadata, and their result type is often `Nullable(UInt8)`, so the optimization incorrectly treats them as missing file columns and replaces them with NULLs. This corrupts the prewhere pipeline: the Parquet reader evaluates the filter expression correctly, but the constant column optimization then overwrites the result with NULLs. With `need_filter = false` (old planner, PREWHERE + WHERE), all rows appear to fail the filter, producing empty output. With `need_filter = true`, the filter column is NULL so all rows are filtered out. The fix skips columns that match the `prewhere_info` or `row_level_filter` column names, since these are computed at read time and never stored in the file. (cherry picked from commit b7696a3)
`DataFileMetaInfo::DataFileMetaInfo` (Iceberg constructor introduced in 3be7196) deserialized `value_bounds` using the table's current schema. After schema evolution (e.g. `int` -> `long`) the bytes were still encoded with the file's old type — a 4-byte int — but were read as 8 bytes for `Int64`. `ColumnVector::insertData` ignores the length argument and always reads `sizeof(T)` bytes via `unalignedLoad`, so the extra 4 bytes came from adjacent memory and produced a garbage hyperrectangle. The garbage range often satisfied `Range::isPoint`, which made the iceberg read optimization replace the column with a constant value taken from the garbage bound, corrupting query results. Pass the file's `resolved_schema_id` separately so types are looked up against the schema the data file was written with, while column names keep coming from the current table schema (so the resulting `columns_info` map is keyed by names callers know about). Reproducer: `test_storage_iceberg_schema_evolution/test_evolved_schema_simple.py::test_evolved_schema_simple` — all 12 parametrizations failed at the assertion after `ALTER COLUMN a TYPE BIGINT`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…optimization The new test for the Iceberg constant-columns read optimization was calibrated against `expected * 3 + N` GET requests per data file, but the actual count is `expected * 2 + N` for both `S3GetObject` and `AzureGetObject` — the parquet metadata cache (warmed by the no-optimization query) consistently absorbs one GET per file in this branch, regardless of object storage backend. Addresses 4 failing test(s) in Integration tests (amd_asan, db disk, old analyzer, 4/6) on #1687. After this fix the still-failing set shrank from 4 -> 0.
RelEasy
|
| Test | Status | Reason |
|---|---|---|
test_storage_iceberg_with_spark/test_read_constant_columns_optimization.py::test_read_constant_columns_optimization[False-s3] |
[fixed] |
Caused by this PR; now passing |
test_storage_iceberg_with_spark/test_read_constant_columns_optimization.py::test_read_constant_columns_optimization[False-azure] |
[fixed] |
Caused by this PR; now passing |
test_storage_iceberg_with_spark/test_read_constant_columns_optimization.py::test_read_constant_columns_optimization[True-s3] |
[fixed] |
Caused by this PR; now passing |
test_storage_iceberg_with_spark/test_read_constant_columns_optimization.py::test_read_constant_columns_optimization[True-azure] |
[fixed] |
Caused by this PR; now passing |
Root cause: The test file was added by this PR with a hardcoded expectation that each Iceberg data file generates expected * 3 + N S3GetObject/AzureGetObject events. The actual count on CI is expected * 2 + N (15 vs expected 22 for is_cluster=False, 18 vs 25 for is_cluster=True) — one fewer GET per file because the parquet-metadata cache, which is populated by the warm-up query at line 109, absorbs the footer read on subsequent queries.
Fix: Changed multiplier from * 3 to * 2 in check_events and updated the surrounding comment.
Verification: Built with bash .releasy/build.sh, then ran all 4 tests via python3 -m ci.praktika run "Integration tests (amd_asan, db disk, old analyzer, 4/6)" --test .... Result: 4 passed in 37.01s.
Commit: dccb0830dd1 "Fix CI: adjust S3/Azure GET multiplier in test_read_constant_columns_optimization"
DONE
Working tree is clean. Final summary:
Final Summary
| Test | Status | Reason |
|---|---|---|
test_storage_iceberg_with_spark/test_read_constant_columns_optimization.py::test_read_constant_columns_optimization[False-s3] |
[fixed] |
Caused by this PR; now passing |
test_storage_iceberg_with_spark/test_read_constant_columns_optimization.py::test_read_constant_columns_optimization[False-azure] |
[fixed] |
Caused by this PR; now passing |
test_storage_iceberg_with_spark/test_read_constant_columns_optimization.py::test_read_constant_columns_optimization[True-s3] |
[fixed] |
Caused by this PR; now passing |
test_storage_iceberg_with_spark/test_read_constant_columns_optimization.py::test_read_constant_columns_optimization[True-azure] |
[fixed] |
Caused by this PR; now passing |
Root cause: The test file was added by this PR with a hardcoded expectation that each Iceberg data file generates expected * 3 + N S3GetObject/AzureGetObject events. The actual count on CI is expected * 2 + N (15 vs expected 22 for is_cluster=False, 18 vs 25 for is_cluster=True) — one fewer GET per file because the parquet-metadata cache, which is populated by the warm-up query at line 109, absorbs the footer read on subsequent queries.
Fix: Changed multiplier from * 3 to * 2 in check_events and updated the surrounding comment.
Verification: Built with bash .releasy/build.sh, then ran all 4 tests via python3 -m ci.praktika run "Integration tests (amd_asan, db disk, old analyzer, 4/6)" --test .... Result: 4 passed in 37.01s.
Commit: dccb0830dd1 "Fix CI: adjust S3/Azure GET multiplier in test_read_constant_columns_optimization"
DONE
🤖 Posted automatically by releasy analyze-fails. Re-run the command to refresh.
Verification report: Altinity/ClickHouse PR #1687ConclusionPR is merged. CI red on head, but every failure is either a flake or a regression-suite scenario already broken at baseline on
CI on head
|
| Check | Test FAIL | Class |
|---|---|---|
Integration tests (amd_asan, db disk, old analyzer, 5/6) |
test_filesystem_cache::test_concurrent_eviction[lru], [slru] |
Emerging flake — 4/4 PRs and 2/2 PRs in 90d, last seen on unrelated PR 2026-04-30 |
Regression workflow (10 failed checks)
| Check | Top failing tests on PR-1687 builds (30d) | Baseline (antalya-26.3, 30d) |
Class |
|---|---|---|---|
Swarms (Release + Aarch64) |
swarm sanity / cluster with one observer node…, swarm sanity / swarm examples, node failure / cpu overload, node failure / network failure, swarm joins / join clause (×11 each) |
30–44% on every PR | Pre-existing broken |
S3Export (partition) (Release + Aarch64) |
export partition / sanity / no partition by, basic table (×11 each) |
50% | Pre-existing broken |
Iceberg (1) (Release + Aarch64) |
rest catalog / sort key timezone / day transform utc, hour transform utc (×11), iceberg iterator race condition (×11) |
41% / 28% | Missing-dep + pre-existing flaky |
Iceberg (2) (Release + Aarch64) |
glue catalog / iceberg iterator race condition |
28% | Pre-existing flaky |
Parquet (Release + Aarch64) |
postgresql/mysql round-trip compression-type variants | ~36% | Pre-existing flaky |
Regression DB on /PRs/1687/ builds (30d): 1,107 Fail / 27,720 OK ≈ 3.8%. Every top failure matches the all-PR baseline fail rate on antalya-26.3.
Related to PR diff?
PR is a 26.3 forward-port of upstream #1414 — "improvements for cluster requests" (72 files in cluster-request / table-function / s3Cluster plumbing).
| Failing test | Diff overlap | Related? |
|---|---|---|
test_filesystem_cache::test_concurrent_eviction[lru/slru] |
none (filesystem cache eviction) | No |
swarms / * (sanity, node failure, joins) |
thematic overlap (swarm = cluster of object-storage nodes); failures match baseline rate on PRs that don't touch cluster-request code | No |
s3_export_partition / sanity / * |
none (export-partition path) | No |
iceberg / sort key timezone / * |
unrelated (timezone partitioning); failure is UNRECOGNIZED_ARGUMENTS — missing-dep |
No |
iceberg / iterator race condition (rest + glue) |
none | No |
parquet / postgresql + mysql round-trip |
none | No |
No failing test intersects the cluster-request code path uniquely or fails above the all-PR baseline.
Local checkout
cd /Users/alsugilyazova/workspace/altinity-clickhouse/ClickHouse
gh pr checkout 1687 --repo Altinity/ClickHouse
# HEAD: dccb0830dd1f4e2706caae0ed90b826efcfef592
Audit Report — PR #1687PR: Antalya 26.3 port - improvements for cluster requests AI audit note: This review was generated by AI (audit-review skill). Confirmed defectsHigh: Object-storage cluster task reschedule can strand or mis-key work after connection loss
Coverage summary
Evidence (code)Enqueue path (identifier + rendezvous input): String file_identifier;
if (send_over_whole_archive && object_info->isArchive())
{
file_identifier = object_info->getPathOrPathToArchiveIfArchive();
// ...
}
else
{
file_identifier = object_info->getIdentifier();
}
// ...
size_t file_replica_idx = getReplicaForFile(file_identifier);
// ...
unprocessed_files.emplace(file_identifier, std::make_pair(object_info, file_replica_idx));
connection_to_files[file_replica_idx].push_back(object_info);Reschedule path (divergent key + void StorageObjectStorageStableTaskDistributor::rescheduleTasksFromReplica(size_t number_of_current_replica)
{
LOG_INFO(log, "Replica {} is marked as lost, tasks are returned to queue", number_of_current_replica);
std::lock_guard lock(mutex);
// ...
for (const auto & file : processed_file_list_ptr->second)
{
auto file_replica_idx = getReplicaForFile(file->getPath());
unprocessed_files.emplace(file->getPath(), std::make_pair(file, file_replica_idx));
connection_to_files[file_replica_idx].push_back(file);
}
replica_to_files_to_be_processed.erase(number_of_current_replica);
}
String ObjectInfo::getIdentifier() const
{
String result = getPath();
if (file_bucket_info)
result += file_bucket_info->getIdentifier();
return result;
}Call graph (reviewed slice)
Fault categories (injection outcomes, condensed)
Interleaving / locking notes
C++ bug-class sweep (brief)
PR hygiene (non-defect)
|
Fix merged in separate PR #1748 |
Cherry-picked from #1414, also has changes from #1597.
Changelog category (leave one):
Frontports for Antalya 26.1
CI/CD Options
Exclude tests:
Regression jobs to run: