forked from ClickHouse/ClickHouse
-
Notifications
You must be signed in to change notification settings - Fork 18
Antalya 26.3 Backport of #99935 - Support more flexible data paths validation in iceberg #1769
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
mkmkme
wants to merge
2
commits into
antalya-26.3
Choose a base branch
from
backports/antalya-26.3/99935
base: antalya-26.3
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
1 change: 1 addition & 0 deletions
1
tests/queries/0_stateless/04033_iceberg_mismatched_location.reference
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1 @@ | ||
| BAD_ARGUMENTS |
47 changes: 47 additions & 0 deletions
47
tests/queries/0_stateless/04033_iceberg_mismatched_location.sh
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,47 @@ | ||
| #!/usr/bin/env bash | ||
| # Tags: no-fasttest | ||
| # Reproduces https://github.com/ClickHouse/ClickHouse/issues/92348 | ||
| # Iceberg std::out_of_range exception when metadata `location` field differs from | ||
| # ClickHouse table path (e.g. table created by Spark with a different URI scheme/path). | ||
|
|
||
| CURDIR=$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd) | ||
| # shellcheck source=../shell_config.sh | ||
| . "$CURDIR"/../shell_config.sh | ||
|
|
||
| TABLE_PATH="04033_iceberg_mismatched/${CLICKHOUSE_TEST_UNIQUE_NAME}" | ||
|
|
||
| ${CLICKHOUSE_CLIENT} --allow_experimental_insert_into_iceberg 1 -q " | ||
| DROP TABLE IF EXISTS t_iceberg_mismatch; | ||
| CREATE TABLE t_iceberg_mismatch (c0 Int) ENGINE = IcebergS3(s3_conn, filename = '${TABLE_PATH}'); | ||
| INSERT INTO t_iceberg_mismatch VALUES (1); | ||
| DROP TABLE IF EXISTS t_iceberg_mismatch; | ||
| " | ||
|
|
||
| # Modify metadata: set location to a very long path, manifest-list to just the filename. | ||
| # This simulates a Spark-created table where ClickHouse rewrites the location field | ||
| # but Spark's manifest-list entries use short relative paths. | ||
| # Before the fix this caused std::out_of_range exception in IcebergPathResolver::resolve | ||
| # because table_location is longer than data_path (the manifest filename), | ||
| # causing unsigned underflow and out-of-range substr. | ||
| ${CLICKHOUSE_CLIENT} -q " | ||
| SELECT * FROM s3(s3_conn, filename='${TABLE_PATH}/metadata/v2.metadata.json', structure='line String', format='LineAsString') | ||
| " | python3 -c " | ||
| import json, sys | ||
| m = json.load(sys.stdin) | ||
| m['location'] = 's3://some-bucket/warehouse/very/deep/nested/path/to/a/spark/created/table/location/with/extra/long/segments/to/ensure/prefix/exceeds/data/path/length' | ||
| for s in m.get('snapshots', []): | ||
| s['manifest-list'] = s['manifest-list'].rsplit('/', 1)[-1] | ||
| print(json.dumps(m)) | ||
| " | ${CLICKHOUSE_CLIENT} -q " | ||
| INSERT INTO FUNCTION s3(s3_conn, filename='${TABLE_PATH}/metadata/v2.metadata.json', structure='line String', format='LineAsString') | ||
| SETTINGS s3_truncate_on_insert=1 | ||
| SELECT * FROM input('line String') FORMAT LineAsString | ||
| " | ||
|
|
||
| # Disable iceberg metadata cache as a client-level setting so the modified metadata | ||
| # file is actually re-read (not served from cache populated during CREATE TABLE above). | ||
| # Before the fix this would crash with std::out_of_range in IcebergPathResolver::resolve. | ||
| # After the fix it returns a proper BAD_ARGUMENTS error about unresolvable path. | ||
| ${CLICKHOUSE_CLIENT} --use_iceberg_metadata_files_cache 0 -q " | ||
| SELECT * FROM icebergS3(s3_conn, filename='${TABLE_PATH}'); | ||
| " 2>&1 | grep -o "BAD_ARGUMENTS" | head -1 |
1 change: 1 addition & 0 deletions
1
tests/queries/0_stateless/04034_iceberg_spark_style_location.reference
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1 @@ | ||
| 42 |
50 changes: 50 additions & 0 deletions
50
tests/queries/0_stateless/04034_iceberg_spark_style_location.sh
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,50 @@ | ||
| #!/usr/bin/env bash | ||
| # Tags: no-fasttest | ||
| # Positive test: Spark-created Iceberg table where metadata location field uses | ||
| # a completely different scheme/bucket (s3a://spark-bucket/...) than ClickHouse's path. | ||
| # The manifest-list entries use full absolute paths with Spark's URI. | ||
| # IcebergPathResolver::resolve should strip table_location prefix and prepend table_root. | ||
|
|
||
| CURDIR=$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd) | ||
| # shellcheck source=../shell_config.sh | ||
| . "$CURDIR"/../shell_config.sh | ||
|
|
||
| TABLE_PATH="04034_iceberg_spark/${CLICKHOUSE_TEST_UNIQUE_NAME}" | ||
|
|
||
| ${CLICKHOUSE_CLIENT} --allow_experimental_insert_into_iceberg 1 -q " | ||
| DROP TABLE IF EXISTS t_iceberg_spark; | ||
| CREATE TABLE t_iceberg_spark (c0 Int) ENGINE = IcebergS3(s3_conn, filename = '${TABLE_PATH}'); | ||
| INSERT INTO t_iceberg_spark VALUES (42); | ||
| DROP TABLE IF EXISTS t_iceberg_spark; | ||
| " | ||
|
|
||
| # Simulate Spark: change location to s3a://spark-bucket/warehouse/spark_table | ||
| # and rewrite manifest-list paths to use the same Spark prefix. | ||
| SPARK_LOCATION='s3a://spark-bucket/warehouse/db/spark_table' | ||
| ${CLICKHOUSE_CLIENT} --input_format_parallel_parsing 0 --output_format_parallel_formatting 0 -q " | ||
| SELECT * FROM s3(s3_conn, filename='${TABLE_PATH}/metadata/v2.metadata.json', structure='line String', format='LineAsString') | ||
| " | python3 -c " | ||
| import json, sys | ||
| m = json.load(sys.stdin) | ||
| old_location = m['location'].rstrip('/') | ||
| new_location = '${SPARK_LOCATION}'.rstrip('/') | ||
| m['location'] = new_location | ||
| for s in m.get('snapshots', []): | ||
| ml = s['manifest-list'] | ||
| # Strip old_location prefix (with or without trailing slash) and rejoin with new_location | ||
| for prefix in [old_location + '/', old_location]: | ||
| if ml.startswith(prefix): | ||
| s['manifest-list'] = new_location + '/' + ml[len(prefix):].lstrip('/') | ||
| break | ||
| print(json.dumps(m)) | ||
| " | ${CLICKHOUSE_CLIENT} -q " | ||
| INSERT INTO FUNCTION s3(s3_conn, filename='${TABLE_PATH}/metadata/v2.metadata.json', structure='line String', format='LineAsString') | ||
| SETTINGS s3_truncate_on_insert=1 | ||
| SELECT * FROM input('line String') FORMAT LineAsString | ||
| " | ||
|
|
||
| # The query should succeed: IcebergPathResolver::resolve strips the Spark | ||
| # table_location prefix from the manifest-list path and prepends table_root. | ||
| ${CLICKHOUSE_CLIENT} --use_iceberg_metadata_files_cache 0 -q " | ||
| SELECT * FROM icebergS3(s3_conn, filename='${TABLE_PATH}'); | ||
| " |
1 change: 1 addition & 0 deletions
1
tests/queries/0_stateless/04061_iceberg_bad_metadata_file_name.reference
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1 @@ | ||
| BAD_ARGUMENTS |
14 changes: 14 additions & 0 deletions
14
tests/queries/0_stateless/04061_iceberg_bad_metadata_file_name.sh
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,14 @@ | ||
| #!/usr/bin/env bash | ||
| # Tags: no-fasttest | ||
| # Regression test: passing a garbage iceberg_metadata_file_path (e.g. '.*') | ||
| # used to cause std::length_error inside getMetadataFileAndVersion because | ||
| # find_first_of returned npos. After the fix it must return BAD_ARGUMENTS. | ||
|
|
||
| CURDIR=$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd) | ||
| # shellcheck source=../shell_config.sh | ||
| . "$CURDIR"/../shell_config.sh | ||
|
|
||
| # The exact query from the fuzzer report. | ||
| ${CLICKHOUSE_CLIENT} -q " | ||
| SELECT * FROM icebergS3('http://localhost:11111/test/est', 'clickhouse', 'clickhouse', SETTINGS iceberg_metadata_file_path = '.*') | ||
| " 2>&1 | grep -o "BAD_ARGUMENTS" | head -1 |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I see this change is from ClickHouse#100420.
Do we want to backport this huge PR too?