Surface corrupt Iceberg metadata as 422 instead of masking as table-not-found#541
Conversation
…ot-found IllegalArgumentException thrown during metadata parsing (e.g. missing schema ID) was mapped to HTTP 400 by the global exception handler, which the catalog client silently swallowed, causing Spark/Trino to report "Table does not exist" instead of the real error. Fix: catch metadata corruption exceptions in refreshMetadata() and wrap them in InvalidTableMetadataException (extends UnprocessableEntityException → 422), so the actual cause surfaces to users. Caught exceptions cover the full range of Iceberg metadata failures: - IllegalArgumentException: corrupt content (malformed JSON, missing schema IDs) - IllegalStateException: UUID mismatch between refreshes - NotFoundException: metadata file missing from storage (dangling pointer) - ValidationException: snapshot/partition spec inconsistencies I/O exceptions (RuntimeIOException, UncheckedIOException) are intentionally NOT caught here as they represent transient infrastructure failures, not metadata corruption. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
| metadataLoc, | ||
| System.currentTimeMillis() - startTime, | ||
| e); | ||
| throw new InvalidTableMetadataException( |
There was a problem hiding this comment.
It seems not intuitive to have user specified invalid or illegal argument throw an invalid table metadata exception
There was a problem hiding this comment.
This is not an user specified invalid or illegal argument. Here, iceberg is throwing IllegalArgumentException because of corrupted metadata. More specifically, the example I've see is - json file is referring to current-schema-id=6 which doesn't exist.
| metadataLoc, | ||
| System.currentTimeMillis() - startTime); | ||
| } catch (IllegalArgumentException | ||
| | IllegalStateException |
There was a problem hiding this comment.
What is the iceberg client side behavior you expect? Cleanup the written commit or leave it there like how it is handled in CUSE?
There was a problem hiding this comment.
I just looked at the read path while making this change but in the write path - commit step, refresh is done before commit. So, if refreshMetadata() fails, it fails before any new metadata is written — nothing to clean up.
maluchari
left a comment
There was a problem hiding this comment.
422 implies request was not processible. I am inclined towards 500 for this usecase. 500 does not promise transience - IMO, it means server failed to fulfill a valid request. It could cause retries but I would also say that implies retry for 500 errors should be revisited. This is a server side issue and hence we should not categorize it as a client class error.
yes, I agree it's not a client issue. We can go with 500 with clear error message as it falls under a generic server side issue. |
| System.currentTimeMillis() - startTime); | ||
| } catch (IllegalArgumentException | ||
| | IllegalStateException | ||
| | NotFoundException |
There was a problem hiding this comment.
We could separate out NotFoundException and ValidationException to clearly categorize them as 4xx error.
There was a problem hiding this comment.
In this context, both NotFoundException and ValidationException are server-side data integrity issues, not client errors:
NotFoundException: the HouseTable record exists but the metadata file it points to is missing from HDFS — a dangling pointer.ValidationException: snapshot sequence number inconsistency in stored metadata.
| metadataLoc, | ||
| System.currentTimeMillis() - startTime, | ||
| e); | ||
| throw new InvalidTableMetadataException( |
There was a problem hiding this comment.
This may not be right exception as we are considering handling 4 types of exception within the catch block.
There was a problem hiding this comment.
Seems related to above comment.
6928731 to
7fa2e59
Compare
Per review feedback: 422 is a 4xx client error class, but corrupt metadata is a server-side issue. 500 correctly signals that the server failed to fulfill a valid request. - InvalidTableMetadataException now extends RuntimeException (not UnprocessableEntityException) - Added dedicated handler in OpenHouseExceptionHandler mapping to 500 - Added controller test verifying the 500 status code mapping Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
7fa2e59 to
80e7c6d
Compare
Summary
A table had corrupt metadata — current-schema-id=6 but schema 6 was missing from the schemas array. Both Spark and Trino reported "Table does not exist", making the issue impossible to diagnose without digging through tables-service logs.
The reason behind Spark and Trino reporting "Table does not exist" is -
TableMetadataParser.fromJson()throwsIllegalArgumentExceptionwhen parsing corrupt metadata.OpenHouseExceptionHandlermapsIllegalArgumentException→ HTTP 400 (Bad Request).OpenHouseTableOperations.doRefresh()) silently swallows 400 responses via.onErrorResume(BadRequest.class, e -> Mono.empty()), converting them toNoSuchTableException→ "Table does not exist".There are couple of issues in this code path
IllegalArgumentException→ HTTP 400 (Bad Request) is done assuming that user side validation will be mapped toIllegalArgumentException. It's done correctly in certain places like likeOpenHouseInternalTableOperations.rebuildSortOrderbut seems messed up in certain places likeStorageManagerwhere it indicates server side issue. This is something can be corrected in follow up PR(s).Keeping the scope only to surfacing iceberg metadata corruption issue, this PR adds
InvalidTableMetadataException(extendsUnprocessableEntityException→ HTTP 422) and catch metadata corruption exceptions inOpenHouseInternalTableOperations.refreshMetadata(). The catch covers the full range of Iceberg metadata failures fromrefreshFromMetadataLocation():IllegalArgumentException: corrupt content — malformed JSON fields, missing schema IDs, invalid partition specs, sort orders, snapshots (~60+ throw sites across TableMetadataParser, SchemaParser, PartitionSpecParser, SortOrderParser, SnapshotParser, JsonUtil, and TableMetadata constructor)IllegalStateException: UUID mismatch — refreshed metadata has a different table UUID than expected.NotFoundException: metadata file missing from storage — HouseTable record exists but the metadata file it points to has been deleted (dangling pointer)ValidationException: snapshot/partition spec inconsistencies — e.g. snapshot sequence number exceeds last sequence number.RuntimeIOExceptionandUncheckedIOExceptionare intentionally not caught as they also represent transient infrastructure failures (HDFS down, network timeout), not necessarily metadata corruption. So, 500 server error is fine.Changes
For all the boxes checked, please include additional details of the changes made in this pull request.
Testing Done
For all the boxes checked, include a detailed description of the testing done for the changes made in this pull request.
Additional Information
For all the boxes checked, include additional details of the changes made in this pull request.