API, Parquet: Map geometry and geography to Parquet logical types#16765
Open
huan233usc wants to merge 1 commit into
Open
API, Parquet: Map geometry and geography to Parquet logical types#16765huan233usc wants to merge 1 commit into
huan233usc wants to merge 1 commit into
Conversation
4 tasks
e0c3e18 to
f724354
Compare
f724354 to
f58e12b
Compare
Map Iceberg geometry and geography to and from Parquet's geometry / geography logical type annotations on a BINARY column, passing the resolved CRS and edge algorithm through directly (Iceberg and Parquet use the same algorithm names; the read side resolves names with EdgeAlgorithm.fromName, the same conversion used when parsing geography type strings, and maps unset annotation parameters to the Iceberg defaults). To make the plain geography type round-trip through writers that omit default parameters (an unset Parquet crs / algorithm defaults to OGC:CRS84 / SPHERICAL), GeographyType now treats an explicit default algorithm as equal to an omitted one: equals, hashCode, and toString use the resolved getters crs() / algorithm() instead of the raw nullable fields, matching how the CRS already resolves through its getter. Schema mapping only; the value read/write path and metrics handling are follow-ups. Co-authored-by: Isaac
f58e12b to
57036d7
Compare
Contributor
Author
|
Hi @szehon-ho , can you help reviewing when you have a chance? Thank you very much! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Map Iceberg
geometryandgeographyprimitive types to and from Parquet's geometry / geography logical type annotations on aBINARYcolumn, so the geo types survive a schema round-trip throughParquetSchemaUtil.convertin both directions.TypeToMessageType: emitgeometry/geographyasBINARYannotated withLogicalTypeAnnotation.geometryType/geographyType, passing the resolved CRS and edge-interpolation algorithm through directly. Iceberg and Parquet use the same algorithm names (SPHERICAL,VINCENTY,THOMAS,ANDOYER,KARNEY), so the algorithm is mapped by name.MessageTypeToType: read those annotations back intoTypes.GeometryType/Types.GeographyType. An unsetcrs/algorithmmaps to the Iceberg default (the Parquet defaults are the same:OGC:CRS84/SPHERICAL), and algorithm names are resolved withEdgeAlgorithm.fromName, the same conversion used when parsing geography type strings (Types.fromPrimitiveString).Types.GeographyType: treat an explicit default algorithm as equal to an omitted one, so the mapping above round-trips for plaingeographyand for files written by engines that omit default parameters.equals/hashCode/toStringnow use the resolved getterscrs()/algorithm()(which already apply the defaults) instead of the raw nullable fields — CRS already resolved this way through its getter; this brings the algorithm in line. No public signature changes.This is the first step of plumbing the geo value path through Parquet. It is intentionally schema mapping only — the generic value read/write path (
BaseParquetReaders/BaseParquetWriter) and theParquetMetricsguard for geo columns are separate follow-ups, so this PR stays small and easy to review. It is purely additive: no behavior changes for non-geo types.This is 1/N for #16650
Test plan
TestParquetSchemaUtil#testGeospatialTypeRoundTripround-trips a schema with default-CRS geometry, an explicit-CRS geometry, default geography, and a geography per edge algorithm (all five, so the by-name mapping is exercised for every constant in both directions) throughParquetSchemaUtil.convert.TestParquetSchemaUtil#testGeospatialAnnotationsWithOmittedParametersreads hand-builtMessageTypes with unset / explicit / explicit-default CRS and algorithm — covering files written by engines that omit defaults — and confirms each maps to the expected Iceberg type.TestTypes#testGeospatialTypeDefaultNormalizationcoversequals()/hashCode()parity for the default-CRS and default-algorithm geography forms, thatalgorithm()still reportsSPHERICAL, and that a non-default algorithm stays distinct;testGeospatialTypeToStringextended for the explicit-default rendering../gradlew :iceberg-api:check :iceberg-parquet:check— clean (tests, checkstyle, revapi, spotless)../gradlew :iceberg-core:test— full core suite green; no regressions inTestSchemaParser/TestSingleValueParser/TestGeospatialTableor anywhere geography types are serialized.