Skip to content

API, Parquet: Map geometry and geography to Parquet logical types#16765

Open
huan233usc wants to merge 1 commit into
apache:mainfrom
huan233usc:parquet-geo-schema
Open

API, Parquet: Map geometry and geography to Parquet logical types#16765
huan233usc wants to merge 1 commit into
apache:mainfrom
huan233usc:parquet-geo-schema

Conversation

@huan233usc

@huan233usc huan233usc commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Summary

Map Iceberg geometry and geography primitive types to and from Parquet's geometry / geography logical type annotations on a BINARY column, so the geo types survive a schema round-trip through ParquetSchemaUtil.convert in both directions.

  • TypeToMessageType: emit geometry / geography as BINARY annotated with LogicalTypeAnnotation.geometryType / geographyType, passing the resolved CRS and edge-interpolation algorithm through directly. Iceberg and Parquet use the same algorithm names (SPHERICAL, VINCENTY, THOMAS, ANDOYER, KARNEY), so the algorithm is mapped by name.
  • MessageTypeToType: read those annotations back into Types.GeometryType / Types.GeographyType. An unset crs / algorithm maps to the Iceberg default (the Parquet defaults are the same: OGC:CRS84 / SPHERICAL), and algorithm names are resolved with EdgeAlgorithm.fromName, the same conversion used when parsing geography type strings (Types.fromPrimitiveString).
  • Types.GeographyType: treat an explicit default algorithm as equal to an omitted one, so the mapping above round-trips for plain geography and for files written by engines that omit default parameters. equals / hashCode / toString now use the resolved getters crs() / algorithm() (which already apply the defaults) instead of the raw nullable fields — CRS already resolved this way through its getter; this brings the algorithm in line. No public signature changes.

This is the first step of plumbing the geo value path through Parquet. It is intentionally schema mapping only — the generic value read/write path (BaseParquetReaders / BaseParquetWriter) and the ParquetMetrics guard for geo columns are separate follow-ups, so this PR stays small and easy to review. It is purely additive: no behavior changes for non-geo types.

This is 1/N for #16650

Test plan

  • TestParquetSchemaUtil#testGeospatialTypeRoundTrip round-trips a schema with default-CRS geometry, an explicit-CRS geometry, default geography, and a geography per edge algorithm (all five, so the by-name mapping is exercised for every constant in both directions) through ParquetSchemaUtil.convert.
  • TestParquetSchemaUtil#testGeospatialAnnotationsWithOmittedParameters reads hand-built MessageTypes with unset / explicit / explicit-default CRS and algorithm — covering files written by engines that omit defaults — and confirms each maps to the expected Iceberg type.
  • TestTypes#testGeospatialTypeDefaultNormalization covers equals() / hashCode() parity for the default-CRS and default-algorithm geography forms, that algorithm() still reports SPHERICAL, and that a non-default algorithm stays distinct; testGeospatialTypeToString extended for the explicit-default rendering.
  • ./gradlew :iceberg-api:check :iceberg-parquet:check — clean (tests, checkstyle, revapi, spotless).
  • ./gradlew :iceberg-core:test — full core suite green; no regressions in TestSchemaParser / TestSingleValueParser / TestGeospatialTable or anywhere geography types are serialized.

@huan233usc huan233usc marked this pull request as draft June 11, 2026 02:38
@huan233usc huan233usc force-pushed the parquet-geo-schema branch from e0c3e18 to f724354 Compare June 11, 2026 05:26
@github-actions github-actions Bot added the API label Jun 11, 2026
@huan233usc huan233usc changed the title Parquet: Map geometry and geography to Parquet logical types API, Parquet: Map geometry and geography to Parquet logical types Jun 11, 2026
@huan233usc huan233usc force-pushed the parquet-geo-schema branch from f724354 to f58e12b Compare June 11, 2026 05:36
@huan233usc huan233usc marked this pull request as ready for review June 11, 2026 05:39
Map Iceberg geometry and geography to and from Parquet's geometry /
geography logical type annotations on a BINARY column, passing the
resolved CRS and edge algorithm through directly (Iceberg and Parquet
use the same algorithm names; the read side resolves names with
EdgeAlgorithm.fromName, the same conversion used when parsing
geography type strings, and maps unset annotation parameters to the
Iceberg defaults).

To make the plain geography type round-trip through writers that omit
default parameters (an unset Parquet crs / algorithm defaults to
OGC:CRS84 / SPHERICAL), GeographyType now treats an explicit default
algorithm as equal to an omitted one: equals, hashCode, and toString
use the resolved getters crs() / algorithm() instead of the raw
nullable fields, matching how the CRS already resolves through its
getter.

Schema mapping only; the value read/write path and metrics handling
are follow-ups.

Co-authored-by: Isaac
@huan233usc huan233usc force-pushed the parquet-geo-schema branch from f58e12b to 57036d7 Compare June 11, 2026 05:53
@huan233usc

Copy link
Copy Markdown
Contributor Author

Hi @szehon-ho , can you help reviewing when you have a chance? Thank you very much!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant