[GH-2646] Auto-populate covering metadata for GeoParquet 1.1.0 writes#2665
[GH-2646] Auto-populate covering metadata for GeoParquet 1.1.0 writes#2665
Conversation
66f1720 to
5f51568
Compare
There was a problem hiding this comment.
Pull request overview
Adds an opt-in/opt-out mechanism for automatically generating GeoParquet 1.1.0 “covering” metadata (and corresponding <geometry>_bbox columns) during writes, reducing manual setup and making writes safer by default.
Changes:
- Introduces
geoparquet.covering.mode(autodefault,legacy) and validates it during writes. - Implements auto-reuse or auto-generation of
<geometryColumnName>_bboxcolumns + covering metadata for GeoParquet1.1.0writes when no explicit covering options are provided. - Updates Scala/R tests and docs to account for auto-added covering columns.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| spark/common/src/main/scala/org/apache/spark/sql/execution/datasources/geoparquet/GeoParquetWriteSupport.scala | Adds covering mode parsing + auto-generation/reuse of *_bbox covering columns and metadata during writes. |
| spark/common/src/main/scala/org/apache/spark/sql/execution/datasources/geoparquet/GeoParquetMetaData.scala | Defines new constants for covering mode option and documents supported values. |
| spark/common/src/test/scala/org/apache/sedona/sql/geoparquetIOTests.scala | Adds auto-covering tests and adjusts round-trip assertions to ignore auto-generated columns. |
| docs/tutorial/files/geoparquet-sedona-spark.md | Documents default auto-covering behavior and the geoparquet.covering.mode option. |
| R/tests/testthat/test-data-interface.R | Updates round-trip comparison to select only original columns (excluding auto-generated covering columns). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| it("GeoParquet auto populates covering metadata for single geometry column") { | ||
| val df = sparkSession | ||
| .range(0, 100) | ||
| .toDF("id") | ||
| .withColumn("id", expr("CAST(id AS DOUBLE)")) | ||
| .withColumn("geometry", expr("ST_Point(id, id + 1)")) | ||
| .withColumn( | ||
| "geometry_bbox", | ||
| expr("struct(id AS xmin, id + 1 AS ymin, id AS xmax, id + 1 AS ymax)")) | ||
| val geoParquetSavePath = |
There was a problem hiding this comment.
The new behavior to gracefully handle an existing but invalid <geometryColumnName>_bbox struct (log warning + skip) isn’t covered by tests here. Adding a test that writes a DataFrame with a geometry_bbox column missing required fields or with non-float/double types would help ensure the write does not fail and that covering metadata is not populated in that case.
| case _: IllegalArgumentException => | ||
| logWarning( | ||
| s"Existing column '$coveringColumnName' is not a valid covering struct " + | ||
| s"(expected struct<xmin, ymin, xmax, ymax> with float/double fields). " + |
There was a problem hiding this comment.
The warning message here says the expected struct is exactly struct<xmin, ymin, xmax, ymax>, but createCoveringColumnMetadata also supports optional zmin/zmax fields. Consider adjusting the log message to reflect the actual accepted schema (required xmin/ymin/xmax/ymax, optional zmin/zmax) so users aren’t misled when troubleshooting.
| s"(expected struct<xmin, ymin, xmax, ymax> with float/double fields). " + | |
| s"(expected struct<xmin, ymin, xmax, ymax> with float/double fields; " + | |
| s"optional zmin/zmax fields are also supported). " + |
|
|
||
| val generatedCoveringFields = mutable.ArrayBuffer.empty[StructField] | ||
| val geometryColumns = | ||
| geometryColumnInfoMap.keys.toSeq.map(ordinal => ordinal -> schema(ordinal).name) |
There was a problem hiding this comment.
geometryColumnInfoMap is a mutable.Map (hash-based), so iterating geometryColumnInfoMap.keys.toSeq can yield a non-deterministic order. That makes the appended auto-generated <geom>_bbox columns (and their ordinals) potentially vary across runs, which can hurt reproducibility and make downstream expectations flaky. Consider iterating geometry ordinals in schema order (e.g., geometryColumnInfoMap.keys.toSeq.sorted) when building geometryColumns.
| geometryColumnInfoMap.keys.toSeq.map(ordinal => ordinal -> schema(ordinal).name) | |
| geometryColumnInfoMap.keys.toSeq.sorted.map(ordinal => ordinal -> schema(ordinal).name) |
Did you read the Contributor Guide?
Is this PR related to a ticket?
[GH-XXX] my subject. Closes GeoParquet 1.1 covering metadata should auto populate #2646What changes were proposed in this PR?
Add a
geoparquet.covering.modeoption to control automatic covering metadata generation when writing GeoParquet files.Behavior
<geometryColumnName>_bboxcovering columns and write corresponding covering metadata. If the user has already provided explicitgeoparquet.coveringorgeoparquet.covering.<col>options, those take precedence and auto-generation is skipped.Changes
GeoParquetMetaData.scala
GEOPARQUET_COVERING_MODE_KEY,GEOPARQUET_COVERING_MODE_AUTO,GEOPARQUET_COVERING_MODE_LEGACY.GeoParquetWriteSupport.scala
geoparquet.covering.modefrom Hadoop configuration. ThrowIllegalArgumentExceptionfor invalid values.maybeAutoGenerateCoveringColumns(): when auto mode is enabled and no explicit covering options are provided, for each geometry column: reuse an existing valid_bboxstruct column, or generate one from the geometry envelope.geoparquet.covering.modein per-column covering parsing)._bboxcolumn has invalid structure (log warning and skip instead of crashing).geoparquetIOTests.scala
geometry_bboxcolumn.geometry_bboxwhen no covering column exists.geometry_bbox).geoparquet-sedona-spark.md
geoparquet.covering.modeoption, default behavior, and how to opt out.1.1.0sincev1.9.0.How was this patch tested?
All 40 geoparquetIOTests pass:
Did this PR include necessary documentation updates?