Support exports to Parquet format#1040
Draft
labianchin wants to merge 77 commits intomasterfrom
Draft
Conversation
Bumps [org.apache.maven.plugins:maven-enforcer-plugin](https://github.com/apache/maven-enforcer) from 3.5.0 to 3.6.2. - [Release notes](https://github.com/apache/maven-enforcer/releases) - [Commits](apache/maven-enforcer@enforcer-3.5.0...enforcer-3.6.2) --- updated-dependencies: - dependency-name: org.apache.maven.plugins:maven-enforcer-plugin dependency-version: 3.6.2 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com>
Bumps [org.mockito:mockito-core](https://github.com/mockito/mockito) from 5.17.0 to 5.20.0. - [Release notes](https://github.com/mockito/mockito/releases) - [Commits](mockito/mockito@v5.17.0...v5.20.0) --- updated-dependencies: - dependency-name: org.mockito:mockito-core dependency-version: 5.20.0 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com>
Bumps [org.apache.maven.plugins:maven-compiler-plugin](https://github.com/apache/maven-compiler-plugin) from 3.13.0 to 3.14.1. - [Release notes](https://github.com/apache/maven-compiler-plugin/releases) - [Commits](apache/maven-compiler-plugin@maven-compiler-plugin-3.13.0...maven-compiler-plugin-3.14.1) --- updated-dependencies: - dependency-name: org.apache.maven.plugins:maven-compiler-plugin dependency-version: 3.14.1 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com>
Bumps [org.apache.maven.plugins:maven-javadoc-plugin](https://github.com/apache/maven-javadoc-plugin) from 3.11.2 to 3.12.0. - [Release notes](https://github.com/apache/maven-javadoc-plugin/releases) - [Commits](apache/maven-javadoc-plugin@maven-javadoc-plugin-3.11.2...maven-javadoc-plugin-3.12.0) --- updated-dependencies: - dependency-name: org.apache.maven.plugins:maven-javadoc-plugin dependency-version: 3.12.0 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com>
Bumps [org.apache.maven.plugins:maven-gpg-plugin](https://github.com/apache/maven-gpg-plugin) from 3.2.7 to 3.2.8. - [Release notes](https://github.com/apache/maven-gpg-plugin/releases) - [Commits](apache/maven-gpg-plugin@maven-gpg-plugin-3.2.7...maven-gpg-plugin-3.2.8) --- updated-dependencies: - dependency-name: org.apache.maven.plugins:maven-gpg-plugin dependency-version: 3.2.8 dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com>
Bumps [org.codehaus.mojo:build-helper-maven-plugin](https://github.com/mojohaus/build-helper-maven-plugin) from 3.6.0 to 3.6.1. - [Release notes](https://github.com/mojohaus/build-helper-maven-plugin/releases) - [Commits](mojohaus/build-helper-maven-plugin@3.6.0...3.6.1) --- updated-dependencies: - dependency-name: org.codehaus.mojo:build-helper-maven-plugin dependency-version: 3.6.1 dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com>
Bumps [org.apache.maven.plugins:maven-project-info-reports-plugin](https://github.com/apache/maven-project-info-reports-plugin) from 3.8.0 to 3.9.0. - [Release notes](https://github.com/apache/maven-project-info-reports-plugin/releases) - [Commits](apache/maven-project-info-reports-plugin@maven-project-info-reports-plugin-3.8.0...maven-project-info-reports-plugin-3.9.0) --- updated-dependencies: - dependency-name: org.apache.maven.plugins:maven-project-info-reports-plugin dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com>
Beam 2.65.0 reaches end-of-support in May 2026. Beam 2.72.0 is the latest stable release, supported until March 2027. Updated dependency versions to match Beam 2.72.0: - errorprone: 2.10.0 -> 2.31.0 - joda-time: 2.10.14 -> 2.14.0 - netty: 4.1.121.Final -> 4.1.124.Final - slf4j: 1.7.30 -> 2.0.16 - google-cloud-libraries-bom: 26.57.0 -> 26.76.0 Added dependency management overrides to resolve version convergence between Beam BOM and libraries-bom: - jackson-dataformat-xml 2.18.2 (from google-cloud-storage) - google-cloud-bigtable 2.73.1 and proto stubs (from beam-io-gcp) - j2objc-annotations 3.1 (from libraries-bom) - failureaccess 1.0.3 (from transitives)
Aligns with the version provided by google-cloud-libraries-bom 26.76.0, removing the separate jackson-dataformat-xml override.
Aligns with the version provided by google-cloud-libraries-bom 26.76.0.
1.25.0 is the highest version compatible with the current libraries-bom 26.76.0 without introducing dependency conflicts. Versions >= 1.25.1 require a newer google-api-client than what the BOM provides.
The previous URL for avro-tools 1.11.3 is no longer available on dlcdn.apache.org, causing CI failures.
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #1040 +/- ##
============================================
- Coverage 91.92% 85.77% -6.15%
- Complexity 283 406 +123
============================================
Files 27 39 +12
Lines 1015 1645 +630
Branches 86 154 +68
============================================
+ Hits 933 1411 +478
- Misses 54 164 +110
- Partials 28 70 +42 🚀 New features to boost your workflow:
|
Also bump google-cloud libraries-bom from 26.76.0 to 26.79.0 (the version that Beam 2.73 expects) and remove the now-stale google-cloud-bigtable 2.73.1 overrides — libraries-bom 26.79.0 plus the Beam BOM converge cleanly without them. Co-Authored-By: Claude <noreply@anthropic.com>
Routine update for the BouncyCastle crypto provider; pulls in fixes across 1.79, 1.80, 1.81, 1.82, 1.83, and 1.84 releases (April 2026). Co-Authored-By: Claude <noreply@anthropic.com>
Update to the latest 10.x release (last line that supports JDK 8/11 toolchains; 11.x requires JDK 17 to run the plugin). Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Claude <noreply@anthropic.com>
JDK 23 (JEP 472) disabled implicit annotation processing, so AutoValue silently stopped generating AutoValue_* classes on JDK 25. Declare the processor explicitly via maven-compiler-plugin's annotationProcessorPaths so javac runs it again, and bump auto-value 1.9 -> 1.11.0 for JDK 23+ compiler-internals support. Bump maven.compiler.release 8 -> 11 to drop the obsolete-source-8 warnings; matches the enforcer's existing [11,) floor. Co-Authored-By: Claude <noreply@anthropic.com>
Install parquet-cli (1.14.4) in CI alongside avro-tools. Update runDBeamParquetDockerCon in e2e.sh to validate Parquet output using parquet-tools: display first 5 records and print the schema. Falls back gracefully if parquet-tools is not available.
Replace the binary string serialization of SQL arrays with proper
Parquet 3-level LIST convention:
optional group column_name (LIST) {
repeated group list {
optional binary element (STRING);
}
}
Array elements are written individually via java.sql.Array.getArray()
and serialized as STRING elements using toString(). This produces
Parquet files that Spark, Hive, and other analytical tools can read
as native array/list types instead of opaque binary blobs.
Null array elements are supported (optional element field is simply
omitted in the list group).
Two new tests: - shouldHandleNullArrayElements: verifies that null elements within an array produce LIST groups with 0 repetitions for the "element" field (Parquet's standard null representation for optional fields) - shouldHandleNullSqlArray: verifies that a completely null SQL array produces 0 repetitions for the entire LIST field
Array element types are now inferred from the JDBC column type name instead of always using STRING. For PostgreSQL, the type name prefix is stripped (e.g. _int4 -> INT32, _int8 -> INT64, _float4 -> FLOAT, _float8 -> DOUBLE, _bool -> BOOLEAN). For H2 and other databases, patterns like "INTEGER ARRAY" are parsed. Schema generation (JdbcParquetSchema): - buildArrayElementType() maps column type name to Parquet element type - resolveArrayElementTypeName() handles both PostgreSQL underscore prefix and H2-style "TYPE ARRAY" naming conventions Write support (JdbcParquetWriteSupport): - writeArrayElement() dispatches to the correct consumer.add*() method based on element type (addInteger, addLong, addFloat, addDouble, addBoolean, or addBinary for strings) This means integer[] columns produce LIST of INT32, bigint[] produces LIST of INT64, etc., which downstream tools (Spark, BigQuery, Hive) can read as native typed arrays.
Parquet files now include the Avro schema (avsc JSON) in the file footer under key `parquet.avro.schema` in keyValueMetaData. This is the standard convention used by parquet-avro and enables tools like Spark, Hive, and BigQuery to read the file with full Avro type information. The Avro schema is generated via JdbcAvroSchema from dbeam-core, ensuring the avsc in the footer matches what JdbcAvroJob would produce for the same table. It is also saved as _AVRO_SCHEMA.avsc alongside _PARQUET_SCHEMA.json in the output directory. The avroSchemaJson string is threaded through the serializable chain: JdbcParquetJob -> JdbcParquetIO.createWrite() -> Sink -> WriteOperation -> Writer -> ResultSetParquetWriterBuilder -> ResultSetWriteSupport.init() -> WriteContext extraMetadata. Tests verify: - Footer metadata contains parquet.avro.schema key with correct value - Footer metadata omits key when Avro schema is not provided - Job output directory includes _AVRO_SCHEMA.avsc file
After each Parquet e2e run, verify the file footer contains the parquet.avro.schema key using parquet-tools meta. Sanity check that the value contains a "type" field (valid Avro JSON). Fail the e2e if the key is missing.
The --arrayMode option (bytes, typed_first_row, typed_postgres) from JdbcExportPipelineOptions is now respected by the Parquet export: - bytes: serializes arrays as plain BINARY (raw bytes), matching the Avro bytes mode behavior - typed_first_row / typed_postgres: uses Parquet 3-level LIST type with typed elements (current default behavior) The arrayMode is threaded through: JdbcParquetJob -> JdbcParquetArgs -> JdbcParquetIO -> WriteSupport JdbcParquetJob -> BeamJdbcParquetSchema -> JdbcParquetSchema Schema generation and write support both check arrayMode to decide between BINARY and LIST representations for ARRAY columns.
…ering
The two metering classes were 100% identical except for the log prefix
("jdbcavroio" vs "jdbcparquetio"). Introduce JdbcMetering parameterized
by a format name string and update all callers.
JdbcAvroMetering and JdbcParquetMetering are retained as deprecated
thin subclasses for backwards compatibility.
…etJob The two benchmark classes were ~98% identical, differing only in which job class they instantiate. Introduce BenchJdbcJob parameterized by a JobRunner functional interface and a job name string. BenchJdbcAvroJob and BenchJdbcParquetJob are retained as thin wrappers that supply the format-specific job factory, preserving their main() entry points and test compatibility.
When useLogicalTypes is enabled, the Parquet schema declares UUID columns as FIXED_LEN_BYTE_ARRAY(16) with UUID logical type, but the writer was emitting a variable-length UTF-8 string. This mismatch would produce corrupt files or write failures. Now checks the field's logical type annotation: if it is UUID, writes the 16-byte big-endian binary representation (MSB first, LSB second). Without logical types, UUID is still written as a STRING.
GregorianCalendar is not thread-safe. With --queryParallelism > 1, concurrent rs.getTimestamp(col, CALENDAR) calls could corrupt timestamp values. Use a ThreadLocal to give each thread its own Calendar instance.
The InputStream from FileSystems.open() and the Scanner were never closed, leaking a file handle. Use try-with-resources.
Optional.of(null) throws NPE if the metric key (e.g. KbWritePerSec) is absent from the map. Use Optional.ofNullable() instead.
The API default (JdbcParquetArgs.create()) was 64MB while the CLI default (ParquetPipelineOptions) was 128MB. Standardize on 128MB which matches Parquet's own default and the CLI behavior.
prepareExport() was issuing two separate queries to generate the Parquet MessageType and Avro Schema. Now executes a single sqlQueryWithLimitOne and passes the ResultSet to both JdbcParquetSchema.createParquetSchema() and JdbcAvroSchema.createAvroSchema(). When a --parquetSchemaFilePath is provided, only the Avro schema still needs a DB query since the Parquet schema comes from the file.
The method replaces non-alphanumeric characters with underscores and is not Avro-specific. The old name was misleading in a Parquet module.
These SQL types fall through to the default STRING representation. Document this explicitly so users are not surprised by the lack of fixed-point DECIMAL Parquet type.
Replace platform-specific stat flags (macOS -f%z vs Linux -c%s) with wc -c which works on both platforms without fallback logic.
Parquet's withRowGroupSize() accepts long, allowing row groups larger than 2GB for wide tables. The int type would overflow for such values.
Both classes independently applied the same regex to normalize column names. If one changed without the other, schema and data would mismatch. Now WriteSupport calls JdbcParquetSchema.normalizeFieldName.
If getAvroCodec() returns null, the method would NPE on avroCodec.equals(). Default to snappy in that case.
When --parquetSchemaFilePath was provided, the Avro schema generation time was not tracked in Beam metrics. Now exposeSchemaMetrics is called on both code paths.
SQL DATE columns are stored as epoch millis (not Parquet DATE) for consistency with the Avro export path. Document this in both the schema builder and writer to prevent future confusion.
JdbcParquetJob.prepareExport queries the DB once for both Parquet and Avro schemas and calls exposeSchemaMetrics directly, leaving createSchema and generateParquetSchema referenced only from their own test. Drop the dead helpers and the corresponding test case.
Scanner with a stringly-named "UTF-8" charset and an "\\A" delimiter trick is opaque. Switch to CharStreams.toString with an InputStreamReader and StandardCharsets.UTF_8 for clearer intent and a typed charset.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Checklist for PR author(s)
mvn com.coveo:fmt-maven-plugin:format org.codehaus.mojo:license-maven-plugin:update-file-header)