Support exports to Parquet format by labianchin · Pull Request #1040 · spotify/dbeam

labianchin · 2026-04-17T11:30:27Z

Fixes: Support for parquet format #143
By introducing a new dbeam-parquet module and related jobs that export on Parquet format
Still needs a more detail review and adjustments.

Checklist for PR author(s)

Changes are covered by unit tests (no major decrease in code coverage %) and/or integration tests.
Ensure code formating (use mvn com.coveo:fmt-maven-plugin:format org.codehaus.mojo:license-maven-plugin:update-file-header)
Document any relevant additions/changes in the appropriate spot in javadocs/docs/README.

Bumps [org.apache.maven.plugins:maven-enforcer-plugin](https://github.com/apache/maven-enforcer) from 3.5.0 to 3.6.2. - [Release notes](https://github.com/apache/maven-enforcer/releases) - [Commits](apache/maven-enforcer@enforcer-3.5.0...enforcer-3.6.2) --- updated-dependencies: - dependency-name: org.apache.maven.plugins:maven-enforcer-plugin dependency-version: 3.6.2 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com>

Bumps [org.mockito:mockito-core](https://github.com/mockito/mockito) from 5.17.0 to 5.20.0. - [Release notes](https://github.com/mockito/mockito/releases) - [Commits](mockito/mockito@v5.17.0...v5.20.0) --- updated-dependencies: - dependency-name: org.mockito:mockito-core dependency-version: 5.20.0 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com>

Bumps [org.apache.maven.plugins:maven-compiler-plugin](https://github.com/apache/maven-compiler-plugin) from 3.13.0 to 3.14.1. - [Release notes](https://github.com/apache/maven-compiler-plugin/releases) - [Commits](apache/maven-compiler-plugin@maven-compiler-plugin-3.13.0...maven-compiler-plugin-3.14.1) --- updated-dependencies: - dependency-name: org.apache.maven.plugins:maven-compiler-plugin dependency-version: 3.14.1 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com>

Bumps [org.apache.maven.plugins:maven-javadoc-plugin](https://github.com/apache/maven-javadoc-plugin) from 3.11.2 to 3.12.0. - [Release notes](https://github.com/apache/maven-javadoc-plugin/releases) - [Commits](apache/maven-javadoc-plugin@maven-javadoc-plugin-3.11.2...maven-javadoc-plugin-3.12.0) --- updated-dependencies: - dependency-name: org.apache.maven.plugins:maven-javadoc-plugin dependency-version: 3.12.0 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com>

Bumps [org.apache.maven.plugins:maven-gpg-plugin](https://github.com/apache/maven-gpg-plugin) from 3.2.7 to 3.2.8. - [Release notes](https://github.com/apache/maven-gpg-plugin/releases) - [Commits](apache/maven-gpg-plugin@maven-gpg-plugin-3.2.7...maven-gpg-plugin-3.2.8) --- updated-dependencies: - dependency-name: org.apache.maven.plugins:maven-gpg-plugin dependency-version: 3.2.8 dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com>

Bumps [org.codehaus.mojo:build-helper-maven-plugin](https://github.com/mojohaus/build-helper-maven-plugin) from 3.6.0 to 3.6.1. - [Release notes](https://github.com/mojohaus/build-helper-maven-plugin/releases) - [Commits](mojohaus/build-helper-maven-plugin@3.6.0...3.6.1) --- updated-dependencies: - dependency-name: org.codehaus.mojo:build-helper-maven-plugin dependency-version: 3.6.1 dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com>

Bumps [org.apache.maven.plugins:maven-project-info-reports-plugin](https://github.com/apache/maven-project-info-reports-plugin) from 3.8.0 to 3.9.0. - [Release notes](https://github.com/apache/maven-project-info-reports-plugin/releases) - [Commits](apache/maven-project-info-reports-plugin@maven-project-info-reports-plugin-3.8.0...maven-project-info-reports-plugin-3.9.0) --- updated-dependencies: - dependency-name: org.apache.maven.plugins:maven-project-info-reports-plugin dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com>

Fixes CVE-2025-49146 (https://nvd.nist.gov/vuln/detail/CVE-2025-49146)

Beam 2.65.0 reaches end-of-support in May 2026. Beam 2.72.0 is the latest stable release, supported until March 2027. Updated dependency versions to match Beam 2.72.0: - errorprone: 2.10.0 -> 2.31.0 - joda-time: 2.10.14 -> 2.14.0 - netty: 4.1.121.Final -> 4.1.124.Final - slf4j: 1.7.30 -> 2.0.16 - google-cloud-libraries-bom: 26.57.0 -> 26.76.0 Added dependency management overrides to resolve version convergence between Beam BOM and libraries-bom: - jackson-dataformat-xml 2.18.2 (from google-cloud-storage) - google-cloud-bigtable 2.73.1 and proto stubs (from beam-io-gcp) - j2objc-annotations 3.1 (from libraries-bom) - failureaccess 1.0.3 (from transitives)

Aligns with the version provided by google-cloud-libraries-bom 26.76.0, removing the separate jackson-dataformat-xml override.

Aligns with the version provided by google-cloud-libraries-bom 26.76.0.

1.25.0 is the highest version compatible with the current libraries-bom 26.76.0 without introducing dependency conflicts. Versions >= 1.25.1 require a newer google-api-client than what the BOM provides.

The previous URL for avro-tools 1.11.3 is no longer available on dlcdn.apache.org, causing CI failures.

codecov · 2026-04-17T11:34:38Z

Codecov Report

❌ Patch coverage is 79.78581% with 151 lines in your changes missing coverage. Please review.
✅ Project coverage is 85.77%. Comparing base (05add13) to head (5dad9ac).
⚠️ Report is 2 commits behind head on master.

Additional details and impacted files

@@             Coverage Diff              @@
##             master    #1040      +/-   ##
============================================
- Coverage     91.92%   85.77%   -6.15%     
- Complexity      283      406     +123     
============================================
  Files            27       39      +12     
  Lines          1015     1645     +630     
  Branches         86      154      +68     
============================================
+ Hits            933     1411     +478     
- Misses           54      164     +110     
- Partials         28       70      +42

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Also bump google-cloud libraries-bom from 26.76.0 to 26.79.0 (the version that Beam 2.73 expects) and remove the now-stale google-cloud-bigtable 2.73.1 overrides — libraries-bom 26.79.0 plus the Beam BOM converge cleanly without them. Co-Authored-By: Claude <noreply@anthropic.com>

Routine update for the BouncyCastle crypto provider; pulls in fixes across 1.79, 1.80, 1.81, 1.82, 1.83, and 1.84 releases (April 2026). Co-Authored-By: Claude <noreply@anthropic.com>

Update to the latest 10.x release (last line that supports JDK 8/11 toolchains; 11.x requires JDK 17 to run the plugin). Co-Authored-By: Claude <noreply@anthropic.com>

Co-Authored-By: Claude <noreply@anthropic.com>

JDK 23 (JEP 472) disabled implicit annotation processing, so AutoValue silently stopped generating AutoValue_* classes on JDK 25. Declare the processor explicitly via maven-compiler-plugin's annotationProcessorPaths so javac runs it again, and bump auto-value 1.9 -> 1.11.0 for JDK 23+ compiler-internals support. Bump maven.compiler.release 8 -> 11 to drop the obsolete-source-8 warnings; matches the enforcer's existing [11,) floor. Co-Authored-By: Claude <noreply@anthropic.com>

Install parquet-cli (1.14.4) in CI alongside avro-tools. Update runDBeamParquetDockerCon in e2e.sh to validate Parquet output using parquet-tools: display first 5 records and print the schema. Falls back gracefully if parquet-tools is not available.

Replace the binary string serialization of SQL arrays with proper Parquet 3-level LIST convention: optional group column_name (LIST) { repeated group list { optional binary element (STRING); } } Array elements are written individually via java.sql.Array.getArray() and serialized as STRING elements using toString(). This produces Parquet files that Spark, Hive, and other analytical tools can read as native array/list types instead of opaque binary blobs. Null array elements are supported (optional element field is simply omitted in the list group).

Two new tests: - shouldHandleNullArrayElements: verifies that null elements within an array produce LIST groups with 0 repetitions for the "element" field (Parquet's standard null representation for optional fields) - shouldHandleNullSqlArray: verifies that a completely null SQL array produces 0 repetitions for the entire LIST field

Array element types are now inferred from the JDBC column type name instead of always using STRING. For PostgreSQL, the type name prefix is stripped (e.g. _int4 -> INT32, _int8 -> INT64, _float4 -> FLOAT, _float8 -> DOUBLE, _bool -> BOOLEAN). For H2 and other databases, patterns like "INTEGER ARRAY" are parsed. Schema generation (JdbcParquetSchema): - buildArrayElementType() maps column type name to Parquet element type - resolveArrayElementTypeName() handles both PostgreSQL underscore prefix and H2-style "TYPE ARRAY" naming conventions Write support (JdbcParquetWriteSupport): - writeArrayElement() dispatches to the correct consumer.add*() method based on element type (addInteger, addLong, addFloat, addDouble, addBoolean, or addBinary for strings) This means integer[] columns produce LIST of INT32, bigint[] produces LIST of INT64, etc., which downstream tools (Spark, BigQuery, Hive) can read as native typed arrays.

Parquet files now include the Avro schema (avsc JSON) in the file footer under key `parquet.avro.schema` in keyValueMetaData. This is the standard convention used by parquet-avro and enables tools like Spark, Hive, and BigQuery to read the file with full Avro type information. The Avro schema is generated via JdbcAvroSchema from dbeam-core, ensuring the avsc in the footer matches what JdbcAvroJob would produce for the same table. It is also saved as _AVRO_SCHEMA.avsc alongside _PARQUET_SCHEMA.json in the output directory. The avroSchemaJson string is threaded through the serializable chain: JdbcParquetJob -> JdbcParquetIO.createWrite() -> Sink -> WriteOperation -> Writer -> ResultSetParquetWriterBuilder -> ResultSetWriteSupport.init() -> WriteContext extraMetadata. Tests verify: - Footer metadata contains parquet.avro.schema key with correct value - Footer metadata omits key when Avro schema is not provided - Job output directory includes _AVRO_SCHEMA.avsc file

After each Parquet e2e run, verify the file footer contains the parquet.avro.schema key using parquet-tools meta. Sanity check that the value contains a "type" field (valid Avro JSON). Fail the e2e if the key is missing.

The --arrayMode option (bytes, typed_first_row, typed_postgres) from JdbcExportPipelineOptions is now respected by the Parquet export: - bytes: serializes arrays as plain BINARY (raw bytes), matching the Avro bytes mode behavior - typed_first_row / typed_postgres: uses Parquet 3-level LIST type with typed elements (current default behavior) The arrayMode is threaded through: JdbcParquetJob -> JdbcParquetArgs -> JdbcParquetIO -> WriteSupport JdbcParquetJob -> BeamJdbcParquetSchema -> JdbcParquetSchema Schema generation and write support both check arrayMode to decide between BINARY and LIST representations for ARRAY columns.

…ering The two metering classes were 100% identical except for the log prefix ("jdbcavroio" vs "jdbcparquetio"). Introduce JdbcMetering parameterized by a format name string and update all callers. JdbcAvroMetering and JdbcParquetMetering are retained as deprecated thin subclasses for backwards compatibility.

…etJob The two benchmark classes were ~98% identical, differing only in which job class they instantiate. Introduce BenchJdbcJob parameterized by a JobRunner functional interface and a job name string. BenchJdbcAvroJob and BenchJdbcParquetJob are retained as thin wrappers that supply the format-specific job factory, preserving their main() entry points and test compatibility.

When useLogicalTypes is enabled, the Parquet schema declares UUID columns as FIXED_LEN_BYTE_ARRAY(16) with UUID logical type, but the writer was emitting a variable-length UTF-8 string. This mismatch would produce corrupt files or write failures. Now checks the field's logical type annotation: if it is UUID, writes the 16-byte big-endian binary representation (MSB first, LSB second). Without logical types, UUID is still written as a STRING.

GregorianCalendar is not thread-safe. With --queryParallelism > 1, concurrent rs.getTimestamp(col, CALENDAR) calls could corrupt timestamp values. Use a ThreadLocal to give each thread its own Calendar instance.

The InputStream from FileSystems.open() and the Scanner were never closed, leaking a file handle. Use try-with-resources.

Optional.of(null) throws NPE if the metric key (e.g. KbWritePerSec) is absent from the map. Use Optional.ofNullable() instead.

The API default (JdbcParquetArgs.create()) was 64MB while the CLI default (ParquetPipelineOptions) was 128MB. Standardize on 128MB which matches Parquet's own default and the CLI behavior.

prepareExport() was issuing two separate queries to generate the Parquet MessageType and Avro Schema. Now executes a single sqlQueryWithLimitOne and passes the ResultSet to both JdbcParquetSchema.createParquetSchema() and JdbcAvroSchema.createAvroSchema(). When a --parquetSchemaFilePath is provided, only the Avro schema still needs a DB query since the Parquet schema comes from the file.

The method replaces non-alphanumeric characters with underscores and is not Avro-specific. The old name was misleading in a Parquet module.

These SQL types fall through to the default STRING representation. Document this explicitly so users are not surprised by the lack of fixed-point DECIMAL Parquet type.

Replace platform-specific stat flags (macOS -f%z vs Linux -c%s) with wc -c which works on both platforms without fallback logic.

Parquet's withRowGroupSize() accepts long, allowing row groups larger than 2GB for wide tables. The int type would overflow for such values.

Both classes independently applied the same regex to normalize column names. If one changed without the other, schema and data would mismatch. Now WriteSupport calls JdbcParquetSchema.normalizeFieldName.

If getAvroCodec() returns null, the method would NPE on avroCodec.equals(). Default to snappy in that case.

When --parquetSchemaFilePath was provided, the Avro schema generation time was not tracked in Beam metrics. Now exposeSchemaMetrics is called on both code paths.

SQL DATE columns are stored as epoch millis (not Parquet DATE) for consistency with the Avro export path. Document this in both the schema builder and writer to prevent future confusion.

JdbcParquetJob.prepareExport queries the DB once for both Parquet and Avro schemas and calls exposeSchemaMetrics directly, leaving createSchema and generateParquetSchema referenced only from their own test. Drop the dead helpers and the corresponding test case.

Scanner with a stringly-named "UTF-8" charset and an "\\A" delimiter trick is opaque. Switch to CharStreams.toString with an InputStreamReader and StandardCharsets.UTF_8 for clearer intent and a typed charset.

dependabot Bot and others added 18 commits April 7, 2026 16:31

Bump PostgreSQL JDBC driver from 42.7.4 to 42.7.7

8e4a759

Fixes CVE-2025-49146 (https://nvd.nist.gov/vuln/detail/CVE-2025-49146)

Bump Avro from 1.11.4 to 1.11.5

50dd3dc

Bump Jackson from 2.15.4 to 2.18.2

8f94e32

Aligns with the version provided by google-cloud-libraries-bom 26.76.0, removing the separate jackson-dataformat-xml override.

Bump Guava from 33.1.0-jre to 33.5.0-jre

56ed88f

Aligns with the version provided by google-cloud-libraries-bom 26.76.0.

Bump PostgreSQL JDBC driver from 42.7.7 to 42.7.8

1526fa0

Bump google-api-services-cloudkms from v1-rev20240314 to v1-rev20260319

46e5248

Bump MariaDB JDBC driver from 3.5.3 to 3.5.8

d111180

Bump zstd-jni from 1.5.6-3 to 1.5.7-7

998d2c9

Bump Cloud SQL socket factory from 1.18.0 to 1.25.0

ad07d12

1.25.0 is the highest version compatible with the current libraries-bom 26.76.0 without introducing dependency conflicts. Versions >= 1.25.1 require a newer google-api-client than what the BOM provides.

Update avro-tools download URL to 1.11.5 in CI workflow

900fc1a

The previous URL for avro-tools 1.11.3 is no longer available on dlcdn.apache.org, causing CI failures.

labianchin and others added 8 commits May 4, 2026 12:00

Bump BouncyCastle from 1.78.1 to 1.84

e02a1e2

Routine update for the BouncyCastle crypto provider; pulls in fixes across 1.79, 1.80, 1.81, 1.82, 1.83, and 1.84 releases (April 2026). Co-Authored-By: Claude <noreply@anthropic.com>

Bump Checkstyle from 10.23.0 to 10.26.1

e589677

Update to the latest 10.x release (last line that supports JDK 8/11 toolchains; 11.x requires JDK 17 to run the plugin). Co-Authored-By: Claude <noreply@anthropic.com>

Bump maven-surefire-plugin from 3.5.3 to 3.5.5

09af448

Co-Authored-By: Claude <noreply@anthropic.com>

Bump maven-shade-plugin from 3.6.0 to 3.6.2

ac4625b

Co-Authored-By: Claude <noreply@anthropic.com>

Use Java 21 on deploy

9f5ff69

Test with Java 25

774604e

Java 21 for e2e tests

e1616da

labianchin force-pushed the parquetQ branch from 5945fb0 to ebd9e6f Compare May 4, 2026 10:18

labianchin force-pushed the parquetQ branch from ebd9e6f to bde9f17 Compare May 5, 2026 08:36

labianchin added 29 commits May 7, 2026 13:12

Remove unused ParquetMetadata import from JdbcParquetRecordTest

4f8a2db

Assert parquet.avro.schema presence in e2e tests

4752a83

After each Parquet e2e run, verify the file footer contains the parquet.avro.schema key using parquet-tools meta. Sanity check that the value contains a "type" field (valid Avro JSON). Fail the e2e if the key is missing.

Add CHANGELOG documenting dbeam-parquet module

12d5859

Make Calendar thread-safe in JdbcParquetWriteSupport

f847d9f

GregorianCalendar is not thread-safe. With --queryParallelism > 1, concurrent rs.getTimestamp(col, CALENDAR) calls could corrupt timestamp values. Use a ThreadLocal to give each thread its own Calendar instance.

Close InputStream in BeamJdbcParquetSchema.parseInputParquetSchemaFile

8c17ba5

The InputStream from FileSystems.open() and the Scanner were never closed, leaking a file handle. Use try-with-resources.

Fix NPE in BenchJdbcJob when metric key is missing from map

d3ed598

Optional.of(null) throws NPE if the metric key (e.g. KbWritePerSec) is absent from the map. Use Optional.ofNullable() instead.

Align default rowGroupSize to 128MB in JdbcParquetArgs

feb4704

The API default (JdbcParquetArgs.create()) was 64MB while the CLI default (ParquetPipelineOptions) was 128MB. Standardize on 128MB which matches Parquet's own default and the CLI behavior.

Rename normalizeForAvro to normalizeFieldName in JdbcParquetSchema

f9d393d

The method replaces non-alphanumeric characters with underscores and is not Avro-specific. The old name was misleading in a Parquet module.

Make LOGGER fields final in JdbcParquetJob and BeamJdbcParquetSchema

6f70f95

Document DECIMAL/NUMERIC type conversion in Parquet docs

a2f2582

These SQL types fall through to the default STRING representation. Document this explicitly so users are not surprised by the lack of fixed-point DECIMAL Parquet type.

Use portable file size command in e2e.sh

dff137d

Replace platform-specific stat flags (macOS -f%z vs Linux -c%s) with wc -c which works on both platforms without fallback logic.

Change rowGroupSize and pageSize from int to long

1d642bc

Parquet's withRowGroupSize() accepts long, allowing row groups larger than 2GB for wide tables. The int type would overflow for such values.

Share normalizeFieldName between JdbcParquetSchema and WriteSupport

616db9f

Both classes independently applied the same regex to normalize column names. If one changed without the other, schema and data would mismatch. Now WriteSupport calls JdbcParquetSchema.normalizeFieldName.

Add null guard in mapAvroCodecToParquetCodec

e8da56f

If getAvroCodec() returns null, the method would NPE on avroCodec.equals(). Default to snappy in that case.

Expose schema metrics on file-override path

37fc3fe

When --parquetSchemaFilePath was provided, the Avro schema generation time was not tracked in Beam metrics. Now exposeSchemaMetrics is called on both code paths.

Add comments noting DATE is stored as TIMESTAMP(MILLIS)

3c25ebe

SQL DATE columns are stored as epoch millis (not Parquet DATE) for consistency with the Avro export path. Document this in both the schema builder and writer to prevent future confusion.

Read Parquet schema file via CharStreams with explicit UTF-8

6791a13

Scanner with a stringly-named "UTF-8" charset and an "\\A" delimiter trick is opaque. Switch to CharStreams.toString with an InputStreamReader and StandardCharsets.UTF_8 for clearer intent and a typed charset.

fmt

5dad9ac

labianchin force-pushed the parquetQ branch from 29c80b8 to 5dad9ac Compare May 7, 2026 11:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support exports to Parquet format#1040

Support exports to Parquet format#1040
labianchin wants to merge 77 commits intomasterfrom
parquetQ

labianchin commented Apr 17, 2026

Uh oh!

codecov Bot commented Apr 17, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

labianchin commented Apr 17, 2026

Checklist for PR author(s)

Uh oh!

codecov Bot commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

codecov Bot commented Apr 17, 2026 •

edited

Loading