Skip to content

Support exports to Parquet format#1040

Draft
labianchin wants to merge 77 commits intomasterfrom
parquetQ
Draft

Support exports to Parquet format#1040
labianchin wants to merge 77 commits intomasterfrom
parquetQ

Conversation

@labianchin
Copy link
Copy Markdown
Collaborator

  • Fixes: Support for parquet format #143
  • By introducing a new dbeam-parquet module and related jobs that export on Parquet format
  • Still needs a more detail review and adjustments.

Checklist for PR author(s)

  • Changes are covered by unit tests (no major decrease in code coverage %) and/or integration tests.
  • Ensure code formating (use mvn com.coveo:fmt-maven-plugin:format org.codehaus.mojo:license-maven-plugin:update-file-header)
  • Document any relevant additions/changes in the appropriate spot in javadocs/docs/README.

dependabot Bot and others added 18 commits April 7, 2026 16:31
Bumps [org.apache.maven.plugins:maven-enforcer-plugin](https://github.com/apache/maven-enforcer) from 3.5.0 to 3.6.2.
- [Release notes](https://github.com/apache/maven-enforcer/releases)
- [Commits](apache/maven-enforcer@enforcer-3.5.0...enforcer-3.6.2)

---
updated-dependencies:
- dependency-name: org.apache.maven.plugins:maven-enforcer-plugin
  dependency-version: 3.6.2
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Bumps [org.mockito:mockito-core](https://github.com/mockito/mockito) from 5.17.0 to 5.20.0.
- [Release notes](https://github.com/mockito/mockito/releases)
- [Commits](mockito/mockito@v5.17.0...v5.20.0)

---
updated-dependencies:
- dependency-name: org.mockito:mockito-core
  dependency-version: 5.20.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Bumps [org.apache.maven.plugins:maven-compiler-plugin](https://github.com/apache/maven-compiler-plugin) from 3.13.0 to 3.14.1.
- [Release notes](https://github.com/apache/maven-compiler-plugin/releases)
- [Commits](apache/maven-compiler-plugin@maven-compiler-plugin-3.13.0...maven-compiler-plugin-3.14.1)

---
updated-dependencies:
- dependency-name: org.apache.maven.plugins:maven-compiler-plugin
  dependency-version: 3.14.1
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Bumps [org.apache.maven.plugins:maven-javadoc-plugin](https://github.com/apache/maven-javadoc-plugin) from 3.11.2 to 3.12.0.
- [Release notes](https://github.com/apache/maven-javadoc-plugin/releases)
- [Commits](apache/maven-javadoc-plugin@maven-javadoc-plugin-3.11.2...maven-javadoc-plugin-3.12.0)

---
updated-dependencies:
- dependency-name: org.apache.maven.plugins:maven-javadoc-plugin
  dependency-version: 3.12.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Bumps [org.apache.maven.plugins:maven-gpg-plugin](https://github.com/apache/maven-gpg-plugin) from 3.2.7 to 3.2.8.
- [Release notes](https://github.com/apache/maven-gpg-plugin/releases)
- [Commits](apache/maven-gpg-plugin@maven-gpg-plugin-3.2.7...maven-gpg-plugin-3.2.8)

---
updated-dependencies:
- dependency-name: org.apache.maven.plugins:maven-gpg-plugin
  dependency-version: 3.2.8
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Bumps [org.codehaus.mojo:build-helper-maven-plugin](https://github.com/mojohaus/build-helper-maven-plugin) from 3.6.0 to 3.6.1.
- [Release notes](https://github.com/mojohaus/build-helper-maven-plugin/releases)
- [Commits](mojohaus/build-helper-maven-plugin@3.6.0...3.6.1)

---
updated-dependencies:
- dependency-name: org.codehaus.mojo:build-helper-maven-plugin
  dependency-version: 3.6.1
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Bumps [org.apache.maven.plugins:maven-project-info-reports-plugin](https://github.com/apache/maven-project-info-reports-plugin) from 3.8.0 to 3.9.0.
- [Release notes](https://github.com/apache/maven-project-info-reports-plugin/releases)
- [Commits](apache/maven-project-info-reports-plugin@maven-project-info-reports-plugin-3.8.0...maven-project-info-reports-plugin-3.9.0)

---
updated-dependencies:
- dependency-name: org.apache.maven.plugins:maven-project-info-reports-plugin
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Beam 2.65.0 reaches end-of-support in May 2026. Beam 2.72.0 is the
latest stable release, supported until March 2027.

Updated dependency versions to match Beam 2.72.0:
- errorprone: 2.10.0 -> 2.31.0
- joda-time: 2.10.14 -> 2.14.0
- netty: 4.1.121.Final -> 4.1.124.Final
- slf4j: 1.7.30 -> 2.0.16
- google-cloud-libraries-bom: 26.57.0 -> 26.76.0

Added dependency management overrides to resolve version convergence
between Beam BOM and libraries-bom:
- jackson-dataformat-xml 2.18.2 (from google-cloud-storage)
- google-cloud-bigtable 2.73.1 and proto stubs (from beam-io-gcp)
- j2objc-annotations 3.1 (from libraries-bom)
- failureaccess 1.0.3 (from transitives)
Aligns with the version provided by google-cloud-libraries-bom 26.76.0,
removing the separate jackson-dataformat-xml override.
Aligns with the version provided by google-cloud-libraries-bom 26.76.0.
1.25.0 is the highest version compatible with the current
libraries-bom 26.76.0 without introducing dependency conflicts.
Versions >= 1.25.1 require a newer google-api-client than what
the BOM provides.
The previous URL for avro-tools 1.11.3 is no longer available on
dlcdn.apache.org, causing CI failures.
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 17, 2026

Codecov Report

❌ Patch coverage is 79.78581% with 151 lines in your changes missing coverage. Please review.
✅ Project coverage is 85.77%. Comparing base (05add13) to head (5dad9ac).
⚠️ Report is 2 commits behind head on master.

Additional details and impacted files
@@             Coverage Diff              @@
##             master    #1040      +/-   ##
============================================
- Coverage     91.92%   85.77%   -6.15%     
- Complexity      283      406     +123     
============================================
  Files            27       39      +12     
  Lines          1015     1645     +630     
  Branches         86      154      +68     
============================================
+ Hits            933     1411     +478     
- Misses           54      164     +110     
- Partials         28       70      +42     
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

labianchin and others added 8 commits May 4, 2026 12:00
Also bump google-cloud libraries-bom from 26.76.0 to 26.79.0 (the
version that Beam 2.73 expects) and remove the now-stale
google-cloud-bigtable 2.73.1 overrides — libraries-bom 26.79.0 plus
the Beam BOM converge cleanly without them.

Co-Authored-By: Claude <noreply@anthropic.com>
Routine update for the BouncyCastle crypto provider; pulls in fixes
across 1.79, 1.80, 1.81, 1.82, 1.83, and 1.84 releases (April 2026).

Co-Authored-By: Claude <noreply@anthropic.com>
Update to the latest 10.x release (last line that supports JDK 8/11
toolchains; 11.x requires JDK 17 to run the plugin).

Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Claude <noreply@anthropic.com>
JDK 23 (JEP 472) disabled implicit annotation processing, so AutoValue
silently stopped generating AutoValue_* classes on JDK 25. Declare the
processor explicitly via maven-compiler-plugin's annotationProcessorPaths
so javac runs it again, and bump auto-value 1.9 -> 1.11.0 for JDK 23+
compiler-internals support. Bump maven.compiler.release 8 -> 11 to drop
the obsolete-source-8 warnings; matches the enforcer's existing [11,)
floor.

Co-Authored-By: Claude <noreply@anthropic.com>
labianchin added 29 commits May 7, 2026 13:12
Install parquet-cli (1.14.4) in CI alongside avro-tools. Update
runDBeamParquetDockerCon in e2e.sh to validate Parquet output using
parquet-tools: display first 5 records and print the schema. Falls
back gracefully if parquet-tools is not available.
Replace the binary string serialization of SQL arrays with proper
Parquet 3-level LIST convention:

  optional group column_name (LIST) {
    repeated group list {
      optional binary element (STRING);
    }
  }

Array elements are written individually via java.sql.Array.getArray()
and serialized as STRING elements using toString(). This produces
Parquet files that Spark, Hive, and other analytical tools can read
as native array/list types instead of opaque binary blobs.

Null array elements are supported (optional element field is simply
omitted in the list group).
Two new tests:
- shouldHandleNullArrayElements: verifies that null elements within
  an array produce LIST groups with 0 repetitions for the "element"
  field (Parquet's standard null representation for optional fields)
- shouldHandleNullSqlArray: verifies that a completely null SQL array
  produces 0 repetitions for the entire LIST field
Array element types are now inferred from the JDBC column type name
instead of always using STRING. For PostgreSQL, the type name prefix
is stripped (e.g. _int4 -> INT32, _int8 -> INT64, _float4 -> FLOAT,
_float8 -> DOUBLE, _bool -> BOOLEAN). For H2 and other databases,
patterns like "INTEGER ARRAY" are parsed.

Schema generation (JdbcParquetSchema):
- buildArrayElementType() maps column type name to Parquet element type
- resolveArrayElementTypeName() handles both PostgreSQL underscore
  prefix and H2-style "TYPE ARRAY" naming conventions

Write support (JdbcParquetWriteSupport):
- writeArrayElement() dispatches to the correct consumer.add*() method
  based on element type (addInteger, addLong, addFloat, addDouble,
  addBoolean, or addBinary for strings)

This means integer[] columns produce LIST of INT32, bigint[] produces
LIST of INT64, etc., which downstream tools (Spark, BigQuery, Hive)
can read as native typed arrays.
Parquet files now include the Avro schema (avsc JSON) in the file
footer under key `parquet.avro.schema` in keyValueMetaData. This is
the standard convention used by parquet-avro and enables tools like
Spark, Hive, and BigQuery to read the file with full Avro type
information.

The Avro schema is generated via JdbcAvroSchema from dbeam-core,
ensuring the avsc in the footer matches what JdbcAvroJob would
produce for the same table. It is also saved as _AVRO_SCHEMA.avsc
alongside _PARQUET_SCHEMA.json in the output directory.

The avroSchemaJson string is threaded through the serializable chain:
JdbcParquetJob -> JdbcParquetIO.createWrite() -> Sink ->
WriteOperation -> Writer -> ResultSetParquetWriterBuilder ->
ResultSetWriteSupport.init() -> WriteContext extraMetadata.

Tests verify:
- Footer metadata contains parquet.avro.schema key with correct value
- Footer metadata omits key when Avro schema is not provided
- Job output directory includes _AVRO_SCHEMA.avsc file
After each Parquet e2e run, verify the file footer contains the
parquet.avro.schema key using parquet-tools meta. Sanity check that
the value contains a "type" field (valid Avro JSON). Fail the e2e
if the key is missing.
The --arrayMode option (bytes, typed_first_row, typed_postgres) from
JdbcExportPipelineOptions is now respected by the Parquet export:

- bytes: serializes arrays as plain BINARY (raw bytes), matching
  the Avro bytes mode behavior
- typed_first_row / typed_postgres: uses Parquet 3-level LIST type
  with typed elements (current default behavior)

The arrayMode is threaded through:
  JdbcParquetJob -> JdbcParquetArgs -> JdbcParquetIO -> WriteSupport
  JdbcParquetJob -> BeamJdbcParquetSchema -> JdbcParquetSchema

Schema generation and write support both check arrayMode to decide
between BINARY and LIST representations for ARRAY columns.
…ering

The two metering classes were 100% identical except for the log prefix
("jdbcavroio" vs "jdbcparquetio"). Introduce JdbcMetering parameterized
by a format name string and update all callers.

JdbcAvroMetering and JdbcParquetMetering are retained as deprecated
thin subclasses for backwards compatibility.
…etJob

The two benchmark classes were ~98% identical, differing only in which
job class they instantiate. Introduce BenchJdbcJob parameterized by a
JobRunner functional interface and a job name string.

BenchJdbcAvroJob and BenchJdbcParquetJob are retained as thin wrappers
that supply the format-specific job factory, preserving their main()
entry points and test compatibility.
When useLogicalTypes is enabled, the Parquet schema declares UUID
columns as FIXED_LEN_BYTE_ARRAY(16) with UUID logical type, but the
writer was emitting a variable-length UTF-8 string. This mismatch
would produce corrupt files or write failures.

Now checks the field's logical type annotation: if it is UUID, writes
the 16-byte big-endian binary representation (MSB first, LSB second).
Without logical types, UUID is still written as a STRING.
GregorianCalendar is not thread-safe. With --queryParallelism > 1,
concurrent rs.getTimestamp(col, CALENDAR) calls could corrupt
timestamp values. Use a ThreadLocal to give each thread its own
Calendar instance.
The InputStream from FileSystems.open() and the Scanner were never
closed, leaking a file handle. Use try-with-resources.
Optional.of(null) throws NPE if the metric key (e.g. KbWritePerSec)
is absent from the map. Use Optional.ofNullable() instead.
The API default (JdbcParquetArgs.create()) was 64MB while the CLI
default (ParquetPipelineOptions) was 128MB. Standardize on 128MB
which matches Parquet's own default and the CLI behavior.
prepareExport() was issuing two separate queries to generate the
Parquet MessageType and Avro Schema. Now executes a single
sqlQueryWithLimitOne and passes the ResultSet to both
JdbcParquetSchema.createParquetSchema() and
JdbcAvroSchema.createAvroSchema().

When a --parquetSchemaFilePath is provided, only the Avro schema
still needs a DB query since the Parquet schema comes from the file.
The method replaces non-alphanumeric characters with underscores and
is not Avro-specific. The old name was misleading in a Parquet module.
These SQL types fall through to the default STRING representation.
Document this explicitly so users are not surprised by the lack of
fixed-point DECIMAL Parquet type.
Replace platform-specific stat flags (macOS -f%z vs Linux -c%s) with
wc -c which works on both platforms without fallback logic.
Parquet's withRowGroupSize() accepts long, allowing row groups larger
than 2GB for wide tables. The int type would overflow for such values.
Both classes independently applied the same regex to normalize column
names. If one changed without the other, schema and data would
mismatch. Now WriteSupport calls JdbcParquetSchema.normalizeFieldName.
If getAvroCodec() returns null, the method would NPE on
avroCodec.equals(). Default to snappy in that case.
When --parquetSchemaFilePath was provided, the Avro schema generation
time was not tracked in Beam metrics. Now exposeSchemaMetrics is
called on both code paths.
SQL DATE columns are stored as epoch millis (not Parquet DATE) for
consistency with the Avro export path. Document this in both the
schema builder and writer to prevent future confusion.
JdbcParquetJob.prepareExport queries the DB once for both Parquet and
Avro schemas and calls exposeSchemaMetrics directly, leaving
createSchema and generateParquetSchema referenced only from their own
test. Drop the dead helpers and the corresponding test case.
Scanner with a stringly-named "UTF-8" charset and an "\\A" delimiter
trick is opaque. Switch to CharStreams.toString with an
InputStreamReader and StandardCharsets.UTF_8 for clearer intent and a
typed charset.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support for parquet format

1 participant