Skip to content

Support all bq load job and ext table config options in GCSToBigQueryOperator#64505

Draft
mlauter wants to merge 1 commit intoapache:mainfrom
mlauter:gcs_to_bigquery_load_job_config
Draft

Support all bq load job and ext table config options in GCSToBigQueryOperator#64505
mlauter wants to merge 1 commit intoapache:mainfrom
mlauter:gcs_to_bigquery_load_job_config

Conversation

@mlauter
Copy link
Copy Markdown
Contributor

@mlauter mlauter commented Mar 30, 2026

Description

Adds an extra_config parameter that is merged into the BigQuery load job configuration (when external_table=False) or the external table configuration (when external_table=True), allowing callers to set any API field not exposed as a top-level operator param without subclassing.

Deprecates src_fmt_configs in favor of extra_config.

Rationale

GCSToBigQueryOperator explicitly validates every parameter it accepts against a known list of valid BigQuery configuration fields. This makes it impossible to use any API option that isn't already exposed as a top-level operator param without subclassing or monkey-patching. In order to support all possible configuration options, we'd need to add a lot of complexity to the operator, and we'd need to change it any time google added a parameter.

Rather than continuing to add individual params for every possible API option, this PR adds a single extra_config dict that is merged directly into the underlying BigQuery configuration at execution time — into JobConfigurationLoad when loading into an existing table, or into ExternalDataConfiguration when creating an external table. Keys in extra_config take precedence over the operator's own top-level params (just as keys in src_format_configs took precedence over the top level params.

This follows a similar pattern to the configuration passthrough in BigQueryInsertJobOperator.

Testing

  • Added unit tests and ran tests to ensure they pass
  • Ran airflow locally and ran an integration test using a custom dag

Parquet list inference using extra_config and external table

    gcs_to_bq_task = GCSToBigQueryOperator(
        task_id="gcs_to_bigquery_parquet_ext_table_options",
        bucket="etldata-prod-adhoc-data-hkwv8r",
        source_objects=["user/mlauter/gcs_to_bq_parquet/input/part-00007-9c1d416d-505b-490e-b783-31ba43a6befc-c000.snappy.parquet"],
        destination_project_dataset_table="etsy-data-warehouse-dev.mlauter.gcs_to_bigquery_parquet_list_test_ext_table",
        source_format="PARQUET",
        write_disposition="WRITE_TRUNCATE",
        external_table=True,
        extra_config={"parquetOptions": {"enableListInference": True}},
    )
extra_config_ext_table

Parquet list inference using extra config

    gcs_to_bq_task = GCSToBigQueryOperator(
        task_id="gcs_to_bigquery_parquet_options",
        bucket="etldata-prod-adhoc-data-hkwv8r",
        source_objects=["user/mlauter/gcs_to_bq_parquet/input/part-00007-9c1d416d-505b-490e-b783-31ba43a6befc-c000.snappy.parquet"],
        destination_project_dataset_table="etsy-data-warehouse-dev.mlauter.gcs_to_bigquery_parquet_list_test",
        source_format="PARQUET",
        write_disposition="WRITE_TRUNCATE",
        extra_config={"parquetOptions": {"enableListInference": True}, "columnNameCharacterMap": "V2"},
    )
extra_config_load_job

Dag succeeded and tables were created as expected. The parquet option here is not related to the PR, its just a setting I'm familiar with, and I was confirming that configuration options passed in extra_config indeed work.


Was generative AI tooling used to co-author this PR?
  • Yes (please specify the tool below)

Generated-by: Claude Sonnet 4.6 following the guidelines


  • Read the Pull Request Guidelines for more information. Note: commit author/co-author name and email in commits become permanently public when merged.
  • For fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
  • When adding dependency, check compliance with the ASF 3rd Party License Policy.
  • For significant user-facing changes create newsfragment: {pr_number}.significant.rst, in airflow-core/newsfragments. You can add this file in a follow-up commit after the PR is created so you know the PR number.

…SToBigQueryOperator

Adds an ``extra_config`` parameter that is merged into the BigQuery load job
configuration (when ``external_table=False``) or the external table configuration
(when ``external_table=True``), allowing callers to set any API field not exposed
as a top-level operator param without subclassing.

Deprecates ``src_fmt_configs`` in favor of ``extra_config``.
@boring-cyborg boring-cyborg bot added area:providers provider:google Google (including GCP) related issues labels Mar 30, 2026
@kaxil kaxil requested a review from Copilot April 2, 2026 00:41
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a generic extra_config passthrough to GCSToBigQueryOperator so callers can set any BigQuery API fields not exposed as explicit operator params, and deprecates src_fmt_configs in favor of this new mechanism.

Changes:

  • Introduces extra_config and merges it into the BigQuery load job configuration or external table configuration at execution time.
  • Deprecates src_fmt_configs and emits an AirflowProviderDeprecationWarning when it’s used.
  • Adds unit tests covering extra_config merge behavior and deprecation warnings.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
providers/google/src/airflow/providers/google/cloud/transfers/gcs_to_bigquery.py Adds extra_config handling and src_fmt_configs deprecation warning in GCSToBigQueryOperator.
providers/google/tests/unit/google/cloud/transfers/test_gcs_to_bigquery.py Adds/updates unit tests for extra_config behavior and deprecation warnings.

Comment on lines +593 to 597
external_config_api_repr.update(self.extra_config)

external_config = ExternalConfig.from_api_repr(external_config_api_repr)
if self.schema_fields:
external_config.schema = [SchemaField.from_api_repr(f) for f in self.schema_fields]
Copy link

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extra_config is merged into external_config_api_repr before ExternalConfig is instantiated, but later in this method external_config.schema and external_config.max_bad_records are set from top-level params. That means extra_config does not consistently take precedence for overlapping fields (e.g. extra_config={"maxBadRecords": 10} will be overridden when max_bad_records is set). To preserve the documented precedence, apply extra_config last (after all top-level-derived fields are applied) or only set schema/max_bad_records when the corresponding key is not present in extra_config.

Copilot uses AI. Check for mistakes.

self.schema_update_options = schema_update_options
self.src_fmt_configs = src_fmt_configs
if src_fmt_configs:
Copy link

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The deprecation warning for src_fmt_configs is guarded by if src_fmt_configs:. Because src_fmt_configs is normalized to {} when None (and an explicitly provided empty dict is falsy), using the deprecated parameter can fail to emit a warning. If the intent is to warn whenever the parameter is provided, capture the original argument before defaulting and check src_fmt_configs is not None (or use a sentinel) rather than a truthiness check.

Suggested change
if src_fmt_configs:
if src_fmt_configs is not None:

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:providers provider:google Google (including GCP) related issues

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants