Support all bq load job and ext table config options in GCSToBigQueryOperator#64505
Support all bq load job and ext table config options in GCSToBigQueryOperator#64505mlauter wants to merge 1 commit intoapache:mainfrom
Conversation
…SToBigQueryOperator Adds an ``extra_config`` parameter that is merged into the BigQuery load job configuration (when ``external_table=False``) or the external table configuration (when ``external_table=True``), allowing callers to set any API field not exposed as a top-level operator param without subclassing. Deprecates ``src_fmt_configs`` in favor of ``extra_config``.
There was a problem hiding this comment.
Pull request overview
Adds a generic extra_config passthrough to GCSToBigQueryOperator so callers can set any BigQuery API fields not exposed as explicit operator params, and deprecates src_fmt_configs in favor of this new mechanism.
Changes:
- Introduces
extra_configand merges it into the BigQuery load job configuration or external table configuration at execution time. - Deprecates
src_fmt_configsand emits anAirflowProviderDeprecationWarningwhen it’s used. - Adds unit tests covering
extra_configmerge behavior and deprecation warnings.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
providers/google/src/airflow/providers/google/cloud/transfers/gcs_to_bigquery.py |
Adds extra_config handling and src_fmt_configs deprecation warning in GCSToBigQueryOperator. |
providers/google/tests/unit/google/cloud/transfers/test_gcs_to_bigquery.py |
Adds/updates unit tests for extra_config behavior and deprecation warnings. |
| external_config_api_repr.update(self.extra_config) | ||
|
|
||
| external_config = ExternalConfig.from_api_repr(external_config_api_repr) | ||
| if self.schema_fields: | ||
| external_config.schema = [SchemaField.from_api_repr(f) for f in self.schema_fields] |
There was a problem hiding this comment.
extra_config is merged into external_config_api_repr before ExternalConfig is instantiated, but later in this method external_config.schema and external_config.max_bad_records are set from top-level params. That means extra_config does not consistently take precedence for overlapping fields (e.g. extra_config={"maxBadRecords": 10} will be overridden when max_bad_records is set). To preserve the documented precedence, apply extra_config last (after all top-level-derived fields are applied) or only set schema/max_bad_records when the corresponding key is not present in extra_config.
|
|
||
| self.schema_update_options = schema_update_options | ||
| self.src_fmt_configs = src_fmt_configs | ||
| if src_fmt_configs: |
There was a problem hiding this comment.
The deprecation warning for src_fmt_configs is guarded by if src_fmt_configs:. Because src_fmt_configs is normalized to {} when None (and an explicitly provided empty dict is falsy), using the deprecated parameter can fail to emit a warning. If the intent is to warn whenever the parameter is provided, capture the original argument before defaulting and check src_fmt_configs is not None (or use a sentinel) rather than a truthiness check.
| if src_fmt_configs: | |
| if src_fmt_configs is not None: |
Description
Adds an
extra_configparameter that is merged into the BigQuery load job configuration (whenexternal_table=False) or the external table configuration (whenexternal_table=True), allowing callers to set any API field not exposed as a top-level operator param without subclassing.Deprecates
src_fmt_configsin favor ofextra_config.Rationale
GCSToBigQueryOperator explicitly validates every parameter it accepts against a known list of valid BigQuery configuration fields. This makes it impossible to use any API option that isn't already exposed as a top-level operator param without subclassing or monkey-patching. In order to support all possible configuration options, we'd need to add a lot of complexity to the operator, and we'd need to change it any time google added a parameter.
Rather than continuing to add individual params for every possible API option, this PR adds a single extra_config dict that is merged directly into the underlying BigQuery configuration at execution time — into
JobConfigurationLoadwhen loading into an existing table, or intoExternalDataConfigurationwhen creating an external table. Keys inextra_configtake precedence over the operator's own top-level params (just as keys insrc_format_configstook precedence over the top level params.This follows a similar pattern to the
configurationpassthrough inBigQueryInsertJobOperator.Testing
Parquet list inference using extra_config and external table
Parquet list inference using extra config
Dag succeeded and tables were created as expected. The parquet option here is not related to the PR, its just a setting I'm familiar with, and I was confirming that configuration options passed in
extra_configindeed work.Was generative AI tooling used to co-author this PR?
Generated-by: Claude Sonnet 4.6 following the guidelines
{pr_number}.significant.rst, in airflow-core/newsfragments. You can add this file in a follow-up commit after the PR is created so you know the PR number.