Skip to content

OpenLineage listener's ProcessPoolExecutor becomes permanently broken #67283

@sergiobuj

Description

@sergiobuj

Under which category would you file this issue?

Providers

Apache Airflow version

3.1.8+astro.1

What happened and how to reproduce it?

The OpenLineage listener plugin uses a ProcessPoolExecutor to emit lineage events asynchronously from the scheduler. When a child process in the pool terminates abruptly, Python's concurrent.futures marks the pool as permanently broken. After this point, every subsequent OpenLineage event fails with BrokenProcessPool until the scheduler process is restarted.

This causes extended periods of missing lineage data with no self-recovery. The warning is logged but the pool is never recreated, so the problem persists indefinitely.

Scheduler logs showing the error

 2026-05-21T08:01:02.690533Z [warning] OpenLineage received exception in method on_dag_run_success
     [airflow.providers.openlineage.plugins.listener] loc=listener.py:918

 Traceback (most recent call last):
   File ".../airflow/providers/openlineage/plugins/listener.py", line 896, in on_dag_run_success
     self.submit_callable(
   File ".../airflow/providers/openlineage/plugins/listener.py", line 974, in submit_callable
     fut = self.executor.submit(callable, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   File "/usr/local/lib/python3.12/concurrent/futures/process.py", line 805, in submit
     raise BrokenProcessPool(self._broken)
 concurrent.futures.process.BrokenProcessPool: A child process terminated abruptly, the process pool is not usable anymore

What you think should happen instead?

The OpenLineage integration could be self-healing and prevent extended outages in lineage reporting.

When a BrokenProcessPool exception is raised in submit_callable, the listener could detect the broken pool state, create a new ProcessPoolExecutor instance, and retry the submission.

Operating System

Debian GNU/Linux 12 (bookworm) — Linux 5.15.0-1110-azure (containerized on Azure)

Deployment

Astronomer

Apache Airflow Provider(s)

openlineage

Versions of Apache Airflow Providers

I think this are the relevant ones from freeze:

openlineage-integration-common==1.41.0
openlineage-python==1.45.0
openlineage_sql==1.41.0

Official Helm Chart version

Not Applicable

Kubernetes Version

Not Applicable

Helm Chart configuration

No response

Docker Image customizations

FROM astrocrpublic.azurecr.io/runtime:3.1-14

ENV AIRFLOW__CORE__MAX_MAP_LENGTH=3072
ENV AIRFLOW__PROVIDERS_JDBC__ALLOW_DRIVER_CLASS_IN_EXTRA=true
ENV AIRFLOW__PROVIDERS_JDBC__ALLOW_DRIVER_PATH_IN_EXTRA=true
ENV JAVA_HOME=/usr/lib/jvm/default-java
ENV AIRFLOW__CORE__DAGBAG_IMPORT_TIMEOUT=50
ENV AIRFLOW__CORE__ALLOWED_DESERIALIZATION_CLASSES="[redacted]"

# .jar copies
# [redacted COPY]
# apt installs of ODBC Driver
# [readacted apt-get]

USER astro

Anything else?

Environment details:

  • Python 3.12.13
  • Running on Astronomer Runtime (Medium: Scheduler (1 vCPU, 2GiB RAM), DAG Processor (1 vCPU, 2GiB RAM))

Impact: Downstream consumers of OpenLineage events see extended periods of zero events. Since the only recovery is a scheduler restart (Git deploy on Astro), and the scheduler otherwise functions normally (task execution is unaffected).

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions