Skip to content

KubernetesJobOperator leaks monitoring pods — on_finish_action is ignored #67332

@jykae

Description

@jykae

Under which category would you file this issue?

Airflow Core

Apache Airflow version

3.1.8

What happened and how to reproduce it?

Each task run of KubernetesJobOperator leaves behind one (or more, with
parallelism > 1) monitoring pod in the target namespace. Pods accumulate
forever; in our environment we observed 25 orphan pods in a single
namespace after a few weeks of scheduled runs.

The leaked pods are the monitoring/log-streaming pods discovered via
self.get_pods(...) and stored on self.pods. They have no
ownerReferences
(they are not owned by the V1Job), so:

  • ttl_seconds_after_finished on the Job does not reap them — Job TTL
    only covers Job-owned pods.
  • propagation_policy="Foreground" on on_kill does not reap them —
    cascade only follows ownerReferences.
  • The on_finish_action parameter inherited from KubernetesPodOperator
    is silently ignored — KubernetesJobOperator.execute() never invokes
    post_complete_action() / cleanup() / process_pod_deletion().

What you think should happen instead?

Same contract as KubernetesPodOperator:

  • With default on_finish_action="delete_pod", pods should be deleted
    after the task finishes (success or failure).
  • With on_finish_action="delete_succeeded_pod", only successful pods.
  • With on_finish_action="keep_pod", pods are kept.

For the Job operator, none of those values currently have any effect on
the monitoring pods.

Operating System

No response

Deployment

Official Apache Airflow Helm Chart

Apache Airflow Provider(s)

No response

Versions of Apache Airflow Providers

No response

Official Helm Chart version

1.19.0

Kubernetes Version

1.35.0

Helm Chart configuration

No response

Docker Image customizations

Not applicable

Anything else?

  • 100 % reproduction rate on every task run.
  • Affects every KubernetesJobOperator deployment we are aware of —
    silent because the leak is slow (one pod per run) and the pods are in
    Succeeded state, so they don't trip readiness/liveness alerts.
  • Operationally significant on long-running deployments: namespace quotas,
    kubectl get pods latency, and scheduler workload all degrade until
    someone manually cleans up.

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind:bugThis is a clearly a bugneeds-triagelabel for new issues that we didn't triage yet

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions