Under which category would you file this issue?
Airflow Core
Apache Airflow version
3.1.8
What happened and how to reproduce it?
Each task run of KubernetesJobOperator leaves behind one (or more, with
parallelism > 1) monitoring pod in the target namespace. Pods accumulate
forever; in our environment we observed 25 orphan pods in a single
namespace after a few weeks of scheduled runs.
The leaked pods are the monitoring/log-streaming pods discovered via
self.get_pods(...) and stored on self.pods. They have no
ownerReferences (they are not owned by the V1Job), so:
ttl_seconds_after_finished on the Job does not reap them — Job TTL
only covers Job-owned pods.
propagation_policy="Foreground" on on_kill does not reap them —
cascade only follows ownerReferences.
- The
on_finish_action parameter inherited from KubernetesPodOperator
is silently ignored — KubernetesJobOperator.execute() never invokes
post_complete_action() / cleanup() / process_pod_deletion().
What you think should happen instead?
Same contract as KubernetesPodOperator:
- With default
on_finish_action="delete_pod", pods should be deleted
after the task finishes (success or failure).
- With
on_finish_action="delete_succeeded_pod", only successful pods.
- With
on_finish_action="keep_pod", pods are kept.
For the Job operator, none of those values currently have any effect on
the monitoring pods.
Operating System
No response
Deployment
Official Apache Airflow Helm Chart
Apache Airflow Provider(s)
No response
Versions of Apache Airflow Providers
No response
Official Helm Chart version
1.19.0
Kubernetes Version
1.35.0
Helm Chart configuration
No response
Docker Image customizations
Not applicable
Anything else?
- 100 % reproduction rate on every task run.
- Affects every
KubernetesJobOperator deployment we are aware of —
silent because the leak is slow (one pod per run) and the pods are in
Succeeded state, so they don't trip readiness/liveness alerts.
- Operationally significant on long-running deployments: namespace quotas,
kubectl get pods latency, and scheduler workload all degrade until
someone manually cleans up.
Are you willing to submit PR?
Code of Conduct
Under which category would you file this issue?
Airflow Core
Apache Airflow version
3.1.8
What happened and how to reproduce it?
Each task run of
KubernetesJobOperatorleaves behind one (or more, withparallelism > 1) monitoring pod in the target namespace. Pods accumulateforever; in our environment we observed 25 orphan pods in a single
namespace after a few weeks of scheduled runs.
The leaked pods are the monitoring/log-streaming pods discovered via
self.get_pods(...)and stored onself.pods. They have noownerReferences(they are not owned by theV1Job), so:ttl_seconds_after_finishedon the Job does not reap them — Job TTLonly covers Job-owned pods.
propagation_policy="Foreground"onon_killdoes not reap them —cascade only follows
ownerReferences.on_finish_actionparameter inherited fromKubernetesPodOperatoris silently ignored —
KubernetesJobOperator.execute()never invokespost_complete_action()/cleanup()/process_pod_deletion().What you think should happen instead?
Same contract as
KubernetesPodOperator:on_finish_action="delete_pod", pods should be deletedafter the task finishes (success or failure).
on_finish_action="delete_succeeded_pod", only successful pods.on_finish_action="keep_pod", pods are kept.For the Job operator, none of those values currently have any effect on
the monitoring pods.
Operating System
No response
Deployment
Official Apache Airflow Helm Chart
Apache Airflow Provider(s)
No response
Versions of Apache Airflow Providers
No response
Official Helm Chart version
1.19.0
Kubernetes Version
1.35.0
Helm Chart configuration
No response
Docker Image customizations
Not applicable
Anything else?
KubernetesJobOperatordeployment we are aware of —silent because the leak is slow (one pod per run) and the pods are in
Succeededstate, so they don't trip readiness/liveness alerts.kubectl get podslatency, and scheduler workload all degrade untilsomeone manually cleans up.
Are you willing to submit PR?
Code of Conduct