Skip to content

Commit 02bc690

Browse files
timsaucerclaude
andcommitted
docs: document Python-version and import portability caveats for inline UDFs
Reviewer feedback on the Expr-pickle PRs (#1544) asked that the cloudpickle portability caveats be discoverable on the user-facing page, not only in docstrings. The distributing_work.rst page is the designated canonical home for the distribution story, so add them here: * New 'Portability requirements for inline Python UDFs' subsection covering the matching-Python-minor-version requirement and the by-value vs by-reference import-capture rule (imported modules must be importable on the worker). * Qualify the 'fully portable' Python-UDF bullet to point at the new requirements. * Cross-reference the new subsection from the closure-capture note. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent f7550ec commit 02bc690

1 file changed

Lines changed: 31 additions & 5 deletions

File tree

docs/source/user-guide/io/distributing_work.rst

Lines changed: 31 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -103,10 +103,10 @@ What travels with the expression
103103

104104
* **Built-in functions** (``abs``, ``length``, arithmetic, comparisons,
105105
etc.) — fully portable. Worker needs nothing pre-registered.
106-
* **Python UDFs** — fully portable. The callable, its signature, and
107-
any state captured in closures travel inside the serialized
108-
expression and are reconstructed on the worker automatically.
109-
Applies equally to:
106+
* **Python UDFs** — travel inline (subject to the two portability
107+
requirements below). The callable, its signature, and any state
108+
captured in closures travel inside the serialized expression and are
109+
reconstructed on the worker automatically. Applies equally to:
110110

111111
* **scalar UDFs** (:py:func:`datafusion.udf`)
112112
* **aggregate UDFs** (:py:func:`datafusion.udaf`)
@@ -116,6 +116,31 @@ What travels with the expression
116116
:py:class:`SessionContext`. Without that registration, evaluation
117117
raises an error.
118118

119+
Portability requirements for inline Python UDFs
120+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
121+
122+
Inline Python UDFs ride on `cloudpickle
123+
<https://github.com/cloudpipe/cloudpickle>`_, which imposes two
124+
requirements on the worker environment:
125+
126+
* **Matching Python minor version.** A cloudpickle payload serializes
127+
Python bytecode, which is not stable across Python minor versions. A
128+
UDF pickled on Python 3.12 cannot be reconstructed on a 3.11 or 3.13
129+
worker. The wire format stamps the sender's ``(major, minor)``; a
130+
mismatch raises a clear error naming both versions rather than
131+
failing obscurely deep inside ``cloudpickle.loads``. Align the Python
132+
version on driver and workers.
133+
* **Imported modules must be importable on the worker.** cloudpickle
134+
captures the UDF callable *by value* — bytecode and closure cells are
135+
inlined, so locally-defined functions and lambdas travel whole. But
136+
any name the callable resolves through ``import`` is captured *by
137+
reference* (module path only). If a UDF body does
138+
``from mylib import transform`` and calls ``transform(...)``, the
139+
worker reconstructs the reference by importing ``mylib`` — which must
140+
therefore be installed on the worker. The same applies to bound
141+
methods of imported classes. Self-contained UDFs (no imports beyond
142+
what the worker already has, e.g. ``pyarrow``) avoid this entirely.
143+
119144
Session contexts at a glance
120145
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
121146

@@ -260,7 +285,8 @@ Practical considerations
260285
state — local variables, module-level objects, file paths — that
261286
state is captured at serialization time. Surprises are possible if
262287
the captured state is large, mutable, or not portable to the
263-
worker's environment.
288+
worker's environment. See `Portability requirements for inline
289+
Python UDFs`_ for the Python-version and imported-module rules.
264290

265291
Disabling Python UDF inlining
266292
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

0 commit comments

Comments
 (0)