[WIP][SPARK-55206][PYTHON][SQL] Transpilation minimal functional implementation with python#56327
Conversation
Co-authored-by: Holden Karau <holden@pigscanfly.ca>
…file Co-authored-by: Holden Karau <holden@pigscanfly.ca>
… class but reset the python worker logs between Co-authored-by: Holden Karau <holden@pigscanfly.ca>
Co-authored-by: Holden Karau <holden@pigscanfly.ca>
… and transpialtion subsequently disabled or ANSI mode swapped and instead just ignore all of the transpiled elems Co-authored-by: Holden Karau <holden@pigscanfly.ca>
Co-authored-by: Holden Karau <holden@pigscanfly.ca>
…s still there or not. Co-authored-by: Holden Karau <holden@pigscanfly.ca>
Co-authored-by: Holden Karau <holden@pigscanfly.ca>
* [SPARK][PYTHON] Refuse to lower non-boolean and/or/not in UDF transpiler; fix Column truthiness in transpile path; surface column in CANNOT_CONVERT_COLUMN_INTO_BOOL The UDF transpiler previously lowered Python `and` / `or` / `not` to Spark's bitwise `&` / `|` / `~` regardless of operand type. For non-bool operands (e.g. `x or 0` against an int column) this silently diverges from Python's truthiness semantics. Detect statically-non-boolean operands (numeric / string literals, arithmetic BinOps, USub/UAdd, IfExp with both arms non-boolean) and refuse to lower; the UDF then falls back to interpreted Python. Also: - Fix `if transpiled_column:` triggering CANNOT_CONVERT_COLUMN_INTO_BOOL during transpilation (which broke test_udf_transpile_basic). Use `is not None` instead. - Include the offending column's repr in the CANNOT_CONVERT_COLUMN_INTO_BOOL error so users can tell which column triggered it. - Add unit tests for boolean and/or transpilation, non-boolean fallback, and the improved error message; add hypothesis tests for both_positive / either_positive. * Fix ruff F841 and CI test failures in transpiler boolean tests - Remove leftover unused ``transpiled_param_names`` local in udf.py (the class attribute ``self._transpiled_param_names`` already covers this; the local was a leftover from an earlier refactor). - Rewrite the unit-test UDFs ``both_positive`` / ``either_positive`` / the matching hypothesis UDFs as single-statement bodies so the transpiler (which only handles one top-level statement) actually lowers them. The previous form short-circuited out before the BoolOp path was exercised. - Stop relying on schema inference for None-only Row inputs in ``test_udf_transpile_falls_back_for_non_boolean_short_circuit``. Pass an explicit LongType schema so ``createDataFrame`` doesn't fail with CANNOT_DETERMINE_TYPE on the ``or_zero_none`` case. - Use a non-null long strategy for the boolean hypothesis tests since the interpreted Python body would raise on a None operand. * Fix ruff format issues in udf.py and test_udf_transpile_unit.py The previous lint failure was masked by ruff check failing first (F841 unused variable). With check passing, ruff format now flags the surrounding region: a stray blank line in the unit test file and a multi-line conditional in udf.py that fits on one line. Apply ruff format auto-fix. * Support comparison operators in transpiler; fix mypy FunctionDef typing Two CI fixes: 1. The new ``test_udf_transpile_boolean_and_or_lowered`` test exercises ``x > 0 and y > 0``, but the transpiler's Compare handler only covered ``Is`` / ``IsNot``. Add Eq, NotEq, Lt, LtE, Gt, GtE so the bool-typed and/or path actually transpiles instead of bailing on the first comparison. Drop the now-stale ``less_than_zero`` case from the ``falls_back_for_unsupported_patterns`` matrix. 2. mypy 1.8.0 (CI) couldn't pick the right ``ast.FunctionDef`` overload from kwargs when ``decorator_list=[]`` was an unannotated empty literal (it inferred ``list[Never]``). Annotate the intermediate list locals so the call resolves to the documented overload. * Drop ast.FunctionDef call to Any to dodge mypy overload mismatch The typed overloads for ``ast.FunctionDef`` in mypy 1.8.0's typeshed require keyword-only ``type_params`` on Python 3.12+, but that field isn't accepted at runtime on every supported Python (it was added in 3.12, before that the constructor rejects it). Setting the constructor to ``Any`` keeps the runtime behaviour unchanged while sidestepping the typeshed overload resolution that the previous typed-locals approach couldn't satisfy. * Raise on NULL in transpiled value comparisons; restore lt test and exercise None Python raises ``TypeError`` when an operand of ``<``, ``<=``, ``>``, ``>=``, ``==`` or ``!=`` is ``None``, but Spark's three-valued logic silently returns ``NULL`` -- a silent semantic divergence. The transpiler now wraps these comparisons with ``when(left.isNull() | right.isNull(), raise_error(...)).otherwise(...)`` so the rewritten plan fails loudly to match the Python source. Code that guards against NULL upstream (``if x is not None: x > 0``) hits the otherwise branch and is unaffected. Tests: - Unit: ``test_udf_transpile_less_than_zero`` restores the ``Lt`` case as a positive transpile assertion (previously in the unsupported-patterns matrix); ``test_udf_transpile_compare_with_none_raises`` pins the new raise behaviour for unguarded comparisons. - Hypothesis: ``_run`` now treats "both sides raised" as equivalent so the boolean hypothesis tests (``both_positive`` / ``either_positive``) can be re-armed with the full nullable ``_long_strategy`` and a ``_BOOLEAN_PAIR_EDGES`` tuple that explicitly seeds NULL combos. * Address Copilot review comments - ``_lower_eq``: Python's ``None == x`` / ``None != x`` returns a bool, not a TypeError. Stop routing ``ast.Eq`` / ``ast.NotEq`` through the raise-on-NULL guard and instead emit the four ``when`` branches matching Python's equality semantics (both NULL -> True/False, one NULL -> False/True, otherwise -> ``==`` / ``!=``). The ``raise_error`` guard now only fires on ordering comparisons (``<``, ``<=``, ``>``, ``>=``). - ``_is_definitely_non_boolean``: narrow the ``ast.BinOp`` match from a bare catch-all to the arithmetic / shift operators we actually want to flag. Bitwise ``&`` / ``|`` / ``^`` are deliberately left out so e.g. ``not ((x > 0) & (y > 0))`` keeps being recognised as boolean. - ``Column.__nonzero__`` (connect): use ``repr(self._expr)`` rather than the dunder directly, in keeping with the standard API. - Hoist ``_BOOLEAN_PAIR_EDGES`` to module-level alongside ``_LONG_PAIR_EDGES`` so the decorator references a stable module attribute rather than a class-body local. --------- Co-authored-by: Claude <noreply@anthropic.com>
…improve warnings. Done with Claude
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
…l in Scala for parent is udf and reduce scope remove some intermedite vals for debugging we don't need anymore, and use MDC logging.
|
CC @devin-petersohn would love your input and review, I know it's still in draft phase though. I'm going to try and finish it up over the coming week |
|
I have a question - how correct and maintainable we want this to be? Python Also, do we expect the users to "try it out" and see if it kind of works, or we aim to be "always correct or raising an error"? There's a huge difference. For example: case ast.Compare():
# All comparison operators produce bool.
return TrueNope, case ast.BoolOp():
# and / or of booleans produces bool.
return TrueNope, The non-boolean check for operations are not fully valid either - any user defined type can return a Like I mentioned when this SPIP was firstly discussed - Python is a super dynamic language and having a transpiler that is always correct on where it claims to be correct is not easy. So I want to confirm the purpose of the SPIP - do we want to be always correct, or correct most of the time to cover more transpilable UDFs, with the cost that we could produce wrong result? |
|
For ast module changes I think we're ok with just not transpiling new / different items. For the comparison operators right now we're limiting the UDF transpilation logic to scalar inputs we can scope it down more to also exclude non scalar captures (and expand back out later if we want for arrow / pandas types but handle them explicitly). To be clear this is. WIP PR but was talking with Scott about this in the context of some of his work so I wanted to raise the current draft for folks while |
What changes were proposed in this pull request?
This is the first PR adding initial transpilation from Python to Catalyst and the framework for others to plug in different transpilers which can target other targets. This replaces #53547 and is the first PR as part of the transpilation SPIP https://docs.google.com/document/d/1cHc6tiR4yO3hppTzrK1F1w9RwyEPMvaeEuL2ub2LURg/edit?tab=t.0#heading=h.iz92aps6qbo5
Why are the changes needed?
Python UDF performance has been a perinial challenge in Spark, and while PyArrow makes the data copy slightly less expensive it's still much slower than native code. Additionally, Python UDF usage has increased substantailly with Pandas on Spark and it is especially hard for folks to write catalyst expressions here.
Does this PR introduce any user-facing change?
New configuration flags are user visible (default to off).
How was this patch tested?
New unit tests that trigger always and hypothesis tests that trigger only selectively.
Was this patch authored or co-authored using generative AI tooling?
Hypothesis teests were generated by Claude 4.7, eq semantics null and truthiness w/claude 4.6, GH code review (unknown backing model) on itiial draft). snake camel swap with sonnet.