Skip to content

[SPARK-56375][PANDAS/PS] Implement DataFrame.set_axis and Series.set_axis#56339

Open
HenryBui777 wants to merge 2 commits into
apache:masterfrom
HenryBui777:SPARK-56375-set-axis
Open

[SPARK-56375][PANDAS/PS] Implement DataFrame.set_axis and Series.set_axis#56339
HenryBui777 wants to merge 2 commits into
apache:masterfrom
HenryBui777:SPARK-56375-set-axis

Conversation

@HenryBui777
Copy link
Copy Markdown

@HenryBui777 HenryBui777 commented Jun 5, 2026

What changes were proposed in this pull request?

This PR implements DataFrame.set_axis() and Series.set_axis() for the Pandas API on Spark, which were previously unsupported (listed in missing/frame.py and missing/series.py).

Changes:

  • python/pyspark/pandas/frame.py: Added DataFrame.set_axis(labels, axis=0) method.
    • axis=0 or axis='index': Reassigns the row index using the provided labels (delegates to set_index with a pd.Index)
    • axis=1 or axis='columns': Reassigns column labels using rename(columns=...)
    • Raises ValueError if the number of labels doesn't match the axis length
    • python/pyspark/pandas/series.py: Added Series.set_axis(labels, axis=0) method.
    • Only axis=0 is supported for Series (consistent with pandas)
    • Raises ValueError if labels length doesn't match the Series length
    • Raises ValueError if axis != 0
    • python/pyspark/pandas/missing/frame.py: Removed set_axis from _unsupported_function list
    • python/pyspark/pandas/missing/series.py: Removed set_axis from _unsupported_function list
      Supported behavior (matching pandas):
import pyspark.pandas as ps

# DataFrame - change row index
df = ps.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]})
df.set_axis(['x', 'y', 'z'])           # axis=0 (default)
df.set_axis(['x', 'y', 'z'], axis=0)
df.set_axis(['x', 'y', 'z'], axis='index')

# DataFrame - change column labels
df.set_axis(['I', 'II'], axis=1)
df.set_axis(['I', 'II'], axis='columns')

# Series
s = ps.Series([1, 2, 3])
s.set_axis(['a', 'b', 'c'])

Why are the changes needed?

DataFrame.set_axis() and Series.set_axis() are standard pandas APIs used extensively for reassigning index or column labels. They were completely unsupported in the Pandas API on Spark, causing PandasNotImplementedError even though the underlying functionality (index/column renaming) is already supported. This PR enables compatibility with pandas code that uses set_axis.

Reference: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.set_axis.html

Does this PR introduce any user-facing change?

Yes. DataFrame.set_axis() and Series.set_axis() now work in the Pandas API on Spark instead of raising PandasNotImplementedError.

How was this patch tested?

Added new unit test files:

  • python/pyspark/pandas/tests/frame/test_set_axis.py - covers:
    • test_set_axis_index: axis=0 with list, string 'index', and pd.Index labels
    • test_set_axis_columns: axis=1 with list, string 'columns', and pd.Index labels
    • test_set_axis_errors: ValueError on length mismatch and invalid axis
    • test_set_axis_numeric_index: numeric index labels
    • python/pyspark/pandas/tests/series/test_set_axis.py - covers:
    • test_set_axis_index: axis=0 with list, string 'index', and pd.Index labels
    • test_set_axis_errors: ValueError on length mismatch and invalid axis (axis=1)
    • test_set_axis_named: named Series

Was this patch authored or co-authored using generative AI tooling?

No.

Henry Bui added 2 commits June 5, 2026 14:37
…axis

This commit implements DataFrame.set_axis() and Series.set_axis() for
the Pandas API on Spark, matching the behavior of the native pandas API.

- DataFrame.set_axis(labels, axis=0) supports both axis=0 (index)
  and axis=1 (columns).
- Series.set_axis(labels, axis=0) supports only axis=0 (index),
  consistent with pandas.
- Removed set_axis from the missing function lists in missing/frame.py
  and missing/series.py.
- Added unit tests in tests/frame/test_set_axis.py and
  tests/series/test_set_axis.py.

Resolves: https://issues.apache.org/jira/browse/SPARK-56375
… raise TypeError

In the Pandas API on Spark, when performing arithmetic between a float
Series and a decimal.Decimal scalar, the behavior was inconsistent:
- With ANSI mode ON: TypeError is raised (correct)
- With ANSI mode OFF: Operation completes silently (incorrect)

Native pandas always raises TypeError in this case. This commit fixes
FractionalOps to always raise TypeError for decimal-float mixed
arithmetic operations (add, sub), regardless of ANSI mode setting.

Other operations (mul, truediv, floordiv, mod, rmul, rmod) already
checked ANSI mode; they are unchanged as a separate concern.

Resolves: https://issues.apache.org/jira/browse/SPARK-55818
@HenryBui777 HenryBui777 changed the title Spark 56375 set axis[SPARK-56375][PANDAS/PS] Implement DataFrame.set_axis and Series.set_axis [SPARK-56375][PANDAS/PS] Implement DataFrame.set_axis and Series.set_axis Jun 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant