Skip to content

[SPARK-56373][PYSPARK] Add docstring annotations to classify PySpark APIs for Spark Connect compatibility#55234

Open
garlandz-db wants to merge 3 commits intoapache:masterfrom
garlandz-db:pyspark-connect-annotations
Open

[SPARK-56373][PYSPARK] Add docstring annotations to classify PySpark APIs for Spark Connect compatibility#55234
garlandz-db wants to merge 3 commits intoapache:masterfrom
garlandz-db:pyspark-connect-annotations

Conversation

@garlandz-db
Copy link
Copy Markdown
Contributor

What changes were proposed in this pull request?

Adds three RST docstring directives to PySpark modules, classes, and methods to annotate their Spark Connect compatibility:

  • .. classic:: true — the API is only available in Classic Spark (not Spark Connect)
  • .. connect:: true — the API is available in Spark Connect
  • .. connect_migration:: <message> — migration guidance for users transitioning from Classic Spark to Spark Connect

The annotation spec is documented in python/pyspark/__init__.py. Annotations are resolved by inheriting from the nearest annotated ancestor; a child annotation overrides the parent's.

Annotated modules/classes in this PR:

  • pyspark.core and its submodules (RDD, SparkContext, etc.) — Classic only, with per-method migration guidance
  • pyspark.mllib — Classic only, migrate to pyspark.ml.connect
  • pyspark.ml.clustering (LDA family) — Classic only
  • pyspark.ml.deepspeed, pyspark.ml.torch — Classic only
  • pyspark.ml.wrapper — Classic only
  • pyspark.ml.connect — Connect
  • pyspark.sql.classic, pyspark.sql.connect — Classic / Connect respectively
  • pyspark.sql.context — Classic only (use SparkSession instead)
  • pyspark.sql.dataframe.DataFrame.rdd — migration guidance
  • pyspark.sql.readwriter.DataFrameReader.json — migration guidance for RDD arg
  • pyspark.sql.udf.UDFRegistration.registerJavaFunction — migration guidance
  • pyspark.sql.metrics, pyspark.errors.exceptions.connect — Connect
  • Various internal/utility modules — Classic only

Why are the changes needed?

These annotations enable tooling (IDEs, documentation generators, linters) to surface accurate Spark Connect compatibility information to users, helping them understand which APIs are available in Spark Connect and how to migrate.

Does this PR introduce any user-facing change?

No. Docstring-only changes; no functional code modifications.

How was this patch tested?

No tests added — docstring-only changes. Existing test suites continue to pass.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude (Anthropic), claude-sonnet-4-6

…APIs for Spark Connect compatibility

Adds three RST directives to PySpark modules, classes, and methods to indicate
Spark Connect compatibility status:

- `.. classic:: true` -- API is only available in Classic Spark (not Spark Connect)
- `.. connect:: true` -- API is available in Spark Connect
- `.. connect_migration:: <message>` -- migration guidance for users transitioning to Spark Connect

Annotations are resolved by inheriting from the nearest annotated ancestor; a child
annotation overrides the parent's. No functional code changes -- docstrings only.

The annotation spec is documented in `python/pyspark/__init__.py`.
@garlandz-db garlandz-db force-pushed the pyspark-connect-annotations branch from 31cd9c7 to 242e1ef Compare April 7, 2026 14:47
- Remove annotation spec documentation from pyspark/__init__.py module
  docstring to prevent it from being parsed as a real directive by tooling
- Fix setLogLevel migration: spark.log.level() does not exist; use
  spark.conf.set("spark.log.level", level) instead
- Fix setJobGroup migration: wrong config key and mechanism; use the
  tag API (spark.addTag/interruptTag) instead
- Fix defaultParallelism migration: remove fabricated "200" default
- Improve setLocalProperty migration: note that spark.conf.set is
  session-global, not thread-local
- Improve readwriter.json migration: handle JSON string input correctly
- Improve udf.registerJavaFunction: cover non-notebook use cases
- Improve mllib migration: list supported/unsupported modules explicitly
…rk/__init__.py

The documentation section explaining the directive syntax was removed in the
previous commit to fix a parser false-match issue. The fix has been moved to
the tooling layer (strip RST inline code spans before regex matching), so the
documentation can be restored here.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant