[SPARK-57234][SS][DOCS] Add Real-time Mode documentation page to the Structured Streaming guide by jerrypeng · Pull Request #56314 · apache/spark

jerrypeng · 2026-06-04T01:16:43Z

What changes were proposed in this pull request?

This PR adds a new documentation page for Real-time Mode in Structured Streaming, introduced in Spark 4.1.0 (SPARK-53736): docs/streaming/real-time-mode.md. The page covers:

How Real-time Mode works: long-running tasks (one per input partition) that process records continuously, in contrast to the per-micro-batch task scheduling of the default engine.
Batch duration is a checkpoint interval, not a latency target.
A comparison with the other execution modes.
Enabling Real-time Mode: the Trigger.RealTime(...) API (Scala/Java) and the realTime trigger keyword argument (Python), plus the requirements to start (update output mode,
checkpoint location, minimum batch duration).
Supported queries (stateless only), fault tolerance (exactly-once processing semantics; sinks such as Kafka provide at-least-once delivery), configuration, examples
(Python/Scala/Java), and caveats.

It also registers the new page in the Structured Streaming left navigation (docs/_data/menu-streaming.yaml).

Real-time Mode (stateless) was added in Spark 4.1.0 but has no user-facing documentation in the Structured Streaming programming guide. This PR adds that page. See SPARK-57234.

Does this PR introduce any user-facing change?

No. This is a documentation-only change.

How was this patch tested?

Documentation-only change. The new page was validated for structure (front matter, code tabs, Liquid {% highlight %} tags), internal links and in-page anchors, navigation-menu anchors,
and ASCII-only content. All trigger API signatures, configuration keys and defaults, the supported-operator and sink lists, and error-class references were cross-checked against the Spark
4.1.0 source (Trigger.java, Triggers.scala, RealTimeModeAllowlist.scala, SQLConf.scala, KafkaMicroBatchStream.scala, and error-conditions.json). Reviewers can verify rendering
locally with SKIP_API=1 bundle exec jekyll build from the docs/ directory.

Was this patch authored or co-authored using generative AI tooling?

co-authored with Claude Code (Opus 4.8).

viirya

Thanks for filling this gap -- I cross-checked the trigger forms, config keys/defaults, error-class names, the allowlist (sinks/operators), and the cross-doc anchors against the 4.1.0 source and they all match. Two things to address and two minor nits.

Comparison table: Continuous Processing should be "At-least-once", not "Exactly-once"

The comparison table lists Continuous Processing under Processing Guarantees as Exactly-once. That's not correct -- Continuous Processing provides at-least-once guarantees, which is also what the page this row links to states: performance-tips.md describes it as "low (~1 ms) end-to-end latency with at-least-once fault-tolerance guarantees," and notes that switching to continuous mode gives at-least-once. As written, the page contradicts its own link target and will mislead anyone comparing the three modes. Please change the cell to At-least-once.

The page should state that Real-time Mode is experimental

Every Trigger.RealTime(...) overload (Trigger.java) and RealTimeTrigger itself (Triggers.scala) is annotated @Experimental, but the page reads as stable product documentation and recommends "prefer Real-time Mode over Continuous Processing." There's an internal inconsistency too: the page repeatedly calls Continuous Processing experimental but never says the same of Real-time Mode, even though both carry the same annotation. Given the API is @Experimental and the page itself promises stateful support "starting in Spark 4.3," a one-line note up front that the API/feature is experimental and may evolve would set the right expectation and match the source's commitment boundary.

Minor nits

The page opens with * Table of contents but omits the {:toc} directive on the next line (the Spark-wide convention, e.g. sql-getting-started.md). Without it the line renders as a literal bullet and no TOC is generated. This mirrors the sibling performance-tips.md, so it's a propagated inconsistency rather than a new one -- but with 10 sections here a real TOC is worth having.
Under Supported Queries -> Sources, "the in-memory source used for testing [is] not supported as [a] Real-time source" is imprecise: there is a real-time-capable in-memory source (LowLatencyMemoryStream, implements SupportsRealTimeRead) used by the RTM test suites. It's the standard micro-batch MemoryStream that isn't. Users never touch either, so impact is nil -- consider just narrowing this to the rate source.

jerrypeng · 2026-06-05T17:08:51Z

@viirya thank you for the review. Addressing your comments inline

Comparison table: Continuous Processing should be "At-least-once", not "Exactly-once"

I want to make a distinction between exactly-once processing guarantees and at-least-once delivery in sinks. Exactly-once processing guarantees means changes to state managed by the engine as a result of processing rows is applied effectively once. At-least-once delivery means output will be written to the external system at-least-once, i.e. duplicates possible. I think it is a important distinction to make. Real-time Mode offers exactly-once processing semantics just like the existing engine. The difference is in the sinks it supports. The only sink that supports exactly-once delivery is the delta sink (through idempotent writes). The kafka sink supports al-least-once delivery semantics regardless of whether real-time mode is used or not. This is an important distinction and I want to call this out in the documentation. In theory you can write an exactly-once sink for RTM, there is just no implementation of it yet.

In regards to continuous mode, it does not support state so the argument is moot here. Let do this:

clearly define the terms
clarify what is supported in continuous mode.

The page should state that Real-time Mode is experimental

I think this is a mistake. The real-time mode APIs are stable. Let me create a PR to remove the experimental annotation.

I will address the minor nits as well.

jerrypeng · 2026-06-05T17:55:45Z

@viirya create the PR to experimental annotation from RTM Triggers:

#56346

…Structured Streaming guide Add a new docs/streaming/real-time-mode.md page documenting Real-time Mode (stateless) for Structured Streaming -- how it works, batch duration as a checkpoint interval, comparison with other modes, enabling it via Trigger.RealTime, supported queries, fault tolerance, examples, configuration, and caveats -- and register the page in the streaming left-navigation.

- Add the {:toc} directive after "* Table of contents" so the page generates an in-page table of contents (matching the Spark-wide convention). - Narrow the unsupported-source note to the built-in rate source. The standard MemoryStream is unsupported, but the test-only LowLatencyMemoryStream is real-time-capable, so the previous "in-memory source" wording was imprecise.

jerrypeng · 2026-06-05T18:30:22Z

@viirya Thank you for your review. I have address your comments. PTAL.

viirya

Thanks for the thorough revision -- re-reviewed and all four points are resolved.

Guarantees: agreed, your processing-vs-delivery distinction is the right framing and an improvement over my blunt "change the cell" suggestion. The comparison table now correctly shows Continuous Processing as At-least-once, and the rewritten Fault Tolerance section cleanly separates exactly-once processing from at-least-once delivery. I spot-checked the new claims against the source -- offsets committed at the end of each batch, Kafka sink at-least-once regardless of mode, and no exactly-once sink shipped yet -- and they all hold.
Experimental: makes sense to fix this at the source rather than caveat the docs. #56346 removes @Experimental from all five Trigger.RealTime(...) overloads and from RealTimeTrigger, which makes the page's stable tone consistent with the API once it lands.
{:toc} added, and the in-memory-source note is narrowed to the rate source.

New prose is ASCII-clean. LGTM pending #56346; the docs read accurately on their own either way.

viirya · 2026-06-05T22:20:05Z

I will merge this after #56346 is merged.

jerrypeng force-pushed the real-time-mode-docs branch 5 times, most recently from 2501bf8 to 4df7bd6 Compare June 4, 2026 05:55

viirya reviewed Jun 5, 2026

View reviewed changes

jerrypeng force-pushed the real-time-mode-docs branch from 4df7bd6 to d6a7e0f Compare June 5, 2026 18:03

jerrypeng requested a review from viirya June 5, 2026 18:30

viirya reviewed Jun 5, 2026

View reviewed changes

viirya approved these changes Jun 5, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-57234][SS][DOCS] Add Real-time Mode documentation page to the Structured Streaming guide#56314

[SPARK-57234][SS][DOCS] Add Real-time Mode documentation page to the Structured Streaming guide#56314
jerrypeng wants to merge 2 commits into
apache:masterfrom
jerrypeng:real-time-mode-docs

jerrypeng commented Jun 4, 2026 •

edited

Loading

Uh oh!

viirya left a comment

Uh oh!

jerrypeng commented Jun 5, 2026

Uh oh!

jerrypeng commented Jun 5, 2026

Uh oh!

jerrypeng commented Jun 5, 2026

Uh oh!

viirya left a comment

Uh oh!

viirya commented Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jerrypeng commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

viirya left a comment

Choose a reason for hiding this comment

Uh oh!

jerrypeng commented Jun 5, 2026

Uh oh!

jerrypeng commented Jun 5, 2026

Uh oh!

jerrypeng commented Jun 5, 2026

Uh oh!

viirya left a comment

Choose a reason for hiding this comment

Uh oh!

viirya commented Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jerrypeng commented Jun 4, 2026 •

edited

Loading