Skip to content

[SPARK-57234][SS][DOCS] Add Real-time Mode documentation page to the Structured Streaming guide#56314

Open
jerrypeng wants to merge 2 commits into
apache:masterfrom
jerrypeng:real-time-mode-docs
Open

[SPARK-57234][SS][DOCS] Add Real-time Mode documentation page to the Structured Streaming guide#56314
jerrypeng wants to merge 2 commits into
apache:masterfrom
jerrypeng:real-time-mode-docs

Conversation

@jerrypeng
Copy link
Copy Markdown
Contributor

@jerrypeng jerrypeng commented Jun 4, 2026

What changes were proposed in this pull request?

This PR adds a new documentation page for Real-time Mode in Structured Streaming, introduced in Spark 4.1.0 (SPARK-53736): docs/streaming/real-time-mode.md. The page covers:

  • How Real-time Mode works: long-running tasks (one per input partition) that process records continuously, in contrast to the per-micro-batch task scheduling of the default engine.
  • Batch duration is a checkpoint interval, not a latency target.
  • A comparison with the other execution modes.
  • Enabling Real-time Mode: the Trigger.RealTime(...) API (Scala/Java) and the realTime trigger keyword argument (Python), plus the requirements to start (update output mode,
    checkpoint location, minimum batch duration).
  • Supported queries (stateless only), fault tolerance (exactly-once processing semantics; sinks such as Kafka provide at-least-once delivery), configuration, examples
    (Python/Scala/Java), and caveats.

It also registers the new page in the Structured Streaming left navigation (docs/_data/menu-streaming.yaml).

Real-time Mode (stateless) was added in Spark 4.1.0 but has no user-facing documentation in the Structured Streaming programming guide. This PR adds that page. See SPARK-57234.

Does this PR introduce any user-facing change?

No. This is a documentation-only change.

How was this patch tested?

Documentation-only change. The new page was validated for structure (front matter, code tabs, Liquid {% highlight %} tags), internal links and in-page anchors, navigation-menu anchors,
and ASCII-only content. All trigger API signatures, configuration keys and defaults, the supported-operator and sink lists, and error-class references were cross-checked against the Spark
4.1.0 source (Trigger.java, Triggers.scala, RealTimeModeAllowlist.scala, SQLConf.scala, KafkaMicroBatchStream.scala, and error-conditions.json). Reviewers can verify rendering
locally with SKIP_API=1 bundle exec jekyll build from the docs/ directory.

Was this patch authored or co-authored using generative AI tooling?

co-authored with Claude Code (Opus 4.8).

@jerrypeng jerrypeng force-pushed the real-time-mode-docs branch 5 times, most recently from 2501bf8 to 4df7bd6 Compare June 4, 2026 05:55
Copy link
Copy Markdown
Member

@viirya viirya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for filling this gap -- I cross-checked the trigger forms, config keys/defaults, error-class names, the allowlist (sinks/operators), and the cross-doc anchors against the 4.1.0 source and they all match. Two things to address and two minor nits.

Comparison table: Continuous Processing should be "At-least-once", not "Exactly-once"

The comparison table lists Continuous Processing under Processing Guarantees as Exactly-once. That's not correct -- Continuous Processing provides at-least-once guarantees, which is also what the page this row links to states: performance-tips.md describes it as "low (~1 ms) end-to-end latency with at-least-once fault-tolerance guarantees," and notes that switching to continuous mode gives at-least-once. As written, the page contradicts its own link target and will mislead anyone comparing the three modes. Please change the cell to At-least-once.

The page should state that Real-time Mode is experimental

Every Trigger.RealTime(...) overload (Trigger.java) and RealTimeTrigger itself (Triggers.scala) is annotated @Experimental, but the page reads as stable product documentation and recommends "prefer Real-time Mode over Continuous Processing." There's an internal inconsistency too: the page repeatedly calls Continuous Processing experimental but never says the same of Real-time Mode, even though both carry the same annotation. Given the API is @Experimental and the page itself promises stateful support "starting in Spark 4.3," a one-line note up front that the API/feature is experimental and may evolve would set the right expectation and match the source's commitment boundary.

Minor nits

  • The page opens with * Table of contents but omits the {:toc} directive on the next line (the Spark-wide convention, e.g. sql-getting-started.md). Without it the line renders as a literal bullet and no TOC is generated. This mirrors the sibling performance-tips.md, so it's a propagated inconsistency rather than a new one -- but with 10 sections here a real TOC is worth having.
  • Under Supported Queries -> Sources, "the in-memory source used for testing [is] not supported as [a] Real-time source" is imprecise: there is a real-time-capable in-memory source (LowLatencyMemoryStream, implements SupportsRealTimeRead) used by the RTM test suites. It's the standard micro-batch MemoryStream that isn't. Users never touch either, so impact is nil -- consider just narrowing this to the rate source.

@jerrypeng
Copy link
Copy Markdown
Contributor Author

@viirya thank you for the review. Addressing your comments inline

Comparison table: Continuous Processing should be "At-least-once", not "Exactly-once"

I want to make a distinction between exactly-once processing guarantees and at-least-once delivery in sinks. Exactly-once processing guarantees means changes to state managed by the engine as a result of processing rows is applied effectively once. At-least-once delivery means output will be written to the external system at-least-once, i.e. duplicates possible. I think it is a important distinction to make. Real-time Mode offers exactly-once processing semantics just like the existing engine. The difference is in the sinks it supports. The only sink that supports exactly-once delivery is the delta sink (through idempotent writes). The kafka sink supports al-least-once delivery semantics regardless of whether real-time mode is used or not. This is an important distinction and I want to call this out in the documentation. In theory you can write an exactly-once sink for RTM, there is just no implementation of it yet.

In regards to continuous mode, it does not support state so the argument is moot here. Let do this:

  1. clearly define the terms
  2. clarify what is supported in continuous mode.

The page should state that Real-time Mode is experimental

I think this is a mistake. The real-time mode APIs are stable. Let me create a PR to remove the experimental annotation.

I will address the minor nits as well.

@jerrypeng
Copy link
Copy Markdown
Contributor Author

@viirya create the PR to experimental annotation from RTM Triggers:

#56346

…Structured Streaming guide

Add a new docs/streaming/real-time-mode.md page documenting Real-time Mode
(stateless) for Structured Streaming -- how it works, batch duration as a
checkpoint interval, comparison with other modes, enabling it via
Trigger.RealTime, supported queries, fault tolerance, examples, configuration,
and caveats -- and register the page in the streaming left-navigation.
@jerrypeng jerrypeng force-pushed the real-time-mode-docs branch from 4df7bd6 to d6a7e0f Compare June 5, 2026 18:03
- Add the {:toc} directive after "* Table of contents" so the page generates an
  in-page table of contents (matching the Spark-wide convention).
- Narrow the unsupported-source note to the built-in rate source. The standard
  MemoryStream is unsupported, but the test-only LowLatencyMemoryStream is
  real-time-capable, so the previous "in-memory source" wording was imprecise.
@jerrypeng jerrypeng requested a review from viirya June 5, 2026 18:30
@jerrypeng
Copy link
Copy Markdown
Contributor Author

@viirya Thank you for your review. I have address your comments. PTAL.

Copy link
Copy Markdown
Member

@viirya viirya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the thorough revision -- re-reviewed and all four points are resolved.

  • Guarantees: agreed, your processing-vs-delivery distinction is the right framing and an improvement over my blunt "change the cell" suggestion. The comparison table now correctly shows Continuous Processing as At-least-once, and the rewritten Fault Tolerance section cleanly separates exactly-once processing from at-least-once delivery. I spot-checked the new claims against the source -- offsets committed at the end of each batch, Kafka sink at-least-once regardless of mode, and no exactly-once sink shipped yet -- and they all hold.
  • Experimental: makes sense to fix this at the source rather than caveat the docs. #56346 removes @Experimental from all five Trigger.RealTime(...) overloads and from RealTimeTrigger, which makes the page's stable tone consistent with the API once it lands.
  • {:toc} added, and the in-memory-source note is narrowed to the rate source.

New prose is ASCII-clean. LGTM pending #56346; the docs read accurately on their own either way.

@viirya
Copy link
Copy Markdown
Member

viirya commented Jun 5, 2026

I will merge this after #56346 is merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants