[SPARK-57234][SS][DOCS] Add Real-time Mode documentation page to the Structured Streaming guide#56314
[SPARK-57234][SS][DOCS] Add Real-time Mode documentation page to the Structured Streaming guide#56314jerrypeng wants to merge 2 commits into
Conversation
2501bf8 to
4df7bd6
Compare
viirya
left a comment
There was a problem hiding this comment.
Thanks for filling this gap -- I cross-checked the trigger forms, config keys/defaults, error-class names, the allowlist (sinks/operators), and the cross-doc anchors against the 4.1.0 source and they all match. Two things to address and two minor nits.
Comparison table: Continuous Processing should be "At-least-once", not "Exactly-once"
The comparison table lists Continuous Processing under Processing Guarantees as Exactly-once. That's not correct -- Continuous Processing provides at-least-once guarantees, which is also what the page this row links to states: performance-tips.md describes it as "low (~1 ms) end-to-end latency with at-least-once fault-tolerance guarantees," and notes that switching to continuous mode gives at-least-once. As written, the page contradicts its own link target and will mislead anyone comparing the three modes. Please change the cell to At-least-once.
The page should state that Real-time Mode is experimental
Every Trigger.RealTime(...) overload (Trigger.java) and RealTimeTrigger itself (Triggers.scala) is annotated @Experimental, but the page reads as stable product documentation and recommends "prefer Real-time Mode over Continuous Processing." There's an internal inconsistency too: the page repeatedly calls Continuous Processing experimental but never says the same of Real-time Mode, even though both carry the same annotation. Given the API is @Experimental and the page itself promises stateful support "starting in Spark 4.3," a one-line note up front that the API/feature is experimental and may evolve would set the right expectation and match the source's commitment boundary.
Minor nits
- The page opens with
* Table of contentsbut omits the{:toc}directive on the next line (the Spark-wide convention, e.g.sql-getting-started.md). Without it the line renders as a literal bullet and no TOC is generated. This mirrors the siblingperformance-tips.md, so it's a propagated inconsistency rather than a new one -- but with 10 sections here a real TOC is worth having. - Under Supported Queries -> Sources, "the in-memory source used for testing [is] not supported as [a] Real-time source" is imprecise: there is a real-time-capable in-memory source (
LowLatencyMemoryStream, implementsSupportsRealTimeRead) used by the RTM test suites. It's the standard micro-batchMemoryStreamthat isn't. Users never touch either, so impact is nil -- consider just narrowing this to theratesource.
|
@viirya thank you for the review. Addressing your comments inline
I want to make a distinction between exactly-once processing guarantees and at-least-once delivery in sinks. Exactly-once processing guarantees means changes to state managed by the engine as a result of processing rows is applied effectively once. At-least-once delivery means output will be written to the external system at-least-once, i.e. duplicates possible. I think it is a important distinction to make. Real-time Mode offers exactly-once processing semantics just like the existing engine. The difference is in the sinks it supports. The only sink that supports exactly-once delivery is the delta sink (through idempotent writes). The kafka sink supports al-least-once delivery semantics regardless of whether real-time mode is used or not. This is an important distinction and I want to call this out in the documentation. In theory you can write an exactly-once sink for RTM, there is just no implementation of it yet. In regards to continuous mode, it does not support state so the argument is moot here. Let do this:
I think this is a mistake. The real-time mode APIs are stable. Let me create a PR to remove the experimental annotation. I will address the minor nits as well. |
…Structured Streaming guide Add a new docs/streaming/real-time-mode.md page documenting Real-time Mode (stateless) for Structured Streaming -- how it works, batch duration as a checkpoint interval, comparison with other modes, enabling it via Trigger.RealTime, supported queries, fault tolerance, examples, configuration, and caveats -- and register the page in the streaming left-navigation.
4df7bd6 to
d6a7e0f
Compare
- Add the {:toc} directive after "* Table of contents" so the page generates an
in-page table of contents (matching the Spark-wide convention).
- Narrow the unsupported-source note to the built-in rate source. The standard
MemoryStream is unsupported, but the test-only LowLatencyMemoryStream is
real-time-capable, so the previous "in-memory source" wording was imprecise.
|
@viirya Thank you for your review. I have address your comments. PTAL. |
viirya
left a comment
There was a problem hiding this comment.
Thanks for the thorough revision -- re-reviewed and all four points are resolved.
- Guarantees: agreed, your processing-vs-delivery distinction is the right framing and an improvement over my blunt "change the cell" suggestion. The comparison table now correctly shows Continuous Processing as
At-least-once, and the rewritten Fault Tolerance section cleanly separates exactly-once processing from at-least-once delivery. I spot-checked the new claims against the source -- offsets committed at the end of each batch, Kafka sink at-least-once regardless of mode, and no exactly-once sink shipped yet -- and they all hold. - Experimental: makes sense to fix this at the source rather than caveat the docs. #56346 removes
@Experimentalfrom all fiveTrigger.RealTime(...)overloads and fromRealTimeTrigger, which makes the page's stable tone consistent with the API once it lands. {:toc}added, and the in-memory-source note is narrowed to theratesource.
New prose is ASCII-clean. LGTM pending #56346; the docs read accurately on their own either way.
|
I will merge this after #56346 is merged. |
What changes were proposed in this pull request?
This PR adds a new documentation page for Real-time Mode in Structured Streaming, introduced in Spark 4.1.0 (SPARK-53736):
docs/streaming/real-time-mode.md. The page covers:Trigger.RealTime(...)API (Scala/Java) and therealTimetrigger keyword argument (Python), plus the requirements to start (update output mode,checkpoint location, minimum batch duration).
(Python/Scala/Java), and caveats.
It also registers the new page in the Structured Streaming left navigation (
docs/_data/menu-streaming.yaml).Real-time Mode (stateless) was added in Spark 4.1.0 but has no user-facing documentation in the Structured Streaming programming guide. This PR adds that page. See SPARK-57234.
Does this PR introduce any user-facing change?
No. This is a documentation-only change.
How was this patch tested?
Documentation-only change. The new page was validated for structure (front matter, code tabs, Liquid
{% highlight %}tags), internal links and in-page anchors, navigation-menu anchors,and ASCII-only content. All trigger API signatures, configuration keys and defaults, the supported-operator and sink lists, and error-class references were cross-checked against the Spark
4.1.0 source (
Trigger.java,Triggers.scala,RealTimeModeAllowlist.scala,SQLConf.scala,KafkaMicroBatchStream.scala, anderror-conditions.json). Reviewers can verify renderinglocally with
SKIP_API=1 bundle exec jekyll buildfrom thedocs/directory.Was this patch authored or co-authored using generative AI tooling?
co-authored with Claude Code (Opus 4.8).