Skip to content

feat: Add scan transform to unnest arrays for realtime ingestion#19379

Open
abhishekrb19 wants to merge 6 commits intoapache:masterfrom
abhishekrb19:scan_transform_realtime
Open

feat: Add scan transform to unnest arrays for realtime ingestion#19379
abhishekrb19 wants to merge 6 commits intoapache:masterfrom
abhishekrb19:scan_transform_realtime

Conversation

@abhishekrb19
Copy link
Copy Markdown
Contributor

@abhishekrb19 abhishekrb19 commented Apr 27, 2026

This PR adds scan query to Druid's transformSpec that can unnest array-valued columns into individual rows during streaming ingestion and possibly do other things that are feasible with scan query semantics. This is similar to the UNNEST support in SQL with MSQ batch ingestion, but for realtime ingestion.

  • New ScanTransform (type: "scan") — a multi-row transform that wraps an embedded scan query and reuses the existing ScanQueryEngine, UnnestDataSource and UnnestCursorFactory machinery to explode arrays at ingest time. Each input row is wrapped
    in a temporary single-row segment, the configured scan query runs against it, and the resulting rows are emitted downstream.

  • The scan query's data source uses "input" as the base table, with an unnest data source to specify which column to explode and an optional unnestFilter to filter array elements

  • Supports string arrays, object arrays, and nested arrays - objects are preserved through unnest

This simplifies queries by avoiding UNNEST at query time and improves query performance since unnesting is done once at ingest time rather than repeatedly at query time.

Release note

Scan transform for ingestion-time array unnesting. Added a new "scan" transform type in transformSpec that unnests array-valued columns into individual rows during streaming ingestion (similar to existing UNNEST functionality with Druid SQL and the MSQ engine). The scan transform wraps a scan query with an unnest data source to explode arrays at ingest time.

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

@JsonProperty("unnestFilter") @Nullable final DimFilter unnestFilter
)
{
this.name = name;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[P2] Validate scan transform output name

ScanTransform stores name separately from unnestColumn.getOutputName(), but the transform framework treats Transform.getName() as the generated field name for collision checks and input-column pruning. If a spec sets name: "tag" but the virtual column outputs "elt", Druid accepts it, writes elt, and silently leaves references to tag unresolved while also missing collisions on elt. Reject mismatches or derive the transform name from the virtual column output.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants