feat: Add scan transform to unnest arrays for realtime ingestion#19379
Open
abhishekrb19 wants to merge 6 commits intoapache:masterfrom
Open
feat: Add scan transform to unnest arrays for realtime ingestion#19379abhishekrb19 wants to merge 6 commits intoapache:masterfrom
scan transform to unnest arrays for realtime ingestion#19379abhishekrb19 wants to merge 6 commits intoapache:masterfrom
Conversation
| @JsonProperty("unnestFilter") @Nullable final DimFilter unnestFilter | ||
| ) | ||
| { | ||
| this.name = name; |
Member
There was a problem hiding this comment.
[P2] Validate scan transform output name
ScanTransform stores name separately from unnestColumn.getOutputName(), but the transform framework treats Transform.getName() as the generated field name for collision checks and input-column pruning. If a spec sets name: "tag" but the virtual column outputs "elt", Druid accepts it, writes elt, and silently leaves references to tag unresolved while also missing collisions on elt. Reject mismatches or derive the transform name from the virtual column output.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR adds scan query to Druid's transformSpec that can unnest array-valued columns into individual rows during streaming ingestion and possibly do other things that are feasible with scan query semantics. This is similar to the UNNEST support in SQL with MSQ batch ingestion, but for realtime ingestion.
New ScanTransform (type: "scan") — a multi-row transform that wraps an embedded scan query and reuses the existing ScanQueryEngine, UnnestDataSource and UnnestCursorFactory machinery to explode arrays at ingest time. Each input row is wrapped
in a temporary single-row segment, the configured scan query runs against it, and the resulting rows are emitted downstream.
The scan query's data source uses "input" as the base table, with an unnest data source to specify which column to explode and an optional unnestFilter to filter array elements
Supports string arrays, object arrays, and nested arrays - objects are preserved through unnest
This simplifies queries by avoiding UNNEST at query time and improves query performance since unnesting is done once at ingest time rather than repeatedly at query time.
Release note
Scan transform for ingestion-time array unnesting. Added a new "scan" transform type in transformSpec that unnests array-valued columns into individual rows during streaming ingestion (similar to existing UNNEST functionality with Druid SQL and the MSQ engine). The scan transform wraps a scan query with an unnest data source to explode arrays at ingest time.
This PR has: