SQL: struct support#586
Conversation
✅ Deploy Preview for rp-cloud ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
| :learning-objective-2: Query nested fields using ROW field-access syntax | ||
| :learning-objective-3: Recognize and resolve cyclic-reference errors | ||
|
|
||
| When a glossterm:topic[]'s schema includes nested Protobuf or Avro message types, you can map those nested structures as SQL `ROW` columns instead of opaque JSON. This makes nested fields queryable by name, includable in projections, and usable in `WHERE`, `GROUP BY`, and `ORDER BY` clauses, without parsing JSON at query time. |
There was a problem hiding this comment.
That may be a nitpick, but stating that we are mapping as SQL ROW columns is not entirely true.
In PostgreSQL, a ROW is an anonymous record, in which you cannot explicitly set the sub-field names (they contain some generic f1, f2, f<n>... names that you cannot change).
What we do in the COMPOUND mapping is we actually create a User-Defined type, and set the names of the fields according to the schema.
Not sure if that's something that we want to explicitly state here, or maybe the ROW meaning here is something other than PostgreSQL ROW.
There was a problem hiding this comment.
Changed to
When a glossterm:topic[]'s schema includes nested Protobuf, Avro, or JSON message types, you can map those nested structures as user-defined types (UDTs) with named fields, queryable using SQL
ROWfield-access syntax, instead of opaque JSON. This makes nested fields queryable by name, includable in projections, and usable inWHERE,GROUP BY, andORDER BYclauses, without parsing JSON at query time.
@mattschumpert do you have a preference on whether we explicitly mention user defined types?
There was a problem hiding this comment.
No idea. I defer to @pkonrad1229
There was a problem hiding this comment.
I don't see an issue with why we shouldn't. Any user can check that for themselves by using e.g. pg_typeof function.
| } | ||
| ---- | ||
|
|
||
| Redpanda SQL maps the table with three columns: `order_id` (text), `customer` (a `ROW` with fields `customer_id`, `name`, and `region`), and `amount` (double precision). |
There was a problem hiding this comment.
ditto here:
a
ROWwith fields
We may also say something along the lines of:
a structure/UDT with fields
| (1 row) | ||
| ---- | ||
|
|
||
| === Use implicit tuple syntax |
There was a problem hiding this comment.
I'm late to the party, as this was not modified in this PR :D I believe it's worth noting that the implicit tuple syntax works only when there are two or more expressions
There was a problem hiding this comment.
@pkonrad1229 Hm, that may have just surfaced as something related based on the Claude Code research... is it ok to leave on this page? The explanation is currently under the first sectionn https://deploy-preview-586--rp-cloud.netlify.app/redpanda-cloud/reference/sql/sql-data-types/row/#syntax
There was a problem hiding this comment.
yeah sure, it's okay to leave it here. I only meant to say that we follow PostgreSQL rules for the ROW constructor, where the implicit syntax works only when there's more than 1 expression, so:
(col)returnscolextression(col1,col2)returns a ROW/record of those two columns
Postgres mentions this implicit syntax rule directly in their docs .
27ea3c1 to
e1f3231
Compare
|
|
||
| * Enable Redpanda SQL on your Redpanda Bring Your Own Cloud (BYOC) cluster. See xref:sql:get-started/deploy-sql-cluster.adoc[Enable Redpanda SQL]. | ||
| * Connect to Redpanda SQL with `psql` or another PostgreSQL client. See xref:sql:connect-to-sql/index.adoc[Connect to Redpanda SQL]. | ||
| * The topic has a schema registered in glossterm:schema-registry[Schema Registry]. The schema includes one or more nested message types. |
There was a problem hiding this comment.
Shouldn't we be specifying that the schema is registered for the topic using the TopicNamingStrategy naming convention @pkonrad1229 @kbatuigas ? You have to name it correctly for this to work, right? If people are not already familiar with this in SR we should educate them (point them to this naming convention)
There was a problem hiding this comment.
Yes, the correct schema_subject is required for this to work. Not deeply familiar with the SR side, but from Oxla's perspective, the underlying naming strategy doesn't matter; only that the resolved subject matches a registered one. Two cases:
- SR uses Confluent's default TopicNameStrategy →
schema_subjectcan be omitted; Oxla defaults to<topic>-value. - SR uses a different strategy →
schema_subjectmust be set explicitly in theCREATE TABLEoptions.
There was a problem hiding this comment.
Side note: this comment is broad, not specific to nested fields. It's general CREATE TABLE + Kafka topic behavior. Question whether we should explain it inline here or just link to the CREATE TABLE reference docs where schema_subject semantics belong.
| ---- | ||
| CREATE TABLE default_redpanda_catalog=>orders WITH ( | ||
| topic = 'orders', | ||
| schema_subject = 'orders-value', |
There was a problem hiding this comment.
Is schema_subject required or optional?
If it's required then maybe the naming convention is not mandatory @pkonrad1229 ?
There was a problem hiding this comment.
From what I see, it's optional, and when omitted, Oxla resolves the subject to <topic>-value, so in this example it could be left out, and the outcome would be the same
| CREATE TABLE default_redpanda_catalog=>orders WITH ( | ||
| topic = 'orders', | ||
| schema_subject = 'orders-value', | ||
| struct_mapping_policy = 'COMPOUND' |
There was a problem hiding this comment.
It says below this is optional. Comments should explain the same here (what is optional vs not)
|
|
||
| | `JSON` | ||
| | The topic schema is recursive, or you prefer flexible access through JSON functions. | ||
| | Recursive types supported; fields are untyped until extracted with JSON functions. Queries that span the Redpanda topic and its linked Iceberg table do not align cleanly, because Iceberg always exposes nested structures as typed columns. |
There was a problem hiding this comment.
@grzebiel this warning would imply something very important to alert the user of (we dont really support querying iceberg topics with recursive types). However, I don't think this is the correct message here (at least not always), because Iceberg topics has a special handling encoding recursive Protobuf Struct fields as a JSON string in the Iceberg table. SO for protobuf, we do have a story for recursive fields (at least in the protobuf case).
So, how should this be adjusted.
| == Next steps | ||
|
|
||
| * xref:sql:query-data/query-streaming-topics.adoc[Query streaming topics]: query a topic without Iceberg history. | ||
| * xref:sql:query-data/query-iceberg-topics.adoc[Query Iceberg topics]: query the Iceberg-translated history of a topic. Use `struct_mapping_policy = 'COMPOUND'` so nested fields align between the Redpanda topic and the linked Iceberg table. |
There was a problem hiding this comment.
@kbatuigas wrong wording IMO. 'Query a topic with Iceberg history' is better.
What's here is technically incorrect because it makes it sound like you're ONLY querying the iceberg portion (tail). but in fact this link is to how to do a bridge query that queries both the live streaming data and iceberg history.
We should ensure we correct this everywhere.
micheleRP
left a comment
There was a problem hiding this comment.
docs-team-standards review
Critical (must fix before merging to main)
-
Three unresolved
// TODOmarkers in the source, two of which are already answered in the review thread.query-nested-fields.adoc:23—// TODO: Confirm TopicNameStrategy requirement. @pkonrad1229 answered in the review thread: "the underlying naming strategy doesn't matter; only that the resolved subject matches a registered one. (1) Confluent default →schema_subjectcan be omitted, Oxla defaults to<topic>-value. (2) other strategy →schema_subjectmust be set explicitly." Bake this answer into the body (and into the Prerequisites bullet that currently sits above the TODO).query-nested-fields.adoc:42—// TODO: Confirm schema_subject required when struct_mapping_policy=COMPOUND. Same SME reply applies:schema_subjectis optional; resolves to<topic>-valueby default. The bullet at line 41 currently only documentsstruct_mapping_policydefaulting toCOMPOUND— add a parallel bullet forschema_subject.row.adoc:239—// TODO: SME — confirm whether nested array-of-struct access (...) works at GA, and whether wildcard expansion on an empty ROW (`(ROW()).*`) is supported. Tracked under OXLA-9444 and OXLA-9431.Still genuinely open; convert to a follow-up doc ticket so it doesn't ship as a comment.
-
SME terminology fix only partly applied — "ROW columns" framing still in two attributes/sections.
- @pkonrad1229 pushed back on "SQL
ROWcolumns" framing in the review thread: "What we do in the COMPOUND mapping is we actually create a User-Defined type… the page body should say 'a structure/UDT with fields'." The body ofquery-nested-fields.adocwas rewritten to use "user-defined types (UDTs)" — good. - Still leaking the old framing in two places:
query-nested-fields.adoc:2(:description:):Map a topic with nested Protobuf, Avro, or JSON fields to SQL ROW columns, then query those fields directly.row.adoc:243(See also):xref:reference:sql/sql-statements/create-table.adoc[CREATE TABLE]: maps a Redpanda topic to a SQL table. Use`struct_mapping_policy = 'COMPOUND'`to surface nested topic fields as ROW columns.
- Fix: match the body's wording. For example, description →
Map a topic's nested fields to typed SQL columns and query them by name.; row.adoc:243 →...to surface nested topic fields as user-defined types accessible with ROW field-access syntax.
- @pkonrad1229 pushed back on "SQL
Suggestions (should consider)
-
Open thread (@mattschumpert) —
schema_subjectand TopicNameStrategy education.- The review thread has an open question from @mattschumpert about whether the page should explicitly explain Schema Registry's TopicNameStrategy convention. @pkonrad1229 clarified the behavior (Oxla just resolves the subject; default is
<topic>-value). Worth folding that explanation into the Prerequisites or the Map-the-topic-as-a-SQL-table section, even just a one-sentence "If your Schema Registry uses Confluent's default TopicNameStrategy, you can omitschema_subject— Redpanda SQL resolves it to<topic>-value."
- The review thread has an open question from @mattschumpert about whether the page should explicitly explain Schema Registry's TopicNameStrategy convention. @pkonrad1229 clarified the behavior (Oxla just resolves the subject; default is
-
Open thread (@mattschumpert) — Iceberg + recursive types warning may be technically wrong for Protobuf.
- On line 99 of
query-nested-fields.adoc, the warning saysCOMPOUNDcannot map recursive types. @mattschumpert flagged that the related Iceberg behavior is more nuanced — Protobuf recursive fields are encoded as JSON strings in Iceberg, so there is a story for them. This wasn't fully resolved in the thread. Worth a follow-up with @grzebiel (who was tagged) before this lands inmain.
- On line 99 of
-
Wording cleanup carryover ("Query an Iceberg topic" → "Query a topic with Iceberg history").
- @mattschumpert asked for this rename to be applied "everywhere." The new page uses the updated wording (lines 127, 137). The nav still has
*** xref:sql:query-data/query-iceberg-topics.adoc[Query Iceberg Topics](line 357) — pre-existing, not modified by this PR, but consistent with what mattschumpert wants. Either update it now (since nav is already in this diff) or track it as a follow-up.
- @mattschumpert asked for this rename to be applied "everywhere." The new page uses the updated wording (lines 127, 137). The nav still has
Co-authored-by: Michele Cyran <michele@redpanda.com>
f3a6858 to
f945fff
Compare
| The `CREATE TABLE` statement maps a Redpanda topic to a SQL table through a catalog. After creating the table, you can query topic data using standard SQL. | ||
| The `CREATE TABLE` statement maps a Redpanda topic to a SQL table through a catalog. After creating the table, you can query the topic using standard SQL. | ||
|
|
||
| NOTE: You must first xref:reference:sql/sql-statements/create-redpanda-catalog.adoc[create a Redpanda catalog connection] before creating tables. `CREATE TABLE` in Redpanda SQL maps Redpanda topics to SQL tables — it does not create standalone tables with user-defined schemas. |
There was a problem hiding this comment.
| NOTE: You must first xref:reference:sql/sql-statements/create-redpanda-catalog.adoc[create a Redpanda catalog connection] before creating tables. `CREATE TABLE` in Redpanda SQL maps Redpanda topics to SQL tables — it does not create standalone tables with user-defined schemas. | |
| NOTE: You must first xref:reference:sql/sql-statements/create-redpanda-catalog.adoc[create a Redpanda catalog connection] before creating tables. `CREATE TABLE` in Redpanda SQL maps Redpanda topics to SQL tables. It does not create standalone tables with user-defined schemas. |
|
|
||
| === Access by name | ||
|
|
||
| For composite columns with declared field names — for example, columns mapped from a topic with `struct_mapping_policy = 'COMPOUND'` (see xref:reference:sql/sql-statements/create-table.adoc[CREATE TABLE]) — access fields by their declared names: |
There was a problem hiding this comment.
| For composite columns with declared field names — for example, columns mapped from a topic with `struct_mapping_policy = 'COMPOUND'` (see xref:reference:sql/sql-statements/create-table.adoc[CREATE TABLE]) — access fields by their declared names: | |
| For composite columns with declared field names, for example, columns mapped from a topic with `struct_mapping_policy = 'COMPOUND'` (see xref:reference:sql/sql-statements/create-table.adoc[CREATE TABLE]), access fields by their declared names: |
| :page-topic-type: reference | ||
|
|
||
| The `ROW` data type represents a composite value (also known as a struct or record) containing one or more fields of different types. | ||
| The `ROW` data type represents a composite value (also known as a struct or record) containing one or more fields of different types. ROW values support field access, lexicographic comparison, NULL checks, conversion to text, and use in `GROUP BY`, `ORDER BY`, and `JOIN` clauses. |
There was a problem hiding this comment.
| The `ROW` data type represents a composite value (also known as a struct or record) containing one or more fields of different types. ROW values support field access, lexicographic comparison, NULL checks, conversion to text, and use in `GROUP BY`, `ORDER BY`, and `JOIN` clauses. | |
| The `ROW` data type represents a composite value (also known as a struct or record) containing one or more fields of different types. `ROW` values support field access, lexicographic comparison, NULL checks, conversion to text, and use in `GROUP BY`, `ORDER BY`, and `JOIN` clauses. |
micheleRP
left a comment
There was a problem hiding this comment.
left 3 little style suggestions!
| * Enable Redpanda SQL on your Redpanda Bring Your Own Cloud (BYOC) cluster. See xref:sql:get-started/deploy-sql-cluster.adoc[Enable Redpanda SQL]. | ||
| * Connect to Redpanda SQL with `psql` or another PostgreSQL client. See xref:sql:connect-to-sql/index.adoc[Connect to Redpanda SQL]. |
There was a problem hiding this comment.
| * Enable Redpanda SQL on your Redpanda Bring Your Own Cloud (BYOC) cluster. See xref:sql:get-started/deploy-sql-cluster.adoc[Enable Redpanda SQL]. | |
| * Connect to Redpanda SQL with `psql` or another PostgreSQL client. See xref:sql:connect-to-sql/index.adoc[Connect to Redpanda SQL]. | |
| * [Enable Redpanda SQL](xref:sql:get-started/deploy-sql-cluster.adoc[Enable Redpanda SQL]) on your Redpanda Bring Your Own Cloud (BYOC) cluster. | |
| * [Connect to Redpanda SQL](xref:sql:connect-to-sql/index.adoc[Connect to Redpanda SQL]) with `psql` or another PostgreSQL client. |
Description
This pull request introduces comprehensive documentation improvements for working with nested fields in Redpanda SQL, focusing on mapping nested Protobuf or Avro structures as SQL ROW columns and querying them directly. It clarifies the use of the
struct_mapping_policyoption, expands the reference for theROWdata type, and adds a dedicated how-to guide for querying nested fields.New documentation and feature explanations:
query-nested-fields.adoc, detailing how to map topics with nested schemas as SQL tables usingstruct_mapping_policy = 'COMPOUND', how to query nested fields with ROW syntax, and how to handle recursive (cyclic) schemas.nav.adoc) to include the new "Query Topics with Nested Fields" guide.Improvements to ROW data type documentation:
ROWdata type reference to document field access (by position and name), wildcard projection, lexicographic comparison, NULL checks, text conversion, and usage inGROUP BY,ORDER BY, andJOINclauses.ROWtype summary to mention its support for field access, comparisons, and use in query clauses.Clarifications to CREATE TABLE options:
struct_mapping_policyoption in theCREATE TABLEdocumentation, emphasizing thatCOMPOUNDmaps nested structures to SQL ROW columns and noting that cyclic types are only supported inJSONmode.Resolves https://github.com/redpanda-data/documentation-private/issues/
Review deadline: 20 May
Page previews
Checks