Skip to content

Fix transpose VARCHAR length drift in unpivot/pivot lowering#5479

Open
songkant-aws wants to merge 2 commits into
opensearch-project:mainfrom
songkant-aws:transpose-varchar-fix
Open

Fix transpose VARCHAR length drift in unpivot/pivot lowering#5479
songkant-aws wants to merge 2 commits into
opensearch-project:mainfrom
songkant-aws:transpose-varchar-fix

Conversation

@songkant-aws
Copy link
Copy Markdown
Collaborator

Summary

PPL transpose lowers via RelBuilder.unpivot() + pivot(). Calcite's unpivot synthesizes a LogicalValues leaf carrying axis literals (the input field names), and types them as CHAR(maxAxisLiteralLen) — the longest literal wins on column-level type inference. That precision propagates through the rest of the lowering and breaks the analytics-engine route end-to-end.

This PR makes the lowering own its type contract:

  1. Build every axis literal at the same CHAR(maxAxisLiteralLen) type. Calcite then space-pads the shorter literals at value-construction time, so the runtime CHAR vector and the declared schema both have the same fixed length. The downstream TRIM strips the padding.
  2. Wrap the trimmed-axis group key in an explicit CAST(... AS VARCHAR) to unbounded VARCHAR. This makes the group key type match _value_transpose_'s unbounded VARCHAR end-to-end, so the aggregate's row-type check sees consistent types.

Why this has to be in sql plugin (not analytics-engine planner)

The typing decisions happen inside Calcite's RelBuilder.unpivot() when it constructs the VALUES leaf — long before any analytics-engine rule sees the plan. By the time the plan reaches the analytics-engine route, the precision drift is already baked into the RelDataType chain. Fixing it downstream would require pattern-matching on transpose-shaped sub-trees inside the planner, which is fragile and mis-attributes the root cause. The lowering author owns the type contract for the operators it emits.

What goes wrong on the analytics-engine route without this fix

Concrete plan from source=accounts | head N | fields firstname, age, balance | transpose:

LogicalAggregate(group=[{1}], row 1_null=[MAX($0) FILTER $2], ...)
  LogicalProject(_value_transpose_=[CAST($5):VARCHAR NOT NULL],   -- $0
                 $f6=[TRIM(BOTH, ' ', $4)],                       -- $1  ← group key
                 $f7=[=($3, 1)], ...)                             -- $2..
    LogicalProject(... column=$4, _value_transpose_=[CASE(...)])
      LogicalJoin(condition=[true])                               -- cross join from pivot
        LogicalProject(..., ROW_NUMBER() OVER ())
          LogicalSort(fetch=N)
            LogicalScan(t1)
        LogicalValues(tuples=[['firstname'},{'age'},{'balance'}]) -- axis literals

Type chain:

  • column (group key): CHAR(9) from longest literal 'firstname' → after TRIM → VARCHAR(9)
  • _value_transpose_: CAST(... AS VARCHAR) — unbounded
  • MAX(_value_transpose_).type: unbounded VARCHAR (inferred from arg 0 at construction)

Two failure modes downstream:

  1. Aggregate.<init> rejects the plan. Non-prefix groupSet={1} aggregates split into PARTIAL+FINAL on the analytics path. PARTIAL hoists group keys to its output prefix, so FINAL's argList=[0] ends up reading the group-key slot (VARCHAR(9)) instead of the agg-state slot. Calcite's typeMatchesInferred check then fires:
    type mismatch:
      aggCall type:    VARCHAR
      inferred type:   VARCHAR(9)
    
  2. Substrait/Arrow FixedChar length mismatch at execution time. Even if the aggregate validation passes, the CHAR(maxAxisLiteralLen) propagates through isthmus → DataFusion as a FixedChar(maxAxisLiteralLen) schema. Runtime arrays carry the actual values whose lengths are shorter (e.g. "age" is length 3, not 9), so DataFusion rejects:
    Row field type (FixedChar{length=3}) does not match schema field type (FixedChar{length=9})
    

Both failures trace back to the same root cause — uneven typing in the unpivot/pivot lowering — and disappear once the lowering emits consistent types.

What changes

core/src/main/java/org/opensearch/sql/calcite/CalciteRelNodeVisitor.javavisitTranspose:

+    int axisLiteralLength = fieldNames.stream().mapToInt(String::length).max().orElse(0);
+    RelDataType axisLiteralType =
+        rx.getTypeFactory().createSqlType(SqlTypeName.CHAR, axisLiteralLength);
     // unpivot axisMap entries:
-                        ImmutableList.of(rx.makeLiteral(f)),
+                        ImmutableList.of(
+                            (RexLiteral) rx.makeLiteral(f, axisLiteralType, false, false)),
     // group key for pivot:
-    RexNode trimmedColumnName =
-        context.rexBuilder.makeCall(SqlStdOperatorTable.TRIM, ...);
+    RexNode trimmedColumnName =
+        context.rexBuilder.makeCast(
+            varchar,
+            context.rexBuilder.makeCall(SqlStdOperatorTable.TRIM, ...),
+            true);

Tests

  • CalcitePPLTransposeTest — added testTransposeColumnAxisUsesUnboundedVarchar regression assertion that pins the output column field's type to unbounded VARCHAR. This catches any future change that re-introduces axis-literal precision into the group key.
  • Updated plan-shape assertions across the existing transpose tests to reflect:
    • Padded axis literals ('cnt ', 'COMM ', 'JOB ', 'SAL ', etc.)
    • CAST(TRIM(...) AS VARCHAR) shape on the group key
    • SparkSQL output GROUP BY CAST(TRIM(\column`) AS STRING)`

End-to-end verification: CalciteTransposeCommandIT 5/5 pass with -Dtests.analytics.parquet_indices=true.

Test plan

  • Unit: ./gradlew :ppl:test --tests "*CalcitePPLTransposeTest"
  • Spotless: ./gradlew spotlessCheck
  • V2 engine path (non-parquet): no regression — unchanged plan shape produces identical results
  • Analytics-engine path (parquet): CalciteTransposeCommandIT 5/5 (vs 0/5 before)

PPL `transpose` lowers via `RelBuilder.unpivot()` + `pivot()`. The unpivot
synthesizes a VALUES leaf carrying axis literals (the input field names),
e.g. `VALUES('firstname'), ('age'), ('balance')`. Calcite types each
RexLiteral as `CHAR(literalLen)` and types the VALUES column as
`CHAR(maxAxisLiteralLen)` — the longest literal wins on column-level
type inference.

This bites the analytics-engine route end-to-end:

1. After unpivot the `column` axis column is `CHAR(9)` (from "firstname").
   Through Calcite's TRIM (`TO_VARYING`) it becomes `VARCHAR(9)`.
2. The `_value_transpose_` value column is built from
   `CAST(input_field AS VARCHAR)` — unbounded VARCHAR.
3. `MAX(_value_transpose_)` aggCall is created with declared return type
   = unbounded VARCHAR (inferred from arg 0 at call-construction time).
4. The downstream non-prefix groupSet aggregate
   (`group=[{1}], MAX($0)`) splits into PARTIAL/FINAL on the analytics
   path. PARTIAL hoists group keys to the output prefix, so FINAL's
   `argList=[0]` reads the group-key slot — `VARCHAR(9)` — instead of
   the agg-state slot. Calcite's `Aggregate.<init>` then runs
   `typeMatchesInferred` and rejects the plan: declared `VARCHAR` ≠
   inferred `VARCHAR(9)`.
5. Even when the aggregate validation passes, the substrait/Arrow path
   sees `FixedChar(maxAxisLiteralLen)` schema vs runtime arrays whose
   actual values are shorter (e.g. "age" with length 3) and trips
   `Row field type (FixedChar{length=3}) does not match schema field
   type (FixedChar{length=9})`.

Two fixes, both in the lowering site:

* Build every axis literal at the same `CHAR(maxAxisLiteralLen)` type.
  Calcite then space-pads the shorter literals at value-construction
  time, so the runtime CHAR vector and the declared schema both have
  the same fixed length. The downstream TRIM strips the padding.
* Wrap the trimmed-axis group key in an explicit
  `CAST(... AS VARCHAR)` to unbounded VARCHAR. This makes the group
  key type match `_value_transpose_`'s unbounded VARCHAR end-to-end,
  so the aggregate's row-type check sees consistent types regardless
  of which side the analytics-engine split rule places the group key
  on.

These have to live in sql plugin, not in the analytics-engine planner:
the typing decisions are made by Calcite's `RelBuilder.unpivot()`
implementation when it constructs the VALUES leaf — long before any
analytics-engine rule sees the plan. By the time the plan reaches the
analytics-engine route, the precision drift is already baked into the
RelDataType chain. Fixing it downstream would require pattern-matching
on transpose-shaped sub-trees inside the planner, which is fragile and
mis-attributes the root cause. The lowering author owns the type
contract for the operators it emits.

Adds:
- `testTransposeColumnAxisUsesUnboundedVarchar` regression assertion
  pinning the output `column` field's type to unbounded VARCHAR. Catches
  any future change that re-introduces axis-literal precision into the
  group key.
- Updated plan-shape assertions across the existing transpose tests to
  reflect the padded axis literals (`'cnt    '`, `'COMM '`, etc.) and
  the `CAST(TRIM(...) AS VARCHAR)` group key.

Verified end-to-end: `CalciteTransposeCommandIT` 5/5 pass with
`tests.analytics.parquet_indices=true`.

Signed-off-by: Songkan Tang <songkant@amazon.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 28, 2026

PR Reviewer Guide 🔍

(Review updated until commit 8b89df5)

Here are some key observations to aid the review process:

🧪 PR contains tests
🔒 No security concerns identified
✅ No TODO sections
🔀 No multiple PR themes
⚡ Recommended focus areas for review

Possible Issue

When fieldNames is empty, axisLiteralLength becomes 0, and axisLiteralType is CHAR(0). Calcite may reject zero-length CHAR types or produce unexpected behavior. This occurs if all fields are metadata fields or if the input has no fields at all.

int axisLiteralLength = fieldNames.stream().mapToInt(String::length).max().orElse(0);
RelDataType axisLiteralType =
    rx.getTypeFactory().createSqlType(SqlTypeName.CHAR, axisLiteralLength);

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 28, 2026

PR Code Suggestions ✨

Latest suggestions up to 8b89df5
Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Impact
General
Handle empty fieldNames case

The axisLiteralLength calculation may result in 0 when fieldNames is empty, which
could cause issues when creating a CHAR type with length 0. Consider adding
validation to ensure a minimum length of 1, or handle the empty case explicitly to
prevent potential type creation errors.

core/src/main/java/org/opensearch/sql/calcite/CalciteRelNodeVisitor.java [916-918]

-int axisLiteralLength = fieldNames.stream().mapToInt(String::length).max().orElse(0);
+int axisLiteralLength = fieldNames.stream().mapToInt(String::length).max().orElse(1);
 RelDataType axisLiteralType =
     rx.getTypeFactory().createSqlType(SqlTypeName.CHAR, axisLiteralLength);
Suggestion importance[1-10]: 3

__

Why: While the suggestion addresses a potential edge case where fieldNames is empty, this scenario is unlikely in the context of the transpose operation which requires at least one field. The change from orElse(0) to orElse(1) is a minor defensive improvement but doesn't address a critical bug.

Low

Previous suggestions

Suggestions up to commit 35bd92b
CategorySuggestion                                                                                                                                    Impact
General
Use VARCHAR instead of CHAR type

Using CHAR type with fixed length can cause issues when field names have varying
lengths, as shorter names will be padded with spaces. This padding is why the TRIM
operation is needed later. Consider using VARCHAR with the maximum length instead to
avoid unnecessary padding and simplify the logic.

core/src/main/java/org/opensearch/sql/calcite/CalciteRelNodeVisitor.java [916-918]

 int axisLiteralLength = fieldNames.stream().mapToInt(String::length).max().orElse(0);
 RelDataType axisLiteralType =
-    rx.getTypeFactory().createSqlType(SqlTypeName.CHAR, axisLiteralLength);
+    rx.getTypeFactory().createSqlType(SqlTypeName.VARCHAR, axisLiteralLength);
Suggestion importance[1-10]: 9

__

Why: The suggestion correctly identifies that using CHAR type causes padding with spaces for shorter field names, which necessitates the TRIM operation added in the PR. Using VARCHAR would eliminate this padding issue and simplify the logic, making it a high-impact improvement.

High
Handle empty field names list safely

When fieldNames is empty, orElse(0) returns 0, which could lead to creating a type
with zero length. This may cause issues with type validation or downstream
operations. Ensure a minimum length of 1 to prevent potential errors.

core/src/main/java/org/opensearch/sql/calcite/CalciteRelNodeVisitor.java [916]

-int axisLiteralLength = fieldNames.stream().mapToInt(String::length).max().orElse(0);
+int axisLiteralLength = fieldNames.stream().mapToInt(String::length).max().orElse(1);
Suggestion importance[1-10]: 7

__

Why: The suggestion addresses a potential edge case where fieldNames is empty, which would result in a zero-length type. While this is a valid concern for robustness, the likelihood of this scenario occurring in practice may be low, and the impact is moderate.

Medium

@songkant-aws songkant-aws added PPL Piped processing language calcite calcite migration releated maintenance Improves code quality, but not the product labels May 28, 2026
CalciteExplainIT.testTransposeExplain regenerated against the updated
transpose lowering: axis literals are now padded to a uniform CHAR(N)
(N = max axis literal length, i.e. 14 for 'account_number'), and the
group-key TRIM output is wrapped in a CAST(... AS VARCHAR) to unbounded
VARCHAR. The plan-shape diff exactly mirrors the documented behavior
change in the parent commit:

* `$f20=[TRIM(...)]` → `$f20=[CAST(TRIM(...)):VARCHAR NOT NULL]`
* axis literals e.g. 'firstname' → 'firstname     ' (padded)
* LogicalValues row type tuples are correspondingly padded

Verified locally: `./gradlew :integ-test:integTestRemote --tests
"*CalciteExplainIT.testTransposeExplain"` passes.

Signed-off-by: Songkan Tang <songkant@amazon.com>
@github-actions
Copy link
Copy Markdown
Contributor

Persistent review updated to latest commit 8b89df5

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

calcite calcite migration releated maintenance Improves code quality, but not the product PPL Piped processing language

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants