Skip to content

Conversation

@Prajwal-banakar
Copy link
Contributor

Purpose

Linked issue: close #2247

This PR adds support for aggregate functions (specifically MIN and MAX) over complex data types such as ARRAY and ROW. Previously, using these functions on complex types would result in an exception because they were considered incomparable. LAST_VALUE and FIRST_VALUE were already supported but are now verified to work with complex types.

Brief change log

  • Updated InternalRowUtils.compare() to support recursive comparison for ARRAY and ROW types.
  • Updated FieldMinAgg and FieldMaxAgg to use the new InternalRowUtils.compare() method that accepts DataType (preserving nested type information) instead of just DataTypeRoot.
  • Added ComplexTypeAggregationTest to verify LAST_VALUE, MIN, and MAX aggregations on ARRAY and ROW types.
  • Note: MAP type comparison is explicitly not supported and will throw an exception, consistent with SQL standards.

Tests

  • Added ComplexTypeAggregationTest which includes:
    • testArrayLastValue: Verifies LAST_VALUE on ARRAY<INT>.
    • testArrayMinMax: Verifies MIN/MAX on ARRAY<INT>.
    • testRowMinMax: Verifies MIN/MAX on nested ROW<INT, STRING>.

API and Format

  • No API or storage format changes. This only enhances the runtime capability of existing aggregation functions.

Documentation

  • No documentation changes needed as this supports standard SQL behavior for these types.

@Prajwal-banakar
Copy link
Contributor Author

"Hi @wuchong, I've submitted the PR to support aggregate functions over complex data types.

The current CI failure in FlinkUnionReadPrimaryKeyTableITCase. It appears to be caused by hardcoded year values (2025) in the test cases that are now mismatching because it is 2026.

@wuchong
Copy link
Member

wuchong commented Jan 4, 2026

Hi @Prajwal-banakar, thank you for the contribution!

Just a quick note: the test failure in FlinkUnionReadPrimaryKeyTableITCase has already been resolved by PR #2295.

Regarding this PR, MAX and MIN are not intended to support complex data types such as ARRAY, MAP, or ROW. These types are not orderable, and therefore Fluss (like other engines such as Apache Spark and Flink) cannot define a consistent total ordering for them. As a result, using MAX/MIN on such columns should be disallowed. The FIP also declared the supported types of max/min: https://cwiki.apache.org/confluence/display/FLUSS/FIP-21%3A+Aggregation+Merge+Engine

Moreover, the original issue #2247 was actually aimed at supporting aggregation functions that operate on complex types, for example, the ARRAY_AGG that returns an ARRAY type.

As a gentle reminder: to avoid wasted effort, we encourage contributors to discuss the design and scope with committers in the GitHub issue and request to be assigned before opening a PR. This is part of our official contribution process, which helps ensure alignment and smoother reviews.

Thanks again for your engagement with Fluss!

@wuchong
Copy link
Member

wuchong commented Jan 4, 2026

One important point you raised and that we should address is early type validation. We should fail fast during table creation if an aggregation function is applied to an unsupported data type. Specifically, in:

org.apache.fluss.server.utils.TableDescriptorValidation#validateAggregationFunctionParameters

we should validate that the column type is within the supported data types for the given aggregation function. If not, we should throw a clear error immediately. This validation also needs dedicated test coverage. I think this is the missing thing in last pull request, right? @platinumhamburg

@Prajwal-banakar
Copy link
Contributor Author

Hi @wuchong,

Thank you for the detailed explanation! I now understand the reasoning behind not supporting MIN/MAX for complex types like ARRAY or ROW due to the lack of total ordering. I also appreciate the clarification on the original intent of #2247.

As a beginner in the codebase, I'd like to help implement the early type validation you mentioned in TableDescriptorValidation#validateAggregationFunctionParameters.

To make sure I stay on the right track:

Should I pivot this PR to focus on adding those "fail-fast" validations and tests?

Or would you prefer I close this PR and open a new one specifically for the validation logic?

I will make sure to discuss design in the issue tracker moving forward to better align with the project's contribution process. Thanks for your patience!

@wuchong
Copy link
Member

wuchong commented Jan 4, 2026

@Prajwal-banakar Yeah, I think it’s best to close this issue and open a new one specifically for that purpose, along with a dedicated pull request. This will help keep the scope clear and the discussion focused.

@wuchong
Copy link
Member

wuchong commented Jan 4, 2026

@Prajwal-banakar I created #2302 for this problem. You can comment in that issue and I can assign it to you.

@wuchong
Copy link
Member

wuchong commented Jan 4, 2026

Closing this issue as discussed.

@wuchong wuchong closed this Jan 4, 2026
@Prajwal-banakar Prajwal-banakar deleted the Task/complex-type-aggregation branch January 4, 2026 14:46
@platinumhamburg
Copy link
Contributor

One important point you raised and that we should address is early type validation. We should fail fast during table creation if an aggregation function is applied to an unsupported data type. Specifically, in:

org.apache.fluss.server.utils.TableDescriptorValidation#validateAggregationFunctionParameters

we should validate that the column type is within the supported data types for the given aggregation function. If not, we should throw a clear error immediately. This validation also needs dedicated test coverage. I think this is the missing thing in last pull request, right? @platinumhamburg

Yes, the previous PR indeed missed the strict type validation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add support for aggregate functions like list_agg return complex data types

3 participants