Skip to content

Fix #5150: Fix dedup aggregation pushdown nullifying renamed fields#5192

Open
opensearchpplteam wants to merge 4 commits intoopensearch-project:mainfrom
opensearchpplteam:fix/issue-5150
Open

Fix #5150: Fix dedup aggregation pushdown nullifying renamed fields#5192
opensearchpplteam wants to merge 4 commits intoopensearch-project:mainfrom
opensearchpplteam:fix/issue-5150

Conversation

@opensearchpplteam
Copy link

Description

Root cause: The DedupPushdownRule converts dedup operations to LITERAL_AGG top_hits aggregations. When a field is renamed before dedup (e.g., rename value as val), the top_hits response from OpenSearch still uses the original field name (value). However, the output schema expects the renamed name (val). The OpenSearchIndexEnumerator looks up fields by the renamed names, finds nothing in the top_hits response map, and returns null.

Fix: Build a field name mapping (original -> renamed) in AggregateAnalyzer for the LITERAL_AGG case by comparing inferNamedField().getRootName() with the output alias from the project args. Pass this mapping to TopHitsParser, which applies it to translate OpenSearch field names back to the expected output names before results are consumed by the enumerator.

Changes:

  • AggregateAnalyzer.java: Build original-to-renamed field name mapping from project args in LITERAL_AGG case, pass to TopHitsParser
  • TopHitsParser.java: Added fieldNameMapping field and applyFieldNameMapping() method to translate OS field names to renamed output names. Added backward-compatible 4-arg constructor; existing 3-arg constructor delegates with empty map.
  • Unit test: Verifies dedup top_hits parsing with field name remapping
  • YAML REST tests: End-to-end tests for rename-then-dedup and eval-then-dedup scenarios

Related Issues

Resolves #5150

Check List

  • New functionality includes testing.
  • New functionality has been documented.
  • New functionality has javadoc added.
  • New functionality has a user manual doc added.
  • New PPL command checklist all confirmed.
  • API changes companion pull request created.
  • Commits are signed per the DCO using --signoff or -s.
  • Public documentation issue/PR created.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@github-actions
Copy link
Contributor

github-actions bot commented Feb 27, 2026

PR Reviewer Guide 🔍

(Review updated until commit e8a2e3a)

Here are some key observations to aid the review process:

🧪 PR contains tests
🔒 No security concerns identified
✅ No TODO sections
🔀 Multiple PR themes

Sub-PR theme: Add fieldNameMapping support to TopHitsParser with unit tests

Relevant files:

  • opensearch/src/main/java/org/opensearch/sql/opensearch/response/agg/TopHitsParser.java
  • opensearch/src/test/java/org/opensearch/sql/opensearch/response/OpenSearchAggregationResponseParserTest.java

Sub-PR theme: Wire fieldNameMapping in AggregateAnalyzer and add integration tests

Relevant files:

  • opensearch/src/main/java/org/opensearch/sql/opensearch/request/AggregateAnalyzer.java
  • integ-test/src/yamlRestTest/resources/rest-api-spec/test/issues/5150.yml

⚡ Recommended focus areas for review

Mapping Collision

The code documents a known limitation where multiple project args referencing the same original field with different output names will result in the later mapping overwriting the earlier one. The LOG.warn handles this, but the behavior may silently produce incorrect results in edge cases like eval pay2 = salary | rename salary as pay | dedup dept_id. This limitation should be validated to ensure it doesn't affect real-world use cases or should be addressed more robustly.

Map<String, String> fieldNameMapping = new HashMap<>();
for (Pair<RexNode, String> arg : args) {
  if (arg.getKey() instanceof RexInputRef) {
    String originalName = helper.inferNamedField(arg.getKey()).getRootName();
    String outputName = arg.getValue();
    if (!originalName.equals(outputName)) {
      String previousMapping = fieldNameMapping.put(originalName, outputName);
      if (previousMapping != null) {
        LOG.warn(
            "Field name mapping collision: field '{}' was mapped to '{}', now"
                + " overwritten by '{}'",
            originalName,
            previousMapping,
            outputName);
      }
    }
  }
}
Mapping Not Applied

The applyFieldNameMapping is only applied in the multi-row path (the last return statement). The returnMergeValue path also iterates over hits and builds maps using field names as keys (lines around the returnMergeValue block), but does not apply the field name mapping. If returnMergeValue is ever used with renamed fields, the mapping would be silently skipped. The comment at line 63-65 acknowledges this but it may be worth verifying the returnMergeValue path is truly not affected.

// Field name mapping is not applied in returnSingleValue or returnMergeValue paths
// because they use the aggregation name (agg.getName()) as the map key, not field names.
// Only the multi-row path below uses actual field names as keys and needs mapping.
Missing Schema Check

The second test case ("Eval new field then dedup on different field retains eval values") does not include a schema match assertion, unlike the first test case. Adding a schema check would make the test more robust and consistent.

- match: { total: 2 }
- length: { datarows: 2 }
- match: { datarows.0.0: "X" }
- match: { datarows.0.1: 100 }
- match: { datarows.0.2: 200 }
- match: { datarows.1.0: "Y" }
- match: { datarows.1.1: 300 }
- match: { datarows.1.2: 600 }

@github-actions
Copy link
Contributor

github-actions bot commented Feb 27, 2026

PR Code Suggestions ✨

Latest suggestions up to e8a2e3a

Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Impact
Possible issue
Detect output key collisions during field remapping

If two original field names map to the same output name, the second one will
silently overwrite the first in the result map. This could cause data loss without
any warning. Consider detecting such collisions and logging a warning or throwing an
exception.

opensearch/src/main/java/org/opensearch/sql/opensearch/response/agg/TopHitsParser.java [162-172]

 private Map<String, Object> applyFieldNameMapping(Map<String, Object> map) {
   if (fieldNameMapping.isEmpty()) {
     return map;
   }
   Map<String, Object> result = new LinkedHashMap<>();
   for (Map.Entry<String, Object> entry : map.entrySet()) {
     String mappedName = fieldNameMapping.getOrDefault(entry.getKey(), entry.getKey());
+    if (result.containsKey(mappedName)) {
+      throw new IllegalStateException(
+          String.format(
+              "Field name mapping collision in result: key '%s' already exists when mapping '%s'",
+              mappedName, entry.getKey()));
+    }
     result.put(mappedName, entry.getValue());
   }
   return result;
 }
Suggestion importance[1-10]: 4

__

Why: The suggestion addresses a valid edge case where two original fields could map to the same output name, causing silent data loss. However, this scenario is unlikely in practice given the mapping is built from project args, and throwing an exception may be overly aggressive compared to a warning log.

Low
General
Store defensive copy of mutable map parameter

The fieldNameMapping field is stored as a direct reference to the passed-in map,
which could be mutated externally after construction. Store a defensive copy to
ensure immutability of the parser's state.

opensearch/src/main/java/org/opensearch/sql/opensearch/response/agg/TopHitsParser.java [50]

-this.fieldNameMapping = fieldNameMapping;
+this.fieldNameMapping = Map.copyOf(fieldNameMapping);
Suggestion importance[1-10]: 4

__

Why: Storing a defensive copy of fieldNameMapping is a good defensive programming practice to prevent external mutation. The Map.copyOf() change is minimal and correct, though the risk of external mutation is low given the call sites in this PR.

Low

Previous suggestions

Suggestions up to commit e9a24ad
CategorySuggestion                                                                                                                                    Impact
General
Strengthen integration test assertions with exact value checks

The is_true assertion checks for truthiness, which would fail for falsy values like
0. Since value fields contain 100 and 300 (non-zero), this works here, but it's
fragile. Consider using match assertions to verify the exact expected values for
stronger test coverage.

integ-test/src/yamlRestTest/resources/rest-api-spec/test/issues/5150.yml [49-53]

   - match: { schema: [{"name": "category", "type": "string"}, {"name": "val", "type": "bigint"}] }
   - length: { datarows: 2 }
-  # Each row should have non-null val
-  - is_true: datarows.0.1
-  - is_true: datarows.1.1
+  # Each row should have non-null val — verify exact values
+  - match: { datarows.0: ["X", 100] }
+  - match: { datarows.1: ["Y", 300] }
Suggestion importance[1-10]: 5

__

Why: The suggestion improves test robustness by replacing is_true (truthiness check) with exact match assertions. This is a valid improvement since is_true would fail for falsy values like 0, and exact value checks provide stronger test coverage for the renamed field behavior.

Low
Detect collisions when applying field name mapping

If two original field names map to the same output name, the second one will
silently overwrite the first in the result map. This could cause data loss without
any warning. Consider detecting and logging a warning when a collision occurs during
the mapping application.

opensearch/src/main/java/org/opensearch/sql/opensearch/response/agg/TopHitsParser.java [162-172]

     private Map<String, Object> applyFieldNameMapping(Map<String, Object> map) {
       if (fieldNameMapping.isEmpty()) {
         return map;
       }
       Map<String, Object> result = new LinkedHashMap<>();
       for (Map.Entry<String, Object> entry : map.entrySet()) {
         String mappedName = fieldNameMapping.getOrDefault(entry.getKey(), entry.getKey());
+        if (result.containsKey(mappedName)) {
+          LOG.warn(
+              "Field name mapping collision in result: mapped name '{}' already exists, overwriting value",
+              mappedName);
+        }
         result.put(mappedName, entry.getValue());
       }
       return result;
     }
Suggestion importance[1-10]: 4

__

Why: The suggestion adds collision detection in applyFieldNameMapping, which is a valid defensive measure. However, this is a relatively rare edge case and the impact is low since the primary collision warning is already handled upstream in AggregateAnalyzer. The LOG reference would also need to be added to TopHitsParser.

Low
Suggestions up to commit d82a0a1
CategorySuggestion                                                                                                                                    Impact
Possible issue
Guard against null field resolution result

If helper.inferNamedField(arg.getKey()) returns null, calling .getRootName() on it
will throw a NullPointerException. A null check should be added before accessing
getRootName() to prevent a runtime crash when the field cannot be resolved.

opensearch/src/main/java/org/opensearch/sql/opensearch/request/AggregateAnalyzer.java [609-617]

 for (Pair<RexNode, String> arg : args) {
   if (arg.getKey() instanceof RexInputRef) {
-    String originalName = helper.inferNamedField(arg.getKey()).getRootName();
+    var namedField = helper.inferNamedField(arg.getKey());
+    if (namedField == null) continue;
+    String originalName = namedField.getRootName();
     String outputName = arg.getValue();
     if (!originalName.equals(outputName)) {
       fieldNameMapping.put(originalName, outputName);
     }
   }
 }
Suggestion importance[1-10]: 6

__

Why: If helper.inferNamedField(arg.getKey()) can return null, calling .getRootName() would throw a NullPointerException. Adding a null check is a reasonable defensive measure, though it depends on whether the method contract guarantees non-null returns.

Low
General
Detect field name mapping collisions

If two original field names map to the same output name, or if a mapped output name
collides with another existing key in the result map, one value will silently
overwrite the other. While this may be an edge case, it could lead to data loss
without any warning. Consider detecting and handling such collisions explicitly.

opensearch/src/main/java/org/opensearch/sql/opensearch/response/agg/TopHitsParser.java [159-169]

 private Map<String, Object> applyFieldNameMapping(Map<String, Object> map) {
   if (fieldNameMapping.isEmpty()) {
     return map;
   }
   Map<String, Object> result = new LinkedHashMap<>();
   for (Map.Entry<String, Object> entry : map.entrySet()) {
     String mappedName = fieldNameMapping.getOrDefault(entry.getKey(), entry.getKey());
+    if (result.containsKey(mappedName)) {
+      throw new IllegalStateException(
+          String.format("Field name mapping collision: multiple fields map to '%s'", mappedName));
+    }
     result.put(mappedName, entry.getValue());
   }
   return result;
 }
Suggestion importance[1-10]: 3

__

Why: While collision detection is a valid concern, this is an edge case that is unlikely in practice since field rename mappings are derived from explicit user-defined renames. Throwing an exception here could cause unexpected failures in valid scenarios where the mapping is well-defined.

Low
Suggestions up to commit 6eb5ac5
CategorySuggestion                                                                                                                                    Impact
Possible issue
Add null check for inferNamedField

The code doesn't handle potential null values from
helper.inferNamedField(arg.getKey()). If this method returns null, calling
getRootName() will throw a NullPointerException. Add a null check before accessing
getRootName() to prevent runtime failures.

opensearch/src/main/java/org/opensearch/sql/opensearch/request/AggregateAnalyzer.java [608-617]

 Map<String, String> fieldNameMapping = new HashMap<>();
 for (Pair<RexNode, String> arg : args) {
   if (arg.getKey() instanceof RexInputRef) {
-    String originalName = helper.inferNamedField(arg.getKey()).getRootName();
-    String outputName = arg.getValue();
-    if (!originalName.equals(outputName)) {
-      fieldNameMapping.put(originalName, outputName);
+    var namedField = helper.inferNamedField(arg.getKey());
+    if (namedField != null) {
+      String originalName = namedField.getRootName();
+      String outputName = arg.getValue();
+      if (!originalName.equals(outputName)) {
+        fieldNameMapping.put(originalName, outputName);
+      }
     }
   }
 }
Suggestion importance[1-10]: 7

__

Why: Valid defensive programming suggestion to prevent potential NullPointerException if helper.inferNamedField() returns null. However, without knowing the implementation details of inferNamedField(), it's unclear if this is a real issue or if the method guarantees non-null returns for RexInputRef instances.

Medium
Suggestions up to commit eddc6b9
CategorySuggestion                                                                                                                                    Impact
Possible issue
Add null check for inferNamedField

The code assumes helper.inferNamedField(arg.getKey()) never returns null, which
could cause a NullPointerException. Add a null check before calling getRootName() to
prevent potential runtime failures when the field cannot be inferred.

opensearch/src/main/java/org/opensearch/sql/opensearch/request/AggregateAnalyzer.java [608-617]

 Map<String, String> fieldNameMapping = new HashMap<>();
 for (Pair<RexNode, String> arg : args) {
   if (arg.getKey() instanceof RexInputRef) {
-    String originalName = helper.inferNamedField(arg.getKey()).getRootName();
-    String outputName = arg.getValue();
-    if (!originalName.equals(outputName)) {
-      fieldNameMapping.put(originalName, outputName);
+    var namedField = helper.inferNamedField(arg.getKey());
+    if (namedField != null) {
+      String originalName = namedField.getRootName();
+      String outputName = arg.getValue();
+      if (!originalName.equals(outputName)) {
+        fieldNameMapping.put(originalName, outputName);
+      }
     }
   }
 }
Suggestion importance[1-10]: 7

__

Why: The suggestion correctly identifies a potential NullPointerException if helper.inferNamedField(arg.getKey()) returns null. Adding a null check improves robustness, though the actual likelihood of this occurring depends on the codebase's guarantees about inferNamedField behavior.

Medium
Suggestions up to commit 5cf5355
CategorySuggestion                                                                                                                                    Impact
Possible issue
Add null check for inferNamedField

The code doesn't handle potential null values from
helper.inferNamedField(arg.getKey()). If inferNamedField returns null, calling
getRootName() will throw a NullPointerException. Add a null check before accessing
getRootName() to prevent runtime failures.

opensearch/src/main/java/org/opensearch/sql/opensearch/request/AggregateAnalyzer.java [608-617]

 Map<String, String> fieldNameMapping = new HashMap<>();
 for (Pair<RexNode, String> arg : args) {
   if (arg.getKey() instanceof RexInputRef) {
-    String originalName = helper.inferNamedField(arg.getKey()).getRootName();
-    String outputName = arg.getValue();
-    if (!originalName.equals(outputName)) {
-      fieldNameMapping.put(originalName, outputName);
+    var namedField = helper.inferNamedField(arg.getKey());
+    if (namedField != null) {
+      String originalName = namedField.getRootName();
+      String outputName = arg.getValue();
+      if (!originalName.equals(outputName)) {
+        fieldNameMapping.put(originalName, outputName);
+      }
     }
   }
 }
Suggestion importance[1-10]: 7

__

Why: The suggestion correctly identifies a potential NullPointerException if helper.inferNamedField(arg.getKey()) returns null. Adding a null check improves robustness, though the actual likelihood of this occurring depends on the implementation of inferNamedField.

Medium
General
Detect field name mapping collisions

If multiple original field names map to the same output name, later entries will
silently overwrite earlier ones, potentially causing data loss. Consider detecting
and handling this collision scenario to prevent unexpected behavior.

opensearch/src/main/java/org/opensearch/sql/opensearch/response/agg/TopHitsParser.java [159-169]

 private Map<String, Object> applyFieldNameMapping(Map<String, Object> map) {
   if (fieldNameMapping.isEmpty()) {
     return map;
   }
   Map<String, Object> result = new LinkedHashMap<>();
   for (Map.Entry<String, Object> entry : map.entrySet()) {
     String mappedName = fieldNameMapping.getOrDefault(entry.getKey(), entry.getKey());
+    if (result.containsKey(mappedName)) {
+      throw new IllegalStateException(
+          String.format("Field name collision detected: multiple fields map to '%s'", mappedName));
+    }
     result.put(mappedName, entry.getValue());
   }
   return result;
 }
Suggestion importance[1-10]: 6

__

Why: The suggestion identifies a potential issue where multiple fields mapping to the same output name could cause data loss. However, this scenario seems unlikely given the context (field renaming via rename command), and throwing an exception might be too aggressive without understanding the broader system behavior.

Low

@penghuo penghuo added PPL Piped processing language bugFix labels Feb 27, 2026
@github-actions
Copy link
Contributor

Persistent review updated to latest commit eddc6b9

…g renamed fields

When the DedupPushdownRule converts a dedup to an aggregation-based
top_hits query, fields that were renamed (via rename or eval) would
return null values. This happened because the TopHitsParser returned
results using original OpenSearch field names, but the output schema
expected the renamed names.

Added a field name mapping to TopHitsParser so it can translate original
OS field names to their renamed output names in the LITERAL_AGG (dedup)
aggregation response path.

Signed-off-by: opensearchpplteam <opensearchpplteam@gmail.com>
@github-actions
Copy link
Contributor

Persistent review updated to latest commit 6eb5ac5

The testNoMvBasic and testNoMvWithEval failures in CalciteExplainIT
are unrelated to this PR's dedup pushdown changes and only occur on
Java 25. Re-triggering CI.

Signed-off-by: opensearchpplteam <opensearchpplteam@gmail.com>
@github-actions
Copy link
Contributor

Persistent review updated to latest commit d82a0a1

String originalName = helper.inferNamedField(arg.getKey()).getRootName();
String outputName = arg.getValue();
if (!originalName.equals(outputName)) {
fieldNameMapping.put(originalName, outputName);
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When two project args reference the same original field with different output names (e.g., eval pay2 = salary | rename salary as pay | dedup dept_id), the second HashMap.put silently overwrites the first mapping. One of the renamed fields would still get null values. This is an edge case from issue #5150 differential testing. Consider either documenting this as a known limitation or using a multimap / detecting the collision.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a warning log when a collision is detected (when the same original field maps to multiple output names). The last mapping wins, which is documented in a code comment as a known limitation. This edge case requires two derived fields from the same original field before dedup, which is rare in practice.

: new LinkedHashMap<>(hit.getSourceAsMap());
hit.getFields().values().forEach(f -> map.put(f.getName(), f.getValue()));
return map;
return applyFieldNameMapping(map);
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The field name mapping is applied only in the multi-row return path (the else branch at line 140). The returnSingleValue path (used by ARG_MAX/ARG_MIN pushdown) and returnMergeValue path skip mapping entirely. If those paths are ever used with renamed fields via a future pushdown rule, they would have the same null-value bug. Worth confirming this is an intentional scope limitation for now, or if those branches should also apply the mapping defensively.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confirmed this is intentional. The returnSingleValue and returnMergeValue paths use agg.getName() as the map key (not field names), so field name mapping is irrelevant for those paths. Added a clarifying comment in the code explaining why mapping is only applied in the multi-row return path.

…apping limitations

- Add warning log when HashMap collision is detected in AggregateAnalyzer
  field name mapping (when same original field maps to multiple output names)
- Add clarifying comments in TopHitsParser explaining why field name mapping
  is only needed in the multi-row return path

Signed-off-by: opensearchpplteam <opensearchpplteam@gmail.com>
@github-actions
Copy link
Contributor

Persistent review updated to latest commit e9a24ad

Comment on lines 52 to 53
- is_true: datarows.0.1
- is_true: datarows.1.1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

compare with real value, not just null value

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated — replaced is_true with match assertions for actual expected values (category and val columns).

Comment on lines 73 to 74
- is_true: datarows.0.2
- is_true: datarows.1.2
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

compare with real value, not just null value

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated — replaced is_true with match assertions for actual expected values (category, value, and doubled columns).

…project#5150

Signed-off-by: opensearchpplteam <opensearchpplteam@gmail.com>
@github-actions
Copy link
Contributor

Persistent review updated to latest commit e8a2e3a

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bugFix PPL Piped processing language

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] Dedup aggregation pushdown nullifies renamed non-dedup fields via top_hits response mapping

2 participants