[BUG] Dedup pushdown disabled when combined with WHERE clause (FilterMergeRule defeats PPLSimplifyDedupRule)

## Query Information
**PPL Command/Query:**
```ppl
source=logs-* 
| where `@timestamp` > DATE_SUB(NOW(), INTERVAL 5 MINUTE)
| dedup `resource.attributes.k8s.namespace.name`
| fields `resource.attributes.k8s.namespace.name`
```

**Expected Result:**
The dedup should push down to the shard as a `composite` aggregation + `top_hits` (the same pushdown that already works without a `where` clause), returning one row per distinct namespace value.

**Actual Result:**
With **any** `where` clause upstream, dedup pushdown is disabled. The scan returns every matching document to the coordinator and dedup runs in memory via a `ROW_NUMBER` window. On large indices (47B+ docs reported by users) this exceeds timeouts or triggers shard failures.

The same query **without** `where` pushes down correctly.

## Dataset Information
**Dataset/Schema Type**
- [x] OpenTelemetry (OTEL)

**Index Mapping**
```json
{
  "index_patterns": ["logs-*"],
  "template": {
    "mappings": {
      "dynamic_templates": [
        { "resource_attributes": { "path_match": "resource.attributes.*",
                                   "mapping": {"type": "keyword"},
                                   "match_mapping_type": "string" } }
      ],
      "properties": {
        "@timestamp":     { "type": "date" },
        "body":           { "type": "text", "norms": false },
        "severityText":   { "type": "keyword" }
      }
    }
  }
}
```

**Sample Data**
```json
{ "@timestamp": "2026-05-28T16:18:40.000Z", "severityText": "INFO",
  "resource": {"attributes": {"k8s.namespace.name": "kube-system"}}, "body": "..." }
{ "@timestamp": "2026-05-28T16:18:41.000Z", "severityText": "INFO",
  "resource": {"attributes": {"k8s.namespace.name": "default"}},     "body": "..." }
{ "@timestamp": "2026-05-28T16:18:42.000Z", "severityText": "INFO",
  "resource": {"attributes": {"k8s.namespace.name": "monitoring"}},  "body": "..." }
{ "@timestamp": "2026-05-28T16:18:43.000Z", "severityText": "INFO",
  "resource": {"attributes": {"k8s.namespace.name": "kube-system"}}, "body": "..." }
```

## Bug Description
**Issue Summary:**

Explain output for both shapes (Calcite engine, pushdown enabled):

**Without `where` (pushdown works):**
```
CalciteEnumerableIndexScan(
  PushDownContext=[[AGGREGATION->...LITERAL_AGG(1), LIMIT->10000]],
  sourceBuilder={"size":0,"aggregations":{
    "composite_buckets":{"composite":{...,"sources":[{"resource.attributes.k8s.namespace.name":{"terms":{"field":"..."}}}]},
                         "aggregations":{"$f1":{"top_hits":{"size":1,...}}}}}})
```

**With `where` (pushdown fails):**
```
EnumerableLimit(fetch=[10000])
  EnumerableCalc(<= row_number 1)
    EnumerableWindow(ROW_NUMBER() OVER (PARTITION BY $8))
      CalciteEnumerableIndexScan(
        PushDownContext=[[PROJECT, FILTER->AND(>(@ts,...), IS NOT NULL(...)), PROJECT]],
        sourceBuilder={"_source":{"includes":[...]},
                       "query":{"bool":{"must":[{"range":...},{"exists":...}]}}})
```
Three operators of in-memory work appear above the scan; the scan emits `_source` (every matching doc) instead of an aggregation.

**Root cause:**
PPL `dedup` is initially compiled into a `ROW_NUMBER() OVER (PARTITION BY field)` window pattern, with a bucket-non-null `IS_NOT_NULL(field)` filter directly above the scan. A HEP rule, `PPLSimplifyDedupRule` (`core/.../plan/rule/PPLSimplifyDedupRule.java`), is supposed to fold that pattern into a `LogicalDedup` operator, which `DedupPushdownRule` (`opensearch/.../planner/rules/DedupPushdownRule.java`) then pushes to the scan as `composite` + `top_hits`.

The HEP program (`CalciteToolsHelper.java:556`) registers two rules in this order: `FilterMergeRule.DEFAULT, PPLSimplifyDedupRule.DEDUP_SIMPLIFY_RULE`. When a user `where` is present, the user's filter and the bucket-non-null filter are adjacent above the scan; `FilterMergeRule` runs first and merges them into one filter `AND(IS_NOT_NULL(field), <user predicate>)`. `PPLSimplifyDedupRule`'s operand requires the bottom filter to satisfy `mayBeFilterFromBucketNonNull` (`PlanUtils.java:788`), which only accepts pure `IS_NOT_NULL` or AND-of-`IS_NOT_NULL`s. The merged condition fails the predicate, the rule never fires, no `LogicalDedup` is produced, and `DedupPushdownRule` has nothing to match. Dedup falls through to the row-number window form executed on the coordinator.

**Steps to Reproduce:**
1. Apply the index template above; index a few sample documents into `logs-000001`.
2. Set `plugins.calcite.enabled: true` and `plugins.calcite.pushdown.enabled: true`.
3. Run `EXPLAIN` on `source=logs-* | dedup resource.attributes.k8s.namespace.name` — observe `composite` + `top_hits` aggregation in the DSL (pushdown succeeds).
4. Run `EXPLAIN` on the same query with `| where @timestamp > DATE_SUB(NOW(), INTERVAL 5 MINUTE)` inserted before `dedup` — observe the `EnumerableWindow` form, no aggregation in DSL (pushdown fails).

## Environment Information
**OpenSearch Version:** 3.7.0-SNAPSHOT (reproduced on `main`, commit `ec433f4ff`).

**Additional Details:**
- Calcite engine: `plugins.calcite.enabled = true`
- Pushdown: `plugins.calcite.pushdown.enabled = true`
- Code locations:
  - `core/src/main/java/org/opensearch/sql/calcite/utils/CalciteToolsHelper.java:556` — HEP rule order
  - `core/src/main/java/org/opensearch/sql/calcite/plan/rule/PPLSimplifyDedupRule.java` — rule whose operand pattern fails after FilterMerge
  - `core/src/main/java/org/opensearch/sql/calcite/utils/PlanUtils.java:788` — `mayBeFilterFromBucketNonNull` predicate
  - `opensearch/src/main/java/org/opensearch/sql/opensearch/planner/rules/DedupPushdownRule.java` — Volcano rule that depends on `LogicalDedup` being produced

**Possible fixes:**
- Loosen `mayBeFilterFromBucketNonNull` to accept any `AND` whose operands *contain* an `IS_NOT_NULL` on the partition column (other conjuncts pass through as a separate filter that gets pushed normally).
- Reorder the HEP program so `PPLSimplifyDedupRule` runs before `FilterMergeRule` — the dedup pattern matcher then sees the original adjacent-filter shape; FilterMerge can run after.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Dedup pushdown disabled when combined with WHERE clause (FilterMergeRule defeats PPLSimplifyDedupRule) #5482

Query Information

Dataset Information

Bug Description

Environment Information

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[BUG] Dedup pushdown disabled when combined with WHERE clause (FilterMergeRule defeats PPLSimplifyDedupRule) #5482

Description

Query Information

Dataset Information

Bug Description

Environment Information

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions