[Feature] Calcite PPL search result highlighting#5141
[Feature] Calcite PPL search result highlighting#5141RyanL1997 wants to merge 18 commits intoopensearch-project:mainfrom
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds per-request PPL highlighting end-to-end: request parsing, plan-level propagation via ThreadLocal, conditional hidden Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant PPLService
participant Planner as CalcitePlanner
participant Worker as ExecThread
participant CalciteCtx as CalcitePlanContext
participant OpenSearch as OpenSearchNode
participant Enumerator as OpenSearchIndexEnumerator
participant Protocol as QueryResult/Formatter
Client->>PPLService: POST /_plugins/_ppl (with optional "highlight")
PPLService->>Planner: build execution plan
PPLService->>Planner: set plan.highlightConfig
Planner->>Worker: submit plan (worker thread)
Worker->>CalciteCtx: setHighlightConfig(plan.highlightConfig)
Worker->>OpenSearch: execute search (HighlightBuilder attached)
OpenSearch-->>Enumerator: return hits with highlight fragments
Enumerator->>Worker: produce rows including hidden _highlight value
Worker->>Protocol: build QueryResult (extract highlights())
Worker->>CalciteCtx: clearHighlightConfig()
Protocol-->>Client: JSON response (optional "highlights" array)
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
333a24b to
a4d156e
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@core/src/main/java/org/opensearch/sql/calcite/CalciteRelNodeVisitor.java`:
- Around line 411-418: When highlight is enabled
(CalcitePlanContext.getHighlightConfig() != null) the highlight preservation
block in CalciteRelNodeVisitor can add HIGHLIGHT_FIELD twice into expandedFields
(duplicate of expandProjectFields' AllFields expansion); modify the logic around
currentFields/index hlIndex and expandedFields to first scan expandedFields for
an existing RexInputRef that references the same input index (or any existing
projection that already corresponds to HIGHLIGHT_FIELD) and only add
context.relBuilder.field(hlIndex) if no matching RexInputRef exists, so the
projection is not duplicated and downstream tuple.get(HIGHLIGHT_FIELD) continues
to work.
| // When highlight is active, include _highlight in the projection so it survives | ||
| // through the Calcite pipeline. This matches DSL behavior where _source filtering | ||
| // does not affect highlights. | ||
| if (CalcitePlanContext.getHighlightConfig() != null) { | ||
| int hlIndex = currentFields.indexOf(HIGHLIGHT_FIELD); | ||
| if (hlIndex >= 0) { | ||
| expandedFields.add(context.relBuilder.field(hlIndex)); | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, let's see the full context of the visitProject method and the code around lines 411-418
cd core/src/main/java/org/opensearch/sql/calcite && wc -l CalciteRelNodeVisitor.javaRepository: opensearch-project/sql
Length of output: 97
🏁 Script executed:
# Read the visitProject method context (expanded range to understand the logic)
cd core/src/main/java/org/opensearch/sql/calcite && sed -n '380,430p' CalciteRelNodeVisitor.javaRepository: opensearch-project/sql
Length of output: 1792
🏁 Script executed:
# Search for how _highlight is used downstream (extraction/handling)
rg "HIGHLIGHT_FIELD|_highlight|getHighlight" -n -A 3 -B 1 --type javaRepository: opensearch-project/sql
Length of output: 50378
🏁 Script executed:
# Check if there are similar guards against duplicate projection in the codebase
rg "expandedFields|alreadyProjected|anyMatch.*RexInputRef" -n --type javaRepository: opensearch-project/sql
Length of output: 1744
🏁 Script executed:
# Search for tests related to highlight projection
rg "highlight.*[Pp]roject|visitProject.*highlight" -n --type javaRepository: opensearch-project/sql
Length of output: 365
🏁 Script executed:
# Check if _highlight can be explicitly selected in user queries
# Look for test cases with explicit _highlight selection
rg "SELECT.*_highlight|project.*_highlight" -n --type java -B 2 -A 2 | head -50Repository: opensearch-project/sql
Length of output: 1538
🏁 Script executed:
# Check expandProjectFields to see if it could include _highlight from user input
cd core/src/main/java/org/opensearch/sql/calcite && sed -n '440,495p' CalciteRelNodeVisitor.javaRepository: opensearch-project/sql
Length of output: 2319
🏁 Script executed:
# Search for how currentFields is populated and whether _highlight would be present
cd core/src/main/java/org/opensearch/sql/calcite && sed -n '393,425p' CalciteRelNodeVisitor.javaRepository: opensearch-project/sql
Length of output: 1328
🏁 Script executed:
# Check if there are tests specifically covering _highlight in projections with other fields
rg "highlight.*project|project.*highlight" -n --type java -A 5 | head -60Repository: opensearch-project/sql
Length of output: 3601
🏁 Script executed:
# Check if _highlight is classified as a metadata field
rg "isMetadataField|METADATA|_highlight" -n --type java -B 2 -A 2 | grep -A 5 "isMetadataField"Repository: opensearch-project/sql
Length of output: 4271
🏁 Script executed:
# Get the isMetadataField implementation
cd core/src/main/java/org/opensearch/sql/calcite && rg "isMetadataField" -A 5 CalciteRelNodeVisitor.javaRepository: opensearch-project/sql
Length of output: 1952
🏁 Script executed:
# Check if _highlight is in METADATAFIELD_TYPE_MAP
rg "METADATAFIELD_TYPE_MAP|_highlight" -n --type java | grep -B 5 -A 5 "METADATAFIELD_TYPE_MAP"Repository: opensearch-project/sql
Length of output: 7296
🏁 Script executed:
# Find where METADATAFIELD_TYPE_MAP is defined
fd "OpenSearchConstants.java" --type f | head -1 | xargs cat | head -100Repository: opensearch-project/sql
Length of output: 1137
🏁 Script executed:
# Verify that _highlight is NOT in the metadata field map and would be included in wildcard projections
cd core/src/main/java/org/opensearch/sql/opensearch/storage && cat -n OpenSearchIndex.java | sed -n '65,75p'Repository: opensearch-project/sql
Length of output: 172
🏁 Script executed:
# Check if there are any existing tests that combine SELECT * with highlight
rg "SELECT \*|AllFields" -n --type java integ-test/src/test/java/org/opensearch/sql/sql/HighlightFunctionIT.javaRepository: opensearch-project/sql
Length of output: 48
🏁 Script executed:
# Find the correct path for OpenSearchIndex.java
fd "OpenSearchIndex.java" --type fRepository: opensearch-project/sql
Length of output: 243
🏁 Script executed:
# Check the METADATAFIELD_TYPE_MAP definition to confirm _highlight is NOT in it
rg "METADATAFIELD_TYPE_MAP\|_highlight" opensearch/src/main/java/org/opensearch/sql/opensearch/storage/OpenSearchIndex.java -A 10 -B 2Repository: opensearch-project/sql
Length of output: 48
🏁 Script executed:
# Check if tests verify SELECT * with highlight scenarios
rg "SELECT.*\*.*highlight|highlight.*SELECT" integ-test/ -n --type java -A 3 -B 1Repository: opensearch-project/sql
Length of output: 48
🏁 Script executed:
# Look for expandWildcardPattern to understand if _highlight gets included
rg "expandWildcardPattern|WildcardUtils" -n --type java -A 5 | head -50Repository: opensearch-project/sql
Length of output: 5426
🏁 Script executed:
# Verify the exact scenario: when ALL fields are selected, does _highlight get added to expandedFields?
# Trace through: AllFields case adds all non-metadata fields
# _highlight is in currentFields (added by CalciteLogicalIndexScan when highlight config is set)
# _highlight is NOT in METADATAFIELD_TYPE_MAP, so !isMetadataField returns true
# So _highlight WOULD be added to expandedFields via the AllFields expansion
# Then the highlight logic adds it AGAIN at line 417
# Confirm _highlight is added to schema when highlight is active
rg "buildSchema|HIGHLIGHT_FIELD" opensearch/src/main/java/org/opensearch/sql/opensearch/storage/scan/CalciteLogicalIndexScan.java -B 3 -A 3Repository: opensearch-project/sql
Length of output: 1710
🏁 Script executed:
# Check if there are integration tests that cover SELECT * with highlight
fd "HighlightFunctionIT.java" --type f -exec cat {} \; | head -150Repository: opensearch-project/sql
Length of output: 253
🏁 Script executed:
# Read the HighlightFunctionIT.java test file
cat integ-test/src/test/java/org/opensearch/sql/sql/HighlightFunctionIT.java | head -200Repository: opensearch-project/sql
Length of output: 7756
🏁 Script executed:
# Let me verify the exact duplicate scenario one more time by checking both code paths:
# Path 1: AllFields expansion (lines 479-482)
# Path 2: Explicit highlight logic (lines 414-417)
# Confirm that expandProjectFields is called BEFORE the highlight logic
cd core/src/main/java/org/opensearch/sql/calcite && sed -n '393,425p' CalciteRelNodeVisitor.java | cat -nRepository: opensearch-project/sql
Length of output: 1559
🏁 Script executed:
# Now trace through a hypothetical SELECT * query with highlight:
# 1. Project node has AllFields as projectList
# 2. visitProject is called
# 3. isSingleAllFieldsProject returns true
# 4. handleAllFieldsProject is called (not visitProject logic at 411-420)
# Check handleAllFieldsProject
cd core/src/main/java/org/opensearch/sql/calcite && sed -n '424,445p' CalciteRelNodeVisitor.javaRepository: opensearch-project/sql
Length of output: 951
🏁 Script executed:
# Check what happens in the non-AllFields-only case
# If someone does: SELECT name, * with highlight
# Then isSingleAllFieldsProject returns FALSE
# Then expandProjectFields is called
# Then highlight logic runs
# Verify: when is expandProjectFields called?
cd core/src/main/java/org/opensearch/sql/calcite && sed -n '395,404p' CalciteRelNodeVisitor.javaRepository: opensearch-project/sql
Length of output: 397
🏁 Script executed:
# Final verification: construct the exact scenario that would trigger the bug
# SELECT name, * with highlight enabled
# Trace:
# 1. Project has [Field("name"), AllFields()]
# 2. isSingleAllFieldsProject returns FALSE (size=2, not 1)
# 3. expandProjectFields processes both:
# - Field "name" → added to expandedFields
# - AllFields → expands to all currentFields (which includes _highlight)
# 4. Then highlight logic adds _highlight AGAIN
# Check if there are tests covering SELECT <field>, * scenario
rg "SELECT.*,.*\*" integ-test/ core/src/test -n --type java | head -20Repository: opensearch-project/sql
Length of output: 3138
Guard against duplicate _highlight projection when using SELECT with mixed explicit fields and wildcards.
When highlight is enabled and a user queries with mixed explicit fields and AllFields (e.g., SELECT name, *), the code can add _highlight twice to the projection:
expandProjectFieldsincludes_highlightvia AllFields expansion (since_highlightis not in METADATAFIELD_TYPE_MAP)- The highlight preservation logic at lines 414-417 adds it again unconditionally
Calcite may rename duplicate columns (e.g., _highlight0, _highlight1), which breaks downstream extraction that uses exact name matching via tuple.get(HIGHLIGHT_FIELD). The suggested guard checking for existing RexInputRef matches prevents this duplication.
🔧 Suggested guard
if (CalcitePlanContext.getHighlightConfig() != null) {
int hlIndex = currentFields.indexOf(HIGHLIGHT_FIELD);
- if (hlIndex >= 0) {
- expandedFields.add(context.relBuilder.field(hlIndex));
+ if (hlIndex >= 0) {
+ boolean alreadyProjected =
+ expandedFields.stream()
+ .anyMatch(
+ rex ->
+ (rex instanceof RexInputRef ref && ref.getIndex() == hlIndex)
+ || (rex instanceof RexCall call
+ && call.getKind() == SqlKind.AS
+ && call.getOperands().get(0) instanceof RexInputRef ref
+ && ref.getIndex() == hlIndex));
+ if (!alreadyProjected) {
+ expandedFields.add(context.relBuilder.field(hlIndex));
+ }
}
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| // When highlight is active, include _highlight in the projection so it survives | |
| // through the Calcite pipeline. This matches DSL behavior where _source filtering | |
| // does not affect highlights. | |
| if (CalcitePlanContext.getHighlightConfig() != null) { | |
| int hlIndex = currentFields.indexOf(HIGHLIGHT_FIELD); | |
| if (hlIndex >= 0) { | |
| expandedFields.add(context.relBuilder.field(hlIndex)); | |
| } | |
| // When highlight is active, include _highlight in the projection so it survives | |
| // through the Calcite pipeline. This matches DSL behavior where _source filtering | |
| // does not affect highlights. | |
| if (CalcitePlanContext.getHighlightConfig() != null) { | |
| int hlIndex = currentFields.indexOf(HIGHLIGHT_FIELD); | |
| if (hlIndex >= 0) { | |
| boolean alreadyProjected = | |
| expandedFields.stream() | |
| .anyMatch( | |
| rex -> | |
| (rex instanceof RexInputRef ref && ref.getIndex() == hlIndex) | |
| || (rex instanceof RexCall call | |
| && call.getKind() == SqlKind.AS | |
| && call.getOperands().get(0) instanceof RexInputRef ref | |
| && ref.getIndex() == hlIndex)); | |
| if (!alreadyProjected) { | |
| expandedFields.add(context.relBuilder.field(hlIndex)); | |
| } | |
| } | |
| } |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@core/src/main/java/org/opensearch/sql/calcite/CalciteRelNodeVisitor.java`
around lines 411 - 418, When highlight is enabled
(CalcitePlanContext.getHighlightConfig() != null) the highlight preservation
block in CalciteRelNodeVisitor can add HIGHLIGHT_FIELD twice into expandedFields
(duplicate of expandProjectFields' AllFields expansion); modify the logic around
currentFields/index hlIndex and expandedFields to first scan expandedFields for
an existing RexInputRef that references the same input index (or any existing
projection that already corresponds to HIGHLIGHT_FIELD) and only add
context.relBuilder.field(hlIndex) if no matching RexInputRef exists, so the
projection is not duplicated and downstream tuple.get(HIGHLIGHT_FIELD) continues
to work.
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (2)
docs/user/ppl/interfaces/endpoint.md (2)
107-116: Clarify dual text and keyword highlighting behavior.Line 108 shows both
"firstname": ["<em>Holmes</em>"]and"firstname.keyword": ["<em>Holmes</em>"]in the highlights. While line 76 mentions that wildcard matching includes.keywordsubfields, users may find it surprising to see both the analyzed text field and the keyword subfield highlighted for the same match.Consider adding a brief note in the example explanation (after line 116) to clarify why both appear and when users might want to filter this (e.g., by specifying explicit fields instead of
"*").🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/user/ppl/interfaces/endpoint.md` around lines 107 - 116, Add a brief explanatory note after the example that clarifies why both analyzed and keyword subfield highlights appear: explain that using a wildcard ("*") matches both the analyzed field (e.g., "firstname") and its unmapped keyword subfield (e.g., "firstname.keyword"), causing duplicate highlight entries for the same token, and advise that users can avoid this by specifying explicit fields to highlight (instead of "*") or by excluding ".keyword" fields when they only want analyzed-text highlights; reference the "highlights" example showing "firstname" and "firstname.keyword" and mention the wildcard behavior.
136-138: Consider expanding abbreviated example output.Using
...for brevity in Example 2 is acceptable, but it may be clearer to either:
- Show the minimal complete schema (just the fields that matter for this example), or
- Add a note indicating what's omitted (e.g.,
// schema and datarows same as Example 1).This helps readers understand the complete structure without needing to reference Example 1.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/user/ppl/interfaces/endpoint.md` around lines 136 - 138, Example 2 in endpoint.md uses "schema" and "datarows" with "..." which hides structure; update the Example 2 JSON so readers can understand it by either (a) replacing the "..." under "schema" and "datarows" with a minimal complete schema and one or two representative datarows (include the actual field names used in Example 1), or (b) add an explicit comment/note beneath the JSON such as "// schema and datarows same as Example 1" that points to the source of the omitted content; locate the block containing "schema" and "datarows" in Example 2 and apply one of these two changes to make the example self-contained.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@docs/user/ppl/interfaces/endpoint.md`:
- Around line 102-106: Several response examples in the docs are inconsistent
about including a "status" field (the snippet with "datarows" and the very first
POST example lack it while other examples include "status": 200); update all
example response JSON blocks to include a "status": 200 field so every example
(including the initial POST example and the block containing "datarows" and the
examples around lines referenced in the review) follows the same format,
ensuring the inserted key is placed at the same level as other top-level
response fields for uniformity.
- Around line 97-106: Example 1's query ("search source=accounts \"Holmes\"") is
missing an explicit fields clause, leaving the returned columns ambiguous;
update that example to include a fields clause listing the schema fields
(firstname, lastname, address) so the expected output is deterministic and
matches the earlier POST example pattern.
- Line 163: Update the sentence to explicitly state that highlighting is
supported only for single-source queries and that its behavior with joins,
subqueries, or multi-source queries (e.g., using the operator "| join") is
unsupported and untested: clarify that the system does not guarantee an error or
consistent output in those cases and that highlighting may be silently omitted
or produce unexpected results, so users should avoid combining highlighting with
joins, subqueries, or multi-source queries.
---
Nitpick comments:
In `@docs/user/ppl/interfaces/endpoint.md`:
- Around line 107-116: Add a brief explanatory note after the example that
clarifies why both analyzed and keyword subfield highlights appear: explain that
using a wildcard ("*") matches both the analyzed field (e.g., "firstname") and
its unmapped keyword subfield (e.g., "firstname.keyword"), causing duplicate
highlight entries for the same token, and advise that users can avoid this by
specifying explicit fields to highlight (instead of "*") or by excluding
".keyword" fields when they only want analyzed-text highlights; reference the
"highlights" example showing "firstname" and "firstname.keyword" and mention the
wildcard behavior.
- Around line 136-138: Example 2 in endpoint.md uses "schema" and "datarows"
with "..." which hides structure; update the Example 2 JSON so readers can
understand it by either (a) replacing the "..." under "schema" and "datarows"
with a minimal complete schema and one or two representative datarows (include
the actual field names used in Example 1), or (b) add an explicit comment/note
beneath the JSON such as "// schema and datarows same as Example 1" that points
to the source of the omitted content; locate the block containing "schema" and
"datarows" in Example 2 and apply one of these two changes to make the example
self-contained.
| "schema": [ | ||
| { "name": "firstname", "type": "string" }, | ||
| { "name": "lastname", "type": "string" }, | ||
| { "name": "address", "type": "string" } | ||
| ], | ||
| "datarows": [ | ||
| ["Holmes", "Morgan", "123 Main St"], | ||
| ["Jane", "Holmes", "456 Oak Ave"], | ||
| ["John", "Smith", "880 Holmes Lane"] | ||
| ], |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
cat -n docs/user/ppl/interfaces/endpoint.md | head -150Repository: opensearch-project/sql
Length of output: 5348
Add explicit field selection to make the query output consistent.
The query search source=accounts "Holmes" omits a fields clause but the schema shows only three fields (firstname, lastname, address). This creates ambiguity about what fields an unrestricted query returns. Add the fields clause:
- "query": "search source=accounts \"Holmes\"",
+ "query": "search source=accounts \"Holmes\" | fields firstname, lastname, address",This aligns Example 1 with the explicit field selection pattern shown in the earlier POST example at line 16 and makes the expected output complete and unambiguous.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@docs/user/ppl/interfaces/endpoint.md` around lines 97 - 106, Example 1's
query ("search source=accounts \"Holmes\"") is missing an explicit fields
clause, leaving the returned columns ambiguous; update that example to include a
fields clause listing the schema fields (firstname, lastname, address) so the
expected output is deterministic and matches the earlier POST example pattern.
| "datarows": [ | ||
| ["Holmes", "Morgan", "123 Main St"], | ||
| ["Jane", "Holmes", "456 Oak Ave"], | ||
| ["John", "Smith", "880 Holmes Lane"] | ||
| ], |
There was a problem hiding this comment.
Response format inconsistency: status field.
Example 1 (line 114) includes "status": 200 in the response, but the initial POST example (lines 51-53) and Example 2 (line 146) also include "status": 200, while the very first example (lines 21-53) does not show a status field at all.
For consistency and to avoid user confusion, ensure all example responses follow the same format. Either include status in all examples or document when it appears.
As per coding guidelines, documentation should follow existing patterns and structure consistently.
Also applies to: 139-143
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@docs/user/ppl/interfaces/endpoint.md` around lines 102 - 106, Several
response examples in the docs are inconsistent about including a "status" field
(the snippet with "datarows" and the very first POST example lack it while other
examples include "status": 200); update all example response JSON blocks to
include a "status": 200 field so every example (including the initial POST
example and the block containing "datarows" and the examples around lines
referenced in the review) follows the same format, ensuring the inserted key is
placed at the same level as other top-level response fields for uniformity.
| - Highlighting is supported only in the Calcite engine. | ||
| - The backend forwards the highlight config as-is to OpenSearch. The same highlighting behavior and limitations as [OpenSearch's highlighting API](https://opensearch.org/docs/latest/search-plugins/searching-data/highlight/) apply. | ||
| - Piped commands (`where`, `sort`, `head`, `dedup`) narrow or reorder the result set but do not affect which terms are highlighted. | ||
| - Highlighting works with **single-source queries only**, consistent with DSL where highlighting is inherently single-index per request. Behavior with joins (`| join`), subqueries, and multi-source queries is not validated. |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
find . -type f -name "*.java" -o -name "*.kt" | head -30Repository: opensearch-project/sql
Length of output: 2727
🏁 Script executed:
# Search for test files related to highlighting
fd -e java -path "*/test/*" "highlight" 2>/dev/null | head -20Repository: opensearch-project/sql
Length of output: 48
🏁 Script executed:
# Look for highlighting implementation and how it handles joins/multi-source
rg -l "highlight" --type=java -i | head -20Repository: opensearch-project/sql
Length of output: 1766
🏁 Script executed:
# Search for join-related tests
fd -e java -path "*/test/*" "join" 2>/dev/null | head -10Repository: opensearch-project/sql
Length of output: 48
🏁 Script executed:
cat ./sql/src/test/java/org/opensearch/sql/sql/antlr/HighlightTest.javaRepository: opensearch-project/sql
Length of output: 1162
🏁 Script executed:
# Search for highlighting implementation in PPL/SQL parsers
rg -C5 "highlight" --type=java ./ppl ./sql | head -100Repository: opensearch-project/sql
Length of output: 10904
🏁 Script executed:
# Search for join-related highlighting or multi-source validation
rg -C3 "highlight" --type=java | rg -C3 "join|multi.*source|subquery" | head -50Repository: opensearch-project/sql
Length of output: 48
🏁 Script executed:
# Look at the PPL parser/domain to see highlight handling
cat ./ppl/src/main/java/org/opensearch/sql/ppl/domain/PPLQueryRequest.java 2>/dev/null | head -200Repository: opensearch-project/sql
Length of output: 3849
🏁 Script executed:
# Search for join implementation in SQL/PPL execution
rg -C5 "join" --type=java -i | grep -C3 "highlight" | head -50Repository: opensearch-project/sql
Length of output: 638
🏁 Script executed:
# Look for any validation or error handling when processing highlights
rg -C3 "highlight" --type=java | grep -C2 "validate|error|fail|check" | head -50Repository: opensearch-project/sql
Length of output: 48
🏁 Script executed:
# Search for how multi-source or subquery scenarios handle highlighting
fd -e java | xargs rg "getHighlight|highlight" | head -30Repository: opensearch-project/sql
Length of output: 4339
🏁 Script executed:
# Check if there's any join-related test with highlight
find . -name "*JoinTest.java" -o -name "*MultiSourceTest.java" | head -10Repository: opensearch-project/sql
Length of output: 140
🏁 Script executed:
# Examine the join test file to see if highlight is tested with joins
cat ./ppl/src/test/java/org/opensearch/sql/ppl/calcite/CalcitePPLJoinTest.java | head -200Repository: opensearch-project/sql
Length of output: 9103
🏁 Script executed:
# Look for the highlight integration test to see what it covers
cat ./integ-test/src/test/java/org/opensearch/sql/calcite/remote/CalcitePPLHighlightIT.java | head -300Repository: opensearch-project/sql
Length of output: 6838
🏁 Script executed:
# Search for any restrictions or validation related to highlight and joins
rg "highlight" --type=java -C3 | grep -C3 "restrict|join|single" | head -40Repository: opensearch-project/sql
Length of output: 48
🏁 Script executed:
# Check the documentation file mentioned in the review
cat ./docs/user/ppl/interfaces/endpoint.md | sed -n '155,175p'Repository: opensearch-project/sql
Length of output: 1699
🏁 Script executed:
# Search for any validation that rejects or restricts multi-source highlights
rg "highlight" --type=java -C5 | grep -C3 "SemanticCheck|validate|exception|error|throw" | head -40Repository: opensearch-project/sql
Length of output: 48
🏁 Script executed:
# Check if there's any multi-source query validation for highlight
rg "multi.*source|subquery" --type=java -C3 | grep -C2 "highlight" | head -30Repository: opensearch-project/sql
Length of output: 48
🏁 Script executed:
# Check if there's validation in query execution or planner that prevents multi-source with highlight
rg "pushDownHighlight|highlight" --type=java -C5 | grep -C3 "source|join|table" | head -50Repository: opensearch-project/sql
Length of output: 48
🏁 Script executed:
# Look for any semantic checks or validation rules for highlight
fd -e java -path "*/semantic*" -o -path "*/validator*" | head -10Repository: opensearch-project/sql
Length of output: 234
🏁 Script executed:
# Search for where highlight is actually processed/validated
rg "class.*Highlight" --type=java | head -20Repository: opensearch-project/sql
Length of output: 1525
🏁 Script executed:
# Examine HighlightAnalyzer to see if there's validation for multi-source scenarios
cat ./core/src/main/java/org/opensearch/sql/analysis/HighlightAnalyzer.javaRepository: opensearch-project/sql
Length of output: 1586
🏁 Script executed:
# Check the test that throws SemanticCheckException for highlight
cat ./opensearch/src/test/java/org/opensearch/sql/opensearch/request/OpenSearchRequestBuilderTest.java | grep -C15 "SemanticCheckException.*highlight"Repository: opensearch-project/sql
Length of output: 48
🏁 Script executed:
# Look at SQL highlight integration tests to see what's tested
cat ./integ-test/src/test/java/org/opensearch/sql/sql/HighlightFunctionIT.javaRepository: opensearch-project/sql
Length of output: 7756
🏁 Script executed:
# Get more context on the SemanticCheckException test for highlight
cat ./opensearch/src/test/java/org/opensearch/sql/opensearch/request/OpenSearchRequestBuilderTest.java | grep -B20 -A5 "SemanticCheckException" | grep -B20 -A5 "highlight"Repository: opensearch-project/sql
Length of output: 2170
🏁 Script executed:
# Search for any validation that explicitly prevents multi-source highlight
rg "pushDownHighlight" --type=java -B5 -A10 | head -80Repository: opensearch-project/sql
Length of output: 9815
🏁 Script executed:
# Search for how logical plans are executed when joins are involved with highlight
rg "LogicalHighlight|LogicalJoin" --type=java -C3 | head -60Repository: opensearch-project/sql
Length of output: 7701
🏁 Script executed:
# Check if there's any rule or optimizer that handles highlight with joins
fd -e java -path "*/optimizer/rule*" | xargs rg "highlight" 2>/dev/null | head -20Repository: opensearch-project/sql
Length of output: 2768
Clarify what happens when highlighting is used with joins or multi-source queries.
The phrase "Behavior with joins (| join), subqueries, and multi-source queries is not validated" is ambiguous. Based on the codebase, highlighting with multi-source queries is genuinely untested. Clarify whether:
- The system rejects these queries with an error,
- Highlighting is silently omitted from results, or
- Behavior is undefined and may produce unexpected results.
This helps users avoid confusion when combining highlight with joins or subqueries.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@docs/user/ppl/interfaces/endpoint.md` at line 163, Update the sentence to
explicitly state that highlighting is supported only for single-source queries
and that its behavior with joins, subqueries, or multi-source queries (e.g.,
using the operator "| join") is unsupported and untested: clarify that the
system does not guarantee an error or consistent output in those cases and that
highlighting may be silently omitted or produce unexpected results, so users
should avoid combining highlighting with joins, subqueries, or multi-source
queries.
Signed-off-by: Jialiang Liang <jiallian@amazon.com>
Signed-off-by: Jialiang Liang <jiallian@amazon.com>
Signed-off-by: Jialiang Liang <jiallian@amazon.com>
Signed-off-by: Jialiang Liang <jiallian@amazon.com>
Signed-off-by: Jialiang Liang <jiallian@amazon.com>
Signed-off-by: Jialiang Liang <jiallian@amazon.com>
Signed-off-by: Jialiang Liang <jiallian@amazon.com>
Signed-off-by: Jialiang Liang <jiallian@amazon.com>
Signed-off-by: Jialiang Liang <jiallian@amazon.com>
0b4e450 to
583b870
Compare
Signed-off-by: Jialiang Liang <jiallian@amazon.com>
Signed-off-by: Jialiang Liang <jiallian@amazon.com>
Signed-off-by: Jialiang Liang <jiallian@amazon.com>
There was a problem hiding this comment.
Although we only expose highlight via the PPL API today, internally can we make it generic:
- Make the highlight part of PPL query as early as possible behind the API, as discussed inhttps://github.com//issues/5156#issuecomment-3924993361 (highlight function/command or search command parameter?)
- If not possible, can we accept arbitrary DSL and merge them with the SearchSourceBuilder already constructed in the index scan operator? That way future extensions don't each need their own end-to-end pipeline.
Signed-off-by: Jialiang Liang <jiallian@amazon.com>
PR Code Analyzer ❗AI-powered 'Code-Diff-Analyzer' found issues on commit 798396c.
The table above displays the top 10 most important findings. Pull Requests Author(s): Please update your Pull Request according to the report above. Repository Maintainer(s): You can Thanks. |
PR Reviewer Guide 🔍(Review updated until commit b290e06)Here are some key observations to aid the review process:
|
PR Code Suggestions ✨Latest suggestions up to b290e06 Explore these optional code suggestions:
Previous suggestionsSuggestions up to commit 5bda888
Suggestions up to commit 798396c
|
Signed-off-by: Jialiang Liang <jiallian@amazon.com>
|
Persistent review updated to latest commit 5bda888 |
Hi @dai-chen. Thanks for the suggestion. After checking, I went with option 2 — accept arbitrary DSL and merge it with the Why not option 1 (make highlight part of PPL syntax)?The main issue is the response format. The V2 highlight() function returns highlights inline as columns in datarows: {
"schema": [
{ "name": "Tags", "type": "text" },
{ "name": "highlight('Tags')", "type": "nested" }
],
"datarows": [
["yeast home-brew", ["<em>yeast</em> home-brew"]]
]
}OSD Explore already knows how to consume highlights from DSL, where they come back as a separate metadata array parallel to the hits. Our current approach matches that shape: {
"schema": [{ "name": "Tags", "type": "text" }],
"datarows": [["yeast home-brew"]],
"highlights": [{ "Tags": ["<em>yeast</em> home-brew"] }]
}If we went with the Current design:The plumbing is now generic. Instead of a highlight-specific pipeline (
To add a future extension (e.g. suggest, rescore, post_filter), you would:
No new end-to-end plumbing needed. So please take a look again thanks! |
Signed-off-by: Jialiang Liang <jiallian@amazon.com>
PR Code Analyzer ❗AI-powered 'Code-Diff-Analyzer' found issues on commit b290e06.
The table above displays the top 10 most important findings. Pull Requests Author(s): Please update your Pull Request according to the report above. Repository Maintainer(s): You can Thanks. |
|
Persistent review updated to latest commit b290e06 |
Description.
Related Issues
Check List
--signoffor-s.By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.