Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion modules/ROOT/nav.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -361,7 +361,7 @@
*** xref:sql:manage/manage-access.adoc[Manage access]
** xref:sql:troubleshoot/index.adoc[Troubleshoot]
*** xref:sql:troubleshoot/degraded-state-handling.adoc[]
*** xref:sql:troubleshoot/memory-management.adoc[Memory Management]
*** xref:sql:troubleshoot/query-out-of-memory.adoc[Query Out-of-Memory Errors]

* xref:develop:index.adoc[Develop]
** xref:develop:kafka-clients.adoc[]
Expand Down
69 changes: 69 additions & 0 deletions modules/sql/pages/troubleshoot/query-out-of-memory.adoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,69 @@
= Troubleshoot Query Out-of-Memory Errors
:description: Recover from query out-of-memory errors in Redpanda SQL and understand the memory limits that govern query execution.
:page-topic-type: troubleshooting
:personas: platform_admin, data_engineer
:learning-objective-1: Identify when a query was canceled because it ran out of memory
:learning-objective-2: Recover from a query out-of-memory error and reduce its frequency
:learning-objective-3: Monitor node memory usage to anticipate memory pressure

If a Redpanda SQL query exhausts the memory available to it, the engine cancels the query and returns an error to the client:

[source,sql]
----
ERROR: Query Out of Memory!
----

Canceling the query frees its memory and allows the engine to continue serving other queries. This is a normal protection mechanism and is not a sign of cluster failure.

Use this page to:

* [ ] {learning-objective-1}
* [ ] {learning-objective-2}
* [ ] {learning-objective-3}

== How Redpanda SQL uses memory

Redpanda SQL queries can read very large input sources (many terabytes), and simple operations such as filtering or projection process input incrementally with a small, roughly constant memory footprint. In those cases the engine can process far more data than fits in RAM.

Memory pressure comes from operations that materialize intermediate state: hash tables for `JOIN` and `GROUP BY`, heaps for `ORDER BY` and top-K, and network buffers between nodes. All of this intermediate state, along with the final result set, must fit into the aggregate memory available across the cluster. The engine does not spill intermediate state to disk, so a query that builds intermediate structures larger than available memory is canceled rather than slowed.

All concurrently running queries contribute to total memory consumption, so a single query can hit the node memory limit because of pressure from other queries running at the same time.

== Recover from the error

When a single query fails with `Query out of Memory`, retry it. The error frees the query's memory, so the next attempt often succeeds, especially if other concurrent queries have completed in the meantime.

If the same query keeps failing, the query itself is too memory-hungry for the current cluster size, or too many other queries are competing for memory at the same time. Reduce the query's memory footprint or reduce concurrent load:

* Reduce concurrency.
+
Run fewer queries in parallel against the cluster. Other queries running at the same time contribute to the total memory pressure.
* Simplify the query.
+
Narrow the scan range with tighter `WHERE` filters, reduce the number of joins, or break a large aggregation into smaller ones. Operations that materialize wide intermediate results (joins, sorts, distinct aggregations) drive memory consumption the most.
* Scale the cluster.
+
Add SQL nodes to increase the aggregate memory available to queries. See xref:sql:get-started/deploy-sql-cluster.adoc#scale-redpanda-sql[Scale Redpanda SQL].

== Monitor memory usage

Use the following Prometheus gauge to track memory consumed by the SQL workload and watch for sustained growth toward the node's limit:

[cols="1,3"]
|===
| Metric | Description

| `query_memory_consumption_total`
| Memory consumed by queries on the node, in bytes. Use this metric to monitor workload memory usage. Unlike `oxla_process_memory_total`, which includes cached files and can sit near the limit even when the node is idle, `query_memory_consumption_total` reflects only memory attributed to query execution.
|===

== OOM prevention cancellations

The `cancelled due to OOM prevention` error is a separate case. Redpanda SQL's engine includes an overseer that monitors overall node memory independently of per-query accounting. When the overseer detects that the untracked memory pool has grown unexpectedly, it cancels running queries on the affected node to keep the engine operational.

This condition is rare and almost always indicates a bug in memory accounting or an unexpected workload pattern. Collect the cluster logs from around the time of the error and contact https://support.redpanda.com/hc/en-us/requests/new[Redpanda Support^].

== Suggested reading

* xref:reference:sql/sql-statements/show-execs.adoc[SHOW EXECS]
* xref:reference:sql/sql-statements/show-nodes.adoc[SHOW NODES]