Conversation
✅ Deploy Preview for rp-cloud ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
fc9e91c to
7e0ff93
Compare
|
|
||
| [source,text] | ||
| ---- | ||
| cancelled due to OOM prevention |
There was a problem hiding this comment.
The cancelled due to OOM prevention is a sibling error to a primary user-facing one: Query Out of Memory.
Query out of Memory is reported when a particular query exhausted all memory resources and had to be cancelled. This is a normal behavior, as the engine counts the allocated memory and prevents it from entering an unexpected state or a deadlock. With this error, it is advised to retry the query or cancel/wait for other concurrently running tasks to finish before retry. I feel like this page is describing this case, but with the wrong error message.
The thing is, the engine doesn't track all allocations, so it doesn't have full control over the allocated memory. This is where the cancelled due to OOM prevention error comes in.
The OOM prevention mechanism is an overseer. It's addressing this by monitoring the overall memory usage in an external, independent way. It's more of an emergency handler, which quickly frees reclaimable resources to remain operational. However, triggering this situation is a result of either the untracked pool exceeding unexpectedly or a serious problem with memory tracking, and should probably almost always result in a bug report by the client with access to the logs. This, I suspect, is more like a runbook/customer support scenario.
I don't know what should be exactly visible in the public documentation, but I feel like this page blends two problems, and there are two parts to describe/discuss, the first one should be definitely visible to the user with an explanation why this happens, and the second (the emergency one) is more like an issue/emergency. Maybe it should be present in the docs too, but on a different page.
There was a problem hiding this comment.
Right, let's focus on the user-facing error @kbatuigas
|
|
||
| [source,text] | ||
| ---- | ||
| cancelled due to OOM prevention |
There was a problem hiding this comment.
Right, let's focus on the user-facing error @kbatuigas
|
|
||
| [source,text] | ||
| ---- | ||
| cancelled due to OOM prevention |
There was a problem hiding this comment.
@kbatuigas where exactly does the user see this error? This doesn't appear to be a complete example.
What does that actually look like in the psql client. Let's show the actual real-world example.
Is this standard Portgres error code?
There was a problem hiding this comment.
I tried some digging with Claude and updated with what I found: https://github.com/redpanda-data/cloud-docs/pull/584/changes#diff-05dde76066e0f81b1b4af9298c747919bd040631aa9a72704836954902d8a59fR19 does that look right @Greketrotny
There was a problem hiding this comment.
A wrong metric (https://github.com/redpanda-data/cloud-docs/pull/584/changes#r3279031995), but everything else looks good.
| // "Recover from OOM cancellation" (concise; uses internal term) | ||
| // Keep "Memory management" (matches current nav label but doesn't signal action) | ||
|
|
||
| Redpanda SQL automatically cancels running queries on a node when the node's memory usage approaches its memory limit. If your application sees the following error, your queries have hit this protection: |
There was a problem hiding this comment.
@kbatuigas here I think we need to explain the overall memory limits principles with RP SQL so users understand the reason why they might be seeing this kind of error ,(esp if we dont have a separate scale guide, lets thus give them a hint here).
Something like: "While RP SQL queries can process very large input sources (many TB) RP SQL query results (and intermediate results created by operations like joins and aggregations) must fit into the aggregate available memory available to all nodes in the cluster (as reported by this metric ...._). All concurrently running queries contribute to total memory consumption and any one query can cause the node memory limits to be hit based on other concurrent queries ..."
I think @Greketrotny should draft /update this ^^^
|
@Greketrotny @mattschumpert I restructured based on your comments-- changed the page title to Troubleshoot Query Out-of-Memory Errors, put in a placeholder section for "How Redpanda SQL uses memory" based on Matt's suggestion (Grzegorz please edit), and kept a short "cancelled due to OOM prevention" section at the end https://deploy-preview-584--rp-cloud.netlify.app/redpanda-cloud/sql/troubleshoot/query-out-of-memory/ Not sure if we'd start a new doc entirely, do we have enough end-user-facing content for it? |
|
|
||
| // TODO: SME — confirm the recovery order above and whether a heuristic | ||
| // exists for choosing among them (for example, watching | ||
| // `oxla_process_memory_total` over time before deciding to scale). |
There was a problem hiding this comment.
No, oxla_process_memory_total can constantly be just below the limit, even when idle, as this metric also includes the memory allocated for cached files. To specifically monitor memory usage for the workload/queries, use the query_memory_consumption_total.
| // (InternalError, the fall-through default in session.cpp). That's a generic | ||
| // "internal error" class, not a memory-class code like 53200 (out_of_memory) | ||
| // or 53400 (configuration_limit_exceeded). Confirm whether this is intentional | ||
| // or a bug to be fixed — if the SQLSTATE changes, add the |
|
The description of the errors, distinctions, consequences, and mitigations looks good to me. What's maybe missing is why all of this exists. The engine currently must fit all intermediate data/calculations in RAM (hashmaps for JOIN and GROUP BY, ORDER BY /TOP K heaps, network buffers), and there is no spilling implemented. And also, what @mattschumpert said, this doesn't mean that the whole data set must fit into the engine, as simple operations have a small and constant footprint, and the engine can process an amount of data vastly greater than the available RAM. I'm sure LLM will compose a nice description from the threads here. I hope that level of detail is sufficient for the public docks. |
micheleRP
left a comment
There was a problem hiding this comment.
docs-team-standards review
Critical (must fix before merging to main)
-
[query-out-of-memory.adoc:60 — Monitor memory usage table] Wrong metric.
- The page recommends
oxla_process_memory_totalto monitor query memory pressure. SME @Greketrotny flagged this in the review thread: "oxla_process_memory_totalcan constantly be just below the limit, even when idle, as this metric also includes the memory allocated for cached files. To specifically monitor memory usage for the workload/queries, use thequery_memory_consumption_total." - Fix: swap the metric to
query_memory_consumption_totaland update the description so it reflects query-memory consumption, not RSS. If both metrics are worth showing, list both with clear "use this for X" guidance.
- The page recommends
-
[query-out-of-memory.adoc] Multiple unresolved
// TODO: SMEcomments still in the source.- Three SME TODOs remain in the page: (a) the SQLSTATE XX000 question (@Greketrotny confirmed in the review thread: "yes, this is probably a bug"), (b) the "How Redpanda SQL uses memory" placeholder TODO that still says "rewrite this section per the PR 584 review thread", and (c) the recovery-order heuristic TODO.
- These are fine on the
rp-sqlintegration branch, but flag explicitly so they're resolved (or converted to a follow-up ticket) before the branch merges tomain.
Suggestions (should consider)
-
Learning objectives and
:description:— backticks in attribute values don't render as monospace.- Current:
:description: Recover from `Query Out of Memory` errors in Redpanda SQL and understand the memory limits that govern query execution. :learning-objective-2: Recover from a `Query out of memory` error and reduce its frequency
- Why: Attribute values are stored as plain strings. When
{learning-objective-2}is substituted into the checkbox list, AsciiDoc's quote substitution doesn't reliably re-process backticks inside attribute-substituted text — the backticks render literally in HTML in many contexts. - Suggested: drop the backticks from these attributes — they're abstract outcomes / page summaries, not literal code references. For example:
:description: Recover from query out-of-memory errors in Redpanda SQL and understand the memory limits that govern query execution. :learning-objective-2: Recover from a query out-of-memory error and reduce its frequency
- Current:
-
[query-out-of-memory.adoc — "Simplify the query" bullet]
`JOIN`swon't render as monospaceJOIN+ plains.- Current:
Narrow the scan range with tighter `WHERE` filters, reduce the number of `JOIN`s, or break a large aggregation into smaller ones. - Why: AsciiDoc's constrained inline syntax (
`text`) requires the closing backtick to be followed by a non-word character. A trailing letter likesbreaks the close, so the markup renders literally as`JOIN`sinstead ofJOINs. - Suggested: use unconstrained inline (double backticks) for pluralized monospace:
Same fix applies anywhere else in the page where a backticked term is immediately followed by a letter.
...reduce the number of ``JOIN``s, or break a large aggregation into smaller ones.
- Current:
-
[query-out-of-memory.adoc:75] Heading "If you see
cancelled due to OOM preventioninstead" reads awkwardly.- Three problems: (a) inline code in a heading isn't standard for Redpanda docs and breaks heading typography; (b) "If you see ... instead" is conversational and conditional, which is the voice of a callout, not a section title; (c) it's long.
- Suggested: rename to something declarative, e.g.
== Cancellations from OOM preventionor== OOM prevention cancellations. Then the body sentence ("Thecancelled due to OOM preventionerror is a separate case…") carries the literal error string, which is the right place for it.
-
PR title vs page title.
- PR title "SQL: OOM" is very terse for a customer-facing doc change. Consider tightening to something like "SQL: troubleshooting query out-of-memory errors" to make the changelog/PR list more scannable. (Suggestion only — no functional impact.)
Feediver1
left a comment
There was a problem hiding this comment.
Two supplementary findings (in addition to @micheleRP's review above)
Michele's docs-team-standards review above covers the metric correctness, TODO markers, attribute backticks, constrained-inline rendering, and heading style. Two cross-cutting items she didn't flag:
Critical — broken xref to a sibling-PR target
query-out-of-memory.adoc:56usesxref:sql:get-started/deploy-sql-cluster.adoc#scale-redpanda-sql[Scale Redpanda SQL]. That target page lives in PR #571 (still OPEN) and isn't onrp-sqlyet. The build will surfacetarget of xref not founduntil #571 lands onrp-sql. Same merge-sequencing pattern as the rest of the SQL GA series.- The other two xrefs in this page (
show-execs.adoc,show-nodes.adoc) resolve fine.
- The other two xrefs in this page (
Suggestion — missing What's New entry (recurring across the SQL GA series)
- No entry for the SQL GA launch (or this troubleshooting page) in
modules/get-started/pages/whats-new-cloud.adoc. This is the fifth PR in the SQL GA series with the same gap — see the same note on #571 / #574 / #575 / #580. A single coordinated "Redpanda SQL: General availability" entry should cover the whole release, with this troubleshooting page linked alongside the get-started / query / auth pages.
Final-pass review via /docs-team-standards:pr-review.
| @@ -0,0 +1,69 @@ | |||
| = Troubleshoot Query Out-of-Memory Errors | |||
| :description: Recover from `Query Out of Memory` errors in Redpanda SQL and understand the memory limits that govern query execution. | |||
There was a problem hiding this comment.
| :description: Recover from `Query Out of Memory` errors in Redpanda SQL and understand the memory limits that govern query execution. | |
| :description: Recover from query out-of-memory errors in Redpanda SQL and understand the memory limits that govern query execution. |
suggest removing the backticks because they don't render on the index page
|
|
||
| If a Redpanda SQL query exhausts the memory available to it, the engine cancels the query and returns an error to the client: | ||
|
|
||
| [source,bash] |
There was a problem hiding this comment.
| [source,bash] | |
| [source,sql] |
| * xref:reference:sql/sql-statements/show-execs.adoc[SHOW EXECS]: inspect currently running queries on the cluster. | ||
| * xref:reference:sql/sql-statements/show-nodes.adoc[SHOW NODES]: list the SQL engine's nodes and their state. |
There was a problem hiding this comment.
| * xref:reference:sql/sql-statements/show-execs.adoc[SHOW EXECS]: inspect currently running queries on the cluster. | |
| * xref:reference:sql/sql-statements/show-nodes.adoc[SHOW NODES]: list the SQL engine's nodes and their state. | |
| * xref:reference:sql/sql-statements/show-execs.adoc[SHOW EXECS] | |
| * xref:reference:sql/sql-statements/show-nodes.adoc[SHOW NODES] |
micheleRP
left a comment
There was a problem hiding this comment.
left a couple little style suggestions
Description
This pull request adds a new troubleshooting guide focused on handling memory-related query cancellations in Redpanda SQL. The page explains the automatic out-of-memory (OOM) protection mechanism, describes the client-facing error, and gives actionable steps for users to recover from or prevent repeated cancellations. It also provides guidance on monitoring memory usage and includes several TODOs for subject matter expert (SME) validation.
Key additions:
Troubleshooting documentation:
memory-management.adoc, that explains how Redpanda SQL cancels queries when a node approaches its memory limit and how users can recover from or prevent these cancellations.oxla_process_memory_totalPrometheus metric.Guidance for further validation:
Resolves https://github.com/redpanda-data/documentation-private/issues/
Review deadline: 21 May
Page previews
Checks