[SPARK-57183][SS][4.2] Close LRUCache on RocksDB.close() in unbounded memory mode#56281
[SPARK-57183][SS][4.2] Close LRUCache on RocksDB.close() in unbounded memory mode#56281kete1987 wants to merge 1 commit into
Conversation
…ry mode ### What changes were proposed in this pull request? In unbounded memory mode (the default, `boundedMemoryUsage = false`), `RocksDBMemoryManager` creates a new `LRUCache` per instance: https://github.com/apache/spark/blob/d7df1920cfb8577c17a9b1555592d3772f061417/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDBMemoryManager.scala#L185 but `RocksDB.close()` never calls `lruCache.close()`: https://github.com/apache/spark/blob/d7df1920cfb8577c17a9b1555592d3772f061417/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDB.scala#L2125-L2135 The Java `LRUCache` wrapper holds a C++ `shared_ptr<Cache>`, so the native object is only freed when the JVM GC finalizes the wrapper — which rarely happens under low heap pressure. This causes native memory to accumulate until GC eventually runs, leading to OOM kills in long-running processes or CI runs with many RocksDB-heavy test suites. The fix adds an explicit `lruCache.close()` call in `RocksDB.close()` for unbounded mode. In bounded mode the cache is a shared singleton managed by `RocksDBMemoryManager` and must not be closed per instance. This is a separate issue from SPARK-56523 (Statistics native memory leak), which was already fixed. ### Why are the changes needed? Without explicit `close()`, each `RocksDB` instance in unbounded mode leaks one `LRUCache` worth of native memory (`blockCacheSizeMB`, default 8 MB) for as long as GC does not run. The memory is never reclaimed deterministically. A standalone reproducer tool confirms ~8.5 MB of native memory growth per open/close cycle in leak mode vs flat memory in fixed mode: https://github.com/kete1987/rocksdb-leak-tool ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added a test in `RocksDBSuite` (`SPARK-57183: LRUCache is closed on RocksDB.close() in unbounded memory mode`) that verifies the native handle is released after `close()` via `LRUCache.isOwningHandle()`. ### Was this patch authored or co-authored using generative AI tooling? Co-authored with Claude (Anthropic), used for analysis, code generation and review assistance. Generated-by: Claude Sonnet 4.6 I affirm that the contribution is my original work and that I license the work to the project under the project's open source license. Closes apache#56234 from kete1987/SPARK-57183-rocksdb-lrucache-leak. Authored-by: Ivan Morales <kete1987@gmail.com> Signed-off-by: Anish Shrigondekar <anish.shrigondekar@databricks.com>
|
@anishshri-db this is a backport of #56234 to branch-4.2. Please merge whenever convenient. |
Do we need the backport for 4.2 ? Usually we only merge critical fixes to cut branches |
Yes, I think it's worth including in 4.2.0 — the release is still in preview (preview5, no final RC yet), and this is a resource leak in the default configuration ( |
|
I'm not sure this is really a leak as such. maybe we can solicit some other opinions here ? cc - @cloud-fan / @HeartSaVioR - any thoughts here around the backport policy we should apply in this case ? |
|
I think the change is safe (not sure if it's critical fix or not), and 4.2 is not released yet, so +1 to backport. |
|
the CI failure is unrelated, merging to 4.2! |
… memory mode ### What changes were proposed in this pull request? Backport of #56234 to branch-4.2. ### Why are the changes needed? See #56234. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? See #56234. ### Was this patch authored or co-authored using generative AI tooling? Co-authored with Claude (Anthropic), used for analysis, code generation and review assistance. Generated-by: Claude Sonnet 4.6 Closes #56281 from kete1987/SPARK-57183-rocksdb-lrucache-leak-branch-4.2. Authored-by: Iván Morales <kete1987@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
What changes were proposed in this pull request?
Backport of #56234 to branch-4.2.
Why are the changes needed?
See #56234.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
See #56234.
Was this patch authored or co-authored using generative AI tooling?
Co-authored with Claude (Anthropic), used for analysis, code generation and review assistance.
Generated-by: Claude Sonnet 4.6