Skip to content

SDSTOR-22200#883

Open
shosseinimotlagh wants to merge 10 commits intoeBay:stable/v3.8.xfrom
shosseinimotlagh:SDSTOR-22200
Open

SDSTOR-22200#883
shosseinimotlagh wants to merge 10 commits intoeBay:stable/v3.8.xfrom
shosseinimotlagh:SDSTOR-22200

Conversation

@shosseinimotlagh
Copy link
Copy Markdown
Contributor

@shosseinimotlagh shosseinimotlagh commented May 8, 2026

Prevent OOM during volume recovery by releasing cached btree nodes

Problem Statement

During multi-volume recovery, HomeStore experiences OOM (Out of Memory) crashes when the total cached btree nodes across all volumes exceed available memory. This occurs because:

  1. Cross-volume cache accumulation: During vol_recovery_start_phase2(), each volume's verify_tree() loads all btree nodes into cache sequentially. Without eviction between volumes, total cache size = sum(all volumes).

  2. Clean node waste: After recovery completes, cached nodes from verify_tree() provide zero value but continue consuming cache quota, leaving no room for dirty buffers that actually need writeback.

  3. Peak memory formula: With N volumes and per-node overhead of 832 bytes (512B-4096B buffer + MemVector + MemPiece + BtreeNode), peak memory during recovery reaches:
    Total = N × (nodes_per_volume × (buffer_size + overhead))
    Example: 13 volumes × 6.86M nodes × 832 bytes ≈ 77.3GB, exceeding typical 60GB limits.

Root Cause Analysis (GDB Session Evidence)

Extensive GDB investigation on production pod revealed:

  • RSS breakdown: 53.14GB total, with btree cache consuming ~40GB
  • Zero dirty buffers: All 13 volumes showed dirty_buf_cnt=0, ruling out dirty buffer accumulation
  • Zero LRU eviction: No evidence of cache eviction during recovery
  • Verified accumulation: Cache = sum(all 13 volumes) instead of max(single volume)
  • Per-node cost: 832 bytes confirmed (buffer + 320 bytes overhead for MemVector/MemPiece/BtreeNode structures)
  • Primary allocator: FixedBlkAllocator MPMC queue consumed 10.06GB

Solution

This PR implements a two-pronged approach to prevent OOM:

1. Iterative Btree Verification (Performance Improvement)

  • Add verify_node_fowarding() with iterative BFS traversal using std::queue
  • Releases BtreeNodePtr immediately after processing each node
  • Benefit: Reduces lock hold time on ancestor nodes during single-tree traversal
  • Important: This is NOT a memory leak fix, only improves lock contention

2. Safe Cache Release After Recovery (OOM Prevention)

  • Add release_cached_tree() to safely evict btree node buffers after each volume's recovery
  • Add free_node_from_cache() in ssd_btree.hpp to evict buffers with proper safety checks:
    • Not locked (try_lock succeeds)
    • Not dirty (no pending writeback)
    • Reference count permits eviction (accounting for wb_cache permanent reference)
  • Key mechanism: Cache::erase(cache_only=true) removes buffer from hash_set/eviction list while wb_cache retains reference for checkpoint flush

3. Integration & Configuration

  • Call release_cached_tree() in vol_recovery_start_phase2() AFTER recovery_start_phase2() completes for each volume
  • Add release_cache_after_recovery config flag (default: true) with hotswap support
  • Add recovery_cache_release_latency metric to track performance overhead
  • Update HomeBlksRecoveryStats to log cache release time

Technical Details

Cache Eviction Safety Checks:

static bool free_node_from_cache(bnodeid_t node_id, uint32_t node_size) {
 auto req = writeback_req_t::make_request();
 req->isSyncCall = true;  // Critical: prevents null deref on wb_req->bcp->cp_id
 auto buf = m_blkstore->read(bid, 0, node_size, req, true);
 if (!buf) { return true; }  // Already evicted
 if (!buf->try_lock()) { return false; }  // Locked, skip
 buf->unlock();
 buf.reset();
 m_blkstore->free_blk(bid, boost::none, boost::none, true);  // cache_only=true
 // Verify eviction succeeded
 auto verify_buf = m_blkstore->read(bid, 0, node_size, verify_req, true);
 return (verify_buf == nullptr);
}

Delegation Chain:
- HomeBlks::vol_recovery_start_phase2() → Volume::release_cached_tree() → mapping::release_cached_tree() → Btree::release_cached_tree() → ssd_btree::free_node_from_cache()

Benefits

1. Limits peak memory: max(single volume) instead of sum(all volumes)
  - Example: 6.86GB peak instead of 77.3GB total
2. Frees cache space for dirty buffers: Clean nodes from verify_tree() waste quota; evicting them makes room for dirty buffers needing writeback
3. Defensive against future bugs: Reduces OOM risk from ANY dirty buffer accumulation scenario
4. Zero data loss: cache_only=true ensures wb_cache retains reference for checkpoint flush
5. Configurable: Runtime hotswap via release_cache_after_recovery flag
6. Observable: recovery_cache_release_latency metric tracks performance cost

Changes Summary

Modified Files:
- src/engine/homeds/btree/btree.hpp: Add iterative verify + cache release methods
- src/engine/homeds/btree/ssd_btree.hpp: Add free_node_from_cache() primitive
- src/homeblks/volume/mapping.{hpp,cpp}: Delegation layer updates
- src/homeblks/volume/volume.{hpp,cpp}: Top-level volume interface updates
- src/homeblks/home_blks.{hpp,cpp}: Recovery orchestration + metrics + stats
- src/homeblks/homeblks_config.fbs: Config flag definition
- src/engine/blkalloc/*: Memory tracking improvements (prerequisite commits)

Metrics & Configuration:
- New gauge: recovery_cache_release_latency
- New config: release_cache_after_recovery (default: true, hotswap enabled)
- Updated: HomeBlksRecoveryStats to include m_cache_release_ms

Code Style:
- Applied clang-format across entire codebase for consistency

Testing Notes

- Requires testing on multi-volume recovery scenario (13+ volumes)
- Monitor recovery_cache_release_latency metric for performance overhead
- Verify peak RSS stays within bounds: max(single_volume) instead of sum(all_volumes)
- Confirm dirty_buf_cnt can increase without OOM after cache release
- Test config flag: verify behavior with release_cache_after_recovery=false

The class was named BlkAllocMetrics but is exclusively used by
VarsizeBlkAllocator and its slab subsystem (SlabMetrics registers to
it as parent). The name is now accurate and leaves room for a separate
FixedBlkAllocMetrics class.
FixedBlkAllocMetrics: new metrics class with num_alloc, num_alloc_failure
counters and a blk_alloc_memory_size gauge. Wired into FixedBlkAllocator
constructor; gauge is set once at construction since MPMC queue capacity
is fixed (total_blks x 12B per slot: 6B BlkId + 6B folly sequence field,
confirmed via GDB on 461GB index device = 10.06GB).

VarsizeBlkAllocMetrics: adds the same blk_alloc_memory_size gauge, updated
once after FreeBlkCacheQueue construction via the new total_slab_capacity()
helper (sum of all slab entry_capacity() x 12B per slot = 1.60GB for the
production data blkstore configuration).
The cache evictor's get_size() callback was returning only m_cache_size (the
raw node data buffer), but each cached entry allocates additional heap memory:

  actual_cost = m_cache_size + sizeof(CacheBufferType)
                             + sizeof(homeds::MemVector)
                             + sizeof(homeds::MemPiece)

This missing overhead caused two interconnected bugs:

1. cache_size metric underreported actual RSS by factor of:
   (m_cache_size + overhead) / m_cache_size

2. Evictor's m_cur_size tracked only m_cache_size, so when m_cur_size
   approached m_max_size, actual RSS was already at:
   m_max_size × (m_cache_size + overhead) / m_cache_size

   This prevented eviction from triggering before OOM.

The fix adds the overhead using compile-time sizeof() for each component,
ensuring both the cache_size gauge and eviction threshold (m_cur_size vs
m_max_size) accurately reflect true memory consumption.

For typical 512B btree nodes: overhead ≈ 320B → multiplier ≈ 1.625×
For larger 4KB nodes: same absolute overhead → multiplier ≈ 1.078×

The formula scales correctly across different node sizes without hardcoding.
Add verify_node_fowarding() using std::queue for iterative tree traversal.
Benefits over recursive verify_node_recursive():
- Reduces lock contention: ancestor nodes unlocked while processing descendants
- Enables cache eviction during traversal: ref_count drops to 1 immediately per node
- Lower peak stack depth for deep trees

This does NOT solve cross-volume cache accumulation during recovery. The actual
OOM fix is force_evict() after each volume completes (separate commit).

Add recursive parameter (default false) to verify_tree() propagated through:
- btree.hpp: verify_tree(update_debug_bm, recursive)
- mapping.hpp/cpp: verify_tree delegation
- volume.hpp/cpp: verify_tree delegation

Dispatcher verify_node() routes to verify_node_recursive() when recursive=true,
verify_node_fowarding() when recursive=false (default).
Add release_cached_tree() to evict cached btree nodes when safe:
- Buffer not locked (try_lock succeeds)
- Buffer not dirty (no pending writeback)
- ref_count permits eviction (only cache/wb_cache hold refs)

Implementation:
- btree.hpp: EvictionStats struct + release_cached_tree() + release_cached_nodes_recursive()
- ssd_btree.hpp: free_node_from_cache() with isSyncCall=true to avoid null deref on wb_req->bcp
- mapping.hpp: delegation wrapper
- volume.hpp: delegation wrapper

Safety: Cache::erase (cache_only=true) correctly handles wb_cache refs - buffer removed
from hash_set/eviction list while wb_cache retains ref for checkpoint flush.

Use case: Call after verify_tree() during recovery to release per-volume cache before
processing next volume, preventing cross-volume memory accumulation.
…lation

In vol_recovery_start_phase2(), call release_cached_tree() after recovery_start_phase2()
finishes for each volume. This proactively evicts clean cached btree nodes, providing:

1. Limits peak memory: max(single volume) instead of sum(all volumes)
   Without this fix, cache accumulates across volumes during sequential recovery.

2. Frees cache space for dirty buffers: Clean nodes from verify_tree() consume cache
   quota but provide zero value after recovery. Evicting them makes room for dirty
   buffers that need cache space for writeback/checkpoint.

3. Defensive against future bugs: Reduces OOM risk from any dirty buffer accumulation
   by not wasting cache on stale clean buffers.

Critical ordering per volume:
1. verify_tree() - loads nodes into cache for verification
2. recovery_start_phase2() - io_replay needs nodes in cache
3. release_cached_tree() - safe to evict after io_replay completes
Add release_cache_after_recovery boolean flag to GeneralConfig in homeblks_config.fbs
with default value true. This allows runtime control of cache release behavior during
volume recovery phase 2.

When enabled (default), cached btree nodes are safely evicted after each volume's
recovery to limit peak memory usage and free cache space for dirty buffers.

The flag is marked as hotswap, allowing dynamic configuration updates without restart.
Add recovery_cache_release_latency gauge to HomeBlksMetrics and m_cache_release_ms
field to HomeBlksRecoveryStats to track time spent releasing cached btree nodes
during volume recovery phase 2.

The metric accumulates time across all volumes and logs per-volume cache release
duration. This enables monitoring the performance overhead of cache eviction and
helps identify if the cache release operation becomes a bottleneck during recovery.
Run ./apply-clang-format.sh to format all C++ source files according to
src/.clang-format style configuration. This ensures consistent code style
across the codebase.
homestore::BlkId bid(node_id);

auto req = writeback_req_t::make_request();
req->isSyncCall = true;
Copy link
Copy Markdown
Collaborator

@szmyd szmyd May 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isSyncCall = true prevents a null deref on wb_req->bcp->cp_id - but that's a call-site workaround. The null check belongs in the wb layer.

Comment thread src/engine/cache/cache.h
// This formula ensures cache_size metric and eviction threshold reflect true RSS cost
static constexpr uint32_t k_overhead{sizeof(CacheBufferType) + sizeof(homeds::MemVector) +
sizeof(homeds::MemPiece)};
return cbuf->get_cache_size() + k_overhead;
Copy link
Copy Markdown
Collaborator

@szmyd szmyd May 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get_size() feeds Evictor::add_record directly - this changes m_cur_size accounting for all IO paths, not just recovery. The accounting fix is correct (old code undercounted RSS), but at any configured m_max_size the cache holds ~38% fewer nodes than before. Tuned configs may need revisiting.

*
* @return : true if this node including all its children are not corrupted; false if not
*/
bool verify_node_fowarding(bnodeid_t bnodeid, BtreeNodePtr parent_node, uint32_t indx, bool update_debug_bm) {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: verify_node_fowardingverify_node_forwarding

return true;
}

void release_cached_nodes_recursive(bnodeid_t root_bnodeid, EvictionStats& stats) {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Name says recursive; implementation is iterative BFS (std::queue, no self-call). Rename to release_cached_nodes_bfs or just release_cached_nodes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants