Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion conanfile.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@

class HomeObjectConan(ConanFile):
name = "homeobject"
version = "4.1.10"
version = "4.1.11"

homepage = "https://github.com/eBay/HomeObject"
description = "Blob Store built on HomeStore"
Expand Down
315 changes: 315 additions & 0 deletions docs/adr/blob-index-analyze.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,315 @@
# HomeObject Blob Index Analyze

**Date:** 2026-05-06
**Purpose:** Reference for blob index sizing and capacity planning

---
## Execution Summary

### Scenario 1: 20TB Data with 2GB Memory Budget

**Configuration:**
- Total data: 20TB
- `io_mem_size`: 2GB
- **Fixed cache pool:** 2GB × 65% = **1.3 GB** (always allocated)

| Blob Size | Blob Count | Index Disk | Hard Memory | Cache Pool | Cache Coverage |
|-----------|------------|------------|-------------|------------|----------------|
| **8 KB** | 2.56B | 60.1 GB | 117 MB | 1.3 GB | **2.16%** ⚠️ |
| **256 KB**| 80M | 1.88 GB | 3.67 MB | 1.3 GB | **69%** ✓ |


---

### Scenario 2: 128TB Data with 2GB Memory Budget

**Configuration:**
- Total data: 128TB
- `io_mem_size`: 2GB
- **Fixed cache pool:** 2GB × 65% = **1.3 GB** (always allocated)

| Blob Size | Blob Count | Index Disk | Hard Memory | Cache Pool | Cache Coverage |
|-----------|------------|------------|-------------|------------|----------------|
| **8 KB** | 16.4B | 385 GB | 770 MB | 1.3 GB | **0.34%** ⚠️⚠️ |
| **256 KB**| 512M | 12.0 GB | 23.4 MB | 1.3 GB | **10.8%** ✓ |


---

## Action Items
### Shrink down the size of Index_vdev
#### [HDD SKU]
Now we use 45% of META drive (200GB*45% =90GB) on HDD SKUs. The number is generally correct as worst case we can consume up to 60GB.
Shrinking it down from 90GB to 60GB saves 60MB memory which is negligible.
#### [QLC SKU]
QLC node total data capacity: ~128 TB.
5% of QLC size (114 TB × 5% = 5721.96 GB) consumes ~11 GB memory for allocator.
Changing it to 0.5% based on below calculation, as a result, Index size would be ~500GB with Hard Memory ~1GB.
Also, we probably should tune down the log pct from 10% to 0.5% (570GB) as well, but this tuning should be done via config
to avoid failing the (vdev too small)

```
Worst case assuming blob size is 8KB, the Index size should be ~0.3% of DataSize

(DataSize/8K)*(4KB/167)*1.01 ==> DataSize * (4K/8K)/167 * 1.01 = DataSize * 0.5/167 * 1.01 ≈ DataSize * 0.3%.
```
### Configure mem_size for QLC nodes.

For QLC nodes we have ~35GB+ memory for a SM, bumping the app_mem_size allows more index be cached in memory, however,
the performance gap is still subject to performance evaluation.

## Quick Estimation Formulas

**For any blob size and data volume:**

```
1. Blob count (N) = Data size / Avg blob size

2. Index disk space ≈ (N / 167) × 4KB × 1.01
(1.01 factor accounts for internal nodes)

3. Hard memory = Index disk / 4096 × 8 bytes
(Allocator) (Always resident, non-negotiable)

4. Soft memory = min(io_mem_size × 65%, Index disk × 10%)
(Working set) (Capped by global memory budget; "Index disk × 10%" in tables
is a conservative working-set estimate, not the actual cap)

5. Dirty buffer limit = io_mem_size × 10%
(Transient peak) (Shared across all writes, not index-specific)
```

**Memory Components:**
- **Hard (Allocator):** Always resident in `folly::MPMCQueue`, holds ALL free blocks
- Independent of `io_mem_size`, scales with index vdev size
- Formula: `(index_disk / 4096) × 8 bytes` (4B `blk_num_t` + 4B `atomic<uint32_t>` sequence per slot)

- **Soft (Working Set):** LRU-managed btree node cache
- **Capped by:** `io_mem_size × cache_size_percent / 100` (default 65%)
- **Shared by:** All indexes + data service caches
- **Not capped by:** Index vdev size

- **Transient (Dirty):** Temporary write buffers
- **Capped by:** `io_mem_size × dirty_buf_percent / 100` (default 10%)
- Freed after checkpoint flush

---
## Quick Reference

| Node Type | Fanout | Entry Size | Calculation |
|-----------|--------|------------|-------------|
| **Leaf** | **167** | 24 bytes | (4096 - 72) / 24 = 167.67 → 167 |
| **Internal** | **125** | 32 bytes | (4096 - 72) / 32 = 125.75 → 125 |

---

## Detailed Breakdown

### Node Structure (4KB nodes)

```
Total node size: 4096 bytes
Header size: 72 bytes (persistent_hdr_t)
Data area: 4024 bytes
```

### Leaf Nodes (Store blob → physical address mappings)

**Entry Structure:**
```cpp
Key: BlobRoute (shard_id + blob_id) = 16 bytes
Value: MultiBlkId (physical block addr) = 8 bytes
Total: = 24 bytes
```

**Capacity:**
```
Fanout = floor(4024 / 24) = 167 entries per leaf
Wasted space = 4024 - (167 × 24) = 16 bytes (99.6% utilization)
```

### Internal Nodes (Store routing keys → child pointers)

**Entry Structure:**
```cpp
Key: BlobRoute (separator key) = 16 bytes
Value: BtreeLinkInfo (child pointer) = 16 bytes
- bnodeid_t (child node ID) = 8 bytes
- link_version (concurrency ctrl) = 8 bytes
Total: = 32 bytes
```

**Capacity:**
```
Fanout = floor(4024 / 32) = 125 children per internal node
Wasted space = 4024 - (125 × 32) = 24 bytes (99.4% utilization)
```

---

## Key Details

### Type Definitions
```cpp
// From HomeObject
using shard_id_t = uint64_t; // 8 bytes
using blob_id_t = uint64_t; // 8 bytes
using bnodeid_t = uint64_t; // 8 bytes (HomeStore)

#pragma pack(1)
struct BlobRoute {
shard_id_t shard; // 8 bytes
blob_id_t blob; // 8 bytes
}; // Total: 16 bytes (packed)
```

### Internal Node Keys
- Internal nodes store **actual BlobRoute keys** (same as leaf keys)
- Keys are **separator keys**: the **last key** from the left child subtree
- NOT synthetic/aggregate keys - real blob identifiers

### Index Sizing Formula

For N blobs:
```
Leaf nodes = ceil(N / 167)
Level 1 = ceil(Leaf nodes / 125)
Level 2 = ceil(Level 1 / 125)
...
Total size = Total nodes × 4096 bytes
```

**Example:** 1 million blobs
```
Leaf: 5,989 nodes
Level 1: 48 nodes
Level 2: 1 node (root)
Total: 6,038 nodes = 23.59 MB
```

---

## Node Implementation

- **Node Type:** `SimpleNode` (fixed-size key/value pairs)
- **Packing:** Sequential, no per-entry overhead
- **Layout:** `[K₀V₀][K₁V₁][K₂V₂]...` (tightly packed)
- **Alignment:** None - entries packed contiguously

---

## Memory Overhead

For index vdev size S:
- **Allocator overhead:** ~0.2% of S
- Uses `FixedBlkAllocator` (4KB fixed block size)
- Maintains all free blocks in memory via `folly::MPMCQueue`
- Each slot: 8 bytes (4B `blk_num_t` + 4B `atomic<uint32_t>` sequence)
- For 10GB index: ~20 MB allocator memory

**Total overhead per blob:** ~24-25 bytes (including tree structure overhead)

---

## Production Scaling Analysis

### Memory Cap Model

**Global Memory Budget:** `io_mem_size` (configuration parameter)
```cpp
// From homestore_decl.hpp
uint64_t io_mem_size() const {
return (hugepage_size != 0) ? hugepage_size : app_mem_size;
}
```

**Index Working Set Cap:**
```cpp
// From resource_mgr.cpp:179
max_index_cache = io_mem_size × cache_size_percent / 100
= io_mem_size × 65% (default from homestore_config.fbs)
```

**Dirty Buffer Cap:**
```cpp
// From resource_mgr.cpp:233
max_dirty_buffers = io_mem_size × dirty_buf_percent / 100
= io_mem_size × 10% (default)
```

**Key Points:**
1. **Global cap trumps index size:** Working set is limited by `io_mem_size × 65%`, NOT index vdev size
- Example: 10GB index but `io_mem_size = 1GB` → max cache = 650MB (not 6.5GB)
- LRU evictor (created in homestore.cpp:280) enforces this limit

2. **Allocator memory is separate:** Not counted in cache budget, always required
- Hard memory = ~0.2% of index disk (8 bytes per 4KB block)

3. **Shared budget:** `io_mem_size` serves ALL HomeStore components (index, data, log)
- Index cache competes with data cache for this budget
- Multi-index deployments share the same `cache_size_percent` pool

**In the tables below:**
- "Soft Memory" shows 65% of index disk (theoretical maximum)
- Actual working set = `min(index_disk × 65%, io_mem_size × 65%)`
- Configure `io_mem_size` large enough to avoid excessive cache thrashing

---

### Case 1: 256KB Blobs

| Data Size | Blob Count | Index Disk | Hard Memory (Allocator) | Soft Memory (Working Set)* |
|-----------|------------|------------|-------------------------|----------------------------|
| **1 TB** | 4.0M | 95.9 MB | 187 KB | 10 MB (10% of index) |
| **20 TB** | 80.0M | 1.88 GB | 3.67 MB | 188 MB (10% of index) |
| **128 TB** | 512.0M | 12.0 GB | 23.4 MB | 1.2 GB (10% of index) |

\* *Actual = min(shown value, `io_mem_size × 65%`). Configure `io_mem_size` accordingly.*

**Calculation:**
```
Blob count = Data size / 256KB
Index disk = ceil(Blob count / 167) × 4KB + overhead (3-level tree)
Hard memory = Index disk / 4KB blocks × 12 bytes (allocator queue)
Soft memory = min(Index disk × 10%, io_mem_size × 65%)
```

---

### Case 2: 8KB Blobs (Worst Case)

| Data Size | Blob Count | Index Disk | Hard Memory (Allocator) | Soft Memory (Working Set)* |
|-----------|------------|------------|-------------------------|----------------------------|
| **1 TB** | 128.0M | 3.01 GB | 5.88 MB | 0.3 GB (10% of index) |
| **20 TB** | 2.56B | 60.1 GB | 117 MB | 6 GB (10% of index) |
| **128 TB** | 16.4B | 385 GB | 770 MB | 38 GB (10% of index) |

\* *Actual = min(shown value, `io_mem_size × 65%`). Large deployments need substantial `io_mem_size`.*

**Calculation:**
```
Blob count = Data size / 8KB
Index disk = ceil(Blob count / 167) × 4KB + overhead (4-5 level tree)
Hard memory = Index disk / 4KB blocks × 12 bytes (allocator queue)
Soft memory = min(Index disk × 10%, io_mem_size × 65%)
```

---



## References

- **HomeStore:** `/Users/xiaoxchen/Code/HomeStore`
- Btree: `src/include/homestore/btree/detail/simple_node.hpp`
- Block allocator: `src/lib/blkalloc/fixed_blk_allocator.{h,cpp}`

- **HomeObject:** `/Users/xiaoxchen/Code/HomeObject`
- Index KV: `src/lib/homestore_backend/index_kv.hpp`
- Blob route: `src/lib/blob_route.hpp`

---

**Last Updated:** 2026-05-06
**Validated Against:** HomeStore commit d9cf5c10, HomeObject current
**Includes:** Production scaling for 1TB-128TB with 256KB and 8KB blob sizes
18 changes: 18 additions & 0 deletions src/lib/homestore_backend/hs_backend_config.fbs
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,9 @@ namespace homeobjectcfg;
attribute "hotswap";
attribute "deprecated";

// Per-service device type override for format_and_start.
// AUTO (default) defers to the built-in hybrid/single mode logic.
enum HSDevTypeOverride : uint8 { AUTO = 0, FAST = 1, DATA = 2 }

table HSBackendSettings {
// timer thread freq in us
Expand Down Expand Up @@ -50,6 +53,21 @@ table HSBackendSettings {
// Check traffic ready before get, default true, set false for 1-replica workaround mode to allow read
check_traffic_ready_before_get: bool = true;

// Per-service device type and size_pct overrides for format_and_start.
// dev_type defaults to AUTO (0 = defer to hybrid/single code logic).
// size_pct defaults to -1.0 (not-set); any value >= 0 overrides the built-in default.
meta_dev_type: HSDevTypeOverride = AUTO;
meta_size_pct: float = -1.0;

log_dev_type: HSDevTypeOverride = AUTO;
log_size_pct: float = -1.0;

index_dev_type: HSDevTypeOverride = AUTO;
index_size_pct: float = -1.0;

replication_dev_type: HSDevTypeOverride = AUTO;
replication_size_pct: float = -1.0;

}

root_type HSBackendSettings;
Loading
Loading