LogDev Recovery CRC Mismatch

The issue happen in my local build SM 0.5.5 image, lets put here to see if it need more discussion or not.

## Environment

| Parameter | Value                      |
|-----------|----------------------------|
| **Cluster** | 908                        |
| **Namespace** | nuobject2sh-dev            |
| **Deployment** | sm-long-running1-1015      |
| **HomeObjVer** | homeobject/4.1.0@oss/v4    |
| **HomeStoreVer** | homestore/7.4.4@oss/master |

## Test Scenario

1. Start write 1 million blobs in 2 PGs
   - `blob_size`: 150K
   - `replica_count`: 3
   - `write:delete:read` ratio: 87:13
2. Randomly select one member to kill every 3 minutes
3. After SM1 was killed and restarted, it crashed during recovery

## Problem

LogDev recovery crashes when reading stale data from a previous journal cycle that coincidentally has matching `prev_grp_crc`.

```
[03/07/26 18:23:11.242] [storage_mgr] [error] [11] [log_stream.cpp:140:next_group] crc doesn't match 268939264 log_dev=1
[03/07/26 18:23:11.242] [storage_mgr] [critical] [11] data is corrupted 1
[03/07/26 18:23:11.309] [storage_mgr] [critical] [11] [stacktrace.cpp:133:crash_handler]
 * ****Received fatal SIGNAL : SIGABRT(6)	PID : 11
[03/07/26 18:23:11.309] [storage_mgr] [critical] [11] Thread num: 123956172627968 entered exit handler
```

## Crash Scenario

### Before Crash
- NuRaft lsn=65651 fully flushed → LogDev log_idx=65650 was written
- NuRaft lsn=65652 proposing → LogDev log_idx=65651 NOT written yet
- **SIGKILL** → offset 268939264 still contains old data from previous journal round

### After Restart (Recovery)

**log_dev=1 - Crashes:**

By adding more logging and restarting again, the detailed log is:

```
[03/10/26 03:44:51.544] [error] CRC mismatch: computed=3101365257 expected=511191246
    offset=268939264 is_at_last_chunk=true start_log_idx=65651
    prev_grp_crc=4037465583 m_prev_crc=4037465583 log_dev=1
[03/10/26 03:44:51.545] [critical] data is corrupted 1
```

Recovery reads:
1. ✓ log_idx=65650 successfully → sets m_prev_crc=4037465583
2. ✗ Reads offset 268939264 (old stale data):
   - Header shows prev_grp_crc=4037465583 → **matches** m_prev_crc ✓
   - Passes prev_crc check, footer check
   - CRC verification fails → **CRASH**

## Root Cause

This appears to be a coincidence issue where the data PVC is reused across multiple rounds of storage hammer testing. After discussion with AI, no additional evidence was found to confirm this is a code bug.

**The coincidence:** Old data's `prev_grp_crc` happens to match the newly written group's `cur_grp_crc` (both 4037465583).

```
SH test Round N-1 (long ago):
  offset 268939264 written with log_idx=M, prev_crc=4037465583

SH test Round N (now):
  offset 268935168: log_idx=65650, cur_crc=4037465583  ← just written
  offset 268939264: NOT written yet → still has old data from Round N-1
    → Old header: prev_crc=4037465583 (coincidentally matches current!)
    → But data content is stale → CRC mismatch
```


### GDB Evidence

```gdb
(gdb) p (homestore::log_group_header)(*0x6006709aba00)
magic = 15731998          ✓ Valid
logdev_id = 1             ✓ Matches
start_log_idx = 65651     ✓ Looks correct
prev_grp_crc = 4037465583 ✓ Matches m_prev_crc (coincidence!)
cur_grp_crc = 511191246   ✗ But data CRC doesn't match

(gdb) p (homestore::log_group_footer)(*105580775258883)
magic = 11537694          ✓ Valid
start_log_idx = 65651     ✓ Consistent with header

(gdb) p (homestore::serialized_log_record)(*105580775258672)
$9 = {size = 111, offset = 148, is_inlined = 1, store_seq_num = 65651, store_id = 0}
```


## Code Bug Discussion

**File:** `src/lib/logstore/log_stream.cpp`, line 138-148

CRC mismatch handling is **inconsistent** with other validation checks. What is the background for this difference? Can we add last_chunk tolerance here as well?

```cpp
// prev_crc check (line 87-100) - has last_chunk tolerance ✓
if (m_prev_crc != 0 && m_prev_crc != header->prev_grp_crc) {
    if (!m_vdev_jd->is_offset_at_last_chunk(*out_dev_offset)) {
        HS_REL_ASSERT(0, "data is corrupted");
    }
    return ret_buf;  // Last chunk → graceful end
}

// Footer check (line 118-129) - has last_chunk tolerance ✓
if (footer->magic != LOG_GROUP_FOOTER_MAGIC || ...) {
    if (!m_vdev_jd->is_offset_at_last_chunk(*out_dev_offset)) {
        HS_REL_ASSERT(0, "data is corrupted");
    }
    return ret_buf;  // Last chunk → graceful end
}

// CRC check (line 138-148) - NO last_chunk tolerance ✗
if (cur_crc != header->cur_grp_crc) {
    LOGERROR("CRC mismatch: ... is_at_last_chunk={}", is_at_last_chunk, ...);
    HS_REL_ASSERT(0, "data is corrupted");  // Always crashes!
}
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LogDev Recovery CRC Mismatch #869

Environment

Test Scenario

Problem

Crash Scenario

Before Crash

After Restart (Recovery)

Root Cause

GDB Evidence

Code Bug Discussion

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Parameter	Value
Cluster	908
Namespace	nuobject2sh-dev
Deployment	sm-long-running1-1015
HomeObjVer	homeobject/4.1.0@oss/v4
HomeStoreVer	homestore/7.4.4@oss/master

LogDev Recovery CRC Mismatch #869

Description

Environment

Test Scenario

Problem

Crash Scenario

Before Crash

After Restart (Recovery)

Root Cause

GDB Evidence

Code Bug Discussion

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions