Skip to content

LogDev Recovery CRC Mismatch #869

@Besroy

Description

@Besroy

The issue happen in my local build SM 0.5.5 image, lets put here to see if it need more discussion or not.

Environment

Parameter Value
Cluster 908
Namespace nuobject2sh-dev
Deployment sm-long-running1-1015
HomeObjVer homeobject/4.1.0@oss/v4
HomeStoreVer homestore/7.4.4@oss/master

Test Scenario

  1. Start write 1 million blobs in 2 PGs
    • blob_size: 150K
    • replica_count: 3
    • write:delete:read ratio: 87:13
  2. Randomly select one member to kill every 3 minutes
  3. After SM1 was killed and restarted, it crashed during recovery

Problem

LogDev recovery crashes when reading stale data from a previous journal cycle that coincidentally has matching prev_grp_crc.

[03/07/26 18:23:11.242] [storage_mgr] [error] [11] [log_stream.cpp:140:next_group] crc doesn't match 268939264 log_dev=1
[03/07/26 18:23:11.242] [storage_mgr] [critical] [11] data is corrupted 1
[03/07/26 18:23:11.309] [storage_mgr] [critical] [11] [stacktrace.cpp:133:crash_handler]
 * ****Received fatal SIGNAL : SIGABRT(6)	PID : 11
[03/07/26 18:23:11.309] [storage_mgr] [critical] [11] Thread num: 123956172627968 entered exit handler

Crash Scenario

Before Crash

  • NuRaft lsn=65651 fully flushed → LogDev log_idx=65650 was written
  • NuRaft lsn=65652 proposing → LogDev log_idx=65651 NOT written yet
  • SIGKILL → offset 268939264 still contains old data from previous journal round

After Restart (Recovery)

log_dev=1 - Crashes:

By adding more logging and restarting again, the detailed log is:

[03/10/26 03:44:51.544] [error] CRC mismatch: computed=3101365257 expected=511191246
    offset=268939264 is_at_last_chunk=true start_log_idx=65651
    prev_grp_crc=4037465583 m_prev_crc=4037465583 log_dev=1
[03/10/26 03:44:51.545] [critical] data is corrupted 1

Recovery reads:

  1. ✓ log_idx=65650 successfully → sets m_prev_crc=4037465583
  2. ✗ Reads offset 268939264 (old stale data):
    • Header shows prev_grp_crc=4037465583 → matches m_prev_crc ✓
    • Passes prev_crc check, footer check
    • CRC verification fails → CRASH

Root Cause

This appears to be a coincidence issue where the data PVC is reused across multiple rounds of storage hammer testing. After discussion with AI, no additional evidence was found to confirm this is a code bug.

The coincidence: Old data's prev_grp_crc happens to match the newly written group's cur_grp_crc (both 4037465583).

SH test Round N-1 (long ago):
  offset 268939264 written with log_idx=M, prev_crc=4037465583

SH test Round N (now):
  offset 268935168: log_idx=65650, cur_crc=4037465583  ← just written
  offset 268939264: NOT written yet → still has old data from Round N-1
    → Old header: prev_crc=4037465583 (coincidentally matches current!)
    → But data content is stale → CRC mismatch

GDB Evidence

(gdb) p (homestore::log_group_header)(*0x6006709aba00)
magic = 15731998          ✓ Valid
logdev_id = 1             ✓ Matches
start_log_idx = 65651     ✓ Looks correct
prev_grp_crc = 4037465583 ✓ Matches m_prev_crc (coincidence!)
cur_grp_crc = 511191246   ✗ But data CRC doesn't match

(gdb) p (homestore::log_group_footer)(*105580775258883)
magic = 11537694          ✓ Valid
start_log_idx = 65651     ✓ Consistent with header

(gdb) p (homestore::serialized_log_record)(*105580775258672)
$9 = {size = 111, offset = 148, is_inlined = 1, store_seq_num = 65651, store_id = 0}

Code Bug Discussion

File: src/lib/logstore/log_stream.cpp, line 138-148

CRC mismatch handling is inconsistent with other validation checks. What is the background for this difference? Can we add last_chunk tolerance here as well?

// prev_crc check (line 87-100) - has last_chunk tolerance ✓
if (m_prev_crc != 0 && m_prev_crc != header->prev_grp_crc) {
    if (!m_vdev_jd->is_offset_at_last_chunk(*out_dev_offset)) {
        HS_REL_ASSERT(0, "data is corrupted");
    }
    return ret_buf;  // Last chunk → graceful end
}

// Footer check (line 118-129) - has last_chunk tolerance ✓
if (footer->magic != LOG_GROUP_FOOTER_MAGIC || ...) {
    if (!m_vdev_jd->is_offset_at_last_chunk(*out_dev_offset)) {
        HS_REL_ASSERT(0, "data is corrupted");
    }
    return ret_buf;  // Last chunk → graceful end
}

// CRC check (line 138-148) - NO last_chunk tolerance ✗
if (cur_crc != header->cur_grp_crc) {
    LOGERROR("CRC mismatch: ... is_at_last_chunk={}", is_at_last_chunk, ...);
    HS_REL_ASSERT(0, "data is corrupted");  // Always crashes!
}

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions