The issue happen in my local build SM 0.5.5 image, lets put here to see if it need more discussion or not.
Environment
| Parameter |
Value |
| Cluster |
908 |
| Namespace |
nuobject2sh-dev |
| Deployment |
sm-long-running1-1015 |
| HomeObjVer |
homeobject/4.1.0@oss/v4 |
| HomeStoreVer |
homestore/7.4.4@oss/master |
Test Scenario
- Start write 1 million blobs in 2 PGs
blob_size: 150K
replica_count: 3
write:delete:read ratio: 87:13
- Randomly select one member to kill every 3 minutes
- After SM1 was killed and restarted, it crashed during recovery
Problem
LogDev recovery crashes when reading stale data from a previous journal cycle that coincidentally has matching prev_grp_crc.
[03/07/26 18:23:11.242] [storage_mgr] [error] [11] [log_stream.cpp:140:next_group] crc doesn't match 268939264 log_dev=1
[03/07/26 18:23:11.242] [storage_mgr] [critical] [11] data is corrupted 1
[03/07/26 18:23:11.309] [storage_mgr] [critical] [11] [stacktrace.cpp:133:crash_handler]
* ****Received fatal SIGNAL : SIGABRT(6) PID : 11
[03/07/26 18:23:11.309] [storage_mgr] [critical] [11] Thread num: 123956172627968 entered exit handler
Crash Scenario
Before Crash
- NuRaft lsn=65651 fully flushed → LogDev log_idx=65650 was written
- NuRaft lsn=65652 proposing → LogDev log_idx=65651 NOT written yet
- SIGKILL → offset 268939264 still contains old data from previous journal round
After Restart (Recovery)
log_dev=1 - Crashes:
By adding more logging and restarting again, the detailed log is:
[03/10/26 03:44:51.544] [error] CRC mismatch: computed=3101365257 expected=511191246
offset=268939264 is_at_last_chunk=true start_log_idx=65651
prev_grp_crc=4037465583 m_prev_crc=4037465583 log_dev=1
[03/10/26 03:44:51.545] [critical] data is corrupted 1
Recovery reads:
- ✓ log_idx=65650 successfully → sets m_prev_crc=4037465583
- ✗ Reads offset 268939264 (old stale data):
- Header shows prev_grp_crc=4037465583 → matches m_prev_crc ✓
- Passes prev_crc check, footer check
- CRC verification fails → CRASH
Root Cause
This appears to be a coincidence issue where the data PVC is reused across multiple rounds of storage hammer testing. After discussion with AI, no additional evidence was found to confirm this is a code bug.
The coincidence: Old data's prev_grp_crc happens to match the newly written group's cur_grp_crc (both 4037465583).
SH test Round N-1 (long ago):
offset 268939264 written with log_idx=M, prev_crc=4037465583
SH test Round N (now):
offset 268935168: log_idx=65650, cur_crc=4037465583 ← just written
offset 268939264: NOT written yet → still has old data from Round N-1
→ Old header: prev_crc=4037465583 (coincidentally matches current!)
→ But data content is stale → CRC mismatch
GDB Evidence
(gdb) p (homestore::log_group_header)(*0x6006709aba00)
magic = 15731998 ✓ Valid
logdev_id = 1 ✓ Matches
start_log_idx = 65651 ✓ Looks correct
prev_grp_crc = 4037465583 ✓ Matches m_prev_crc (coincidence!)
cur_grp_crc = 511191246 ✗ But data CRC doesn't match
(gdb) p (homestore::log_group_footer)(*105580775258883)
magic = 11537694 ✓ Valid
start_log_idx = 65651 ✓ Consistent with header
(gdb) p (homestore::serialized_log_record)(*105580775258672)
$9 = {size = 111, offset = 148, is_inlined = 1, store_seq_num = 65651, store_id = 0}
Code Bug Discussion
File: src/lib/logstore/log_stream.cpp, line 138-148
CRC mismatch handling is inconsistent with other validation checks. What is the background for this difference? Can we add last_chunk tolerance here as well?
// prev_crc check (line 87-100) - has last_chunk tolerance ✓
if (m_prev_crc != 0 && m_prev_crc != header->prev_grp_crc) {
if (!m_vdev_jd->is_offset_at_last_chunk(*out_dev_offset)) {
HS_REL_ASSERT(0, "data is corrupted");
}
return ret_buf; // Last chunk → graceful end
}
// Footer check (line 118-129) - has last_chunk tolerance ✓
if (footer->magic != LOG_GROUP_FOOTER_MAGIC || ...) {
if (!m_vdev_jd->is_offset_at_last_chunk(*out_dev_offset)) {
HS_REL_ASSERT(0, "data is corrupted");
}
return ret_buf; // Last chunk → graceful end
}
// CRC check (line 138-148) - NO last_chunk tolerance ✗
if (cur_crc != header->cur_grp_crc) {
LOGERROR("CRC mismatch: ... is_at_last_chunk={}", is_at_last_chunk, ...);
HS_REL_ASSERT(0, "data is corrupted"); // Always crashes!
}
The issue happen in my local build SM 0.5.5 image, lets put here to see if it need more discussion or not.
Environment
Test Scenario
blob_size: 150Kreplica_count: 3write:delete:readratio: 87:13Problem
LogDev recovery crashes when reading stale data from a previous journal cycle that coincidentally has matching
prev_grp_crc.Crash Scenario
Before Crash
After Restart (Recovery)
log_dev=1 - Crashes:
By adding more logging and restarting again, the detailed log is:
Recovery reads:
Root Cause
This appears to be a coincidence issue where the data PVC is reused across multiple rounds of storage hammer testing. After discussion with AI, no additional evidence was found to confirm this is a code bug.
The coincidence: Old data's
prev_grp_crchappens to match the newly written group'scur_grp_crc(both 4037465583).GDB Evidence
Code Bug Discussion
File:
src/lib/logstore/log_stream.cpp, line 138-148CRC mismatch handling is inconsistent with other validation checks. What is the background for this difference? Can we add last_chunk tolerance here as well?