During the sddz storage hammer test with a 4M blob size, a corner case was identified on the follower side. It is unclear if this issue needs to be handled in the future. The details are as follows:
[09/23/25 01:40:32.992] [storage_mgr] [trace] [66] [raft_state_machine.cpp:59:localize_journal_entry_prepare] [traceID=14355508641916858133] [rdev0:2d1b59fa-0a5e-43f9-9377-3933cae3af4a] Raft Channel: Localizing Raft log_entry: server_id=1424293988, term=1, journal_entry=[version=1.1, code=HS_DATA_LINKED, server_id=1424293988, dsn=77473, header_size=48, key_size=8, value_size=8]
[09/23/25 01:40:32.992] [storage_mgr] [debug] [66] [hs_blob_manager.cpp:431:blob_put_get_blk_alloc_hints] [traceID=14355508641916858133,shardID=0x1f4,pg=0,shard=0x1f4,blob=76470] Picked p_chunk_id=309, reserved_blks=16384
[09/23/25 01:40:32.992] [storage_mgr] [debug] [66] [raft_repl_dev.cpp:1052:applier_create_req] [traceID=14355508641916858133] [rdev0:2d1b59fa-0a5e-43f9-9377-3933cae3af4a]
[09/23/25 01:40:32.995] [storage_mgr] [trace] [66] [raft_repl_dev.cpp:1222:fetch_data_from_remote] [traceID=14355508641916858133] [rdev0:2d1b59fa-0a5e-43f9-9377-3933cae3af4a] Fetching data from originator=1424293988, remote: rreq=[repl_key=[server=1424293988, term=1, dsn=77473, hash=1424240324], lsn=-1 state=[BLK_ALLOCATED | ] m_headersize=48 m_keysize=8 is_proposer=false local_blkid=[[{blk#=239853 count=1025 chunk=309},] ] remote_blkid=[{blk#=239853 count=1025 chunk=309},]], remote_blkid=[{blk#=239853 count=1025 chunk=309},], my server_id=556276154
[09/23/25 01:40:33.305] [storage_mgr] [debug] [19] [raft_repl_dev.cpp:1405:handle_fetch_data_response] [traceID=14355508641916858133] [rdev0:2d1b59fa-0a5e-43f9-9377-3933cae3af4a] Data Channel: Handling fetched data for rreq=[dsn=77473 term=1 lsn=-1 op=HS_DATA_LINKED local_blkid=[[{blk#=239853 count=1025 chunk=309},] ] state=[BLK_ALLOCATED | ]], data_size: 4198400, total_size: 4198400, local_blkid: [{blk#=239853 count=1025 chunk=309},]
[09/23/25 01:40:33.315] [storage_mgr] [debug] [19] [raft_repl_dev.cpp:1423:handle_fetch_data_response] [traceID=14355508641916858133] [rdev0:2d1b59fa-0a5e-43f9-9377-3933cae3af4a] Data Channel: the sha256 value of the data received in handle_fetch_data_response for rreq=[dsn=77473 term=1 lsn=-1 op=HS_DATA_LINKED local_blkid=[[{blk#=239853 count=1025 chunk=309},] ] state=[BLK_ALLOCATED | DATA_RECEIVED | ]] is 9974290253907391527
[09/23/25 01:40:33.315] [storage_mgr] [trace] [19] [raft_repl_dev.cpp:1453:handle_fetch_data_response] [traceID=14355508641916858133] [rdev0:2d1b59fa-0a5e-43f9-9377-3933cae3af4a] Data Channel: Data fetched from remote: rreq=[dsn=77473 term=1 lsn=-1 op=HS_DATA_LINKED local_blkid=[[{blk#=239853 count=1025 chunk=309},] ] state=[BLK_ALLOCATED | DATA_RECEIVED | ]], data_size: 4198400, total_size: 4198400, local_blkid: [{blk#=239853 count=1025 chunk=309},]
[09/23/25 01:40:33.445] [storage_mgr] [error] [19] [uring_drive_interface.cpp:514:handle_completions] Error in completion of io, iocb=0x7181e3430310, result=-11, retry=0
[09/23/25 01:40:33.564] [storage_mgr] [trace] [66] [common.cpp:176:set_lsn] [traceID=14355508641916858133] Setting lsn=77477 for request=repl_key=[server=1424293988, term=1, dsn=77473, hash=1424240324], lsn=77477 state=[BLK_ALLOCATED | DATA_RECEIVED | ] m_headersize=48 m_keysize=8 is_proposer=false local_blkid=[[{blk#=239853 count=1025 chunk=309},] ] remote_blkid=[{blk#=239853 count=1025 chunk=309},]
[09/23/25 01:40:33.564] [storage_mgr] [trace] [66] [repl_log_store.cpp:22:append] [traceID=14355508641916858133] [rdev0:2d1b59fa-0a5e-43f9-9377-3933cae3af4a] Raft Channel: Received append log entry rreq=[dsn=77473 term=1 lsn=77477 op=HS_DATA_LINKED local_blkid=[[{blk#=239853 count=1025 chunk=309},] ] state=[BLK_ALLOCATED | DATA_RECEIVED | LOG_RECEIVED | ]]
[09/23/25 01:40:33.564] [storage_mgr] [trace] [66] [raft_state_machine.cpp:192:pre_commit_ext] [traceID=14355508641916858133] [rdev0:2d1b59fa-0a5e-43f9-9377-3933cae3af4a] Precommit rreq=[dsn=77473 term=1 lsn=77477 op=HS_DATA_LINKED local_blkid=[[{blk#=239853 count=1025 chunk=309},] ] state=[BLK_ALLOCATED | DATA_RECEIVED | LOG_RECEIVED | ]]
[09/23/25 01:40:34.205] [storage_mgr] [error] [19] [uring_drive_interface.cpp:514:handle_completions] Error in completion of io, iocb=0x7181e3430310, result=-11, retry=0
...
[09/25/25 20:51:42.562] [storage_mgr] [error] [19] [uring_drive_interface.cpp:514:handle_completions] Error in completion of io, iocb=0x7181e3430310, result=-11, retry=0
[09/25/25 20:51:42.736] [storage_mgr] [trace] [74] [raft_repl_dev.cpp:2121:flush_durable_commit_lsn] [traceID=n/a] [rdev0:2d1b59fa-0a5e-43f9-9377-3933cae3af4a] Flushing durable commit lsn to 77420
[09/25/25 20:51:42.736] [storage_mgr] [trace] [19] [raft_repl_dev.cpp:1113:operator()] [traceID=n/a] [rdev0:2d1b59fa-0a5e-43f9-9377-3933cae3af4a] 64 pending reqs's data are written
[09/25/25 20:51:42.736] [storage_mgr] [debug] [19] [raft_repl_dev.cpp:1447:operator()] [traceID=14355508641916858133] [rdev0:2d1b59fa-0a5e-43f9-9377-3933cae3af4a] Data Channel: Data Write completed rreq=[dsn=77473 term=1 lsn=77477 op=HS_DATA_LINKED local_blkid=[[{blk#=239853 count=1025 chunk=309},] ] state=[BLK_ALLOCATED | DATA_RECEIVED | DATA_WRITTEN | LOG_RECEIVED | ]], data_write_latency_us=241869429741, total_write_latency_us=241869171509, write_num_pieces=1
[09/25/25 20:51:42.737] [storage_mgr] [trace] [66] [home_raft_log_store.cpp:212] [replstore=0] end_of_append_batch flushed upto start=77421 cnt=64 lsn=77484
[09/25/25 20:51:42.737] [storage_mgr] [warning] [66] [handle_append_entries.cxx:1101:handle_append_entries] appending entries from peer 1424293988 took long time (241869745 ms)
During the sddz storage hammer test with a 4M blob size, a corner case was identified on the follower side. It is unclear if this issue needs to be handled in the future. The details are as follows:
The follower became stuck in the
handle_fetch_data_responsephase due to repeatedEAGAINerrors, which prevented it from completing end_of_append_batch and committing subsequent logs. The root cause of the repeated EAGAIN errors remains unknown, and the issue persisted for over two days before resolving itself.Below are the relevant logs: