Follower Stuck in handle_fetch_data_response Due to Prolonged EAGAIN Errors

During the sddz storage hammer test with a 4M blob size, a corner case was identified on the follower side. It is unclear if this issue needs to be handled in the future. The details are as follows: 

The follower became stuck in the `handle_fetch_data_response` phase due to repeated `EAGAIN` errors, which prevented it from completing end_of_append_batch and committing subsequent logs. The root cause of the repeated EAGAIN errors remains unknown, and the issue persisted for over two days before resolving itself.

Below are the relevant logs:
```
[09/23/25 01:40:32.992] [storage_mgr] [trace] [66] [raft_state_machine.cpp:59:localize_journal_entry_prepare] [traceID=14355508641916858133] [rdev0:2d1b59fa-0a5e-43f9-9377-3933cae3af4a] Raft Channel: Localizing Raft log_entry: server_id=1424293988, term=1, journal_entry=[version=1.1, code=HS_DATA_LINKED, server_id=1424293988, dsn=77473, header_size=48, key_size=8, value_size=8]
[09/23/25 01:40:32.992] [storage_mgr] [debug] [66] [hs_blob_manager.cpp:431:blob_put_get_blk_alloc_hints] [traceID=14355508641916858133,shardID=0x1f4,pg=0,shard=0x1f4,blob=76470] Picked p_chunk_id=309, reserved_blks=16384
[09/23/25 01:40:32.992] [storage_mgr] [debug] [66] [raft_repl_dev.cpp:1052:applier_create_req] [traceID=14355508641916858133] [rdev0:2d1b59fa-0a5e-43f9-9377-3933cae3af4a]
[09/23/25 01:40:32.995] [storage_mgr] [trace] [66] [raft_repl_dev.cpp:1222:fetch_data_from_remote] [traceID=14355508641916858133] [rdev0:2d1b59fa-0a5e-43f9-9377-3933cae3af4a] Fetching data from originator=1424293988, remote: rreq=[repl_key=[server=1424293988, term=1, dsn=77473, hash=1424240324], lsn=-1 state=[BLK_ALLOCATED | ] m_headersize=48 m_keysize=8 is_proposer=false local_blkid=[[{blk#=239853 count=1025 chunk=309},] ] remote_blkid=[{blk#=239853 count=1025 chunk=309},]], remote_blkid=[{blk#=239853 count=1025 chunk=309},], my server_id=556276154
[09/23/25 01:40:33.305] [storage_mgr] [debug] [19] [raft_repl_dev.cpp:1405:handle_fetch_data_response] [traceID=14355508641916858133] [rdev0:2d1b59fa-0a5e-43f9-9377-3933cae3af4a] Data Channel: Handling fetched data for rreq=[dsn=77473 term=1 lsn=-1 op=HS_DATA_LINKED local_blkid=[[{blk#=239853 count=1025 chunk=309},] ] state=[BLK_ALLOCATED | ]], data_size: 4198400, total_size: 4198400, local_blkid: [{blk#=239853 count=1025 chunk=309},]
[09/23/25 01:40:33.315] [storage_mgr] [debug] [19] [raft_repl_dev.cpp:1423:handle_fetch_data_response] [traceID=14355508641916858133] [rdev0:2d1b59fa-0a5e-43f9-9377-3933cae3af4a] Data Channel: the sha256 value of the data received in handle_fetch_data_response for rreq=[dsn=77473 term=1 lsn=-1 op=HS_DATA_LINKED local_blkid=[[{blk#=239853 count=1025 chunk=309},] ] state=[BLK_ALLOCATED | DATA_RECEIVED | ]] is 9974290253907391527
[09/23/25 01:40:33.315] [storage_mgr] [trace] [19] [raft_repl_dev.cpp:1453:handle_fetch_data_response] [traceID=14355508641916858133] [rdev0:2d1b59fa-0a5e-43f9-9377-3933cae3af4a] Data Channel: Data fetched from remote: rreq=[dsn=77473 term=1 lsn=-1 op=HS_DATA_LINKED local_blkid=[[{blk#=239853 count=1025 chunk=309},] ] state=[BLK_ALLOCATED | DATA_RECEIVED | ]], data_size: 4198400, total_size: 4198400, local_blkid: [{blk#=239853 count=1025 chunk=309},]

[09/23/25 01:40:33.445] [storage_mgr] [error] [19] [uring_drive_interface.cpp:514:handle_completions] Error in completion of io, iocb=0x7181e3430310, result=-11, retry=0

[09/23/25 01:40:33.564] [storage_mgr] [trace] [66] [common.cpp:176:set_lsn] [traceID=14355508641916858133] Setting lsn=77477 for request=repl_key=[server=1424293988, term=1, dsn=77473, hash=1424240324], lsn=77477 state=[BLK_ALLOCATED | DATA_RECEIVED | ] m_headersize=48 m_keysize=8 is_proposer=false local_blkid=[[{blk#=239853 count=1025 chunk=309},] ] remote_blkid=[{blk#=239853 count=1025 chunk=309},]
[09/23/25 01:40:33.564] [storage_mgr] [trace] [66] [repl_log_store.cpp:22:append] [traceID=14355508641916858133] [rdev0:2d1b59fa-0a5e-43f9-9377-3933cae3af4a] Raft Channel: Received append log entry rreq=[dsn=77473 term=1 lsn=77477 op=HS_DATA_LINKED local_blkid=[[{blk#=239853 count=1025 chunk=309},] ] state=[BLK_ALLOCATED | DATA_RECEIVED | LOG_RECEIVED | ]]
[09/23/25 01:40:33.564] [storage_mgr] [trace] [66] [raft_state_machine.cpp:192:pre_commit_ext] [traceID=14355508641916858133] [rdev0:2d1b59fa-0a5e-43f9-9377-3933cae3af4a] Precommit rreq=[dsn=77473 term=1 lsn=77477 op=HS_DATA_LINKED local_blkid=[[{blk#=239853 count=1025 chunk=309},] ] state=[BLK_ALLOCATED | DATA_RECEIVED | LOG_RECEIVED | ]]

[09/23/25 01:40:34.205] [storage_mgr] [error] [19] [uring_drive_interface.cpp:514:handle_completions] Error in completion of io, iocb=0x7181e3430310, result=-11, retry=0
...

[09/25/25 20:51:42.562] [storage_mgr] [error] [19] [uring_drive_interface.cpp:514:handle_completions] Error in completion of io, iocb=0x7181e3430310, result=-11, retry=0

[09/25/25 20:51:42.736] [storage_mgr] [trace] [74] [raft_repl_dev.cpp:2121:flush_durable_commit_lsn] [traceID=n/a] [rdev0:2d1b59fa-0a5e-43f9-9377-3933cae3af4a] Flushing durable commit lsn to 77420
[09/25/25 20:51:42.736] [storage_mgr] [trace] [19] [raft_repl_dev.cpp:1113:operator()] [traceID=n/a] [rdev0:2d1b59fa-0a5e-43f9-9377-3933cae3af4a] 64 pending reqs's data are written
[09/25/25 20:51:42.736] [storage_mgr] [debug] [19] [raft_repl_dev.cpp:1447:operator()] [traceID=14355508641916858133] [rdev0:2d1b59fa-0a5e-43f9-9377-3933cae3af4a] Data Channel: Data Write completed rreq=[dsn=77473 term=1 lsn=77477 op=HS_DATA_LINKED local_blkid=[[{blk#=239853 count=1025 chunk=309},] ] state=[BLK_ALLOCATED | DATA_RECEIVED | DATA_WRITTEN | LOG_RECEIVED | ]], data_write_latency_us=241869429741, total_write_latency_us=241869171509, write_num_pieces=1
[09/25/25 20:51:42.737] [storage_mgr] [trace] [66] [home_raft_log_store.cpp:212] [replstore=0] end_of_append_batch flushed upto start=77421 cnt=64 lsn=77484
[09/25/25 20:51:42.737] [storage_mgr] [warning] [66] [handle_append_entries.cxx:1101:handle_append_entries] appending entries from peer 1424293988 took long time (241869745 ms)

```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Follower Stuck in handle_fetch_data_response Due to Prolonged EAGAIN Errors #815

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Follower Stuck in handle_fetch_data_response Due to Prolonged EAGAIN Errors #815

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions