Harden NFS file-copy plugin and core disk/AIO paths against corrupt input and recoverable failures by linmajia · Pull Request #261 · microsoft/rDSN

linmajia · 2026-06-30T12:59:59Z

Summary

This branch hardens the NFS file-copy plugin (used by replica learning) and
several core disk-engine / AIO / filesystem paths so that recoverable I/O
failures and corrupt or untrusted RPC input no longer abort the host process.
Each fix converts a fatal dassert/uncaught-exception/OOB-access into a
graceful error return through a channel the surrounding code already has
(NFS completion callback, aio callback, or a bool/error_code return).
Happy-path behavior is unchanged; the new code runs only on failure paths.

It also fixes two empty-file (0-byte) copy regressions and advances the
rDSN.dist.service submodule pointer to its merged master
(the replication/meta-server hardening, merged separately as
rDSN.dist.service#30).

Motivation

The NFS service and the shared disk/AIO engine run on every replica node inside
one process. A single recoverable disk error (short write on a full disk,
fsync/close/stat failure, io_destroy during shutdown) or one malformed
NFS RPC from a peer could dassert/throw and abort the whole host — taking down
every healthy replica with it. Several of these were also memory-safety bugs
(OOB reads, uninitialized-heap disclosure into replicated data).

Changes

NFS — untrusted-input hardening

end_get_file_size mismatched list lengths — the response's parallel
size_list/file_list arrays from an untrusted peer were indexed in
lockstep without checking equal length → OOB std::vector::operator[] read.
Now validated up front; rejected with ERR_INVALID_DATA.
end_copy response-size validation — the client wrote
response.file_content.data() for response.size bytes without checking
size against the buffer length → past-the-end read corrupting the
destination with heap memory (disclosure into replicated data) or crash.
Now cross-checked before the local write; rejected with ERR_INVALID_DATA.
(Client-side counterpart of the validation on_copy already does on the
request.)
on_get_file_size source_dir prefix — the server stripped the
source_dir prefix via path.substr(request.source_dir.length()), using the
raw untrusted length as a position into an already-normalized path. A
non-normalized source_dir (e.g. redundant slashes) makes the normalized
path shorter → std::out_of_range thrown off an RPC worker thread (no
try/catch in dispatch) → NFS service abort. Both loops now guard the length
and reject with ERR_INVALID_PARAMETERS.

NFS — recoverable-failure & empty-file handling

File-copy I/O aborts → error channel — continue_write,
handle_completion, close_file GC, and the all-files on_get_file_size
branch now route create_directory/dsn_file_open/dsn_file_close/
file_size failures through the completion channel (or dwarn) instead of
aborting. The local-write pump now iterates instead of recurses, so a
persistent error can't grow the stack.
Zero-byte file copy — empty source files (one size == 0 copy request)
previously failed the whole transfer: first because on_copy rejected
size <= 0 (relaxed to size < 0, keeping the oversized-request overflow
guard), and then because a zero-length aio read/write completes with
ERR_HANDLE_EOF, not ERR_OK. The server now short-circuits a size == 0
request with an empty ERR_OK reply and the client finalizes the segment
inline as success — no zero-length aio is issued.
Short-read uninitialized data — internal_read_callback set resp.size
to the requested size while shipping the full block buffer, so a short read
wrote uninitialized heap into the destination (corruption + disclosure). Now
fails the block with ERR_FILE_OPERATION_FAILED when fewer bytes than
requested are read.

Core — disk engine, AIO, filesystem

disk_file::on_write_completed short write — asserted the provider always
wrote the full requested size (false on a full disk). Now reports
ERR_FILE_OPERATION_FAILED for the affected and remaining batched tasks via
the aio channel and preserves the trailing size == 0 invariant.
~native_linux_aio_provider io_destroy — a destructor asserted
io_destroy() == 0, aborting at shutdown if the kernel AIO context was
already released (EINVAL). Now logs derror and lets shutdown proceed.
remove_directory nftw walk — the nftw callback aborted on any
typeflag other than FTW_F/FTW_DP, but FTW_NS (per-entry stat failure)
and FTW_DNR (unreadable dir) are recoverable. Now dwarn + FTW_STOP so
the caller returns false instead of crashing (no-op on the Windows walk).

Submodule

rDSN.dist.service pointer → master — advanced to the merged-master
commit containing the replication-log / meta-server / mutation::read_from
hardening (merged as rDSN.dist.service#30).

The NFS file-copy plugin (used by replica learning) aborted the whole host via dassert on recoverable file I/O failures. Convert these to use the existing error/completion channels, and make the local-write pump iterate instead of recurse so a persistent error cannot grow the stack. nfs_server_impl.cpp: - on_get_file_size (all-files branch): on file_size() failure, set ERR_FILE_OPERATION_FAILED and break instead of aborting; resp.error already carries the failure to the client. - close_file GC: on dsn_file_close failure, dwarn instead of aborting; erase the handle from the map before closing. nfs_client_impl.cpp: - continue_write: on create_directory / dsn_file_open failure, finalize the user request via handle_completion(ERR_FILE_OPERATION_FAILED) instead of aborting. - continue_write: iterate instead of recurse on per-file failures to keep stack depth O(1) under persistent errors. - handle_completion: on dsn_file_close failure, dwarn instead of aborting.

end_get_file_size() iterates resp.size_list but indexes the parallel resp.file_list with the same index. The response is deserialized from a remote, untrusted peer; if it arrives with size_list longer than file_list, resp.file_list[i] is an out-of-bounds std::vector::operator[] read (constructing a std::string from garbage), corrupting memory or crashing the host. Validate that the two parallel arrays have equal length right after the existing resp.error check, and fail the request through the existing nfs_task channel with ERR_INVALID_DATA, matching the adjacent early-return error paths. A well-behaved server fills both lists in lockstep, so normal behavior is unchanged.

remove_directory()'s nftw callback aborted the whole process via dassert when nftw reported a typeflag other than FTW_F/FTW_DP. nftw delivers FTW_NS when a per-entry stat() fails (transient I/O error, permission problem, or a concurrent removal race) and FTW_DNR for an unreadable directory; both are recoverable and remove_directory()/remove_path() already return a bool. Replace the dassert with a dwarn + FTW_STOP so the walk stops and the caller returns false instead of crashing the host. The Windows custom walk only ever passes FTW_F/FTW_DP, so this is a no-op there.

on_copy rejected request.size <= 0, but size == 0 is legitimate: the client emits exactly one copy request per file, and an empty source file yields a single request with size 0. Empty files are listed by on_get_file_size without filtering, so any directory containing even one empty file failed the entire multi-file transfer / replica-learn with ERR_INVALID_PARAMETERS. Reject only request.size < 0 so a zero length flows through the original open + zero-length file::read path (restoring pre-hardening behavior) while keeping the oversized-request heap-overflow guard intact.

disk_file::on_write_completed asserted that the provider always wrote at least the requested number of bytes, aborting the process on a short write (which can happen on a full disk). Report ERR_FILE_OPERATION_FAILED for the affected task and the remaining batched tasks through the aio callback channel instead, and stop consuming the (now invalid) written size so the trailing size==0 invariant still holds.

~native_linux_aio_provider asserted io_destroy() returned 0, which aborts during shutdown if the kernel AIO context was already released or is otherwise invalid (e.g. EINVAL). A destructor must not abort: log the failure with derror (matching the other I/O error sites in this file) and let shutdown proceed.

Empty (0-byte) source files made the whole file transfer fail. A zero-length aio read/write completes with ERR_HANDLE_EOF (not ERR_OK), so the single size==0 copy request an empty file produces was reported as a failure -- the prior size-guard relaxation alone was not enough. Fix both sides to avoid issuing zero-length aio: the server short-circuits a size==0 copy request with an empty ERR_OK reply (no open/read), and the client finalizes the segment inline as success (the destination is already created by its O_CREAT open) via the loop, not recursion. Short reads also shipped uninitialized memory: internal_read_callback ignored the actual bytes read and set resp.size to the requested size while sending the full make_shared_array block buffer, so on a short read the client wrote uninitialized heap into the destination -- file corruption and a heap-memory disclosure into replicated data. Fail the copy block via ERR_FILE_OPERATION_FAILED when fewer bytes than requested are read. Add core.aio_zero_length pinning the zero-length read/write == ERR_HANDLE_EOF contract these fixes depend on.

Point rDSN.dist.service at the error-handling-follow-up branch (bf631e7), which hardens replica log replay and checkpoint copy against corrupt/untrusted input.

Advance the rDSN.dist.service submodule pointer from bf631e7 to 9f490ec, which converts the log_file read-mode constructor's file_size() dassert abort into a graceful ERR_FILE_OPERATION_FAILED rejection via the existing log_file::open_read() error channel.

Point rDSN.dist.service to 18ad52f, which converts the abort-on-recoverable- log-file-creation-failure in mutation_log::mark_new_offset into the existing _io_error_callback failover path and guards write_pending_mutations against a null current log file.

Advance the rDSN.dist.service pointer to 9d2d9dc, which makes log_file::flush() and log_file::close() report recoverable disk-syscall failures through their existing error channels instead of aborting the replica server via dassert.

Point to rDSN.dist.service b93df50, which makes mutation::read_from return nullptr instead of throwing and converts its callers to nullptr checks.

The NFS client issued file::write(response.file_content.data(), response.size, ...) using the size and buffer from the remote server's copy_response without cross-checking them. A response whose declared size is negative or larger than the actual file_content buffer would read past the end of that buffer, corrupting the destination file with adjacent heap memory (disclosed into the replicated data) or crashing. This is the client-side counterpart of the size validation on_copy already performs on the request. end_copy now rejects such responses through the existing handle_completion(ERR_INVALID_DATA) channel before the local write.

on_get_file_size returned each file path relative to source_dir via path.substr(request.source_dir.length()), using the raw untrusted source_dir length as the position into a path that get_subfiles/path_combine had already normalized (consecutive separators collapsed, trailing separator stripped). A peer sending a non-normalized source_dir (e.g. with redundant slashes) makes the normalized path shorter than the prefix length, so substr() throws std::out_of_range. The RPC dispatch chain has no try/catch, so the exception unwinds off the worker thread and aborts the NFS service, which runs on every replica node. Guard both loops: if request.source_dir.length() exceeds the path length the source_dir cannot be a valid prefix, so reject the request with ERR_INVALID_PARAMETERS instead of throwing. The guard only triggers for a non-normalized source_dir; valid requests are unaffected.

The error-handling-follow-up changes in the rDSN.dist.service submodule were squash-merged into its master branch (microsoft#30). Advance the submodule pointer from the now-deleted branch commit to the merged master commit.

HX Lin added 17 commits June 30, 2026 10:37

Sync dist service submodule

5dc32fb

Point rDSN.dist.service at the error-handling-follow-up branch (bf631e7), which hardens replica log replay and checkpoint copy against corrupt/untrusted input.

Sync dist service submodule (mutation::read_from returns nullptr)

e7edd19

Point to rDSN.dist.service b93df50, which makes mutation::read_from return nullptr instead of throwing and converts its callers to nullptr checks.

Sync dist service submodule (harden restore_from_local_storage)

a3c25cf

Sync dist service submodule (harden dump_file::read_next_buffer)

314d5c2

linmajia merged commit b04ec67 into microsoft:master Jun 30, 2026
2 checks passed

linmajia deleted the error-handling-follow-up branch June 30, 2026 13:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Harden NFS file-copy plugin and core disk/AIO paths against corrupt input and recoverable failures#261

Harden NFS file-copy plugin and core disk/AIO paths against corrupt input and recoverable failures#261
linmajia merged 17 commits into
microsoft:masterfrom
linmajia:error-handling-follow-up

linmajia commented Jun 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

linmajia commented Jun 30, 2026

Summary

Motivation

Changes

NFS — untrusted-input hardening

NFS — recoverable-failure & empty-file handling

Core — disk engine, AIO, filesystem

Submodule

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant