Harden NFS file-copy plugin and core disk/AIO paths against corrupt input and recoverable failures#261
Merged
Conversation
added 17 commits
June 30, 2026 10:37
The NFS file-copy plugin (used by replica learning) aborted the whole host via dassert on recoverable file I/O failures. Convert these to use the existing error/completion channels, and make the local-write pump iterate instead of recurse so a persistent error cannot grow the stack. nfs_server_impl.cpp: - on_get_file_size (all-files branch): on file_size() failure, set ERR_FILE_OPERATION_FAILED and break instead of aborting; resp.error already carries the failure to the client. - close_file GC: on dsn_file_close failure, dwarn instead of aborting; erase the handle from the map before closing. nfs_client_impl.cpp: - continue_write: on create_directory / dsn_file_open failure, finalize the user request via handle_completion(ERR_FILE_OPERATION_FAILED) instead of aborting. - continue_write: iterate instead of recurse on per-file failures to keep stack depth O(1) under persistent errors. - handle_completion: on dsn_file_close failure, dwarn instead of aborting.
end_get_file_size() iterates resp.size_list but indexes the parallel resp.file_list with the same index. The response is deserialized from a remote, untrusted peer; if it arrives with size_list longer than file_list, resp.file_list[i] is an out-of-bounds std::vector::operator[] read (constructing a std::string from garbage), corrupting memory or crashing the host. Validate that the two parallel arrays have equal length right after the existing resp.error check, and fail the request through the existing nfs_task channel with ERR_INVALID_DATA, matching the adjacent early-return error paths. A well-behaved server fills both lists in lockstep, so normal behavior is unchanged.
remove_directory()'s nftw callback aborted the whole process via dassert when nftw reported a typeflag other than FTW_F/FTW_DP. nftw delivers FTW_NS when a per-entry stat() fails (transient I/O error, permission problem, or a concurrent removal race) and FTW_DNR for an unreadable directory; both are recoverable and remove_directory()/remove_path() already return a bool. Replace the dassert with a dwarn + FTW_STOP so the walk stops and the caller returns false instead of crashing the host. The Windows custom walk only ever passes FTW_F/FTW_DP, so this is a no-op there.
on_copy rejected request.size <= 0, but size == 0 is legitimate: the client emits exactly one copy request per file, and an empty source file yields a single request with size 0. Empty files are listed by on_get_file_size without filtering, so any directory containing even one empty file failed the entire multi-file transfer / replica-learn with ERR_INVALID_PARAMETERS. Reject only request.size < 0 so a zero length flows through the original open + zero-length file::read path (restoring pre-hardening behavior) while keeping the oversized-request heap-overflow guard intact.
disk_file::on_write_completed asserted that the provider always wrote at least the requested number of bytes, aborting the process on a short write (which can happen on a full disk). Report ERR_FILE_OPERATION_FAILED for the affected task and the remaining batched tasks through the aio callback channel instead, and stop consuming the (now invalid) written size so the trailing size==0 invariant still holds.
~native_linux_aio_provider asserted io_destroy() returned 0, which aborts during shutdown if the kernel AIO context was already released or is otherwise invalid (e.g. EINVAL). A destructor must not abort: log the failure with derror (matching the other I/O error sites in this file) and let shutdown proceed.
Empty (0-byte) source files made the whole file transfer fail. A zero-length aio read/write completes with ERR_HANDLE_EOF (not ERR_OK), so the single size==0 copy request an empty file produces was reported as a failure -- the prior size-guard relaxation alone was not enough. Fix both sides to avoid issuing zero-length aio: the server short-circuits a size==0 copy request with an empty ERR_OK reply (no open/read), and the client finalizes the segment inline as success (the destination is already created by its O_CREAT open) via the loop, not recursion. Short reads also shipped uninitialized memory: internal_read_callback ignored the actual bytes read and set resp.size to the requested size while sending the full make_shared_array block buffer, so on a short read the client wrote uninitialized heap into the destination -- file corruption and a heap-memory disclosure into replicated data. Fail the copy block via ERR_FILE_OPERATION_FAILED when fewer bytes than requested are read. Add core.aio_zero_length pinning the zero-length read/write == ERR_HANDLE_EOF contract these fixes depend on.
Point rDSN.dist.service at the error-handling-follow-up branch (bf631e7), which hardens replica log replay and checkpoint copy against corrupt/untrusted input.
Advance the rDSN.dist.service submodule pointer from bf631e7 to 9f490ec, which converts the log_file read-mode constructor's file_size() dassert abort into a graceful ERR_FILE_OPERATION_FAILED rejection via the existing log_file::open_read() error channel.
Point rDSN.dist.service to 18ad52f, which converts the abort-on-recoverable- log-file-creation-failure in mutation_log::mark_new_offset into the existing _io_error_callback failover path and guards write_pending_mutations against a null current log file.
Advance the rDSN.dist.service pointer to 9d2d9dc, which makes log_file::flush() and log_file::close() report recoverable disk-syscall failures through their existing error channels instead of aborting the replica server via dassert.
Point to rDSN.dist.service b93df50, which makes mutation::read_from return nullptr instead of throwing and converts its callers to nullptr checks.
The NFS client issued file::write(response.file_content.data(), response.size, ...) using the size and buffer from the remote server's copy_response without cross-checking them. A response whose declared size is negative or larger than the actual file_content buffer would read past the end of that buffer, corrupting the destination file with adjacent heap memory (disclosed into the replicated data) or crashing. This is the client-side counterpart of the size validation on_copy already performs on the request. end_copy now rejects such responses through the existing handle_completion(ERR_INVALID_DATA) channel before the local write.
on_get_file_size returned each file path relative to source_dir via path.substr(request.source_dir.length()), using the raw untrusted source_dir length as the position into a path that get_subfiles/path_combine had already normalized (consecutive separators collapsed, trailing separator stripped). A peer sending a non-normalized source_dir (e.g. with redundant slashes) makes the normalized path shorter than the prefix length, so substr() throws std::out_of_range. The RPC dispatch chain has no try/catch, so the exception unwinds off the worker thread and aborts the NFS service, which runs on every replica node. Guard both loops: if request.source_dir.length() exceeds the path length the source_dir cannot be a valid prefix, so reject the request with ERR_INVALID_PARAMETERS instead of throwing. The guard only triggers for a non-normalized source_dir; valid requests are unaffected.
The error-handling-follow-up changes in the rDSN.dist.service submodule were squash-merged into its master branch (microsoft#30). Advance the submodule pointer from the now-deleted branch commit to the merged master commit.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This branch hardens the NFS file-copy plugin (used by replica learning) and
several core disk-engine / AIO / filesystem paths so that recoverable I/O
failures and corrupt or untrusted RPC input no longer abort the host process.
Each fix converts a fatal
dassert/uncaught-exception/OOB-access into agraceful error return through a channel the surrounding code already has
(NFS completion callback, aio callback, or a
bool/error_codereturn).Happy-path behavior is unchanged; the new code runs only on failure paths.
It also fixes two empty-file (0-byte) copy regressions and advances the
rDSN.dist.servicesubmodule pointer to its mergedmaster(the replication/meta-server hardening, merged separately as
rDSN.dist.service#30).Motivation
The NFS service and the shared disk/AIO engine run on every replica node inside
one process. A single recoverable disk error (short write on a full disk,
fsync/close/statfailure,io_destroyduring shutdown) or one malformedNFS RPC from a peer could
dassert/throw and abort the whole host — taking downevery healthy replica with it. Several of these were also memory-safety bugs
(OOB reads, uninitialized-heap disclosure into replicated data).
Changes
NFS — untrusted-input hardening
end_get_file_sizemismatched list lengths — the response's parallelsize_list/file_listarrays from an untrusted peer were indexed inlockstep without checking equal length → OOB
std::vector::operator[]read.Now validated up front; rejected with
ERR_INVALID_DATA.end_copyresponse-size validation — the client wroteresponse.file_content.data()forresponse.sizebytes without checkingsizeagainst the buffer length → past-the-end read corrupting thedestination with heap memory (disclosure into replicated data) or crash.
Now cross-checked before the local write; rejected with
ERR_INVALID_DATA.(Client-side counterpart of the validation
on_copyalready does on therequest.)
on_get_file_sizesource_dirprefix — the server stripped thesource_dirprefix viapath.substr(request.source_dir.length()), using theraw untrusted length as a position into an already-normalized path. A
non-normalized
source_dir(e.g. redundant slashes) makes the normalizedpath shorter →
std::out_of_rangethrown off an RPC worker thread (notry/catch in dispatch) → NFS service abort. Both loops now guard the length
and reject with
ERR_INVALID_PARAMETERS.NFS — recoverable-failure & empty-file handling
continue_write,handle_completion,close_fileGC, and the all-fileson_get_file_sizebranch now route
create_directory/dsn_file_open/dsn_file_close/file_sizefailures through the completion channel (ordwarn) instead ofaborting. The local-write pump now iterates instead of recurses, so a
persistent error can't grow the stack.
size == 0copy request)previously failed the whole transfer: first because
on_copyrejectedsize <= 0(relaxed tosize < 0, keeping the oversized-request overflowguard), and then because a zero-length aio read/write completes with
ERR_HANDLE_EOF, notERR_OK. The server now short-circuits asize == 0request with an empty
ERR_OKreply and the client finalizes the segmentinline as success — no zero-length aio is issued.
internal_read_callbacksetresp.sizeto the requested size while shipping the full block buffer, so a short read
wrote uninitialized heap into the destination (corruption + disclosure). Now
fails the block with
ERR_FILE_OPERATION_FAILEDwhen fewer bytes thanrequested are read.
Core — disk engine, AIO, filesystem
disk_file::on_write_completedshort write — asserted the provider alwayswrote the full requested size (false on a full disk). Now reports
ERR_FILE_OPERATION_FAILEDfor the affected and remaining batched tasks viathe aio channel and preserves the trailing
size == 0invariant.~native_linux_aio_providerio_destroy— a destructor assertedio_destroy() == 0, aborting at shutdown if the kernel AIO context wasalready released (
EINVAL). Now logsderrorand lets shutdown proceed.remove_directorynftw walk — thenftwcallback aborted on anytypeflag other than
FTW_F/FTW_DP, butFTW_NS(per-entrystatfailure)and
FTW_DNR(unreadable dir) are recoverable. Nowdwarn+FTW_STOPsothe caller returns
falseinstead of crashing (no-op on the Windows walk).Submodule
rDSN.dist.servicepointer → master — advanced to the merged-mastercommit containing the replication-log / meta-server /
mutation::read_fromhardening (merged as
rDSN.dist.service#30).