Skip to content

Harden NFS file-copy plugin and core disk/AIO paths against corrupt input and recoverable failures#261

Merged
linmajia merged 17 commits into
microsoft:masterfrom
linmajia:error-handling-follow-up
Jun 30, 2026
Merged

Harden NFS file-copy plugin and core disk/AIO paths against corrupt input and recoverable failures#261
linmajia merged 17 commits into
microsoft:masterfrom
linmajia:error-handling-follow-up

Conversation

@linmajia

Copy link
Copy Markdown
Contributor

Summary

This branch hardens the NFS file-copy plugin (used by replica learning) and
several core disk-engine / AIO / filesystem paths so that recoverable I/O
failures and corrupt or untrusted RPC input no longer abort the host process.
Each fix converts a fatal dassert/uncaught-exception/OOB-access into a
graceful error return through a channel the surrounding code already has
(NFS completion callback, aio callback, or a bool/error_code return).
Happy-path behavior is unchanged; the new code runs only on failure paths.

It also fixes two empty-file (0-byte) copy regressions and advances the
rDSN.dist.service submodule pointer to its merged master
(the replication/meta-server hardening, merged separately as
rDSN.dist.service#30).

Motivation

The NFS service and the shared disk/AIO engine run on every replica node inside
one process. A single recoverable disk error (short write on a full disk,
fsync/close/stat failure, io_destroy during shutdown) or one malformed
NFS RPC from a peer could dassert/throw and abort the whole host — taking down
every healthy replica with it. Several of these were also memory-safety bugs
(OOB reads, uninitialized-heap disclosure into replicated data).

Changes

NFS — untrusted-input hardening

  • end_get_file_size mismatched list lengths — the response's parallel
    size_list/file_list arrays from an untrusted peer were indexed in
    lockstep without checking equal length → OOB std::vector::operator[] read.
    Now validated up front; rejected with ERR_INVALID_DATA.
  • end_copy response-size validation — the client wrote
    response.file_content.data() for response.size bytes without checking
    size against the buffer length → past-the-end read corrupting the
    destination with heap memory (disclosure into replicated data) or crash.
    Now cross-checked before the local write; rejected with ERR_INVALID_DATA.
    (Client-side counterpart of the validation on_copy already does on the
    request.)
  • on_get_file_size source_dir prefix — the server stripped the
    source_dir prefix via path.substr(request.source_dir.length()), using the
    raw untrusted length as a position into an already-normalized path. A
    non-normalized source_dir (e.g. redundant slashes) makes the normalized
    path shorter → std::out_of_range thrown off an RPC worker thread (no
    try/catch in dispatch) → NFS service abort. Both loops now guard the length
    and reject with ERR_INVALID_PARAMETERS.

NFS — recoverable-failure & empty-file handling

  • File-copy I/O aborts → error channelcontinue_write,
    handle_completion, close_file GC, and the all-files on_get_file_size
    branch now route create_directory/dsn_file_open/dsn_file_close/
    file_size failures through the completion channel (or dwarn) instead of
    aborting. The local-write pump now iterates instead of recurses, so a
    persistent error can't grow the stack.
  • Zero-byte file copy — empty source files (one size == 0 copy request)
    previously failed the whole transfer: first because on_copy rejected
    size <= 0 (relaxed to size < 0, keeping the oversized-request overflow
    guard), and then because a zero-length aio read/write completes with
    ERR_HANDLE_EOF, not ERR_OK. The server now short-circuits a size == 0
    request with an empty ERR_OK reply and the client finalizes the segment
    inline as success — no zero-length aio is issued.
  • Short-read uninitialized datainternal_read_callback set resp.size
    to the requested size while shipping the full block buffer, so a short read
    wrote uninitialized heap into the destination (corruption + disclosure). Now
    fails the block with ERR_FILE_OPERATION_FAILED when fewer bytes than
    requested are read.

Core — disk engine, AIO, filesystem

  • disk_file::on_write_completed short write — asserted the provider always
    wrote the full requested size (false on a full disk). Now reports
    ERR_FILE_OPERATION_FAILED for the affected and remaining batched tasks via
    the aio channel and preserves the trailing size == 0 invariant.
  • ~native_linux_aio_provider io_destroy — a destructor asserted
    io_destroy() == 0, aborting at shutdown if the kernel AIO context was
    already released (EINVAL). Now logs derror and lets shutdown proceed.
  • remove_directory nftw walk — the nftw callback aborted on any
    typeflag other than FTW_F/FTW_DP, but FTW_NS (per-entry stat failure)
    and FTW_DNR (unreadable dir) are recoverable. Now dwarn + FTW_STOP so
    the caller returns false instead of crashing (no-op on the Windows walk).

Submodule

  • rDSN.dist.service pointer → master — advanced to the merged-master
    commit containing the replication-log / meta-server / mutation::read_from
    hardening (merged as rDSN.dist.service#30).

HX Lin added 17 commits June 30, 2026 10:37
The NFS file-copy plugin (used by replica learning) aborted the whole
host via dassert on recoverable file I/O failures. Convert these to use
the existing error/completion channels, and make the local-write pump
iterate instead of recurse so a persistent error cannot grow the stack.

nfs_server_impl.cpp:
- on_get_file_size (all-files branch): on file_size() failure, set
  ERR_FILE_OPERATION_FAILED and break instead of aborting; resp.error
  already carries the failure to the client.
- close_file GC: on dsn_file_close failure, dwarn instead of aborting;
  erase the handle from the map before closing.

nfs_client_impl.cpp:
- continue_write: on create_directory / dsn_file_open failure, finalize
  the user request via handle_completion(ERR_FILE_OPERATION_FAILED)
  instead of aborting.
- continue_write: iterate instead of recurse on per-file failures to
  keep stack depth O(1) under persistent errors.
- handle_completion: on dsn_file_close failure, dwarn instead of aborting.
end_get_file_size() iterates resp.size_list but indexes the parallel
resp.file_list with the same index. The response is deserialized from a
remote, untrusted peer; if it arrives with size_list longer than
file_list, resp.file_list[i] is an out-of-bounds std::vector::operator[]
read (constructing a std::string from garbage), corrupting memory or
crashing the host.

Validate that the two parallel arrays have equal length right after the
existing resp.error check, and fail the request through the existing
nfs_task channel with ERR_INVALID_DATA, matching the adjacent
early-return error paths. A well-behaved server fills both lists in
lockstep, so normal behavior is unchanged.
remove_directory()'s nftw callback aborted the whole process via dassert
when nftw reported a typeflag other than FTW_F/FTW_DP. nftw delivers
FTW_NS when a per-entry stat() fails (transient I/O error, permission
problem, or a concurrent removal race) and FTW_DNR for an unreadable
directory; both are recoverable and remove_directory()/remove_path()
already return a bool. Replace the dassert with a dwarn + FTW_STOP so the
walk stops and the caller returns false instead of crashing the host. The
Windows custom walk only ever passes FTW_F/FTW_DP, so this is a no-op
there.
on_copy rejected request.size <= 0, but size == 0 is legitimate: the
client emits exactly one copy request per file, and an empty source file
yields a single request with size 0. Empty files are listed by
on_get_file_size without filtering, so any directory containing even one
empty file failed the entire multi-file transfer / replica-learn with
ERR_INVALID_PARAMETERS. Reject only request.size < 0 so a zero length
flows through the original open + zero-length file::read path (restoring
pre-hardening behavior) while keeping the oversized-request heap-overflow
guard intact.
disk_file::on_write_completed asserted that the provider always wrote at
least the requested number of bytes, aborting the process on a short
write (which can happen on a full disk). Report ERR_FILE_OPERATION_FAILED
for the affected task and the remaining batched tasks through the aio
callback channel instead, and stop consuming the (now invalid) written
size so the trailing size==0 invariant still holds.
~native_linux_aio_provider asserted io_destroy() returned 0, which aborts
during shutdown if the kernel AIO context was already released or is
otherwise invalid (e.g. EINVAL). A destructor must not abort: log the
failure with derror (matching the other I/O error sites in this file) and
let shutdown proceed.
Empty (0-byte) source files made the whole file transfer fail. A
zero-length aio read/write completes with ERR_HANDLE_EOF (not ERR_OK),
so the single size==0 copy request an empty file produces was reported
as a failure -- the prior size-guard relaxation alone was not enough.
Fix both sides to avoid issuing zero-length aio: the server short-circuits
a size==0 copy request with an empty ERR_OK reply (no open/read), and the
client finalizes the segment inline as success (the destination is already
created by its O_CREAT open) via the loop, not recursion.

Short reads also shipped uninitialized memory: internal_read_callback
ignored the actual bytes read and set resp.size to the requested size
while sending the full make_shared_array block buffer, so on a short read
the client wrote uninitialized heap into the destination -- file
corruption and a heap-memory disclosure into replicated data. Fail the
copy block via ERR_FILE_OPERATION_FAILED when fewer bytes than requested
are read.

Add core.aio_zero_length pinning the zero-length read/write ==
ERR_HANDLE_EOF contract these fixes depend on.
Point rDSN.dist.service at the error-handling-follow-up branch
(bf631e7), which hardens replica log replay and checkpoint copy against
corrupt/untrusted input.
Advance the rDSN.dist.service submodule pointer from bf631e7 to 9f490ec,
which converts the log_file read-mode constructor's file_size() dassert
abort into a graceful ERR_FILE_OPERATION_FAILED rejection via the existing
log_file::open_read() error channel.
Point rDSN.dist.service to 18ad52f, which converts the abort-on-recoverable-
log-file-creation-failure in mutation_log::mark_new_offset into the existing
_io_error_callback failover path and guards write_pending_mutations against a
null current log file.
Advance the rDSN.dist.service pointer to 9d2d9dc, which makes
log_file::flush() and log_file::close() report recoverable disk-syscall
failures through their existing error channels instead of aborting the
replica server via dassert.
Point to rDSN.dist.service b93df50, which makes mutation::read_from return
nullptr instead of throwing and converts its callers to nullptr checks.
The NFS client issued file::write(response.file_content.data(),
response.size, ...) using the size and buffer from the remote server's
copy_response without cross-checking them. A response whose declared size
is negative or larger than the actual file_content buffer would read past
the end of that buffer, corrupting the destination file with adjacent heap
memory (disclosed into the replicated data) or crashing. This is the
client-side counterpart of the size validation on_copy already performs on
the request. end_copy now rejects such responses through the existing
handle_completion(ERR_INVALID_DATA) channel before the local write.
on_get_file_size returned each file path relative to source_dir via
path.substr(request.source_dir.length()), using the raw untrusted source_dir
length as the position into a path that get_subfiles/path_combine had already
normalized (consecutive separators collapsed, trailing separator stripped).
A peer sending a non-normalized source_dir (e.g. with redundant slashes) makes
the normalized path shorter than the prefix length, so substr() throws
std::out_of_range. The RPC dispatch chain has no try/catch, so the exception
unwinds off the worker thread and aborts the NFS service, which runs on every
replica node.

Guard both loops: if request.source_dir.length() exceeds the path length the
source_dir cannot be a valid prefix, so reject the request with
ERR_INVALID_PARAMETERS instead of throwing. The guard only triggers for a
non-normalized source_dir; valid requests are unaffected.
The error-handling-follow-up changes in the rDSN.dist.service submodule were
squash-merged into its master branch (microsoft#30). Advance the submodule pointer from
the now-deleted branch commit to the merged master commit.
@linmajia linmajia merged commit b04ec67 into microsoft:master Jun 30, 2026
2 checks passed
@linmajia linmajia deleted the error-handling-follow-up branch June 30, 2026 13:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant