Skip to content

Harden C API and RPC boundaries against fault-injected failures#259

Merged
linmajia merged 15 commits into
microsoft:masterfrom
linmajia:c-api-error-handling
Jun 29, 2026
Merged

Harden C API and RPC boundaries against fault-injected failures#259
linmajia merged 15 commits into
microsoft:masterfrom
linmajia:c-api-error-handling

Conversation

@linmajia

Copy link
Copy Markdown
Contributor

Summary

  • Add no-throw guards and safer error returns for C API boundaries with real return channels.
  • Harden message read/write, receive-message creation, receive-buffer preparation, and serverlet reply marshalling paths.
  • Convert recoverable disk AIO, config key enumeration, sync-provider construction, and parser/buffer failures into existing error/null/bool channels.
  • Update fault-injection coverage and submodule pointers for merged rDSN.dist.service and rDSN.tools.hpc hardening.

Validation

  • Built rDSN with plugins on Ubuntu/libfiu.
  • Ran focused libfiu/simple_kv runtime sweeps with allocation, file I/O, and network faults.
  • Ran focused corrupted-message and oversized-config checks without assertions.
  • Synced submodules to merged master commits:
    • rDSN.dist.service 82ec2cd
    • rDSN.tools.hpc 758c7b5

HX Lin added 15 commits June 28, 2026 11:08
The rDSN public API (DSN_API functions) is a C ABI, but the
implementations are C++ and freely call code that can throw (heap
allocation, STL/blob/string, serialization, engine calls). An exception
unwinding across the C ABI is undefined behavior. Under fault injection
(libfiu malloc faults), those throws become reachable on paths that
previously "never failed", so an escaping std::bad_alloc can crash the
process in a way the C caller cannot observe or handle.

Phase 1 - no-throw boundary:
- Add src/core/src/c_api_guard.h: DSN_C_GUARD_BEGIN / DSN_C_GUARD_END /
  DSN_C_GUARD_END_VOID. The guard wraps a throw-capable C entry point in
  try/catch and translates any escaping exception into that API's
  documented failure value, always logging the diagnostic (function name
  plus what()) so a translated failure is never silent. Two catch
  clauses: std::exception (for what()) and catch(...) to also stop
  non-std exceptions. The value variant ends with a trailing
  return (fail_ret) so -Wreturn-type stays quiet under -Wall -Werror.
- Apply guards to 68 functions across 8 files (service_api_c.cpp,
  rpc_message.cpp, main.cpp, app_manager.cpp, address.cpp,
  command_manager.cpp, perf_counters.cpp, global_checkers.cpp).
- fail_ret conventions: pointer -> nullptr, bool -> false, dsn_error_t ->
  ERR_UNKNOWN, count APIs -> -1, *_from_string / config getters -> the
  caller-supplied default.
- Deliberately not guarded: logging.cpp (catch handler calls derror, so
  guarding the logger risks recursion), dsn_run / dsn_run_config (already
  self-catch), and startup-only type registration (no runtime sentinel).

Phase 2 - abort -> error returns where a channel already exists:
- message_ex::write_next / write_commit / read_commit changed from void
  to bool; the "pending write/read not committed" and "no buffer under
  read" sequencing dasserts become derror + return false. read_next had
  its leading dassert converted the same way. The dsn_msg_* C wrappers
  propagate the bool. write_append (no callers) keeps void + dassert.
- task::wait() self-wait (a guaranteed deadlock) becomes derror +
  return false; wait() now returns bool. dsn_task_wait_timeout (bool)
  propagates it; dsn_task_wait (void) keeps its dassert since it has no
  error channel.

Boundary hardening for the message read/write out-params:
- write_next / read_next validate the ptr and size out-parameters for
  null before use.
- write_next validates *ptr after tls_trans_mem_next (re-committing 0
  bytes to keep the next/commit pairing consistent) before doing pointer
  arithmetic on it.
- read_next validates the produced *ptr (null for an empty blob, where
  _rw_offset is 0) after assigning it and before mutating read state, so
  it never returns a null *ptr as a successful read.

Internal invariants with no caller-recoverable cause (e.g. the write
buffer-count post-condition) are intentionally kept as dassert.

Built on Ubuntu 16.04 with -Wall -Werror: Build succeed.
write_next/write_commit and read_next/read_commit were converted from
process-aborting dassert to recoverable false returns when called out of
order. Add TEST(core, message_ex_rw_pairing) to lock in that contract:
- write_commit with nothing pending, and a double write_commit, return false
- write_next with an uncommitted write pending returns false and zeroes the
  ptr/size out-params
- read_commit with nothing under read, and a double read_commit, return false
- read_next with an uncommitted read pending returns false and zeroes the
  ptr/size out-params

Built on Ubuntu 16.04 with -Wall -Werror: Build succeed.
@linmajia linmajia merged commit 51eaf5b into microsoft:master Jun 29, 2026
2 checks passed
@linmajia linmajia deleted the c-api-error-handling branch June 30, 2026 00:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant