Skip to content

Harden rDSN fault handling under fault injection#255

Merged
linmajia merged 10 commits into
microsoft:masterfrom
linmajia:libfiu-robustness
Jun 23, 2026
Merged

Harden rDSN fault handling under fault injection#255
linmajia merged 10 commits into
microsoft:masterfrom
linmajia:libfiu-robustness

Conversation

@linmajia

Copy link
Copy Markdown
Contributor

Summary

Improve rDSN robustness under fault injection by converting several assertion/crash/no-progress paths into controlled failures, and by hardening simple_kv checkpoint persistence against partial writes.

Changes

  • Improve network startup fault handling:

    • Propagate RPC provider startup failures instead of aborting through core assertions.
    • Handle socket/bind/listen/provider initialization failures more gracefully.
  • Harden process startup under injected exceptions:

    • Catch startup-time exceptions in main().
    • Return controlled startup failure instead of crashing unexpectedly.
  • Harden task-code and threadpool registration:

    • Prevent failed task-code registration from dereferencing invalid IDs.
    • Make registration paths tolerate allocation/container failures.
    • Avoid aborting during registration failure logging.
  • Harden allocator and binary-reader fault paths:

    • Make callocator throw allocation failure instead of returning null to STL containers.
    • Handle corrupt replica app-info/binary-reader input without assertion crashes.
  • Harden module loading:

    • Return controlled errors for module/symbol load failures instead of asserting.
  • Harden simple_kv checkpoints:

    • Write checkpoints through a temporary file and publish via rename only after successful flush/close.
    • Validate checkpoint files during recovery.
    • Recover from the newest valid checkpoint instead of blindly trusting the newest file.
    • Return checkpoint corruption errors instead of asserting on malformed checkpoint data.
    • Validate learned checkpoint state before accessing state.files[0].
  • Sync external submodules to merged master:

    • rDSN.dist.service: 4757232
    • rDSN.tools.hpc: 842ba7c

Validation

  • Ran focused libfiu probes for the discovered failure classes.
  • Confirmed the previously observed assertion/crash/no-progress cases were no longer reproduced in the focused probes.
  • Full test suite ran successfully.

@linmajia linmajia merged commit e5c9abd into microsoft:master Jun 23, 2026
2 checks passed
@linmajia linmajia deleted the libfiu-robustness branch June 23, 2026 13:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant