Harden rDSN fault handling under fault injection by linmajia · Pull Request #255 · microsoft/rDSN

linmajia · 2026-06-23T12:54:30Z

Summary

Improve rDSN robustness under fault injection by converting several assertion/crash/no-progress paths into controlled failures, and by hardening simple_kv checkpoint persistence against partial writes.

Changes

Improve network startup fault handling:
- Propagate RPC provider startup failures instead of aborting through core assertions.
- Handle socket/bind/listen/provider initialization failures more gracefully.
Harden process startup under injected exceptions:
- Catch startup-time exceptions in main().
- Return controlled startup failure instead of crashing unexpectedly.
Harden task-code and threadpool registration:
- Prevent failed task-code registration from dereferencing invalid IDs.
- Make registration paths tolerate allocation/container failures.
- Avoid aborting during registration failure logging.
Harden allocator and binary-reader fault paths:
- Make callocator throw allocation failure instead of returning null to STL containers.
- Handle corrupt replica app-info/binary-reader input without assertion crashes.
Harden module loading:
- Return controlled errors for module/symbol load failures instead of asserting.
Harden simple_kv checkpoints:
- Write checkpoints through a temporary file and publish via rename only after successful flush/close.
- Validate checkpoint files during recovery.
- Recover from the newest valid checkpoint instead of blindly trusting the newest file.
- Return checkpoint corruption errors instead of asserting on malformed checkpoint data.
- Validate learned checkpoint state before accessing state.files[0].
Sync external submodules to merged master:
- rDSN.dist.service: 4757232
- rDSN.tools.hpc: 842ba7c

Validation

Ran focused libfiu probes for the discovered failure classes.
Confirmed the previously observed assertion/crash/no-progress cases were no longer reproduced in the focused probes.
Full test suite ran successfully.

HX Lin added 10 commits June 23, 2026 15:00

Improve network startup fault handling

858c031

Handle startup exceptions under fault injection

e8c2ebe

Handle task code registration exceptions

965b8a1

Harden allocator and binary reader faults

4e6c9b3

Handle registration and app info faults

a812afb

Harden task spec registration

2db26d4

Use nullptr for pointer nulls

151c9f0

Handle module load failures

4f74b5a

Harden simple_kv checkpoints against partial writes

10ed47a

Sync external submodules to master

fd7921a

linmajia merged commit e5c9abd into microsoft:master Jun 23, 2026
2 checks passed

linmajia deleted the libfiu-robustness branch June 23, 2026 13:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Harden rDSN fault handling under fault injection#255

Harden rDSN fault handling under fault injection#255
linmajia merged 10 commits into
microsoft:masterfrom
linmajia:libfiu-robustness

linmajia commented Jun 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

linmajia commented Jun 23, 2026

Summary

Changes

Validation

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant