Skip to content

Fix data race in TopicPayloadPool payload acquisition#6339

Open
PavelGuzenfeld wants to merge 1 commit intoeProsima:masterfrom
PavelGuzenfeld:fix/topic-payload-pool-tsan
Open

Fix data race in TopicPayloadPool payload acquisition#6339
PavelGuzenfeld wants to merge 1 commit intoeProsima:masterfrom
PavelGuzenfeld:fix/topic-payload-pool-tsan

Conversation

@PavelGuzenfeld
Copy link
Copy Markdown

@PavelGuzenfeld PavelGuzenfeld commented Mar 21, 2026

Summary

Fixes a ThreadSanitizer-detected data race in TopicPayloadPool that manifests during DDS intra-process communication when publisher serialization and subscriber deserialization access the same payload buffer concurrently.

Reported in: ros2/rclcpp#2941
Issue filed: #6340

Changes

1. Hold mutex through reference increment in do_get_payload()

Previously, the pool mutex was released before payload_node->reference() was called. In this window, a concurrent release_payload() could see ref_count==0 and return the node to the free list, allowing reuse by another thread.

-    lock.unlock();
-    payload_node->reference();
-    payload.data = payload_node->data();
-    payload.max_size = payload_node->data_size();
-    payload.payload_owner = this;
+    payload_node->reference();
+    payload.data = payload_node->data();
+    payload.max_size = payload_node->data_size();
+    payload.payload_owner = this;
+    lock.unlock();

2. Strengthen memory ordering in PayloadNode::reference()

Changed from memory_order_relaxed to memory_order_acq_rel. When the sharing path (get_payload(const&)) passes a payload buffer from publisher to subscriber, relaxed ordering does not guarantee the subscriber sees the publisher's writes to the buffer. acq_rel establishes the necessary happens-before relationship.

Test Results

AddressSanitizer (ASan)

[==========] 2563 tests from 3 test suites ran. (6393 ms total)
[  PASSED  ] 2563 tests.

ThreadSanitizer (TSan)

[==========] 2563 tests from 3 test suites ran. (17573 ms total)
[  PASSED  ] 2563 tests.

Environment: Ubuntu 24.04, GCC 13, foonathan_memory 0.7.4, fastcdr from thirdparty submodule.

Test plan

  • TopicPayloadPoolTests pass with ASan (2563/2563, zero memory errors)
  • TopicPayloadPoolTests pass with TSan (2563/2563, zero data races, halt_on_error=1)
  • Build clean (no warnings in modified files)

Signed-off-by: Pavel Guzenfeld me@pavelguzenfeld.com

Two thread-safety issues in TopicPayloadPool:

1. do_get_payload() released the pool mutex before calling
   payload_node->reference(). In the window between releasing the
   lock and incrementing the reference count, a concurrent
   release_payload() could see ref_count==0 and return the node to
   the free list, allowing another thread to reuse it.

   Fix: hold the lock until after reference() and metadata reads
   are complete.

2. PayloadNode::reference() used memory_order_relaxed for the
   atomic increment. When the sharing path (get_payload(const&))
   passes a payload buffer from publisher to subscriber, relaxed
   ordering does not guarantee the subscriber sees the publisher's
   writes to the buffer data. This can cause TSan-detected races
   between serialize_array and deserialize_array on the same buffer.

   Fix: use memory_order_acq_rel to establish happens-before
   ordering between the writing thread and the reading thread.

Ref: ros2/rclcpp#2941

Signed-off-by: Pavel Guzenfeld <me@pavelguzenfeld.com>
@PavelGuzenfeld PavelGuzenfeld force-pushed the fix/topic-payload-pool-tsan branch from bf6d2b9 to c34dead Compare March 21, 2026 21:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant