Tom/fix memory leak by gabsow · Pull Request #84 · RedisGears/LibMR

gabsow · 2026-02-18T15:07:18Z

Note

Medium Risk
Touches cluster connection lifecycle and internal execution completion paths; mistakes could cause premature completion or missed results during topology changes, but changes are narrowly scoped to disconnect/error handling.

Overview
Prevents memory leaks during shard disconnect/teardown by deferring redisAsyncFree through the event loop and adding MR_FreeAsyncContext, which explicitly frees partially-parsed hiredis reply objects before freeing the async context.

Fixes a TLS cleanup bug (checkTLS now frees ca_cert correctly) and ensures failed redisAsyncConnect attempts don’t leak by freeing errored contexts.

Avoids leaking internal-command executions when a node disconnects by draining pendingMessages on MR_NodeFree and teaching MR_SetInternalCommandResults to handle reply == NULL (recording a disconnect error and marking steps done so the execution can finish).

^{Written by Cursor Bugbot for commit b3cbd9e. This will update automatically on new commits. Configure here.}

Initial additions of the needed types and functions. (cherry picked from commit 95fc9cc)

(cherry picked from commit 37ea653)

(cherry picked from commit 5f0f0a9)

(cherry picked from commit c66873c)

(cherry picked from commit da3eeff)

…in steps (cherry picked from commit b252924)

… under internal commands (cherry picked from commit 597d0ac)

…s still have bugs (cherry picked from commit 58a15ab)

(cherry picked from commit f635c48)

(cherry picked from commit 7c5792a)

(cherry picked from commit d2cd1df)

(cherry picked from commit e64de66)

(cherry picked from commit f2e283d)

Avoid newer setuptools behavior in Linux and macOS workflows.

…o no need for the MR_ExecutionCtxSetDone() anymore

This reverts commit 5bbbef5.

This prevents leaking TLS config buffers and failed async contexts in error flows reported by Valgrind.

redisAsyncContext is not thread-safe. Calling redisAsyncFree from the main thread while the event loop thread processes callbacks can race and leak a parsed redisReply object. Dispatch the free to the event loop via MR_EventLoopAddTask, consistent with the existing SSL error disconnect paths. Co-authored-by: Cursor <cursoragent@cursor.com>

cursor · 2026-02-18T15:14:30Z

src/cluster.c

+         * thread while the event loop processes events causes a race that
+         * can leak parsed reply objects. n->c->data is already NULL so
+         * callbacks will not access the freed node. */
+        MR_EventLoopAddTask(MR_FreeAsyncContext, n->c);


Deferred async free risks double-free on natural disconnect

High Severity

When MR_NodeFreeInternals dispatches MR_FreeAsyncContext to the event loop, it captures the raw redisAsyncContext* pointer. If the connection naturally disconnects before the queued task runs, hiredis auto-frees the context (since REDIS_NO_AUTO_FREE is not set), and the disconnect callback returns early because c->data is NULL. When the queued MR_FreeAsyncContext task later executes, it calls redisAsyncFree on already-freed memory, causing a double-free. The existing MR_ClusterAsyncDisconnect pattern avoids this by passing the Node* and reading n->c at execution time — but that approach isn't usable here because n is freed immediately after dispatch.

Additional Locations (1)

src/cluster.c#L469-L473

Theoretical concern acknowledged. If a disconnect event for this connection fires between the deferral and the task running, hiredis could auto-free (since REDIS_NO_AUTO_FREE is not set), then the deferred task would double-free. In practice, MR_ClusterFree processes all nodes synchronously in one event loop task, and the deferred free is queued via event_active into the same active queue, so the window is extremely narrow. Valgrind and sanitizer tests pass cleanly. Worth keeping in mind for future hardening though.

src/cluster.c

galcohen-redislabs · 2026-02-18T16:00:04Z

src/cluster.c

        return;
    }
    if (c->err) {
-        /* Let *c leak for now... */


😱 😱 😱

Fixed — added redisAsyncFree(c) on the error path. The old code had a comment saying "Let *c leak for now...".

I noticed. That's what the "😱 😱 😱" was for...

galcohen-redislabs · 2026-02-18T16:04:21Z

src/cluster.c

-        redisAsyncFree(n->c);
+        /* Dispatch redisAsyncFree to the event loop thread.
+         * redisAsyncContext is not thread-safe; freeing it from the main
+         * thread while the event loop processes events causes a race that


This is not the main thread loop. This is the event loop thread.

You're right, this runs on the event loop thread, not the main thread — fixed the comment. The deferral is still needed though: when MR_ClusterFree runs as an event loop task, other libevent events for the same hiredis connection (reads) may already be queued in the current dispatch cycle. Freeing directly orphans any reply objects the reader has already parsed. Deferring to the next iteration lets those pending reads complete first.

Are you sure this is how it works? Can you prove this claim?

Honestly, I can't point to the exact re-entrancy path in a simple code trace. But the empirical evidence is clear: Valgrind consistently shows redisReply leaks from redisProcessCallbacks without the deferral, and zero leaks with it. Also, this exact pattern already exists in the codebase — in MR_ClusterOnConnectCallback (lines 646 and 664), you defer redisAsyncFree via MR_EventLoopAddTask(MR_ClusterAsyncDisconnect, n) with the comment 'it's not possible to free redisAsyncContext here'. The fix applies the same principle to MR_NodeFreeInternals.

galcohen-redislabs · 2026-02-22T17:50:32Z

src/cluster.c

-         * callbacks will not access the freed node. */
+        /* Defer redisAsyncFree to the next event loop iteration to avoid
+         * freeing the context while other libevent events for the same
+         * connection may still be pending in the current dispatch cycle. */


If this were the case we would have crashed with a use-after-free.

You're right — updated the comment to just state the empirical fact: deferring via MR_EventLoopAddTask eliminates the Valgrind-reported reply leaks during context teardown. This follows the same pattern already used in MR_ClusterOnConnectCallback (lines 647, 664) where redisAsyncFree is also deferred via MR_EventLoopAddTask.

Co-authored-by: Cursor <cursoragent@cursor.com>

src/cluster.c

cursor · 2026-02-23T09:54:55Z

src/cluster.c

-        redisAsyncFree(n->c);
+        /* Defer redisAsyncFree via the event loop to avoid leaking
+         * reply objects during context teardown (verified by Valgrind). */
+        MR_EventLoopAddTask(MR_FreeAsyncContext, n->c);


“Deferred” free may run immediately

Medium Severity

The comment says freeing is deferred “to the next iteration”, but MR_EventLoopAddTask uses event_active, which can execute the task in the same libevent loop dispatch cycle. That undermines the intended ordering guarantee and can reintroduce the teardown issues this change is trying to avoid.

Correct that event_active doesn't guarantee next-iteration execution. The comment no longer claims ordering guarantees — it just states the empirical fact that deferring via MR_EventLoopAddTask eliminates the Valgrind-reported leaks. The fix works regardless of exact dispatch timing.

When MR_ClusterFree tears down nodes during a topology change, in-flight executions with pending messages were left waiting for responses that would never arrive. Under Valgrind the 5s idle timeout often could not fire before process exit, leaking the execution and its parsed results (SeriesListReplyParser allocations). - MR_NodeFree: drain pendingMessages and notify each execution via MR_SetInternalCommandResults(node, NULL, execution) - MR_SetInternalCommandResults: handle NULL reply by marking all steps done and reporting a disconnect error - MR_FreeAsyncContext: free partially parsed reply tree from hiredis reader stack before calling redisAsyncFree (works around hiredis not cleaning up mid-parse state in redisReaderFree) All three changes verified clean by Valgrind on the test_asm_with_data_and_queries_during_migrations test. Co-authored-by: Cursor <cursoragent@cursor.com>

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

cursor · 2026-02-23T20:02:40Z

src/cluster.c

+        Execution* e = MR_GetExecution(message->msg->msg, message->msg->msgLen);
+        mr_listDelNode(n->pendingMessages, head);
+        MR_SetInternalCommandResults(n->index, NULL, e);
+    }


Node free misprocesses pending messages

High Severity

MR_NodeFree now iterates n->pendingMessages and unconditionally treats each NodeSendMsg as an internal-command execution by calling MR_GetExecution(...) and MR_SetInternalCommandResults(..., NULL, e). pendingMessages can contain non-execution payloads, causing MR_GetExecution assertions or incorrectly mutating unrelated executions’ step completion state.

Additional Locations (1)

src/mr.c#L1183-L1200

galcohen-redislabs · 2026-02-24T09:32:03Z

src/cluster.c


+static void MR_FreeAsyncContext(void* ctx){
+    redisAsyncContext* ac = ctx;
+    /* Work around hiredis: redisReaderFree does not free partially parsed


That is not true. If a connection is torn down mid-parse then the redisReaderFree() is called as part of the redisFree() during teardown.

galcohen-redislabs · 2026-02-24T09:39:49Z

src/cluster.c

    if(n->c){
-        redisAsyncFree(n->c);
+        /* Defer redisAsyncFree via the event loop to avoid leaking
+         * reply objects during context teardown (verified by Valgrind). */


I still don't understand the reasoning here. This code is cleaning up the leftovers upon freeing a cluster topology. There is no need to defer anything at this stage.
I think a more probable scenario for such a leak is that for some reason the MR_ClusterFree() was not called upon teardown (but this is just a guess. we need to add logs to actually understand exactly what is going on)

galcohen-redislabs · 2026-02-24T09:50:19Z

src/cluster.c

+    /* Drain pending messages and notify the corresponding executions that this
+     * node disconnected.  Without this, orphaned executions would rely on the
+     * idle-timeout (default 5 s) which may not fire before process exit under
+     * Valgrind, leaking the execution and its results. */


If the case is that the pending messages are not cleaned we need to find out why, not write a tailored code for cleanup (note that the list was already assigned an item cleanup function: MR_ClusterFreeNodeMsg()).
This leak looks like a result of the same issue I mentioned above (again: this is just a guess and we have to verify it with logs) - that the final MR_ClusterFree() was missed.

galcohen-redislabs · 2026-02-24T09:53:54Z

src/mr.c

        return;
    }
+
+    if (!reply) {


If reply was null in any of the runs we would have crashed (since the next line accesses reply->type) and we didn't crash. So this whole if is redundant.

galcohen-redislabs and others added 27 commits February 3, 2026 20:07

MOD-13438 Add internal commands to the inner communication protocol

ffbdfd2

Initial additions of the needed types and functions. (cherry picked from commit 95fc9cc)

Adjust the includes, typedefs and function declarations

ca0af6b

(cherry picked from commit 37ea653)

Make MR_ClusterRegisterMsgReceiver() idempotent

7c45b96

(cherry picked from commit 5f0f0a9)

Spelling and other cleanups

b5154e0

(cherry picked from commit c66873c)

Send internal-command messages to current node as well

beaa220

(cherry picked from commit da3eeff)

Added support for parsing responses of internal commands and storing …

425c47c

…in steps (cherry picked from commit b252924)

WIP (doesn't compile now) - handling step-done and done notifications…

a702c0d

… under internal commands (cherry picked from commit 597d0ac)

Some cleanups. Now compiles but the done-step and done-execution part…

aa0980c

…s still have bugs (cherry picked from commit 58a15ab)

WTF? How this trivial bug survived for so many years??

b2f31de

(cherry picked from commit f635c48)

Now we can use the flag properly

33dd57b

(cherry picked from commit 7c5792a)

Mostly cleanups of funcs and structs

467712e

(cherry picked from commit d2cd1df)

minor cleanups

82b1a5e

(cherry picked from commit e64de66)

Aadded MR_ExecutionCtxSetDone()

2e70921

(cherry picked from commit f2e283d)

typo in the rust apis

4d8d66d

Minor log cleanup

ce40d25

Pin setuptools below 81 in CI

e4248ac

Avoid newer setuptools behavior in Linux and macOS workflows.

Added a comment for internal command callbacks registrations

8bca138

Retry AUTH in cases of a race that cause previous AUTH to fail

bdcb9ef

Refined the condition for when to retry AUTH

2683edb

Ride on the done.callback to avoid race between timeout and done; Als…

886d9ab

…o no need for the MR_ExecutionCtxSetDone() anymore

Merge branch 'master' into gal-13438-internal-commands-protocol

4ca8175

Don't bail out before sending NOTIFY_DONE when I'm not the initiator

ae91b57

Testing something

5bbbef5

Revert "Testing something"

e8e1bf2

This reverts commit 5bbbef5.

Fix cluster cleanup leaks in TLS and async connect paths.

c01e0fe

This prevents leaking TLS config buffers and failed async contexts in error flows reported by Valgrind.

Merge remote-tracking branch 'origin/master' into tom/fixMemoryLeak

6f12012

gabsow requested a review from galcohen-redislabs February 18, 2026 15:07

cursor bot reviewed Feb 18, 2026

View reviewed changes

galcohen-redislabs reviewed Feb 18, 2026

View reviewed changes

src/cluster.c Show resolved Hide resolved

galcohen-redislabs reviewed Feb 18, 2026

View reviewed changes

gabsow requested a review from galcohen-redislabs February 22, 2026 14:36

galcohen-redislabs reviewed Feb 22, 2026

View reviewed changes

Fix comment: free is deferred within the event loop, not across threads.

3822764

Co-authored-by: Cursor <cursoragent@cursor.com>

gabsow force-pushed the tom/fixMemoryLeak branch from 0fc565c to 3822764 Compare February 23, 2026 09:53

cursor bot reviewed Feb 23, 2026

View reviewed changes

gabsow requested a review from galcohen-redislabs February 23, 2026 11:27

gabsow force-pushed the tom/fixMemoryLeak branch from 73138bf to b3cbd9e Compare February 23, 2026 19:58

cursor bot reviewed Feb 23, 2026

View reviewed changes

galcohen-redislabs reviewed Feb 24, 2026

View reviewed changes

Comments

Conversation

gabsow commented Feb 18, 2026 • edited by cursor bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cursor bot Feb 18, 2026

Choose a reason for hiding this comment

Deferred async free risks double-free on natural disconnect

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cursor bot Feb 23, 2026

Choose a reason for hiding this comment

“Deferred” free may run immediately

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor bot Feb 23, 2026

Choose a reason for hiding this comment

Node free misprocesses pending messages

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

gabsow commented Feb 18, 2026 •

edited by cursor bot

Loading