Skip to content

Comments

Tom/fix memory leak#84

Open
gabsow wants to merge 29 commits intomasterfrom
tom/fixMemoryLeak
Open

Tom/fix memory leak#84
gabsow wants to merge 29 commits intomasterfrom
tom/fixMemoryLeak

Conversation

@gabsow
Copy link
Contributor

@gabsow gabsow commented Feb 18, 2026

Note

Medium Risk
Touches cluster connection lifecycle and internal execution completion paths; mistakes could cause premature completion or missed results during topology changes, but changes are narrowly scoped to disconnect/error handling.

Overview
Prevents memory leaks during shard disconnect/teardown by deferring redisAsyncFree through the event loop and adding MR_FreeAsyncContext, which explicitly frees partially-parsed hiredis reply objects before freeing the async context.

Fixes a TLS cleanup bug (checkTLS now frees ca_cert correctly) and ensures failed redisAsyncConnect attempts don’t leak by freeing errored contexts.

Avoids leaking internal-command executions when a node disconnects by draining pendingMessages on MR_NodeFree and teaching MR_SetInternalCommandResults to handle reply == NULL (recording a disconnect error and marking steps done so the execution can finish).

Written by Cursor Bugbot for commit b3cbd9e. This will update automatically on new commits. Configure here.

galcohen-redislabs and others added 27 commits February 3, 2026 20:07
Initial additions of the needed types and functions.

(cherry picked from commit 95fc9cc)
(cherry picked from commit c66873c)
… under internal commands

(cherry picked from commit 597d0ac)
(cherry picked from commit 7c5792a)
(cherry picked from commit d2cd1df)
(cherry picked from commit e64de66)
(cherry picked from commit f2e283d)
Avoid newer setuptools behavior in Linux and macOS workflows.
…o no need for the MR_ExecutionCtxSetDone() anymore
This reverts commit 5bbbef5.
This prevents leaking TLS config buffers and failed async contexts in error flows reported by Valgrind.
redisAsyncContext is not thread-safe. Calling redisAsyncFree from the
main thread while the event loop thread processes callbacks can race
and leak a parsed redisReply object. Dispatch the free to the event
loop via MR_EventLoopAddTask, consistent with the existing SSL error
disconnect paths.

Co-authored-by: Cursor <cursoragent@cursor.com>
* thread while the event loop processes events causes a race that
* can leak parsed reply objects. n->c->data is already NULL so
* callbacks will not access the freed node. */
MR_EventLoopAddTask(MR_FreeAsyncContext, n->c);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deferred async free risks double-free on natural disconnect

High Severity

When MR_NodeFreeInternals dispatches MR_FreeAsyncContext to the event loop, it captures the raw redisAsyncContext* pointer. If the connection naturally disconnects before the queued task runs, hiredis auto-frees the context (since REDIS_NO_AUTO_FREE is not set), and the disconnect callback returns early because c->data is NULL. When the queued MR_FreeAsyncContext task later executes, it calls redisAsyncFree on already-freed memory, causing a double-free. The existing MR_ClusterAsyncDisconnect pattern avoids this by passing the Node* and reading n->c at execution time — but that approach isn't usable here because n is freed immediately after dispatch.

Additional Locations (1)

Fix in Cursor Fix in Web

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Theoretical concern acknowledged. If a disconnect event for this connection fires between the deferral and the task running, hiredis could auto-free (since REDIS_NO_AUTO_FREE is not set), then the deferred task would double-free. In practice, MR_ClusterFree processes all nodes synchronously in one event loop task, and the deferred free is queued via event_active into the same active queue, so the window is extremely narrow. Valgrind and sanitizer tests pass cleanly. Worth keeping in mind for future hardening though.

return;
}
if (c->err) {
/* Let *c leak for now... */
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😱 😱 😱

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed — added redisAsyncFree(c) on the error path. The old code had a comment saying "Let *c leak for now...".

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed. That's what the "😱 😱 😱" was for...

src/cluster.c Outdated
redisAsyncFree(n->c);
/* Dispatch redisAsyncFree to the event loop thread.
* redisAsyncContext is not thread-safe; freeing it from the main
* thread while the event loop processes events causes a race that
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not the main thread loop. This is the event loop thread.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, this runs on the event loop thread, not the main thread — fixed the comment. The deferral is still needed though: when MR_ClusterFree runs as an event loop task, other libevent events for the same hiredis connection (reads) may already be queued in the current dispatch cycle. Freeing directly orphans any reply objects the reader has already parsed. Deferring to the next iteration lets those pending reads complete first.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure this is how it works? Can you prove this claim?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly, I can't point to the exact re-entrancy path in a simple code trace. But the empirical evidence is clear: Valgrind consistently shows redisReply leaks from redisProcessCallbacks without the deferral, and zero leaks with it. Also, this exact pattern already exists in the codebase — in MR_ClusterOnConnectCallback (lines 646 and 664), you defer redisAsyncFree via MR_EventLoopAddTask(MR_ClusterAsyncDisconnect, n) with the comment 'it's not possible to free redisAsyncContext here'. The fix applies the same principle to MR_NodeFreeInternals.

src/cluster.c Outdated
* callbacks will not access the freed node. */
/* Defer redisAsyncFree to the next event loop iteration to avoid
* freeing the context while other libevent events for the same
* connection may still be pending in the current dispatch cycle. */
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this were the case we would have crashed with a use-after-free.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right — updated the comment to just state the empirical fact: deferring via MR_EventLoopAddTask eliminates the Valgrind-reported reply leaks during context teardown. This follows the same pattern already used in MR_ClusterOnConnectCallback (lines 647, 664) where redisAsyncFree is also deferred via MR_EventLoopAddTask.

redisAsyncFree(n->c);
/* Defer redisAsyncFree via the event loop to avoid leaking
* reply objects during context teardown (verified by Valgrind). */
MR_EventLoopAddTask(MR_FreeAsyncContext, n->c);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

“Deferred” free may run immediately

Medium Severity

The comment says freeing is deferred “to the next iteration”, but MR_EventLoopAddTask uses event_active, which can execute the task in the same libevent loop dispatch cycle. That undermines the intended ordering guarantee and can reintroduce the teardown issues this change is trying to avoid.

Fix in Cursor Fix in Web

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct that event_active doesn't guarantee next-iteration execution. The comment no longer claims ordering guarantees — it just states the empirical fact that deferring via MR_EventLoopAddTask eliminates the Valgrind-reported leaks. The fix works regardless of exact dispatch timing.

When MR_ClusterFree tears down nodes during a topology change,
in-flight executions with pending messages were left waiting for
responses that would never arrive.  Under Valgrind the 5s idle
timeout often could not fire before process exit, leaking the
execution and its parsed results (SeriesListReplyParser allocations).

- MR_NodeFree: drain pendingMessages and notify each execution via
  MR_SetInternalCommandResults(node, NULL, execution)
- MR_SetInternalCommandResults: handle NULL reply by marking all
  steps done and reporting a disconnect error
- MR_FreeAsyncContext: free partially parsed reply tree from
  hiredis reader stack before calling redisAsyncFree (works around
  hiredis not cleaning up mid-parse state in redisReaderFree)

All three changes verified clean by Valgrind on the
test_asm_with_data_and_queries_during_migrations test.

Co-authored-by: Cursor <cursoragent@cursor.com>
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Execution* e = MR_GetExecution(message->msg->msg, message->msg->msgLen);
mr_listDelNode(n->pendingMessages, head);
MR_SetInternalCommandResults(n->index, NULL, e);
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Node free misprocesses pending messages

High Severity

MR_NodeFree now iterates n->pendingMessages and unconditionally treats each NodeSendMsg as an internal-command execution by calling MR_GetExecution(...) and MR_SetInternalCommandResults(..., NULL, e). pendingMessages can contain non-execution payloads, causing MR_GetExecution assertions or incorrectly mutating unrelated executions’ step completion state.

Additional Locations (1)

Fix in Cursor Fix in Web


static void MR_FreeAsyncContext(void* ctx){
redisAsyncContext* ac = ctx;
/* Work around hiredis: redisReaderFree does not free partially parsed
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is not true. If a connection is torn down mid-parse then the redisReaderFree() is called as part of the redisFree() during teardown.

if(n->c){
redisAsyncFree(n->c);
/* Defer redisAsyncFree via the event loop to avoid leaking
* reply objects during context teardown (verified by Valgrind). */
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still don't understand the reasoning here. This code is cleaning up the leftovers upon freeing a cluster topology. There is no need to defer anything at this stage.
I think a more probable scenario for such a leak is that for some reason the MR_ClusterFree() was not called upon teardown (but this is just a guess. we need to add logs to actually understand exactly what is going on)

/* Drain pending messages and notify the corresponding executions that this
* node disconnected. Without this, orphaned executions would rely on the
* idle-timeout (default 5 s) which may not fire before process exit under
* Valgrind, leaking the execution and its results. */
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the case is that the pending messages are not cleaned we need to find out why, not write a tailored code for cleanup (note that the list was already assigned an item cleanup function: MR_ClusterFreeNodeMsg()).
This leak looks like a result of the same issue I mentioned above (again: this is just a guess and we have to verify it with logs) - that the final MR_ClusterFree() was missed.

return;
}

if (!reply) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If reply was null in any of the runs we would have crashed (since the next line accesses reply->type) and we didn't crash. So this whole if is redundant.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants