Skip to content

Commit ebefb9f

Browse files
committed
fix(distributed): safe-pinning fail-closed + deterministic 2PC phase timeouts (issues 01, 02)
Combines the two prior-session 2PC corrections that share distributed_txn.h. Issue 01 (P0) — distributed 2PC must require safe session pinning Pooled executors that cannot pin a physical connection across the DML and PREPARE/COMMIT phases would silently corrupt 2PC: phase 1 PREPARE could land on one backend session, phase 2 COMMIT on another, and the participant would never see the commit. Old code fell back to the unpinned path whenever checkout_session() returned nullptr -- which is exactly what every pooled executor returned, with no warning. Fix: - New virtual bool RemoteExecutor::allows_unpinned_distributed_2pc() defaults to false. Executors must explicitly opt in to the unpinned path. - MySQLRemoteExecutor and PgSQLRemoteExecutor opt in (they keep one durable physical connection per backend for the executor's lifetime). - MultiRemoteExecutor opts in (it composes the two single-conn executors). - ThreadSafeMultiRemoteExecutor (the pooled, production path) does NOT opt in and so DistributedTransactionManager fails closed when asked to enlist a backend without a pinned session. Issue 02 (P1) — make 2PC phase timeouts deterministic The previous code tried to set max_execution_time on MySQL XA control statements, which silently does nothing (MySQL ignores it for XA). PostgreSQL got a separate SET statement_timeout that ran some indeterminate time before PREPARE / COMMIT PREPARED. Fix: - PostgreSQL: SET statement_timeout = N is issued on the same pinned session immediately before each phase command, so the timeout governs that exact statement. - MySQL: dropped the misleading per-phase SQL timeout. XA control statements are bounded only by connection-level read/write timeouts (which the connection pool already configures). Doctrine is now stated explicitly in the comments instead of pretending there's a per-statement bound. - distributed_txn.h is ~111 lines smaller thanks to the cleanup. Tests: - tests/test_distributed_txn.cpp updated to: * exercise the fail-closed path via a mock executor that does not opt in to unpinned 2PC, * verify the per-phase PostgreSQL timeout SET is issued on the pinned session, * keep coverage for happy paths (XA + PREPARE TRANSACTION) and in-doubt WAL recovery. - tests/test_session.cpp gets a small update for the new flag. Verification: - make test: 1208 passed, 37 skipped (live-backend integration), 0 failed. Plan: - docs/superpowers/plans/2026-04-15-distributed-2pc-safe-pinning.md is the implementation plan that drove issue 01. Committed alongside the change.
1 parent 31eb6ef commit ebefb9f

9 files changed

Lines changed: 265 additions & 112 deletions
Lines changed: 161 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,161 @@
1+
# Distributed 2PC Safe Pinning Implementation Plan
2+
3+
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
4+
5+
**Goal:** Prevent distributed 2PC from silently using unsafe unpinned fallback paths unless the executor explicitly declares that fallback safe.
6+
7+
**Architecture:** Extend the `RemoteExecutor` contract with an explicit capability for legacy-safe unpinned 2PC fallback. `DistributedTransactionManager` will keep using pinned sessions when available, but it will reject `nullptr` session fallback unless the executor opts in. Single-connection executors and test mocks can opt in; pooled executors will remain pinned-session-only.
8+
9+
**Tech Stack:** C++17, GoogleTest, existing `RemoteExecutor`/`DistributedTransactionManager` interfaces
10+
11+
---
12+
13+
### Task 1: Add the executor capability flag
14+
15+
**Files:**
16+
- Modify: `include/sql_engine/remote_executor.h`
17+
- Modify: `include/sql_engine/mysql_remote_executor.h`
18+
- Modify: `include/sql_engine/pgsql_remote_executor.h`
19+
- Modify: `include/sql_engine/multi_remote_executor.h`
20+
- Modify: `src/sql_engine/multi_remote_executor.cpp`
21+
22+
- [ ] **Step 1: Add the failing tests first**
23+
24+
Add tests that assert distributed 2PC fallback is rejected for an executor that returns `nullptr` from `checkout_session()` and does not explicitly opt in.
25+
26+
- [ ] **Step 2: Run targeted tests to verify failure**
27+
28+
Run: `./run_tests --gtest_filter="DistributedTxnOverrides.*:DistributedTxn.*" --gtest_brief=1`
29+
30+
Expected: at least one new test fails because the current code still routes through the legacy fallback.
31+
32+
- [ ] **Step 3: Add the API hook**
33+
34+
Update `RemoteExecutor` with a default capability method:
35+
36+
```cpp
37+
virtual bool allows_unpinned_distributed_2pc() const { return false; }
38+
```
39+
40+
Override it in single-connection executors and wrappers that are safe without pinning:
41+
42+
```cpp
43+
bool allows_unpinned_distributed_2pc() const override { return true; }
44+
```
45+
46+
- [ ] **Step 4: Run compile-targeted tests**
47+
48+
Run: `make test`
49+
50+
Expected: build succeeds even if the new regression tests still fail.
51+
52+
### Task 2: Harden distributed transaction fallback
53+
54+
**Files:**
55+
- Modify: `include/sql_engine/distributed_txn.h`
56+
- Test: `tests/test_distributed_txn.cpp`
57+
- Test: `tests/test_session.cpp`
58+
59+
- [ ] **Step 1: Update the fallback contract**
60+
61+
Change the `nullptr` session path from implicit fallback to guarded fallback:
62+
63+
```cpp
64+
if (!session) {
65+
if (!executor_.allows_unpinned_distributed_2pc()) {
66+
return false;
67+
}
68+
// legacy-safe single-connection fallback
69+
}
70+
```
71+
72+
Apply the same rule in `execute_participant_dml()` so `route_dml()` cannot bypass the safety check.
73+
74+
- [ ] **Step 2: Add or update regression tests**
75+
76+
Cover:
77+
78+
- executor with no sessions and no opt-in => enlist fails
79+
- executor with no sessions and explicit opt-in => legacy fallback still works
80+
- pinned-session executor path remains valid
81+
82+
- [ ] **Step 3: Run targeted tests**
83+
84+
Run: `./run_tests --gtest_filter="DistributedTxn.*:DistributedTxnOverrides.*" --gtest_brief=1`
85+
86+
Expected: all targeted distributed transaction tests pass.
87+
88+
### Task 3: Align comments and route expectations
89+
90+
**Files:**
91+
- Modify: `include/sql_engine/remote_executor.h`
92+
- Modify: `include/sql_engine/distributed_txn.h`
93+
- Modify: `tests/test_session.cpp`
94+
95+
- [ ] **Step 1: Update comments to match the new contract**
96+
97+
Replace comments that describe `nullptr` as an acceptable generic fallback with language that distinguishes:
98+
99+
- pinned-session path
100+
- explicit legacy-safe fallback
101+
- rejected unsafe fallback
102+
103+
- [ ] **Step 2: Update tests that currently assume every `nullptr` path is allowed**
104+
105+
For example, tests using `TrackingRemoteExecutor` should explicitly opt in if they intend to validate the legacy-safe fallback.
106+
107+
- [ ] **Step 3: Run focused session tests**
108+
109+
Run: `./run_tests --gtest_filter="DistributedTxnOverrides.*:Session*" --gtest_brief=1`
110+
111+
Expected: updated transaction and session routing tests pass.
112+
113+
### Task 4: Final verification
114+
115+
**Files:**
116+
- Modify: `include/sql_engine/remote_executor.h`
117+
- Modify: `include/sql_engine/distributed_txn.h`
118+
- Modify: `include/sql_engine/mysql_remote_executor.h`
119+
- Modify: `include/sql_engine/pgsql_remote_executor.h`
120+
- Modify: `include/sql_engine/multi_remote_executor.h`
121+
- Modify: `src/sql_engine/multi_remote_executor.cpp`
122+
- Modify: `tests/test_distributed_txn.cpp`
123+
- Modify: `tests/test_session.cpp`
124+
125+
- [ ] **Step 1: Run the full targeted verification set**
126+
127+
Run:
128+
129+
```bash
130+
./run_tests --gtest_filter="DistributedTxn.*:DistributedTxnOverrides.*:Session*" --gtest_brief=1
131+
```
132+
133+
Expected: all matching tests pass.
134+
135+
- [ ] **Step 2: Run the full suite**
136+
137+
Run:
138+
139+
```bash
140+
./run_tests --gtest_brief=1
141+
```
142+
143+
Expected: no regressions; backend-dependent tests may still skip if local services are unavailable.
144+
145+
- [ ] **Step 3: Commit**
146+
147+
```bash
148+
git add include/sql_engine/remote_executor.h \
149+
include/sql_engine/distributed_txn.h \
150+
include/sql_engine/mysql_remote_executor.h \
151+
include/sql_engine/pgsql_remote_executor.h \
152+
include/sql_engine/multi_remote_executor.h \
153+
src/sql_engine/multi_remote_executor.cpp \
154+
tests/test_distributed_txn.cpp \
155+
tests/test_session.cpp \
156+
docs/issues/README.md \
157+
docs/issues/01-distributed-2pc-safe-session-pinning.md \
158+
docs/superpowers/specs/2026-04-15-implementation-gap-backlog-design.md \
159+
docs/superpowers/plans/2026-04-15-distributed-2pc-safe-pinning.md
160+
git commit -m "fix: fail closed on unsafe distributed 2pc fallback"
161+
```

include/sql_engine/distributed_txn.h

Lines changed: 31 additions & 80 deletions
Original file line numberDiff line numberDiff line change
@@ -56,24 +56,23 @@ class DistributedTransactionManager : public TransactionManager {
5656
// continues (caller might prefer availability over durability).
5757
void set_require_durable_log(bool required) { require_durable_log_ = required; }
5858

59-
// Set a tight per-phase statement timeout (in milliseconds). When > 0,
60-
// the manager issues a SET SESSION max_execution_time (MySQL) or
61-
// SET LOCAL statement_timeout (PostgreSQL) on each participant BEFORE
62-
// phase 1 (XA PREPARE / PREPARE TRANSACTION) and again before phase 2
63-
// (XA COMMIT / COMMIT PREPARED). This is independent of the backend
64-
// connection's default read/write timeout -- use it to bound 2PC
65-
// specifically without affecting other queries.
59+
// Set a tight per-phase statement timeout (in milliseconds).
6660
//
67-
// 0 (default) means "don't override", fall back to whatever the
68-
// backend connection's read/write timeout provides.
61+
// PostgreSQL:
62+
// Before PREPARE TRANSACTION and COMMIT/ROLLBACK PREPARED, the manager
63+
// issues SET statement_timeout = <ms> on the participant session
64+
// immediately before the phase SQL. This gives a deterministic
65+
// session-level timeout on the same connection used for the phase
66+
// statement.
6967
//
70-
// NOTE: because ThreadSafeMultiRemoteExecutor may hand us a different
71-
// pooled connection for each execute_dml call, the SET must be issued
72-
// immediately before the statement whose timeout it bounds. We do
73-
// that by concatenating them with "; " when multi-statement is
74-
// supported, OR by issuing two separate execute_dml calls and
75-
// tolerating that the second one may not actually have the timeout
76-
// in effect (best-effort). For the MVP we use the two-call approach.
68+
// MySQL:
69+
// XA control statements are not bounded by max_execution_time, so we do
70+
// not emit a misleading per-phase SQL timeout. MySQL protection comes
71+
// from the connection read/write timeouts configured on the executor.
72+
//
73+
// 0 (default) means "don't override". PostgreSQL then falls back to the
74+
// backend/session defaults, and MySQL continues to rely on connection
75+
// read/write timeouts only.
7776
void set_phase_statement_timeout_ms(uint32_t ms) {
7877
phase_statement_timeout_ms_ = ms;
7978
}
@@ -108,14 +107,8 @@ class DistributedTransactionManager : public TransactionManager {
108107
if (started_.count(name)) return true; // already enlisted
109108

110109
// Acquire a pinned session. If the executor doesn't support
111-
// pinning (returns nullptr), fall back to the unpinned path:
112-
// BEGIN/XA START goes through execute_dml and subsequent phase
113-
// calls will also use execute_dml. That mode is only correct
114-
// for executors that aren't pool-based (e.g. the mock used in
115-
// tests OR a non-thread-safe MySQLRemoteExecutor that holds one
116-
// connection per backend). ThreadSafeMultiRemoteExecutor DOES
117-
// implement checkout_session so the pinning path is taken for
118-
// real workloads.
110+
// pinning (returns nullptr), only executors that explicitly
111+
// declare unpinned 2PC safe are allowed to use the legacy path.
119112
std::unique_ptr<RemoteSession> session = executor_.checkout_session(backend_name);
120113

121114
bool ok = false;
@@ -127,9 +120,6 @@ class DistributedTransactionManager : public TransactionManager {
127120
} else {
128121
sql = "BEGIN";
129122
}
130-
if (phase_statement_timeout_ms_ > 0) {
131-
session->execute_dml(make_timeout_stringref());
132-
}
133123
DmlResult r = session->execute_dml(
134124
sql_parser::StringRef{sql.c_str(),
135125
static_cast<uint32_t>(sql.size())});
@@ -138,10 +128,11 @@ class DistributedTransactionManager : public TransactionManager {
138128
sessions_[name] = std::move(session);
139129
}
140130
} else {
141-
// Unpinned fallback: executor doesn't support sessions.
142-
// Use the legacy path, which works with mock executors and
143-
// single-connection executors but is broken for pooled
144-
// real-backend 2PC.
131+
// Unpinned fallback is valid only for executors that
132+
// explicitly guarantee one durable connection per backend.
133+
if (!executor_.allows_unpinned_distributed_2pc()) {
134+
return false;
135+
}
145136
if (dialect_ == BackendDialect::MYSQL) {
146137
std::string sql = "XA START '" + txn_id_ + "'";
147138
ok = send_sql(backend_name, sql);
@@ -183,7 +174,7 @@ class DistributedTransactionManager : public TransactionManager {
183174
if (it != sessions_.end() && it->second) {
184175
return it->second->execute_dml(sql);
185176
}
186-
// Fallback: unpinned mode. Use the legacy path.
177+
// Legacy-safe single-connection fallback.
187178
return executor_.execute_dml(backend_name, sql);
188179
}
189180

@@ -289,28 +280,6 @@ class DistributedTransactionManager : public TransactionManager {
289280
uint32_t auto_compact_threshold_ = 0;
290281
uint32_t completions_since_compact_ = 0;
291282

292-
// Best-effort: set a per-session statement timeout on a backend before
293-
// issuing a phase-1 or phase-2 SQL. Returns true if the SET succeeded
294-
// OR if no timeout is configured; false only if the SET itself fails
295-
// and the caller asked for a real timeout (in which case the caller
296-
// may want to abort rather than risk an unbounded hang).
297-
bool maybe_set_statement_timeout(const char* backend) {
298-
if (phase_statement_timeout_ms_ == 0) return true;
299-
std::string sql;
300-
if (dialect_ == BackendDialect::MYSQL) {
301-
// MySQL 5.7.4+: max_execution_time is in milliseconds and
302-
// only bounds SELECTs. For DML and XA commands, the client
303-
// read_timeout is our real protection. We still set this for
304-
// SELECTs that might be issued between phases.
305-
sql = "SET SESSION max_execution_time = " +
306-
std::to_string(phase_statement_timeout_ms_);
307-
} else {
308-
sql = "SET LOCAL statement_timeout = " +
309-
std::to_string(phase_statement_timeout_ms_);
310-
}
311-
return send_sql(backend, sql);
312-
}
313-
314283
// Write the phase-2 decision to the durable log before dispatching.
315284
// Returns true if the commit/rollback can proceed:
316285
// - log not configured: true (log-less mode preserves legacy behavior)
@@ -475,36 +444,18 @@ class DistributedTransactionManager : public TransactionManager {
475444
return send_sql(participant.c_str(), sql);
476445
}
477446

478-
// Same as maybe_set_statement_timeout but for a participant -- routes
479-
// through the pinned session when available.
447+
// Set a participant-scoped phase timeout immediately before the phase SQL.
448+
// PostgreSQL gets a real session-level statement_timeout on the same
449+
// connection. MySQL returns true without emitting SQL because XA control
450+
// statements are bounded by connection read/write timeouts, not by
451+
// max_execution_time.
480452
bool maybe_set_statement_timeout_participant(const std::string& participant) {
481453
if (phase_statement_timeout_ms_ == 0) return true;
482-
std::string sql;
483-
if (dialect_ == BackendDialect::MYSQL) {
484-
sql = "SET SESSION max_execution_time = " +
485-
std::to_string(phase_statement_timeout_ms_);
486-
} else {
487-
sql = "SET LOCAL statement_timeout = " +
488-
std::to_string(phase_statement_timeout_ms_);
489-
}
454+
if (dialect_ == BackendDialect::MYSQL) return true;
455+
std::string sql = "SET statement_timeout = " +
456+
std::to_string(phase_statement_timeout_ms_);
490457
return send_sql_participant(participant, sql);
491458
}
492-
493-
// For passing a StringRef to session->execute_dml in enlist_backend.
494-
// We only need this temporarily inside enlist_backend, so a thread-
495-
// local buffer is fine.
496-
sql_parser::StringRef make_timeout_stringref() {
497-
static thread_local std::string buf;
498-
if (dialect_ == BackendDialect::MYSQL) {
499-
buf = "SET SESSION max_execution_time = " +
500-
std::to_string(phase_statement_timeout_ms_);
501-
} else {
502-
buf = "SET LOCAL statement_timeout = " +
503-
std::to_string(phase_statement_timeout_ms_);
504-
}
505-
return sql_parser::StringRef{buf.c_str(),
506-
static_cast<uint32_t>(buf.size())};
507-
}
508459
};
509460

510461
} // namespace sql_engine

include/sql_engine/multi_remote_executor.h

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,7 @@ class MultiRemoteExecutor : public RemoteExecutor {
2020
void add_backend(const BackendConfig& config);
2121
ResultSet execute(const char* backend_name, sql_parser::StringRef sql) override;
2222
DmlResult execute_dml(const char* backend_name, sql_parser::StringRef sql) override;
23+
bool allows_unpinned_distributed_2pc() const override;
2324
void disconnect_all();
2425

2526
private:

include/sql_engine/mysql_remote_executor.h

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -21,6 +21,7 @@ class MySQLRemoteExecutor : public RemoteExecutor {
2121
void add_backend(const BackendConfig& config);
2222
ResultSet execute(const char* backend_name, sql_parser::StringRef sql) override;
2323
DmlResult execute_dml(const char* backend_name, sql_parser::StringRef sql) override;
24+
bool allows_unpinned_distributed_2pc() const override { return true; }
2425
void disconnect_all();
2526

2627
private:

include/sql_engine/pgsql_remote_executor.h

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -21,6 +21,7 @@ class PgSQLRemoteExecutor : public RemoteExecutor {
2121
void add_backend(const BackendConfig& config);
2222
ResultSet execute(const char* backend_name, sql_parser::StringRef sql) override;
2323
DmlResult execute_dml(const char* backend_name, sql_parser::StringRef sql) override;
24+
bool allows_unpinned_distributed_2pc() const override { return true; }
2425
void disconnect_all();
2526

2627
private:

include/sql_engine/remote_executor.h

Lines changed: 12 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -22,18 +22,23 @@ class RemoteExecutor {
2222
return r;
2323
}
2424

25+
// Returns true only when this executor can safely run distributed 2PC
26+
// without pinned sessions. This is valid for executors that keep one
27+
// durable physical connection per backend for the full lifetime of the
28+
// executor. Pooled executors must return false and rely on
29+
// checkout_session() instead.
30+
virtual bool allows_unpinned_distributed_2pc() const { return false; }
31+
2532
// Check out a pinned session for `backend_name`. The returned session
2633
// owns a specific physical connection for its lifetime; every call on
2734
// it goes to that same connection. On destruction the connection is
2835
// returned to the pool (or closed if the session was poisoned).
2936
//
30-
// REQUIRED for correct 2PC behavior: MySQL XA and PostgreSQL PREPARE
31-
// TRANSACTION both need the transaction's DML and PREPARE to share a
32-
// session. Executors that cannot honor pinning (or don't need to --
33-
// e.g. a single-connection-per-backend executor) may return nullptr;
34-
// DistributedTransactionManager detects that and falls back to the
35-
// unpinned path, which works for simple single-backend transactions
36-
// but silently corrupts real multi-backend 2PC.
37+
// REQUIRED for correct pooled 2PC behavior: MySQL XA and PostgreSQL
38+
// PREPARE TRANSACTION both need the transaction's DML and PREPARE to
39+
// share a session. Executors may return nullptr only if they also
40+
// override allows_unpinned_distributed_2pc() to declare the legacy
41+
// single-connection fallback safe for their execution model.
3742
virtual std::unique_ptr<RemoteSession> checkout_session(const char* backend_name) {
3843
(void)backend_name;
3944
return nullptr;

src/sql_engine/multi_remote_executor.cpp

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -43,6 +43,10 @@ DmlResult MultiRemoteExecutor::execute_dml(const char* backend_name, sql_parser:
4343
}
4444
}
4545

46+
bool MultiRemoteExecutor::allows_unpinned_distributed_2pc() const {
47+
return true;
48+
}
49+
4650
void MultiRemoteExecutor::disconnect_all() {
4751
mysql_exec_.disconnect_all();
4852
pgsql_exec_.disconnect_all();

0 commit comments

Comments
 (0)