Skip to content

Add control plane / data plane architecture#146

Merged
fuziontech merged 3 commits intomainfrom
control-plane-architecture
Feb 6, 2026
Merged

Add control plane / data plane architecture#146
fuziontech merged 3 commits intomainfrom
control-plane-architecture

Conversation

@fuziontech
Copy link
Member

Summary

  • Adds a multi-process control plane / data plane architecture for zero-downtime deployments and cross-session DuckDB cache reuse
  • Control plane accepts TCP connections and routes them to long-lived worker processes via Unix socket FD passing (SCM_RIGHTS)
  • Workers manage shared DuckDB instances, handling multiple client sessions as goroutines with full PG wire protocol support
  • Graceful control plane handover transfers the TCP listener FD to a new process, keeping workers (and active queries) untouched
  • Rolling worker updates via SIGUSR2 drain and replace workers one at a time

Architecture

  PG Client ──TLS──> Control Plane (TCP listener, rate limiting, routing)
                         │
                    FD pass (SCM_RIGHTS) + gRPC management
                         │
                    ┌────┴────┐
                    ▼         ▼
                 Worker 1  Worker 2  ...  (long-lived, shared DuckDB)
                  ├─ Session 1          
                  ├─ Session 2          
                  └─ Session N          

New packages

  • controlplane/ - Control plane, worker pool, handover, worker process
  • controlplane/proto/ - gRPC service definition (6 RPCs: Configure, AcceptConnection, CancelQuery, Drain, Health, Shutdown)
  • controlplane/fdpass/ - Unix socket FD passing via SCM_RIGHTS

Modified packages

  • server/ - Extracted CreateDBConnection, LoadExtensions, AttachDuckLake as standalone functions; added exports.go for cross-package access to protocol functions
  • main.go - Added --mode flag routing (standalone/control-plane/worker)

Usage

# Standalone (default, unchanged behavior)
./duckgres --port 5432

# Control plane mode
./duckgres --mode control-plane --port 5432 --worker-count 4

# Zero-downtime deployment
./duckgres-v2 --mode control-plane --port 5432 --handover-socket /var/run/duckgres/handover.sock

# Rolling worker update
kill -USR2 <control-plane-pid>

Test plan

  • Verify go build succeeds
  • Verify existing unit tests pass (go test ./...)
  • Verify FD passing tests pass (go test ./controlplane/fdpass/)
  • Verify standalone mode is completely unchanged (default behavior)
  • Test control plane mode with psql: \dt, \d, queries, COPY, prepared statements
  • Test multiple concurrent psql sessions distribute across workers
  • Test graceful deployment with handover socket (long query survives)
  • Test rolling update via SIGUSR2

🤖 Generated with Claude Code

…ents

Implement a multi-process architecture that splits duckgres into a control
plane (connection management, routing) and data plane (pool of long-lived
DuckDB worker processes). This enables zero-downtime deployments, cross-session
DuckDB cache reuse, and rolling worker updates.

Key components:
- gRPC-based worker management (Configure, Health, Drain, Shutdown)
- Unix socket FD passing via SCM_RIGHTS for TCP connection handoff
- Least-connections load balancing across worker pool
- Graceful control plane handover via listener FD transfer
- Rolling worker updates triggered by SIGUSR2
- Health check loop with automatic worker restart

New CLI modes: --mode control-plane | worker | standalone (default)
Standalone mode (existing behavior) is completely unchanged.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@fuziontech fuziontech requested review from a team and EDsCODE February 6, 2026 01:23
@fuziontech
Copy link
Member Author

Code review

Found 2 issues:

  1. done channel never closed for handed-over workers, causing shutdown/rolling-update hangs. ConnectExistingWorker creates done: make(chan struct{}) but, unlike SpawnWorker (which has a goroutine calling close(worker.done) on process exit at line 135), there is no goroutine monitoring the handed-over worker process. ShutdownAll and RollingUpdate both wait on <-w.done, which will never unblock for these workers -- the code falls through to the timeout every time, causing unnecessary delays and preventing clean shutdown of handed-over workers.

// ConnectExistingWorker connects to a worker that was handed over from a previous control plane.
// The worker process is already running - we just need to establish gRPC connection.
func (p *WorkerPool) ConnectExistingWorker(id int, grpcSocket, fdSocket string) error {
// Connect gRPC
conn, err := grpc.NewClient(
"unix://"+grpcSocket,
grpc.WithTransportCredentials(insecure.NewCredentials()),
)
if err != nil {
return fmt.Errorf("connect gRPC to worker %d: %w", id, err)
}
client := pb.NewWorkerControlClient(conn)
// Verify the worker is healthy
ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
health, err := client.Health(ctx, &pb.HealthRequest{})
cancel()
if err != nil {
conn.Close()
return fmt.Errorf("health check worker %d: %w", id, err)
}
if !health.Healthy {
conn.Close()
return fmt.Errorf("worker %d is not healthy", id)
}
worker := &ManagedWorker{
ID: id,
GRPCSocket: grpcSocket,
FDSocket: fdSocket,
GRPCConn: conn,
Client: client,
StartTime: time.Now(),
done: make(chan struct{}),
}
p.mu.Lock()
p.workers[id] = worker
p.mu.Unlock()
slog.Info("Connected to existing worker.", "id", id,
"active_connections", health.ActiveConnections)
return nil
}

Contrast with the monitoring goroutine in SpawnWorker:

if err != nil {
slog.Error("Worker process exited.", "id", id, "pid", worker.PID, "error", err)
} else {
slog.Info("Worker process exited cleanly.", "id", id, "pid", worker.PID)
}
close(worker.done)
// Remove from pool
p.mu.Lock()
delete(p.workers, id)
p.mu.Unlock()
}()

Consumers that block on done:

select {
case <-w.done:
case <-deadline:
slog.Warn("Worker shutdown timeout, killing.", "id", w.ID, "pid", w.PID)
if w.Cmd.Process != nil {
_ = w.Cmd.Process.Kill()
}
}
}

select {
case <-old.done:
case <-time.After(30 * time.Second):
if old.Cmd.Process != nil {
_ = old.Cmd.Process.Kill()
}

  1. CancelQuery in the worker kills the entire session instead of cancelling just the running query. The RPC calls s.cancel(), which cancels the session context and tears down the connection. In standalone mode, Server.CancelQuery cancels only the in-flight query (via the activeQueries map keyed by BackendKey) while keeping the connection alive. A client issuing Ctrl+C during a long query in control-plane mode will lose its connection instead of just cancelling the query.

func (w *Worker) CancelQuery(_ context.Context, req *pb.CancelQueryRequest) (*pb.CancelQueryResponse, error) {
w.sessionsMu.RLock()
defer w.sessionsMu.RUnlock()
for _, s := range w.sessions {
if s.pid == req.BackendPid && s.secretKey == req.SecretKey {
s.cancel()
slog.Info("Query cancelled via gRPC.", "pid", req.BackendPid)
return &pb.CancelQueryResponse{Cancelled: true}, nil
}
}
return &pb.CancelQueryResponse{Cancelled: false}, nil

🤖 Generated with Claude Code

- If this code review was useful, please react with 👍. Otherwise, react with 👎.

ConnectExistingWorker never closed the done channel, causing
ShutdownAll and RollingUpdate to always hit their timeout for
handed-over workers. Add a health-check monitoring goroutine.

CancelQuery killed the entire session instead of just the running
query. Use the per-session minServer.CancelQuery() to cancel only
the in-flight query, matching standalone mode behavior.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@fuziontech fuziontech force-pushed the control-plane-architecture branch 2 times, most recently from 247e114 to 55830ca Compare February 6, 2026 03:57
- Add _ = prefix to all unchecked .Close() return values (errcheck)
- Remove unused nextWorker field from WorkerPool (unused)
- Remove unused activeQueriesMu field from Worker (unused)
- Remove unused loadExtensions/attachDuckLake method wrappers (unused)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@fuziontech fuziontech force-pushed the control-plane-architecture branch from 55830ca to 9b4c408 Compare February 6, 2026 03:59
@fuziontech fuziontech merged commit d06cab3 into main Feb 6, 2026
10 checks passed
@fuziontech fuziontech deleted the control-plane-architecture branch February 6, 2026 05:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant