Motivation
When a TCP connection is broken without a RST (e.g., intermediate network device silently drops packets, process killed with kill -9), the gRPC client needs to rely on HTTP/2 PING (keepalive) to detect the dead connection. Until the keepalive mechanism detects the failure, all requests on that connection will fail and have to wait until their deadline expires.
Currently, the gRPC server in the Proxy does not configure permitKeepAliveTime or permitKeepAliveWithoutCalls, which means gRPC Netty server defaults apply:
permitKeepAliveTime = 5 minutes — clients cannot send keepalive pings more frequently than every 5 minutes, or the server will send a GOAWAY
permitKeepAliveWithoutCalls = false — keepalive pings on idle connections (no active RPCs) are rejected by the server
This causes two problems:
- Slow dead-connection detection: The
rocketmq-clients Java SDK sets keepAliveTime = 300s (5 min), so worst-case detection time is 5 min + 30s = 5.5 minutes. During this window, all sends to the affected endpoint fail.
- Idle connection keepalive ineffective: Although
rocketmq-clients sets keepAliveWithoutCalls(true), the server's default permitKeepAliveWithoutCalls = false silently rejects these pings, making idle connection health detection impossible.
Proposed Changes
Add the following configurable parameters to ProxyConfig:
| Parameter |
Default Value |
Description |
grpcServerPermitKeepAliveTimeMillis |
10000 (10s) |
Minimum time a client should wait before sending each keepalive ping |
grpcServerPermitKeepAliveWithoutCalls |
true |
Whether to allow keepalive pings when there are no outstanding RPCs |
Apply these in GrpcServerBuilder:
serverBuilder
.permitKeepAliveTime(config.getGrpcServerPermitKeepAliveTimeMillis(), TimeUnit.MILLISECONDS)
.permitKeepAliveWithoutCalls(config.isGrpcServerPermitKeepAliveWithoutCalls());
Impact
- Overhead is minimal: Each keepalive PING frame is only ~70 bytes (including TCP/IP headers). At 30s interval per connection, this adds ~8KB/hour per connection.
- Backward compatible: Default values are more permissive than the current implicit defaults, but won't break existing clients. Clients with longer keepalive intervals (e.g., current 300s) will continue to work fine.
- Once the server permits shorter keepalive intervals, client-side improvements (reducing
keepAliveTime from 300s to 30s) can bring dead-connection detection time from 5.5 minutes down to ~40 seconds.
Related Code
GrpcServerBuilder: proxy/src/main/java/org/apache/rocketmq/proxy/grpc/GrpcServerBuilder.java
ProxyConfig: proxy/src/main/java/org/apache/rocketmq/proxy/config/ProxyConfig.java
- Client keepalive settings:
keepAliveTime=300s, keepAliveTimeout=30s, keepAliveWithoutCalls=true in rocketmq-clients Java SDK (RpcClientImpl.java)
Motivation
When a TCP connection is broken without a RST (e.g., intermediate network device silently drops packets, process killed with
kill -9), the gRPC client needs to rely on HTTP/2 PING (keepalive) to detect the dead connection. Until the keepalive mechanism detects the failure, all requests on that connection will fail and have to wait until their deadline expires.Currently, the gRPC server in the Proxy does not configure
permitKeepAliveTimeorpermitKeepAliveWithoutCalls, which means gRPC Netty server defaults apply:permitKeepAliveTime= 5 minutes — clients cannot send keepalive pings more frequently than every 5 minutes, or the server will send a GOAWAYpermitKeepAliveWithoutCalls= false — keepalive pings on idle connections (no active RPCs) are rejected by the serverThis causes two problems:
rocketmq-clientsJava SDK setskeepAliveTime = 300s(5 min), so worst-case detection time is 5 min + 30s = 5.5 minutes. During this window, all sends to the affected endpoint fail.rocketmq-clientssetskeepAliveWithoutCalls(true), the server's defaultpermitKeepAliveWithoutCalls = falsesilently rejects these pings, making idle connection health detection impossible.Proposed Changes
Add the following configurable parameters to
ProxyConfig:grpcServerPermitKeepAliveTimeMillis10000(10s)grpcServerPermitKeepAliveWithoutCallstrueApply these in
GrpcServerBuilder:Impact
keepAliveTimefrom 300s to 30s) can bring dead-connection detection time from 5.5 minutes down to ~40 seconds.Related Code
GrpcServerBuilder:proxy/src/main/java/org/apache/rocketmq/proxy/grpc/GrpcServerBuilder.javaProxyConfig:proxy/src/main/java/org/apache/rocketmq/proxy/config/ProxyConfig.javakeepAliveTime=300s, keepAliveTimeout=30s, keepAliveWithoutCalls=truein rocketmq-clients Java SDK (RpcClientImpl.java)