Skip to content

[Enhancement] Make gRPC server keepalive parameters configurable in ProxyConfig #10510

@qianye1001

Description

@qianye1001

Motivation

When a TCP connection is broken without a RST (e.g., intermediate network device silently drops packets, process killed with kill -9), the gRPC client needs to rely on HTTP/2 PING (keepalive) to detect the dead connection. Until the keepalive mechanism detects the failure, all requests on that connection will fail and have to wait until their deadline expires.

Currently, the gRPC server in the Proxy does not configure permitKeepAliveTime or permitKeepAliveWithoutCalls, which means gRPC Netty server defaults apply:

  • permitKeepAliveTime = 5 minutes — clients cannot send keepalive pings more frequently than every 5 minutes, or the server will send a GOAWAY
  • permitKeepAliveWithoutCalls = false — keepalive pings on idle connections (no active RPCs) are rejected by the server

This causes two problems:

  1. Slow dead-connection detection: The rocketmq-clients Java SDK sets keepAliveTime = 300s (5 min), so worst-case detection time is 5 min + 30s = 5.5 minutes. During this window, all sends to the affected endpoint fail.
  2. Idle connection keepalive ineffective: Although rocketmq-clients sets keepAliveWithoutCalls(true), the server's default permitKeepAliveWithoutCalls = false silently rejects these pings, making idle connection health detection impossible.

Proposed Changes

Add the following configurable parameters to ProxyConfig:

Parameter Default Value Description
grpcServerPermitKeepAliveTimeMillis 10000 (10s) Minimum time a client should wait before sending each keepalive ping
grpcServerPermitKeepAliveWithoutCalls true Whether to allow keepalive pings when there are no outstanding RPCs

Apply these in GrpcServerBuilder:

serverBuilder
    .permitKeepAliveTime(config.getGrpcServerPermitKeepAliveTimeMillis(), TimeUnit.MILLISECONDS)
    .permitKeepAliveWithoutCalls(config.isGrpcServerPermitKeepAliveWithoutCalls());

Impact

  • Overhead is minimal: Each keepalive PING frame is only ~70 bytes (including TCP/IP headers). At 30s interval per connection, this adds ~8KB/hour per connection.
  • Backward compatible: Default values are more permissive than the current implicit defaults, but won't break existing clients. Clients with longer keepalive intervals (e.g., current 300s) will continue to work fine.
  • Once the server permits shorter keepalive intervals, client-side improvements (reducing keepAliveTime from 300s to 30s) can bring dead-connection detection time from 5.5 minutes down to ~40 seconds.

Related Code

  • GrpcServerBuilder: proxy/src/main/java/org/apache/rocketmq/proxy/grpc/GrpcServerBuilder.java
  • ProxyConfig: proxy/src/main/java/org/apache/rocketmq/proxy/config/ProxyConfig.java
  • Client keepalive settings: keepAliveTime=300s, keepAliveTimeout=30s, keepAliveWithoutCalls=true in rocketmq-clients Java SDK (RpcClientImpl.java)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions