feat(spanner): add shared endpoint cooldowns for location-aware rerouting by rahul2393 · Pull Request #12845 · googleapis/google-cloud-java

rahul2393 · 2026-04-17T22:34:26Z

Summary

This PR improves Java Spanner's location-aware bypass routing when routed replicas are overloaded or unavailable, and extends score-based replica selection

The client now:

avoids recently overloaded routed endpoints using shared cooldowns
records RESOURCE_EXHAUSTED / UNAVAILABLE as EWMA error penalties
uses EWMA-based selection for both preferLeader=false and strong preferLeader=true read/query routing when
operation_uid is available

It also keeps the location-aware read path lock-free via immutable group snapshots.

What changed

Added shared channel-level cooldown tracking for routed endpoints that return RESOURCE_EXHAUSTED / UNAVAILABLE, while still keeping request-scoped exclusions for same-logical-request retries.
Updated bypass retry behavior so eligible reads/queries can reroute to another replica instead of immediately
returning to the same failed endpoint.
Recorded RESOURCE_EXHAUSTED / UNAVAILABLE as EWMA error penalties for routed replicas, so unhealthy endpoints are deprioritized even after the immediate retry/cooldown window.
Extended score-based routing to strong preferLeader=true read/query traffic when operation_uid is present, using leader preference as a bias instead of a hard override.
Kept preferLeader=true behavior unchanged for paths without operation_uid such as mutation/commit routing.
Refactored KeyRangeCache group state to immutable snapshots and removed per-group synchronization from the routing hot path.

…ting

gemini-code-assist

Code Review

This pull request introduces an endpoint cooldown mechanism to handle RESOURCE_EXHAUSTED errors and refactors the KeyRangeCache to use immutable snapshots, replacing per-group locking to improve read performance. The new EndpointOverloadCooldownTracker manages short-lived cooldowns with exponential backoff and jitter, while KeyAwareChannel is updated to exclude endpoints on both RESOURCE_EXHAUSTED and UNAVAILABLE status codes. Feedback is provided to optimize the GroupSnapshot constructor by removing a redundant list copy.

gemini-code-assist · 2026-04-17T22:36:21Z

+    private GroupSnapshot(ByteString generation, int leaderIndex, List<TabletSnapshot> tablets) {
+      this.generation = generation;
+      this.leaderIndex = leaderIndex;
+      this.tablets = Collections.unmodifiableList(new ArrayList<>(tablets));
+    }


The GroupSnapshot constructor performs a redundant copy of the tablets list. Since the only caller (CachedGroup.update) already creates a new ArrayList, we can wrap it directly in an unmodifiable list to avoid unnecessary allocations.

Suggested change

private GroupSnapshot(ByteString generation, int leaderIndex, List<TabletSnapshot> tablets) {

this.generation = generation;

this.leaderIndex = leaderIndex;

this.tablets = Collections.unmodifiableList(new ArrayList<>(tablets));

}

private GroupSnapshot(ByteString generation, int leaderIndex, List<TabletSnapshot> tablets) {

this.generation = generation;

this.leaderIndex = leaderIndex;

this.tablets = Collections.unmodifiableList(tablets);

}

rahul2393 · 2026-04-19T21:23:06Z

/gemini review

gemini-code-assist

Code Review

This pull request implements latency-aware routing for Spanner endpoints by introducing a score-based replica selection mechanism using time-decayed EWMA. Key additions include registries for tracking endpoint latency and inflight requests, a cooldown tracker for overloaded endpoints, and updates to the KeyRangeCache to support score-aware selection. Feedback identifies several high-priority issues in the new static registries, including a memory leak in the latency tracker map due to accumulating operation identifiers, potential key collisions between different client instances sharing a JVM, and a race condition when updating inflight request counts. There is also a recommendation to reduce the maximum size of the request ID cache to prevent excessive memory consumption.

gemini-code-assist · 2026-04-19T21:26:25Z

+  private static final ConcurrentHashMap<TrackerKey, LatencyTracker> TRACKERS =
+      new ConcurrentHashMap<>();


The TRACKERS map is a static ConcurrentHashMap that is never cleared. Since it is keyed by operationUid (which is per-range/group and can change over time) and address, this will lead to a memory leak in long-running applications as old operationUid entries accumulate indefinitely. Consider using a Cache with a maximum size and/or time-based expiration (e.g., com.google.common.cache.Cache with expireAfterAccess) to bound memory usage.

References

For safety, use bounded resources with a defined maximum size to prevent resource exhaustion.

gemini-code-assist · 2026-04-19T21:26:25Z

+  static final class TrackerKey {
+    private final long operationUid;
+    private final String address;


Using only operationUid and address as a key in a static registry poses a collision risk if multiple Spanner client instances (e.g., targeting different projects or instances) are used within the same JVM. While operationUid is intended to be unique, it is likely only scoped to a specific database or instance context. Consider including a more globally unique identifier, such as a databaseId or an instance-specific prefix, in the TrackerKey to ensure isolation between different clients.

gemini-code-assist · 2026-04-19T21:26:25Z

+    int updated = counter.decrementAndGet();
+    if (updated <= 0) {
+      INFLIGHT_REQUESTS.remove(address, counter);
+    }


There is a race condition in finishRequest. INFLIGHT_REQUESTS.remove(address, counter) can remove the AtomicInteger object even if another thread has just incremented it via beginRequest (since beginRequest uses computeIfAbsent which returns the existing object). This leads to lost inflight counts. Since the number of Spanner endpoints is relatively small and stable, it is safer to keep the AtomicInteger in the map indefinitely rather than attempting to remove it when it reaches zero.

Suggested change

int updated = counter.decrementAndGet();

if (updated <= 0) {

INFLIGHT_REQUESTS.remove(address, counter);

}

counter.decrementAndGet();

gemini-code-assist · 2026-04-19T21:26:25Z

+
+  private static final Cache<String, RoutingTarget> TARGETS =
+      CacheBuilder.newBuilder()
+          .maximumSize(100_000_000L)


The maximumSize of 100,000,000 for TARGETS is excessively large and could lead to significant memory consumption if the cache fills up. Since entries are typically removed upon call closure in HeaderInterceptor and have a 10-minute expiration, a much smaller limit (e.g., 1,000,000) should be more than sufficient for even the most high-throughput applications while providing better safety against memory exhaustion.

Suggested change

.maximumSize(100_000_000L)

.maximumSize(1_000_000L)

References

For safety, use a defined maximum size for bounded resources to prevent memory exhaustion.

rahul2393 added 3 commits April 17, 2026 21:58

feat(spanner): add shared endpoint cooldowns for location-aware rerou…

ea55529

…ting

retry unavailable errors on different replica

2598291

address comments

2463cbf

rahul2393 requested review from a team as code owners April 17, 2026 22:34

gemini-code-assist bot reviewed Apr 17, 2026

View reviewed changes

rahul2393 added 3 commits April 20, 2026 02:12

add EWMA support for stale_reads

7c4bb2e

support for strong reads

1b00d27

fix tests

da00277

gemini-code-assist bot reviewed Apr 19, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(spanner): add shared endpoint cooldowns for location-aware rerouting#12845

feat(spanner): add shared endpoint cooldowns for location-aware rerouting#12845
rahul2393 wants to merge 6 commits intogoogleapis:mainfrom
rahul2393:endpoint-cooldown-re

rahul2393 commented Apr 17, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Apr 17, 2026

Uh oh!

rahul2393 commented Apr 19, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Apr 19, 2026

Uh oh!

gemini-code-assist bot Apr 19, 2026

Uh oh!

gemini-code-assist bot Apr 19, 2026

Uh oh!

gemini-code-assist bot Apr 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		private static final ConcurrentHashMap<TrackerKey, LatencyTracker> TRACKERS =
		new ConcurrentHashMap<>();

Conversation

rahul2393 commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What changed

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

rahul2393 commented Apr 19, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Apr 19, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 19, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 19, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 19, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

rahul2393 commented Apr 17, 2026 •

edited

Loading