Fix CachedResource virtual-storage discovery on multi-replica/embedded-VW shards#4190
Conversation
|
/test pull-kcp-build-image |
|
/build-image |
3 similar comments
|
/build-image |
|
/build-image |
|
/build-image |
|
/build-image |
e0763e5 to
3b3206f
Compare
|
/build-image |
3b3206f to
4370ab9
Compare
| // The DynamicRESTMapper is per-process, in-memory state that THIS shard replica's own | ||
| // API server reads (the aggregating CRD-version discovery server and the virtual resources | ||
| // server) to resolve GroupKinds to resources. Both controllers only mutate that in-memory | ||
| // mapper and never write to storage, so — unlike the leader-elected controllers started by | ||
| // kcp-start-controllers — they MUST run on every replica. If they only ran on the leader, a | ||
| // non-leader replica would serve requests against an empty mapper, making built-in/system | ||
| // types such as cache.kcp.io/CachedResourceEndpointSlice unresolvable and breaking discovery | ||
| // and serving of virtual-storage (CachedResource-backed) resources on that replica. Start | ||
| // them from a post-start hook that is not gated by leader election. | ||
| return s.AddPostStartHook("kcp-start-dynamicrestmapper-controllers", func(hookContext genericapiserver.PostStartHookContext) error { | ||
| controllerCtx := klog.NewContext(ctx, klog.FromContext(ctx).WithValues("postStartHook", "kcp-start-dynamicrestmapper-controllers")) | ||
| for _, controller := range controllers { | ||
| go s.runController(controllerCtx, controller) | ||
| } | ||
| return nil | ||
| }) | ||
| } |
There was a problem hiding this comment.
Would it make sense to move the dynrest controller out to separate between authoritative and information controllers? If we have more controllers with the same error pattern it would make sense to separate the two.
There was a problem hiding this comment.
Agreed it would be cleaner to formalize the "authoritative (leader-elected) vs informational (per-replica, in-memory)" split once there is more than one such controller. To keep this PR scoped to the bug I left it as a single dedicated post-start hook for now; happy to file a follow-up issue for the abstraction (or fold it in here if you'd prefer).
There was a problem hiding this comment.
Now non-ai comment:
will do in follow-up :D
| return s.AddPostStartHook("kcp-start-dynamicrestmapper-controllers", func(hookContext genericapiserver.PostStartHookContext) error { | ||
| controllerCtx := klog.NewContext(ctx, klog.FromContext(ctx).WithValues("postStartHook", "kcp-start-dynamicrestmapper-controllers")) | ||
| for _, controller := range controllers { | ||
| go s.runController(controllerCtx, controller) |
There was a problem hiding this comment.
| go s.runController(controllerCtx, controller) | |
| go s.runController(hookContext, controller) |
There was a problem hiding this comment.
This was correct, no? Just the parent of controllerCtx was wrong and should be hookCtx.
There was a problem hiding this comment.
Done — controllerCtx now uses hookContext as its parent (and derives the logger from it), and we keep passing controllerCtx to runController, matching @gman0's point.
The DynamicRESTMapper is per-process, in-memory state that each shard replica's own API server reads — the aggregating CRD-version discovery server and the virtual resources server use it to resolve GroupKinds (e.g. cache.kcp.io/CachedResourceEndpointSlice) to resources. Its builtin and dynamic controllers were registered through the leader-elected kcp-start-controllers hook, so on a multi-replica shard only the leader populated its mapper. Non-leader replicas served API traffic against an empty mapper, so the REST mapping for system types failed there and virtual-storage (CachedResource-backed) resources became undiscoverable and unservable whenever a request landed on a non-leader replica — observed as intermittent "Some resources are temporarily unavailable" and "no matches for kind ... in group cache.kcp.io". Both controllers only mutate the in-memory mapper and never write to storage, so they are safe to run everywhere. Start them from a dedicated post-start hook that is not gated by leader election, guarded by a sync.Once so the wiring happens exactly once per process even though installControllers can run again on leadership re-acquisition.
The aggregating CRD-version discovery server and the virtual resources server forward
requests to the virtual workspace endpoint URLs advertised in CachedResourceEndpointSlices.
Each such URL is the serving shard's own external virtual-workspace URL
(Shard.Spec.VirtualWorkspaceURL), reached through the cluster ingress that routes by TLS SNI
directly to that shard — even when virtual workspaces run embedded in the shard.
vwClientConfig was only given a usable TLS config (shard client cert + VW CA) when virtual
workspaces were NOT embedded. With embedded virtual workspaces (the default) it kept the
loopback client config, whose ServerName is the loopback name and whose bearer token is only
valid on the loopback listener. The forwarded request's TLS SNI was therefore the loopback
name, the ingress could not route it, and the connection was reset ("failed to perform API
discovery: ... read: connection reset by peer"), so every virtual-storage (CachedResource-
backed) resource was dropped from discovery and could not be served.
When the shard is configured with a client certificate, authenticate the forward client with
it, leave ServerName empty so the SNI is derived from the target host, and drop the loopback
bearer token; fall back to the root CA (which signs the shard serving cert) when
--shard-virtual-workspace-ca-file is unset. Without a shard client certificate (e.g. an
in-process test server) the endpoint URL points back at this same server, so the loopback
config is kept as-is.
4370ab9 to
41a22b9
Compare
|
LGTM label has been added. DetailsGit tree hash: 5c73f0ab08327c24e6a1206bfaf5d9052a633632 |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: ntnn The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Summary
CachedResource-backed (virtual-storage) resources were intermittently undiscoverable and unservable on a kcp shard running with multiple replicas and/or embedded virtual workspaces.
kubectl get <resource>returned:and the list/get failed. Whether it failed flapped depending on which shard replica served the request. There are two independent root causes.
1. DynamicRESTMapper controllers only ran on the leader replica
The builtin and dynamic REST mapper controllers populate a per-process, in-memory
DynamicRESTMapperthat every shard replica's own API server reads — the aggregating CRD-version discovery server and the virtual resources server use it to resolve GroupKinds (e.g.cache.kcp.io/CachedResourceEndpointSlice) to resources. They were registered under the leader-electedkcp-start-controllershook, so on a shard with more than one replica only the leader populated its mapper. Non-leader replicas served requests against an empty mapper, soRESTMappingfor system types failed (no matches for kind "CachedResourceEndpointSlice" in group "cache.kcp.io") and every virtual-storage resource was dropped from discovery there.Both controllers only mutate the in-memory mapper and never write to storage, so they are safe to run on every replica. They are now started from a dedicated post-start hook that is not gated by leader election, guarded by a
sync.Onceso a leadership re-acquisition does not re-wire them.2. Virtual workspace forward client used the loopback SNI/identity with embedded virtual workspaces
The aggregating discovery and virtual resources servers forward requests to the virtual workspace endpoint URLs advertised in
CachedResourceEndpointSlice. Each such URL is the serving shard's own external, SNI-routed virtual-workspace URL (Shard.Spec.VirtualWorkspaceURL, e.g.https://<shard>.example:8443/services/replication/...) — reached through the ingress/front-proxy, not the in-cluster Service address and not loopback, even when the virtual workspace runs embedded in that shard.vwClientConfigwas only given a usable TLS config (shard client cert + VW CA) when virtual workspaces were not embedded. With embedded virtual workspaces (the default), it kept the loopback client config, whoseServerNameis the loopback name and whose bearer token is only valid on the loopback listener. The forwarded request's TLS SNI was therefore the loopback name, the ingress could not SNI-route the connection, and it was reset:The forward client now always authenticates with the shard client certificate, drops the loopback bearer token, and leaves
ServerNameempty so the SNI is derived from the target host. When--shard-virtual-workspace-ca-fileis unset (typical with embedded virtual workspaces) it falls back to the root CA, which signs the shard serving certificate.What Type of PR Is This?
/kind bug
Related Issue(s)
Fixes #
Release Notes