Skip to content

route bake targets across pods with random loadbalance#3861

Open
areebahmeddd wants to merge 1 commit into
docker:masterfrom
areebahmeddd:fix/random-pods
Open

route bake targets across pods with random loadbalance#3861
areebahmeddd wants to merge 1 commit into
docker:masterfrom
areebahmeddd:fix/random-pods

Conversation

@areebahmeddd

Copy link
Copy Markdown
Contributor

loadbalance=random on the kubernetes driver was sending all bake targets to the same pod.
the client connection was cached after the first dial, so pod selection only happened once.

fix this by creating a new connection per target when loadbalance=random is enabled.
also disable the shared local mount session optimization in this mode.

Closes #3382

@AkihiroSuda

Copy link
Copy Markdown
Collaborator

loadbalance=random on the kubernetes driver was sending all bake targets to the same pod. the client connection was cached after the first dial, so pod selection only happened once.

fix this by creating a new connection per target when loadbalance=random is enabled.

What about loadbalance=sticky ?

@AkihiroSuda

Copy link
Copy Markdown
Collaborator

Also, please squash the commits

@areebahmeddd

areebahmeddd commented May 28, 2026

Copy link
Copy Markdown
Contributor Author

What about loadbalance=sticky ?

no change. it reuses the same shared connection path

Also, please squash the commits

sure 🙂

With loadbalance=random, all bake targets were landing on the same pod.
The client connection was cached after the first dial, so pod selection
only happened once.

Fix by dialing a fresh connection per target when loadbalance=random.
Also skip the shared local-mount session optimisation in this mode since
targets connect to different pods and cannot share a session.

Fixes docker#3382

Signed-off-by: Areeb Ahmed <areebahmed0709@gmail.com>
@areebahmeddd

areebahmeddd commented Jun 9, 2026

Copy link
Copy Markdown
Contributor Author

gentle ping! @AkihiroSuda

the ci failure is not related to our changes.

crazy-max
crazy-max previously approved these changes Jun 9, 2026
@crazy-max crazy-max dismissed their stale review June 9, 2026 14:11

second review

@crazy-max crazy-max left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was looking at

func (pc *RandomPodChooser) ChoosePod(ctx context.Context) (*corev1.Pod, error) {
pods, err := ListRunningPods(ctx, pc.PodClient, pc.Deployment, pc.StatefulSet)
if err != nil {
return nil, err
}
if len(pods) == 0 {
return nil, errors.New("no running buildkit pods found")
}
randSource := pc.RandSource
if randSource == nil {
randSource = rand.NewSource(time.Now().Unix())
}
rnd := rand.New(randSource) // #nosec G404 -- no strong seeding required
n := rnd.Int() % len(pods)
logrus.Debugf("RandomPodChooser.ChoosePod(): len(pods)=%d, n=%d", len(pods), n)
return pods[n], nil
}

Seems to still looks capable of picking the same pod for all targets in one bake no? ChoosePod() creates a new RNG on every call and seeds it with time.Now().Unix(). With this PR, multiple targets can create fresh clients and dial within the same second, so each call can get the same seed and produce the same first random value iiuc.

Maybe we should seed the RNG once at the package level, with concurrency handled, or use another safe random selection path? Otherwise this PR may still reproduce the original issue.

Comment thread driver/manager.go
Comment on lines +131 to +143
func (d *DriverHandle) NewClient(ctx context.Context) (*client.Client, error) {
return d.Driver.Client(ctx, d.getClientOptions()...)
}

func (d *DriverHandle) NeedsNewClientPerBuild() bool {
type perBuildClientDriver interface {
NeedsNewClientPerBuild() bool
}
if p, ok := d.Driver.(perBuildClientDriver); ok {
return p.NeedsNewClientPerBuild()
}
return false
}

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually after new review this doesn't feel like the right shape for the driver API. NeedsNewClientPerBuild() is added as an optional hidden interface inside DriverHandle, but the behavior is broader than the name says. The build code also uses it to disable shared local mount sessions.

Can we make that contract explicit near the Driver interface instead of hiding it in manager.go? Something like an optional RequiresUncachedClient() interface would make the build-side behavior easier to reason about. I would also rename NewClient() to something like UncachedClient() so it is clear this is deliberately bypassing DriverHandle.Client() caching, not just another constructor.

@areebahmeddd

Copy link
Copy Markdown
Contributor Author

Hm, let me take a look

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bake targets are not distributed across buildkit pods

4 participants