Skip to content

fix: preallocate network#1571

Open
luke-lombardi wants to merge 2 commits into
mainfrom
ll/network-prealloc
Open

fix: preallocate network#1571
luke-lombardi wants to merge 2 commits into
mainfrom
ll/network-prealloc

Conversation

@luke-lombardi
Copy link
Copy Markdown
Contributor

@luke-lombardi luke-lombardi commented May 25, 2026

Summary by cubic

Preallocated container network slots per worker and added node-scoped IP pools with atomic Redis ownership to cut startup latency and stabilize IP assignment. Also batched scheduler requests and expanded startup/network telemetry for clearer bottleneck analysis.

  • New Features

    • Network preallocation with a slot pool per worker; configurable via worker.pools.*.networkPreallocation and networkSlotPoolSize (default compute pool: enabled with 32).
    • Node-scoped network prefixes with Redis-backed atomic IP reservation (owner/refcounts) and a MoveContainerIp RPC.
    • Richer network timing with microsecond precision, a worker-network event source, and deeper network drilldowns in the startup report.
    • Scheduler processes request batches and breaks ties by free capacity to avoid overscheduling; adds a fast retry path.
    • Worker builds and ships goproc (no external download) with faster readiness probing; sandbox exec retries until ready.
    • Container start concurrency scales with worker CPU; compute pool default set to 32; runtime can update CPU quotas post-start when supported.
    • Defaults: compute pool CPU baseline increased to 2000m; event APIs auto-fill duration_us; stub ID inferred for stub-scoped container IDs.
  • Bug Fixes

    • Fixed a race when registering Prometheus counters under load.
    • More robust transient error detection for sandbox connections and gRPC retries.
    • IP assignment consistency: prevents duplicate or stolen addresses during rapid container lifecycle changes.

Written for commit 898f509. Summary will update on new commits. Review in cubic

Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

6 issues found across 39 files

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="pkg/repository/worker_redis_test.go">

<violation number="1" location="pkg/repository/worker_redis_test.go:493">
P2: Custom agent: **Prevent Redundant Code Duplication**

Duplicated concurrent-reservation test harness in CPU and GPU test functions should be extracted into a shared helper.</violation>
</file>

<file name="pkg/worker/network.go">

<violation number="1" location="pkg/worker/network.go:308">
P2: Per-container lock entries are never removed from `containerLocks`, causing unbounded map growth as new container IDs are processed.</violation>
</file>

Tip: cubic can generate docs of your entire codebase and keep them up to date. Try it here.

Re-trigger cubic

Comment thread pkg/repository/worker_redis.go
Comment thread pkg/worker/network.go
Comment thread pkg/repository/worker_redis_test.go
Comment thread pkg/repository/usage/usage_prometheus.go
Comment thread pkg/scheduler/reserve.go
Comment thread pkg/worker/network.go
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants