Skip to content

feat(platform): CBM_WORKERS env override for cbm_default_worker_count#364

Closed
yangsec888 wants to merge 1 commit into
DeusData:mainfrom
yangsec888:feat/cbm-workers-env-override
Closed

feat(platform): CBM_WORKERS env override for cbm_default_worker_count#364
yangsec888 wants to merge 1 commit into
DeusData:mainfrom
yangsec888:feat/cbm-workers-env-override

Conversation

@yangsec888
Copy link
Copy Markdown
Contributor

What

Adds a single env-knob, CBM_WORKERS, that explicitly sets the
worker count returned by cbm_default_worker_count(). When unset,
behaviour is unchanged.

Why

In containerized deployments, cbm_default_worker_count(initial=true)
returns sysconf(_SC_NPROCESSORS_ONLN) (host-CPU count, not the
container's CPU quota). On a 1-vCPU pod scheduled onto a 16-core
node cbm spawns ~16 indexing workers, each with its own per-worker
buffers — the dominant OOMKill driver in container deployments
(see companion issue #363 for the broader cgroup-awareness ask).

A small env knob solves the immediate operator pain without taking
on the scope of full cgroup detection, and it follows the same
shape as the existing CBM_* env overrides: explicit override >
implicit detection.

Patch

src/foundation/system_info.c:

int cbm_default_worker_count(bool initial) {
    /* CBM_WORKERS env override (clamped to [1, CBM_WORKERS_MAX]).
     * Useful inside containers where sysconf(_SC_NPROCESSORS_ONLN)
     * reports host CPUs rather than the cgroup's effective CPU quota.
     * Same precedence shape as other CBM_* env overrides:
     * explicit override > implicit detection. */
    char buf[CBM_SZ_32];
    if (cbm_safe_getenv("CBM_WORKERS", buf, sizeof(buf), NULL) != NULL) {
        long n = strtol(buf, NULL, CBM_DECIMAL_BASE);
        if (n >= MIN_WORKERS && n <= CBM_WORKERS_MAX) {
            return (int)n;
        }
        cbm_log_warn("workers.env.invalid", "value", buf,
                     "fallback", "sysconf");
    }

    cbm_system_info_t info = cbm_system_info();
    // ... unchanged from today ...
}

CBM_WORKERS_MAX = 256 is added to the file-local enum; happy to
drop the upper clamp if you'd rather trust the operator, or move
the constant somewhere shared. cbm_safe_getenv, CBM_DECIMAL_BASE,
CBM_SZ_32, SKIP_ONE, MIN_WORKERS, and cbm_log_warn all
already exist (see src/foundation/platform.h, constants.h, and
log.h).

Tests

tests/test_platform.c gains three cases inside the existing
SUITE(platform):

  1. platform_default_workers_env_overrideCBM_WORKERS=4 is honored for both initial=true and initial=false.
  2. platform_default_workers_env_invalid0, -1, 9999, and not-a-number all fall back to the sysconf-derived default (and emit workers.env.invalid at WARN).
  3. platform_default_workers_env_unset — when unset, cbm_default_worker_count(true) == cbm_system_info().total_cores, preserving today's contract.

Docs

One-line addition to the Environment Variables table in
README.md:

CBM_WORKERS — Override the parallel-indexing worker count
returned by cbm_default_worker_count. Useful inside containers
where sysconf(_SC_NPROCESSORS_ONLN) reports host CPUs rather
than the cgroup's effective quota. Range 1–256; invalid values
are ignored with a warning.

Behaviour when unset

Identical to today. No deployment that doesn't set CBM_WORKERS
sees any difference.

Verification on my side

I have not run scripts/build.sh / scripts/test.sh /
scripts/lint.sh locally because the full ASan + UBSan setup
isn't installed on my machine. Happy to iterate quickly on any
CI feedback (lint, clang-tidy, cppcheck, format) — the patch is
small enough that a turnaround should be fast.

Related

Adds a single env-knob, CBM_WORKERS, that explicitly sets the worker
count returned by cbm_default_worker_count(). When unset, behaviour
is unchanged.

In containerized deployments, cbm_default_worker_count(initial=true)
returns sysconf(_SC_NPROCESSORS_ONLN) (host-CPU count, not the
container's CPU quota). On a 1-vCPU pod scheduled onto a 16-core node
cbm spawns ~16 indexing workers, each with its own per-worker buffers
— the dominant OOMKill driver in container deployments. The broader
ask (full cgroup awareness for cbm_system_info / mem.init budget
logs) lives in a sibling issue; this knob is the smaller, lower-risk
path that ships independently.

Same precedence shape as the existing CBM_* env overrides: explicit
override > implicit detection. Range clamped to [1, 256]; invalid
values are ignored and the function logs `workers.env.invalid` at
WARN before falling through to the sysconf-derived default.

Files touched:
- src/foundation/system_info.c — env probe + clamp + warn-and-fall-through
- tests/test_platform.c — 3 new TEST cases covering override, invalid
  fallback, and unset baseline
- README.md — CBM_WORKERS row in the Environment Variables table

Related: cgroup-aware detection issue (filed alongside this PR).
@DeusData
Copy link
Copy Markdown
Owner

Thank you, @yangsec888! 🙏 The CBM_WORKERS escape hatch is a clean, well-scoped fix for the container over-provisioning problem — cbm_safe_getenv + strtol + clamp to [1,256] with a warn-and-fall-back on invalid values is exactly the right shape, and it matches the precedence of the other CBM_* overrides. Your three tests (override / invalid / unset) cover it well.

I landed it as d952238, crediting you as the commit author (it sequences right before your cgroup-detection work). Verified locally: build clean, all tests pass and the workers.env.invalid warnings fire as expected. Thanks for tackling the container OOMKill story! 🙏

DeusData pushed a commit that referenced this pull request May 30, 2026
Adds a CBM_WORKERS env knob (clamped to [1, 256], invalid values warned
and ignored) that explicitly sets the parallel-indexing worker count.
In containers, sysconf(_SC_NPROCESSORS_ONLN) reports host CPUs rather
than the cgroup's effective quota, so a 1-vCPU pod on a 16-core node
otherwise spawns ~16 workers — the dominant OOMKill driver. When unset,
behaviour is unchanged.

From #364.
DeusData pushed a commit that referenced this pull request May 30, 2026
Inside a container, sysconf(_SC_NPROCESSORS_ONLN) and sysinfo() report
the host's CPU count and RAM, not the cgroup's effective quota — so
cbm_default_worker_count over-provisions workers and the mmap budget can
exceed the cgroup memory cap, driving OOMKills (#363).

detect_system_linux now reads the cgroup limits:
- cbm_detect_cgroup_cpus: cgroup v2 cpu.max, then v1 cpu.cfs_quota_us /
  cpu.cfs_period_us → ceil(quota/period); -1 when unlimited.
- cbm_detect_cgroup_mem: cgroup v2 memory.max, then v1
  memory.limit_in_bytes; 0 when unlimited or the v1 near-ULLONG_MAX
  sentinel.
- Effective value is min(cgroup, host), guarding against mis-mounted
  cgroups that report more than the host.

Helpers read via a bounded read_small_file (fopen "re", capped fread)
and are exposed through system_info_internal.h for unit tests that drive
them against a fake cgroup tree. The test teardown uses opendir/unlink/
rmdir (no shell spawn).

Distilled from #365 onto current main (unioned the test additions with
the CBM_WORKERS tests from #364, and replaced the test-cleanup
system("rm -rf") with a shell-free recursive remove). Closes #363
together with #364.
@DeusData DeusData closed this May 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants