feat(platform): cgroup-aware CPU/memory detection in detect_system_linux#365
feat(platform): cgroup-aware CPU/memory detection in detect_system_linux#365yangsec888 wants to merge 1 commit into
Conversation
Detect the effective CPU quota and memory limit from cgroup v2 or v1
files rather than always reporting host totals. Inside a container,
`sysconf(_SC_NPROCESSORS_ONLN)` and `sysinfo()` return the host's
numbers — which makes downstream consumers (e.g. cbm_default_worker_count)
over-provision workers, exhaust the cgroup's memory cap, and trigger
OOMKills.
The new Linux path:
1. Reads `<root>/cpu.max` (v2) or `<root>/cpu/cpu.cfs_{quota,period}_us`
(v1) and computes effective CPUs as ceil(quota/period).
2. Reads `<root>/memory.max` (v2) or
`<root>/memory/memory.limit_in_bytes` (v1) and treats "max" /
near-ULLONG_MAX as "unlimited".
3. Takes min(cgroup, host) for both, so a mis-mounted cgroup that
reports something larger than the host can't push us above true
hardware. Falls back cleanly to host when no cgroup files exist.
Helpers are exposed via `src/foundation/system_info_internal.h` (an
internal-only header, alongside the existing pipeline_internal.h
precedent) so tests can drive them against a fake `/sys/fs/cgroup`
tree without depending on the runtime environment.
Adds 11 Linux-only tests covering:
- v2 cpu.max integer quota + ceil rounding + "max" unlimited
- v1 cfs_quota_us/cfs_period_us + -1 unlimited
- v2 memory.max integer + "max"
- v1 memory.limit_in_bytes + near-ULLONG_MAX unlimited sentinel
- Missing cgroup files (host-fallback path)
macOS and BSD detection are unchanged. Windows is unaffected.
Refs DeusData#363
Inside a container, sysconf(_SC_NPROCESSORS_ONLN) and sysinfo() report the host's CPU count and RAM, not the cgroup's effective quota — so cbm_default_worker_count over-provisions workers and the mmap budget can exceed the cgroup memory cap, driving OOMKills (#363). detect_system_linux now reads the cgroup limits: - cbm_detect_cgroup_cpus: cgroup v2 cpu.max, then v1 cpu.cfs_quota_us / cpu.cfs_period_us → ceil(quota/period); -1 when unlimited. - cbm_detect_cgroup_mem: cgroup v2 memory.max, then v1 memory.limit_in_bytes; 0 when unlimited or the v1 near-ULLONG_MAX sentinel. - Effective value is min(cgroup, host), guarding against mis-mounted cgroups that report more than the host. Helpers read via a bounded read_small_file (fopen "re", capped fread) and are exposed through system_info_internal.h for unit tests that drive them against a fake cgroup tree. The test teardown uses opendir/unlink/ rmdir (no shell spawn). Distilled from #365 onto current main (unioned the test additions with the CBM_WORKERS tests from #364, and replaced the test-cleanup system("rm -rf") with a shell-free recursive remove). Closes #363 together with #364.
|
Thank you, @yangsec888! 🙏 This is excellent, thorough work — the cgroup-awareness is exactly what containerized deployments need, and I appreciated the care in the details:
Two small things I adjusted while landing (both noted for transparency): the test additions were unioned with the Landed in a5a3d1d, crediting you as author. Together with #364 this closes #363. Verified locally: build clean, all 3,622 tests pass (cgroup tests are |
Summary
Closes #363 in conjunction with #364: this PR is the auto-detect half of cgroup-awareness; #364 is the
CBM_WORKERSenv-override escape hatch. They're independent and can land in either order.Today
detect_system_linux()reports host CPU count viasysconf(_SC_NPROCESSORS_ONLN)and host RAM viasysinfo(). Inside a container, neither reflects the cgroup's effective quota, socbm_default_worker_countover-provisions workers and the SQLite mmap budget can exceed the cgroup memory cap — see #363 for the OOMKill story.This PR makes the Linux detection path cgroup-aware:
<root>/cpu.max(v2) or<root>/cpu/cpu.cfs_quota_us+cpu.cfs_period_us(v1), computeceil(quota / period)."max"and-1quotas mean "no limit, fall back to sysconf".<root>/memory.max(v2) or<root>/memory/memory.limit_in_bytes(v1)."max"is unlimited; cgroup v1's near-ULLONG_MAXsentinel (PAGE_COUNTER_MAX) is also treated as unlimited.min(cgroup, host), so a mis-mounted cgroup that reports something bigger than the host can't push us above true hardware.The helpers live in a new internal header
src/foundation/system_info_internal.h(same pattern as the existingpipeline_internal.h) so tests can drive them against a fake/sys/fs/cgrouptree without depending on the runtime environment.Why this matters
Symptoms in the downstream consumer (sast-ai-app) before the workaround landed:
limits.memory: 2Gion a 32-core node spawned 32 indexing workers.OOMKilledmid-index, watcher restart loop, 20Gi PVC grew anyway.Operationally this was patched downstream by quadrupling pod memory and pre-creating PVCs at 4× the size CBM "thought" it needed. With this PR, the cgroup limits flow through
cbm_default_worker_countnaturally and the over-provisioning disappears.Test plan
11 new Linux-only tests in
tests/test_platform.c, each creating a freshmkdtemptree and exercising one detection path:cgroup_v2_cpu_quota—cpu.max = "200000 100000"→ 2cgroup_v2_cpu_quota_rounds_up—cpu.max = "150000 100000"→ ceil(1.5) = 2cgroup_v2_cpu_unlimited—cpu.max = "max 100000"→ -1cgroup_v1_cpu_quota— cfs_quota_us/cfs_period_us = 200000/100000 → 2cgroup_v1_cpu_unlimited— cfs_quota_us = -1 → -1cgroup_no_cpu_files— empty tmp dir → -1cgroup_v2_mem—memory.max = "2147483648"→ 2 GiBcgroup_v2_mem_unlimited—memory.max = "max"→ 0cgroup_v1_mem—memory.limit_in_bytes = "1073741824"→ 1 GiBcgroup_v1_mem_unlimited_sentinel—memory.limit_in_bytes = "9223372036854775807"→ 0cgroup_no_mem_files— empty tmp dir → 0Local verification
Verified on macOS (Apple Silicon, clang 17). The 11 Linux tests are
#ifdef __linux__-guarded and skipped on macOS, so end-to-end Linux validation lives in upstream CI.scripts/build.sh,-Wall -Wextra -Werror): clean, 47s.scripts/test.sh):3553 passed, 1 failed. The single failure issearch_code_multi_wordintests/test_mcp.c:694— already failing on plainupstream/mainat the same SHA, unrelated to this PR, also showing up on recent nightly soak failures.clang-format+cppcheck): clean on all three touched files. (The remainingclang-formatdiff atsystem_info.c:97/99is in BSD code I didn't touch; it reflects an Apple clang-format 17 vs. CI clang-format disagreement that exists onupstream/mainalready.)Files
src/foundation/system_info.c— new cgroup helpers;detect_system_linuxrewritten with the safety clamps. macOS/BSD/Windows paths untouched. (+117/-6)src/foundation/system_info_internal.h— new internal header declaring the cgroup helpers for tests. (+44)tests/test_platform.c— 11 new Linux-only tests + small tmp-dir/fixture helpers. (+179)Relationship to #364
These two are deliberately independent:
CBM_WORKERSenv override) lets operators force a specific value when they want to leave additional headroom below the cgroup cap (or above it, on bare metal).If both land, the precedence is:
CBM_WORKERSenv > cgroup auto-detect > host fallback, which matches the precedence shape we use for otherCBM_*knobs.