fix: raise RLIMIT_NOFILE soft limit at startup#422
Conversation
98abc4a to
1f2611d
Compare
🤖 Kimi Code ReviewThe code is generally correct and follows the expected pattern for raising file descriptor limits in systems programming. This is a straightforward port of well-tested code from Parity/ethrex. Minor issues:
Correctness acknowledgments:
Security/Performance: Automated review by Kimi (Moonshot AI) · kimi-k2.5 · custom prompt |
🤖 Codex Code ReviewNo findings. The PR is operational only; it does not touch fork choice, attestation handling, justification/finalization, state transition, XMSS, or SSZ paths. The startup change in main.rs is consistent with the existing RocksDB configuration in rocksdb.rs: since the backend already uses Residual risk is low and mostly around validation coverage, not design. I could not complete Automated review by OpenAI Codex · gpt-5.4 · custom prompt |
🤖 Claude Code ReviewHere is the review output: PR #422 —
|
Greptile SummaryThis PR ports the
Confidence Score: 4/5Safe to merge — failure paths are non-fatal and the fix directly addresses the documented production panic. The change is a narrow, well-understood system call sequence ported from a proven sibling codebase. The Linux path has a silent degradation when the hard limit is RLIM_INFINITY (setrlimit returns EINVAL, only a warning is logged), and the LimitRaised variant fires even when no actual change was made — both are cosmetic/edge-case issues that do not affect the primary fix for the container ulimit-1024 panic. bin/ethlambda/src/fd_limit.rs — Linux branch edge case with RLIM_INFINITY hard limits and the always-fires LimitRaised variant.
|
| Filename | Overview |
|---|---|
| bin/ethlambda/src/fd_limit.rs | New module ported from ethrex that raises RLIMIT_NOFILE soft limit to the hard limit on Linux, and to min(kern.maxfilesperproc, hard) on macOS/iOS; no-op elsewhere. The Linux path does not guard against RLIM_INFINITY hard limits, which causes setrlimit to fail silently (only a warn is emitted). |
| bin/ethlambda/src/main.rs | Calls raise_fd_limit() right after the startup banner, logs from→to on success, and only warns on failure — safe and non-fatal. |
| Cargo.toml | Adds libc = "0.2" to workspace dependencies; straightforward. |
| bin/ethlambda/Cargo.toml | Wires libc.workspace = true into the binary crate; thiserror was already present so all derives compile. |
Flowchart
%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[main: startup banner] --> B[fd_limit::raise_fd_limit]
B --> C{Platform?}
C -- Linux --> D[getrlimit RLIMIT_NOFILE]
C -- macOS/iOS --> E[sysctl kern.maxfilesperproc]
C -- Other --> F[return Unsupported]
D --> G[rlim_cur = rlim_max]
E --> H[getrlimit RLIMIT_NOFILE]
H --> I["rlim_cur = min(maxfilesperproc, rlim_max)"]
G --> J[setrlimit]
I --> J
J -- success --> K["Ok(LimitRaised { from, to })"]
J -- error --> L["Err(FailedToSetLimit)"]
K --> M["info! Raised RLIMIT_NOFILE from=X to=Y"]
L --> N["warn! Failed to raise RLIMIT_NOFILE"]
F --> O["info! Not supported on this platform"]
Prompt To Fix All With AI
Fix the following 2 code review issues. Work through them one at a time, proposing concise fixes.
---
### Issue 1 of 2
bin/ethlambda/src/fd_limit.rs:144
**RLIM_INFINITY hard limit causes setrlimit to fail**
When the process hard limit for `RLIMIT_NOFILE` is `RLIM_INFINITY` (e.g., `ulimit -H -n unlimited`), `rlim_cur` is set to `RLIM_INFINITY` (`u64::MAX`). Linux rejects this with `EINVAL` because `RLIMIT_NOFILE` is silently capped by `/proc/sys/fs/nr_open` (default 1048576). The call lands in the `Err` branch in `main.rs` and only a warning is logged, so the fix has no effect on those hosts. The macOS path avoids this by capping against `kern.maxfilesperproc`; the same approach — reading `/proc/sys/fs/nr_open` and using `min(rlim_max, nr_open)` — would make the Linux path robust in this scenario.
### Issue 2 of 2
bin/ethlambda/src/fd_limit.rs:155-158
**`LimitRaised` fires even when the limit is unchanged**
If the soft limit is already equal to the hard limit, the code calls `setrlimit` with an identical value (which succeeds), and returns `LimitRaised { from: N, to: N }`. The caller in `main.rs` then logs `"Raised RLIMIT_NOFILE soft limit from=N to=N"`, which may confuse operators inspecting startup logs. Adding a short-circuit before `setrlimit` — returning `Unsupported` or a dedicated `AlreadyAtMax` variant when `old_value == rlim.rlim_cur` — would make the log accurate.
Reviews (1): Last reviewed commit: "fix: raise RLIMIT_NOFILE soft limit at s..." | Re-trigger Greptile
| let old_value = rlim.rlim_cur; | ||
|
|
||
| // Set soft limit to hard limit | ||
| rlim.rlim_cur = rlim.rlim_max; |
There was a problem hiding this comment.
RLIM_INFINITY hard limit causes setrlimit to fail
When the process hard limit for RLIMIT_NOFILE is RLIM_INFINITY (e.g., ulimit -H -n unlimited), rlim_cur is set to RLIM_INFINITY (u64::MAX). Linux rejects this with EINVAL because RLIMIT_NOFILE is silently capped by /proc/sys/fs/nr_open (default 1048576). The call lands in the Err branch in main.rs and only a warning is logged, so the fix has no effect on those hosts. The macOS path avoids this by capping against kern.maxfilesperproc; the same approach — reading /proc/sys/fs/nr_open and using min(rlim_max, nr_open) — would make the Linux path robust in this scenario.
Prompt To Fix With AI
This is a comment left during a code review.
Path: bin/ethlambda/src/fd_limit.rs
Line: 144
Comment:
**RLIM_INFINITY hard limit causes setrlimit to fail**
When the process hard limit for `RLIMIT_NOFILE` is `RLIM_INFINITY` (e.g., `ulimit -H -n unlimited`), `rlim_cur` is set to `RLIM_INFINITY` (`u64::MAX`). Linux rejects this with `EINVAL` because `RLIMIT_NOFILE` is silently capped by `/proc/sys/fs/nr_open` (default 1048576). The call lands in the `Err` branch in `main.rs` and only a warning is logged, so the fix has no effect on those hosts. The macOS path avoids this by capping against `kern.maxfilesperproc`; the same approach — reading `/proc/sys/fs/nr_open` and using `min(rlim_max, nr_open)` — would make the Linux path robust in this scenario.
How can I resolve this? If you propose a fix, please make it concise.| Ok(Outcome::LimitRaised { | ||
| from: old_value.into(), | ||
| to: rlim.rlim_cur.into(), | ||
| }) |
There was a problem hiding this comment.
LimitRaised fires even when the limit is unchanged
If the soft limit is already equal to the hard limit, the code calls setrlimit with an identical value (which succeeds), and returns LimitRaised { from: N, to: N }. The caller in main.rs then logs "Raised RLIMIT_NOFILE soft limit from=N to=N", which may confuse operators inspecting startup logs. Adding a short-circuit before setrlimit — returning Unsupported or a dedicated AlreadyAtMax variant when old_value == rlim.rlim_cur — would make the log accurate.
Prompt To Fix With AI
This is a comment left during a code review.
Path: bin/ethlambda/src/fd_limit.rs
Line: 155-158
Comment:
**`LimitRaised` fires even when the limit is unchanged**
If the soft limit is already equal to the hard limit, the code calls `setrlimit` with an identical value (which succeeds), and returns `LimitRaised { from: N, to: N }`. The caller in `main.rs` then logs `"Raised RLIMIT_NOFILE soft limit from=N to=N"`, which may confuse operators inspecting startup logs. Adding a short-circuit before `setrlimit` — returning `Unsupported` or a dedicated `AlreadyAtMax` variant when `old_value == rlim.rlim_cur` — would make the log accurate.
How can I resolve this? If you propose a fix, please make it concise.Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
1f2611d to
f80aba1
Compare
The storage layer opens RocksDB with `set_max_open_files(-1)`, which
keeps an unbounded number of SST files in the table cache. On
containerized hosts the soft `ulimit -n` is 1024 by default, so on a
long-running devnet the table cache eventually grows past that and the
next `open(2)` returns `EMFILE`. RocksDB surfaces this as
`commit: IO error: ... Too many open files`, the `.expect("commit")`
in `Store` panics, the BlockChain actor dies, and the node stops
making progress while the container still reports "Up".
Mirror ethrex's approach: at startup, bump the soft `RLIMIT_NOFILE`
to the hard limit (Linux) or `kern.maxfilesperproc` (macOS). The
ported file is `fd_limit.rs`, originally from Parity Tech via ethrex
under Apache-2.0; copyright header is preserved.
Observed in a 4-day 16-node Kurtosis devnet where every node panicked
within a few minutes of each other once the table cache grew past
1024 FDs.
f80aba1 to
0ca9d2a
Compare
Summary
Bump the soft open-file-descriptor limit to the hard limit at process start, mirroring ethrex's
raise_fd_limit.Why
The storage layer opens RocksDB with
set_max_open_files(-1)(crates/storage/src/backend/rocksdb.rs:38), so the table cache holds every SST file open. On containerized hosts the softulimit -nis 1024 by default. After a few days the cache exceeds that, the nextopen(2)returnsEMFILE, RocksDB surfaces it asand the
.expect("commit")inStore(crates/storage/src/store.rs:686and:1046) panics on the tokio worker.spawned_concurrencylogsPanic in message handler, the BlockChain actor stops, P2P keeps receiving messages it can't forward (Failed to forward ... err=Actor stopped), and the container still reportsUpwhile the chain stalls.Observed in a 4-day, 16-node Kurtosis devnet where every node panicked within minutes of each other once the cache grew past 1024 FDs:
ethrex uses the same RocksDB
-1policy and never hits this because it callsraise_fd_limit()fromcmd/ethrex/initializers.rs:543at startup. This PR ports the same module to ethlambda.Changes
bin/ethlambda/src/fd_limit.rs— new, ported from ethrex (Apache-2.0, copyright header preserved). Linux:setrlimit(NOFILE, soft = hard). macOS/iOS: bump tomin(kern.maxfilesperproc, hard). Other platforms: no-op.bin/ethlambda/src/main.rs— callfd_limit::raise_fd_limit()right after the startup banner, log the from→to values, warn on failure instead of aborting.Cargo.toml,bin/ethlambda/Cargo.toml— addlibcdep.Test plan
cargo build --bin ethlambdacargo clippy --bin ethlambda -- -D warningscargo fmt --all -- --checkRaised RLIMIT_NOFILE soft limit from=1024 to=524288appearsToo many open filesFollow-ups (not in this PR)
.expect("commit")calls inStoreare still load-bearing for any transient IO error — worth converting toResultand letting the actor retry/escalate cleanly rather than panic.set_max_open_files(N)would also work but trades cache-miss latency for FD safety; raising the limit is the change with no perf cost.