fix: raise RLIMIT_NOFILE soft limit at startup by MegaRedHand · Pull Request #422 · lambdaclass/ethlambda

MegaRedHand · 2026-06-08T14:54:28Z

Summary

Bump the soft open-file-descriptor limit to the hard limit at process start, mirroring ethrex's raise_fd_limit.

Why

The storage layer opens RocksDB with set_max_open_files(-1) (crates/storage/src/backend/rocksdb.rs:38), so the table cache holds every SST file open. On containerized hosts the soft ulimit -n is 1024 by default. After a few days the cache exceeds that, the next open(2) returns EMFILE, RocksDB surfaces it as

commit: Error { message: "IO error: While open a file for random read: /data/157678.sst: Too many open files" }

and the .expect("commit") in Store (crates/storage/src/store.rs:686 and :1046) panics on the tokio worker. spawned_concurrency logs Panic in message handler, the BlockChain actor stops, P2P keeps receiving messages it can't forward (Failed to forward ... err=Actor stopped), and the container still reports Up while the chain stalls.

Observed in a 4-day, 16-node Kurtosis devnet where every node panicked within minutes of each other once the cache grew past 1024 FDs:

node-0 : panicked at crates/storage/src/store.rs:686:24
node-1 : panicked at crates/storage/src/store.rs:686:24
...
node-15: panicked at crates/storage/src/store.rs:686:24

ethrex uses the same RocksDB -1 policy and never hits this because it calls raise_fd_limit() from cmd/ethrex/initializers.rs:543 at startup. This PR ports the same module to ethlambda.

Changes

bin/ethlambda/src/fd_limit.rs — new, ported from ethrex (Apache-2.0, copyright header preserved). Linux: setrlimit(NOFILE, soft = hard). macOS/iOS: bump to min(kern.maxfilesperproc, hard). Other platforms: no-op.
bin/ethlambda/src/main.rs — call fd_limit::raise_fd_limit() right after the startup banner, log the from→to values, warn on failure instead of aborting.
Cargo.toml, bin/ethlambda/Cargo.toml — add libc dep.

Test plan

cargo build --bin ethlambda
cargo clippy --bin ethlambda -- -D warnings
cargo fmt --all -- --check
Boot a node and confirm the log line Raised RLIMIT_NOFILE soft limit from=1024 to=524288 appears
Run a multi-day devnet without nodes panicking on Too many open files

Follow-ups (not in this PR)

The .expect("commit") calls in Store are still load-bearing for any transient IO error — worth converting to Result and letting the actor retry/escalate cleanly rather than panic.
A bounded set_max_open_files(N) would also work but trades cache-miss latency for FD safety; raising the limit is the change with no perf cost.

github-actions · 2026-06-08T14:55:34Z

🤖 Kimi Code Review

The code is generally correct and follows the expected pattern for raising file descriptor limits in systems programming. This is a straightforward port of well-tested code from Parity/ethrex.

Minor issues:

Missing SAFETY comments (fd_limit.rs, lines 56, 117, 120, 129):
The unsafe blocks lack // SAFETY: documentation explaining why the raw pointer operations and libc calls are sound. While the usage is correct (properly sized buffers, valid pointers), Rust best practice requires documenting the safety invariants, e.g.:
```
// SAFETY: `rlim` is a valid, properly aligned stack-allocated struct.
if libc::getrlimit(libc::RLIMIT_NOFILE, &mut rlim) != 0 {
```
Redundant clippy allow (fd_limit.rs, line 24):
#[allow(clippy::enum_variant_names)] appears unnecessary—the variants don't end with "Error" (they use "FailedTo..." prefixes), so this is likely copy-paste residue.
macOS integer conversion edge case (fd_limit.rs, line 70):
maxfiles is libc::c_int (signed). While sysctl should never return negative values, casting a negative c_int to libc::rlim_t (unsigned) would produce a massive limit value. Consider adding debug_assert!(maxfiles > 0) after the sysctl call for defense-in-depth.

Correctness acknowledgments:

The platform-specific #[cfg] gates are correct: macOS/iOS use KERN_MAXFILESPERPROC sysctl, Linux uses hard limit.
Error handling properly uses io::Error::last_os_error() after checking syscall return values.
The Outcome/Error enums properly abstract platform differences.
main.rs correctly treats fd limit failures as non-fatal (warning only), which is appropriate for a consensus client that should attempt to start even with constrained resources.
Integer conversions from rlim_t to u64 use .into(), which is safe (widening conversion on all platforms).

Security/Performance:
No issues. The code executes once at startup, uses stack-allocated buffers, and doesn't introduce new attack surfaces. The RocksDB justification comment in main.rs (lines 147-149) accurately describes the motivation—unbounded table cache with default ulimits causes EMFILE panics.

Automated review by Kimi (Moonshot AI) · kimi-k2.5 · custom prompt

github-actions · 2026-06-08T14:56:12Z

🤖 Codex Code Review

No findings.

The PR is operational only; it does not touch fork choice, attestation handling, justification/finalization, state transition, XMSS, or SSZ paths. The startup change in main.rs is consistent with the existing RocksDB configuration in rocksdb.rs: since the backend already uses set_max_open_files(-1), raising RLIMIT_NOFILE at boot is the right mitigation for EMFILE failures on low-ulimit hosts. The fd_limit helper in fd_limit.rs handles failures by logging and continuing, which is the correct failure mode for availability.

Residual risk is low and mostly around validation coverage, not design. I could not complete cargo check in this environment because the pinned toolchain/dependency flow needs writable rustup/cargo state and a git fetch for leanVM, both blocked here. The macOS-specific branch in fd_limit.rs is therefore unverified in practice from this sandbox.

Automated review by OpenAI Codex · gpt-5.4 · custom prompt

github-actions · 2026-06-08T14:56:22Z

🤖 Claude Code Review

Here is the review output:

PR #422 — `fix: raise RLIMIT_NOFILE soft limit at startup`

Author: Tomás Grüner | +186 / -0 | 4 files changed

Overview

Ports raise_fd_limit() from ethrex to prevent the node from panicking with EMFILE after a few days of uptime. The root cause is set_max_open_files(-1) in crates/storage/src/backend/rocksdb.rs:38 keeping every SST file open indefinitely against a container default soft limit of 1024. The fix is correct, well-motivated, and focused.

Correctness

RLIM_INFINITY on Linux (minor)

bin/ethlambda/src/fd_limit.rs:144:

rlim.rlim_cur = rlim.rlim_max;

If the hard limit is RLIM_INFINITY (e.g. a container launched with --ulimit nofile=-1:-1), this sets rlim_cur to u64::MAX. setrlimit will then either:

Succeed (with CAP_SYS_RESOURCE) → the to field logged is 18446744073709551615, which looks alarming.
Fail with EPERM → the warning path fires, which is fine.

In practice, Kurtosis/Docker containers cap the hard limit at 524288 or similar, so this won't bite in the target environment. The ethrex original has the same behavior. Still worth a brief note, or capping with a practical sentinel like 1 << 20 the way the macOS branch caps at kern.maxfilesperproc.

from == to produces a misleading log (cosmetic)

If the process already runs with soft == hard (e.g. a privileged container or a dev machine that already bumped the limit), the code still logs:

Raised RLIMIT_NOFILE soft limit from=524288 to=524288

from and to being identical suggests the limit was raised when nothing changed. Simple fix at the call site in main.rs:

Ok(fd_limit::Outcome::LimitRaised { from, to }) if from == to => {
    debug!(from, "RLIMIT_NOFILE soft limit already at hard limit, no change needed");
}
Ok(fd_limit::Outcome::LimitRaised { from, to }) => {
    info!(from, to, "Raised RLIMIT_NOFILE soft limit");
}

Or handle it inside raise_fd_limit itself by returning Outcome::Unsupported / a new NoChangeNeeded variant when already at max.

Code Quality

The #[allow(clippy::enum_variant_names)] on Error is correct — all three variants share the Failed prefix, which is inherent to the error domain, not a naming mistake.
#[allow(dead_code)] on Outcome::Unsupported is appropriate: the variant is only reachable on platforms other than Linux/macOS/iOS, so it would be dead in a typical CI build.
thiserror is correctly pulled from the workspace (bin/ethlambda/Cargo.toml:31), no new dep entry required.
The Apache-2.0 attribution header from Parity Technologies is preserved — correct.
#[allow(clippy::useless_conversion, non_camel_case_types)] inherited from ethrex; both are needed for the libc FFI types and .into() conversions on platforms where the underlying integer widths already match.

Convention Adherence

info!(from, to, "Raised RLIMIT_NOFILE soft limit") uses tracing shorthand correctly per project style.
warn!(%err, "Failed to raise ...") matches the inspect_err/warn! pattern used elsewhere.
Module placement (mod fd_limit; before mod version;) follows alphabetical ordering.
The startup comment explaining why the limit matters (unbounded table cache, EMFILE, panic chain) is the right level of context — the "why is non-obvious" bar from the project guide is clearly met here.

Security / Safety

The unsafe blocks are minimal and correct:

Both getrlimit/setrlimit calls check return codes.
No pointer aliasing or lifetime issues; all buffers are stack-allocated and passed by address.
The macOS sysctl call initialises maxfiles to 0 before use and checks the return value.

One theoretical issue on macOS: if sysctl succeeds but writes a negative value into maxfiles: libc::c_int, the maxfiles as libc::rlim_t cast sign-extends to a very large u64, and cmp::min(..., rlim.rlim_max) picks rlim_max instead — benign in practice since kern.maxfilesperproc is always positive.

Summary

This is a clean, focused fix for a real production incident. The code is a faithful port of a proven module, handles errors gracefully (warn, don't crash), and is correctly gated per platform. The two items above (RLIM_INFINITY edge case and from == to logging) are low-severity and inherited from upstream — the PR is ready to merge as-is, with the logging cosmetic as an optional follow-up.

The PR description's own follow-ups (converting expect("commit") in Store to Result, considering a bounded set_max_open_files) are the right longer-term mitigations and don't block this change.

Automated review by Claude (Anthropic) · sonnet · custom prompt

greptile-apps · 2026-06-08T14:57:34Z

Greptile Summary

This PR ports the raise_fd_limit module from ethrex to ethlambda, bumping the process soft RLIMIT_NOFILE to the hard limit at startup so RocksDB's unbounded table cache no longer exhausts the default container ulimit of 1024 and causes panics.

fd_limit.rs is a clean port with platform-specific branches: Linux sets soft = hard, macOS/iOS caps at min(kern.maxfilesperproc, hard), and other platforms are a no-op.
main.rs calls the function immediately after the banner, logs the from→to values on success, and only warns on failure — no change can abort the node.
Cargo.toml and bin/ethlambda/Cargo.toml add libc = "0.2" to the workspace and binary crate respectively; thiserror was already present.

Confidence Score: 4/5

Safe to merge — failure paths are non-fatal and the fix directly addresses the documented production panic.

The change is a narrow, well-understood system call sequence ported from a proven sibling codebase. The Linux path has a silent degradation when the hard limit is RLIM_INFINITY (setrlimit returns EINVAL, only a warning is logged), and the LimitRaised variant fires even when no actual change was made — both are cosmetic/edge-case issues that do not affect the primary fix for the container ulimit-1024 panic.

bin/ethlambda/src/fd_limit.rs — Linux branch edge case with RLIM_INFINITY hard limits and the always-fires LimitRaised variant.

Important Files Changed

Filename	Overview
bin/ethlambda/src/fd_limit.rs	New module ported from ethrex that raises RLIMIT_NOFILE soft limit to the hard limit on Linux, and to min(kern.maxfilesperproc, hard) on macOS/iOS; no-op elsewhere. The Linux path does not guard against RLIM_INFINITY hard limits, which causes setrlimit to fail silently (only a warn is emitted).
bin/ethlambda/src/main.rs	Calls raise_fd_limit() right after the startup banner, logs from→to on success, and only warns on failure — safe and non-fatal.
Cargo.toml	Adds libc = "0.2" to workspace dependencies; straightforward.
bin/ethlambda/Cargo.toml	Wires libc.workspace = true into the binary crate; thiserror was already present so all derives compile.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[main: startup banner] --> B[fd_limit::raise_fd_limit]
    B --> C{Platform?}
    C -- Linux --> D[getrlimit RLIMIT_NOFILE]
    C -- macOS/iOS --> E[sysctl kern.maxfilesperproc]
    C -- Other --> F[return Unsupported]
    D --> G[rlim_cur = rlim_max]
    E --> H[getrlimit RLIMIT_NOFILE]
    H --> I["rlim_cur = min(maxfilesperproc, rlim_max)"]
    G --> J[setrlimit]
    I --> J
    J -- success --> K["Ok(LimitRaised { from, to })"]
    J -- error --> L["Err(FailedToSetLimit)"]
    K --> M["info! Raised RLIMIT_NOFILE from=X to=Y"]
    L --> N["warn! Failed to raise RLIMIT_NOFILE"]
    F --> O["info! Not supported on this platform"]

Prompt To Fix All With AI

Fix the following 2 code review issues. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 2
bin/ethlambda/src/fd_limit.rs:144
**RLIM_INFINITY hard limit causes setrlimit to fail**

When the process hard limit for `RLIMIT_NOFILE` is `RLIM_INFINITY` (e.g., `ulimit -H -n unlimited`), `rlim_cur` is set to `RLIM_INFINITY` (`u64::MAX`). Linux rejects this with `EINVAL` because `RLIMIT_NOFILE` is silently capped by `/proc/sys/fs/nr_open` (default 1048576). The call lands in the `Err` branch in `main.rs` and only a warning is logged, so the fix has no effect on those hosts. The macOS path avoids this by capping against `kern.maxfilesperproc`; the same approach — reading `/proc/sys/fs/nr_open` and using `min(rlim_max, nr_open)` — would make the Linux path robust in this scenario.

### Issue 2 of 2
bin/ethlambda/src/fd_limit.rs:155-158
**`LimitRaised` fires even when the limit is unchanged**

If the soft limit is already equal to the hard limit, the code calls `setrlimit` with an identical value (which succeeds), and returns `LimitRaised { from: N, to: N }`. The caller in `main.rs` then logs `"Raised RLIMIT_NOFILE soft limit from=N to=N"`, which may confuse operators inspecting startup logs. Adding a short-circuit before `setrlimit` — returning `Unsupported` or a dedicated `AlreadyAtMax` variant when `old_value == rlim.rlim_cur` — would make the log accurate.

_{Reviews (1): Last reviewed commit: "fix: raise RLIMIT_NOFILE soft limit at s..." | Re-trigger Greptile}

greptile-apps · 2026-06-08T14:57:37Z

+        let old_value = rlim.rlim_cur;
+
+        // Set soft limit to hard limit
+        rlim.rlim_cur = rlim.rlim_max;


RLIM_INFINITY hard limit causes setrlimit to fail

When the process hard limit for RLIMIT_NOFILE is RLIM_INFINITY (e.g., ulimit -H -n unlimited), rlim_cur is set to RLIM_INFINITY (u64::MAX). Linux rejects this with EINVAL because RLIMIT_NOFILE is silently capped by /proc/sys/fs/nr_open (default 1048576). The call lands in the Err branch in main.rs and only a warning is logged, so the fix has no effect on those hosts. The macOS path avoids this by capping against kern.maxfilesperproc; the same approach — reading /proc/sys/fs/nr_open and using min(rlim_max, nr_open) — would make the Linux path robust in this scenario.

Prompt To Fix With AI

This is a comment left during a code review. Path: bin/ethlambda/src/fd_limit.rs Line: 144 Comment: **RLIM_INFINITY hard limit causes setrlimit to fail** When the process hard limit for `RLIMIT_NOFILE` is `RLIM_INFINITY` (e.g., `ulimit -H -n unlimited`), `rlim_cur` is set to `RLIM_INFINITY` (`u64::MAX`). Linux rejects this with `EINVAL` because `RLIMIT_NOFILE` is silently capped by `/proc/sys/fs/nr_open` (default 1048576). The call lands in the `Err` branch in `main.rs` and only a warning is logged, so the fix has no effect on those hosts. The macOS path avoids this by capping against `kern.maxfilesperproc`; the same approach — reading `/proc/sys/fs/nr_open` and using `min(rlim_max, nr_open)` — would make the Linux path robust in this scenario. How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2026-06-08T14:57:38Z

+        Ok(Outcome::LimitRaised {
+            from: old_value.into(),
+            to: rlim.rlim_cur.into(),
+        })


LimitRaised fires even when the limit is unchanged

If the soft limit is already equal to the hard limit, the code calls setrlimit with an identical value (which succeeds), and returns LimitRaised { from: N, to: N }. The caller in main.rs then logs "Raised RLIMIT_NOFILE soft limit from=N to=N", which may confuse operators inspecting startup logs. Adding a short-circuit before setrlimit — returning Unsupported or a dedicated AlreadyAtMax variant when old_value == rlim.rlim_cur — would make the log accurate.

Prompt To Fix With AI

This is a comment left during a code review. Path: bin/ethlambda/src/fd_limit.rs Line: 155-158 Comment: **`LimitRaised` fires even when the limit is unchanged** If the soft limit is already equal to the hard limit, the code calls `setrlimit` with an identical value (which succeeds), and returns `LimitRaised { from: N, to: N }`. The caller in `main.rs` then logs `"Raised RLIMIT_NOFILE soft limit from=N to=N"`, which may confuse operators inspecting startup logs. Adding a short-circuit before `setrlimit` — returning `Unsupported` or a dedicated `AlreadyAtMax` variant when `old_value == rlim.rlim_cur` — would make the log accurate. How can I resolve this? If you propose a fix, please make it concise.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

The storage layer opens RocksDB with `set_max_open_files(-1)`, which keeps an unbounded number of SST files in the table cache. On containerized hosts the soft `ulimit -n` is 1024 by default, so on a long-running devnet the table cache eventually grows past that and the next `open(2)` returns `EMFILE`. RocksDB surfaces this as `commit: IO error: ... Too many open files`, the `.expect("commit")` in `Store` panics, the BlockChain actor dies, and the node stops making progress while the container still reports "Up". Mirror ethrex's approach: at startup, bump the soft `RLIMIT_NOFILE` to the hard limit (Linux) or `kern.maxfilesperproc` (macOS). The ported file is `fd_limit.rs`, originally from Parity Tech via ethrex under Apache-2.0; copyright header is preserved. Observed in a 4-day 16-node Kurtosis devnet where every node panicked within a few minutes of each other once the table cache grew past 1024 FDs.

MegaRedHand force-pushed the fix/raise-fd-limit branch from 98abc4a to 1f2611d Compare June 8, 2026 14:55

greptile-apps Bot reviewed Jun 8, 2026

View reviewed changes

MegaRedHand force-pushed the fix/raise-fd-limit branch from 1f2611d to f80aba1 Compare June 8, 2026 15:13

MegaRedHand force-pushed the fix/raise-fd-limit branch from f80aba1 to 0ca9d2a Compare June 8, 2026 15:14

pablodeymo approved these changes Jun 8, 2026

View reviewed changes

MegaRedHand merged commit 899babe into main Jun 8, 2026
2 of 3 checks passed

MegaRedHand deleted the fix/raise-fd-limit branch June 8, 2026 21:25

MegaRedHand mentioned this pull request Jun 10, 2026

Tune RocksDB options #266

Closed

MegaRedHand linked an issue Jun 10, 2026 that may be closed by this pull request

Tune RocksDB options #266

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: raise RLIMIT_NOFILE soft limit at startup#422

fix: raise RLIMIT_NOFILE soft limit at startup#422
MegaRedHand merged 1 commit into
mainfrom
fix/raise-fd-limit

MegaRedHand commented Jun 8, 2026

Uh oh!

github-actions Bot commented Jun 8, 2026

Uh oh!

github-actions Bot commented Jun 8, 2026

Uh oh!

github-actions Bot commented Jun 8, 2026

Uh oh!

greptile-apps Bot commented Jun 8, 2026

Important Files Changed

Uh oh!

greptile-apps Bot Jun 8, 2026

Uh oh!

greptile-apps Bot Jun 8, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

MegaRedHand commented Jun 8, 2026

Summary

Why

Changes

Test plan

Follow-ups (not in this PR)

Uh oh!

github-actions Bot commented Jun 8, 2026

🤖 Kimi Code Review

Uh oh!

github-actions Bot commented Jun 8, 2026

🤖 Codex Code Review

Uh oh!

github-actions Bot commented Jun 8, 2026

🤖 Claude Code Review

PR #422 — fix: raise RLIMIT_NOFILE soft limit at startup

Overview

Correctness

Code Quality

Convention Adherence

Security / Safety

Summary

Uh oh!

greptile-apps Bot commented Jun 8, 2026

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Flowchart

Uh oh!

greptile-apps Bot Jun 8, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Jun 8, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

PR #422 — `fix: raise RLIMIT_NOFILE soft limit at startup`