Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
122 changes: 122 additions & 0 deletions docs/plans/2026-05-03-patchelf-order-bug-analysis.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,122 @@
# patchelf 0.18.0 op-order corruption on compact ELFs

**Date**: 2026-05-03
**Affects**: `mcpplibs-xpkg` ≤ v0.0.36 — any consumer using `elfpatch` on small ELFs (~280 KB or smaller, e.g. ninja 1.12.1 at 273 KB).
**Reported by**: mcpp sandbox bring-up — `g++` segfault diagnosis traced back to `xlings install ninja`.
**Fix**: ship in v0.0.37 — reverse op order in `_patch_elf_executables` and `_patch_elf` fallback so `--set-rpath` runs before `--set-interpreter`.

## Symptom

After `xlings install ninja` on a fresh sandbox using `xlings 0.4.12` (predicate-driven elfpatch enabled by default), the resulting ninja binary segfaults at execve+1:

```
$ ninja --version
Segmentation fault (core dumped)
```

`readelf -a` on the broken binary shows:

- ✅ `PT_INTERP` correct (points at sandbox `xim-x-glibc/2.39/lib64/ld-linux-x86-64.so.2`)
- ❌ `RUNPATH` empty — but xlings's elfpatch DID call `--set-rpath`
- ❌ `DT_NEEDED` entries missing — dynamic section corrupted
- File size 273 KB → 282 KB (patchelf grew it ~8 KB)

## Reproduction

Use the patchelf shipped via `xim:patchelf@0.18.0` and a fresh ninja 1.12.1 binary from upstream GitHub release. No xlings code involved — this is a pure patchelf characterization.

| # | patchelf invocation(s) | Result |
|---|---|---|
| T1 | `--set-interpreter <loader>` only | ✅ binary OK |
| T2 | `--set-rpath <dir>` only | ✅ binary OK |
| T3 | `--set-rpath <dir>` then `--set-interpreter <loader>` (two invocations) | ✅ binary OK |
| **T4** | `--set-interpreter <loader>` then `--set-rpath <dir>` (two invocations) | ❌ **segfault @ execve+1** |
| **T5** | `--set-interpreter <loader> --set-rpath <dir>` (single command) | ❌ **segfault @ execve+1** |
| **T6** | `--set-rpath <dir> --set-interpreter <loader>` (single command, reversed CLI args) | ❌ **segfault @ execve+1** |
| T8 | After T4 corruption, run `--shrink-rpath` or `--set-rpath <dir>` again to "fix" | ❌ unrecoverable |

Key observations:

- **T5 vs T6**: putting `--set-rpath` first in CLI args does NOT help. `patchelf` processes interp before rpath internally regardless of the CLI arg order, so combined-command form is equivalent to T4.
- **T7** (not in matrix): the same wrong order on a larger binary like `openssl` (~700 KB) leaves the binary intact. The bug only triggers on compact ELFs whose program-header padding is too tight to absorb the `.interp` section growth.
- **T8**: once corrupted, no patchelf operation recovers the binary. DT_NEEDED is gone; the loader can't resolve symbols at all.

## Why this only surfaces under `xlings 0.4.12+`

The 0.4.10-era xlings did **not** auto-patchelf consumer binaries. Migration to predicate-driven elfpatch (xlings 0.4.11, then fixed in 0.4.12) means xlings now runs patchelf on **every** ELF in install_dir whenever a runtime dep declares `exports.runtime.loader` (e.g. ninja → glibc). Compact binaries that were previously left alone now trip the order bug.

The contributor reporting this (mcpp sandbox) had been independently running their own `patchelf_walk` as a second pass for years — when the second pass ran on bins that 0.4.10 had left untouched, only their pass hit ninja and they used the same wrong order. So the bug was reachable before too, just less visible because 0.4.10 + their patch_walk hit fewer binaries.

## Root cause (patchelf internal)

`patchelf 0.18.0` processes `--set-interpreter` first within its operation pipeline:

1. `--set-interpreter` extends the `.interp` section (longer absolute path → larger string). To fit, patchelf appends a new `PT_LOAD` segment at file tail and rewrites program headers.
2. `--set-rpath` then attempts to add a `DT_RUNPATH` entry to the dynamic section. It computes the dynamic section's file offset based on assumptions stale after step 1's program-header rewrite.
3. The rpath write lands at a wrong offset, overwriting `DT_NEEDED` entries — causing the loader to crash trying to walk a malformed dynamic table.

For binaries with comfortable padding (most release binaries > 500 KB), step 1's PT_LOAD addition fits without disturbing later structures, and step 2's stale offset still happens to be correct. For compact binaries like ninja, the padding is exhausted and offsets shift.

The order-sensitive nature is patchelf's robustness gap — see [NixOS/patchelf](https://github.com/NixOS/patchelf) issue tracker for similar reports across versions. We don't fix patchelf here; we work around it.

## Fix

In `src/lua-stdlib/xim/libxpkg/elfpatch.lua`, reverse op order in two places:

### `_patch_elf_executables` (declarative bins/libs mode)

```diff
-if loader and _has_pt_interp(filepath, patch_tool) then
- ok = _exec_ok(... " --set-interpreter " ...)
-end
-if ok and rpath and rpath ~= "" then
- ok = _exec_ok(... " --set-rpath " ...)
+if rpath and rpath ~= "" then
+ ok = _exec_ok(... " --set-rpath " ...)
+end
+if ok and loader and _has_pt_interp(filepath, patch_tool) then
+ ok = _exec_ok(... " --set-interpreter " ...)
end
```

### `_patch_elf` fallback (full-scan mode)

Same reversal applied to the fallback branch.

### Why reverse-order avoids the bug

When `--set-rpath` runs first:

1. rpath modification touches only the dynamic section (DT_RUNPATH entry). Program headers stay intact.
2. `--set-interpreter` then extends `.interp` and appends a PT_LOAD. Even if program-header layout shifts, no further patchelf op runs after this — so no stale-offset write occurs.

Verified empirically (T3 in the matrix): rpath-first survives ninja 1.12.1 cleanly.

## Regression test

`tests/test_executor.cpp::ApplyElfpatchAuto_LinuxRpathBeforeInterpreter`:

- Stub `patchelf` shell script logs every invocation
- Run `apply_elfpatch_auto` against a minimal ELF
- Read log, assert `--set-rpath` line index < `--set-interpreter` line index

If a future change accidentally reverts the order, this test fails immediately with a clear message.

## Cross-platform scope

| Platform | Affected | Reason |
|---|---|---|
| Linux | YES (this fix) | patchelf-based path |
| macOS | NO | uses `install_name_tool` which has no equivalent op-order interaction |
| Windows | NO | no patchelf path; `M._apply` early-bails |

## Ship plan

1. Land this fix as libxpkg PR (this branch)
2. Tag `v0.0.37` after merge
3. Register in `mcpplibs-index`
4. Bump xlings's `add_requires("mcpplibs-xpkg X")` in a follow-up PR (no xlings release yet — defer to next 0.4.x release window)

## Upstream patchelf

This is patchelf's robustness gap. Filing an upstream bug is out-of-scope for this PR but worth doing — note that the trigger isn't just "compact binary" but also "two sequential ops where one mutates program headers". A defensive patchelf would re-resolve dynamic-section offsets after every PT_LOAD modification.
41 changes: 29 additions & 12 deletions src/lua-stdlib/xim/libxpkg/elfpatch.lua
Original file line number Diff line number Diff line change
Expand Up @@ -377,21 +377,33 @@ end
-- Patch directories as executables (interpreter + rpath). Files without
-- PT_INTERP (shared libs that happened to land in a bin dir, static
-- binaries) get rpath-only treatment instead of failing the whole entry.
--
-- Op order: --set-rpath FIRST, then --set-interpreter. patchelf 0.18.0
-- has an order-sensitive corruption bug on compact ELFs (≤ ~280 KB,
-- e.g. ninja 1.12.1 at 273 KB): when --set-interpreter runs first, it
-- extends PT_LOAD and shifts the dynamic section; the subsequent
-- --set-rpath operates on stale offsets and writes through DT_NEEDED,
-- corrupting the binary irreversibly (loader segfaults at execve+1).
-- Note this is patchelf's INTERNAL processing order — passing both
-- flags in a single command also fails because patchelf processes
-- interp before rpath internally regardless of CLI order. The
-- workaround is two separate invocations in reverse order.
-- See docs/plans/2026-05-03-patchelf-order-bug-analysis.md.
local function _patch_elf_executables(patch_tool, dirs, install_dir, loader, rpath, shrink, result)
for _, dir in ipairs(dirs) do
local full = path.is_absolute(dir) and dir or path.join(install_dir, dir)
local targets = _collect_targets(full, { include_shared_libs = true })
for _, filepath in ipairs(targets) do
result.scanned = result.scanned + 1
local ok = true
if loader and _has_pt_interp(filepath, patch_tool) then
if rpath and rpath ~= "" then
ok = _exec_ok(_shell_quote(patch_tool.program)
.. " --set-interpreter " .. _shell_quote(loader)
.. " --set-rpath " .. _shell_quote(rpath)
.. " " .. _shell_quote(filepath))
end
if ok and rpath and rpath ~= "" then
if ok and loader and _has_pt_interp(filepath, patch_tool) then
ok = _exec_ok(_shell_quote(patch_tool.program)
.. " --set-rpath " .. _shell_quote(rpath)
.. " --set-interpreter " .. _shell_quote(loader)
.. " " .. _shell_quote(filepath))
end
if ok then
Expand Down Expand Up @@ -473,13 +485,25 @@ local function _patch_elf(target, opts, result)
-- legitimately have no INTERP segment, causing patchelf to exit 1
-- and log noise). Files with INTERP get loader + rpath; files
-- without get rpath only.
--
-- Op order: --set-rpath FIRST, then --set-interpreter. See
-- _patch_elf_executables comment for why (patchelf 0.18.0 small-
-- ELF corruption bug) and docs/plans/2026-05-03-patchelf-order-
-- bug-analysis.md for full analysis.
_info("fallback scan mode, loader=" .. tostring(loader))
local targets = _collect_targets(target, opts)
for _, filepath in ipairs(targets) do
result.scanned = result.scanned + 1
local any_ok = false
local has_interp = _has_pt_interp(filepath, patch_tool)

if rpath and rpath ~= "" then
if _exec_ok(_shell_quote(patch_tool.program)
.. " --set-rpath " .. _shell_quote(rpath)
.. " " .. _shell_quote(filepath)) then
any_ok = true
end
end
if loader and has_interp then
if _exec_ok(_shell_quote(patch_tool.program)
.. " --set-interpreter " .. _shell_quote(loader)
Expand All @@ -489,16 +513,9 @@ local function _patch_elf(target, opts, result)
elseif loader and not has_interp then
-- Shared library / static binary: skip interp set silently;
-- still consider it for rpath. Don't penalize the patched
-- count if rpath also succeeds below.
-- count if rpath alone succeeded above.
any_ok = true
end
if rpath and rpath ~= "" then
if _exec_ok(_shell_quote(patch_tool.program)
.. " --set-rpath " .. _shell_quote(rpath)
.. " " .. _shell_quote(filepath)) then
any_ok = true
end
end

if any_ok then
result.patched = result.patched + 1
Expand Down
57 changes: 37 additions & 20 deletions src/xpkg-lua-stdlib.cppm
Original file line number Diff line number Diff line change
Expand Up @@ -1356,21 +1356,33 @@ end
-- Patch directories as executables (interpreter + rpath). Files without
-- PT_INTERP (shared libs that happened to land in a bin dir, static
-- binaries) get rpath-only treatment instead of failing the whole entry.
--
-- Op order: --set-rpath FIRST, then --set-interpreter. patchelf 0.18.0
-- has an order-sensitive corruption bug on compact ELFs (≤ ~280 KB,
-- e.g. ninja 1.12.1 at 273 KB): when --set-interpreter runs first, it
-- extends PT_LOAD and shifts the dynamic section; the subsequent
-- --set-rpath operates on stale offsets and writes through DT_NEEDED,
-- corrupting the binary irreversibly (loader segfaults at execve+1).
-- Note this is patchelf's INTERNAL processing order — passing both
-- flags in a single command also fails because patchelf processes
-- interp before rpath internally regardless of CLI order. The
-- workaround is two separate invocations in reverse order.
-- See docs/plans/2026-05-03-patchelf-order-bug-analysis.md.
local function _patch_elf_executables(patch_tool, dirs, install_dir, loader, rpath, shrink, result)
for _, dir in ipairs(dirs) do
local full = path.is_absolute(dir) and dir or path.join(install_dir, dir)
local targets = _collect_targets(full, { include_shared_libs = true })
for _, filepath in ipairs(targets) do
result.scanned = result.scanned + 1
local ok = true
if loader and _has_pt_interp(filepath, patch_tool) then
if rpath and rpath ~= "" then
ok = _exec_ok(_shell_quote(patch_tool.program)
.. " --set-interpreter " .. _shell_quote(loader)
.. " --set-rpath " .. _shell_quote(rpath)
.. " " .. _shell_quote(filepath))
end
if ok and rpath and rpath ~= "" then
if ok and loader and _has_pt_interp(filepath, patch_tool) then
ok = _exec_ok(_shell_quote(patch_tool.program)
.. " --set-rpath " .. _shell_quote(rpath)
.. " --set-interpreter " .. _shell_quote(loader)
.. " " .. _shell_quote(filepath))
end
if ok then
Expand All @@ -1393,6 +1405,10 @@ local function _patch_elf_libraries(patch_tool, dirs, install_dir, rpath, shrink
local ok = true
if rpath and rpath ~= "" then
ok = _exec_ok(_shell_quote(patch_tool.program)
)__LUA__";

inline constexpr std::string_view elfpatch_lua_1 = R"__LUA__(

.. " --set-rpath " .. _shell_quote(rpath)
.. " " .. _shell_quote(filepath))
end
Expand All @@ -1416,10 +1432,6 @@ local function _patch_elf(target, opts, result)
local loader = _resolve_loader(opts.loader)
local rpath = _normalize_rpath(opts.rpath)
if opts.loader and not loader then
)__LUA__";

inline constexpr std::string_view elfpatch_lua_1 = R"__LUA__(

local msg = "cannot resolve loader: " .. tostring(opts.loader)
if opts.strict then
error(msg)
Expand Down Expand Up @@ -1456,13 +1468,25 @@ inline constexpr std::string_view elfpatch_lua_1 = R"__LUA__(
-- legitimately have no INTERP segment, causing patchelf to exit 1
-- and log noise). Files with INTERP get loader + rpath; files
-- without get rpath only.
--
-- Op order: --set-rpath FIRST, then --set-interpreter. See
-- _patch_elf_executables comment for why (patchelf 0.18.0 small-
-- ELF corruption bug) and docs/plans/2026-05-03-patchelf-order-
-- bug-analysis.md for full analysis.
_info("fallback scan mode, loader=" .. tostring(loader))
local targets = _collect_targets(target, opts)
for _, filepath in ipairs(targets) do
result.scanned = result.scanned + 1
local any_ok = false
local has_interp = _has_pt_interp(filepath, patch_tool)

if rpath and rpath ~= "" then
if _exec_ok(_shell_quote(patch_tool.program)
.. " --set-rpath " .. _shell_quote(rpath)
.. " " .. _shell_quote(filepath)) then
any_ok = true
end
end
if loader and has_interp then
if _exec_ok(_shell_quote(patch_tool.program)
.. " --set-interpreter " .. _shell_quote(loader)
Expand All @@ -1472,16 +1496,9 @@ inline constexpr std::string_view elfpatch_lua_1 = R"__LUA__(
elseif loader and not has_interp then
-- Shared library / static binary: skip interp set silently;
-- still consider it for rpath. Don't penalize the patched
-- count if rpath also succeeds below.
-- count if rpath alone succeeded above.
any_ok = true
end
if rpath and rpath ~= "" then
if _exec_ok(_shell_quote(patch_tool.program)
.. " --set-rpath " .. _shell_quote(rpath)
.. " " .. _shell_quote(filepath)) then
any_ok = true
end
end

if any_ok then
result.patched = result.patched + 1
Expand Down Expand Up @@ -1752,6 +1769,10 @@ function M.auto(enable_or_opts)
else
_RUNTIME.elfpatch_legacy_auto = (enable_or_opts == true)
end
)__LUA__";

inline constexpr std::string_view elfpatch_lua_2 = R"__LUA__(

return _RUNTIME.elfpatch_legacy_auto
end

Expand All @@ -1772,10 +1793,6 @@ function M._apply()
-- macosx → Mach-O + install_name_tool: RPATH only; INTERP irrelevant
-- (dyld is the kernel's responsibility, no per-binary loader).
-- Predicate currently keys off `loader` so it's a no-op on
)__LUA__";

inline constexpr std::string_view elfpatch_lua_2 = R"__LUA__(

-- macosx unless a dep declares one — which is correct since
-- macOS deps shouldn't declare `loader`. Use elfpatch.set({
-- rpath = {...} }) explicitly if rpath-only patching needed.
Expand Down
Loading
Loading