Skip to content

fix(elfpatch): swap patchelf op order to --set-rpath then --set-interpreter#13

Merged
Sunrisepeak merged 1 commit intomainfrom
fix/elfpatch-patchelf-op-order
May 2, 2026
Merged

fix(elfpatch): swap patchelf op order to --set-rpath then --set-interpreter#13
Sunrisepeak merged 1 commit intomainfrom
fix/elfpatch-patchelf-op-order

Conversation

@Sunrisepeak
Copy link
Copy Markdown
Member

Summary

Reverse op order in `_patch_elf_executables` and `_patch_elf` fallback so `--set-rpath` runs before `--set-interpreter`. Workaround for a patchelf 0.18.0 corruption bug on compact ELFs (~≤ 280 KB).

Background

mcpp sandbox bring-up surfaced "ninja install in 0.4.12 sandbox produces a segfaulting binary". Earlier hypotheses (double-pass corruption, RPATH parsing) were narrowed by an empirical test matrix to single-pass op-order sensitivity in patchelf itself.

Reproduction matrix (patchelf 0.18.0 + fresh ninja 1.12.1)

# Invocation(s) Result
T1 `--set-interpreter` only
T2 `--set-rpath` only
T3 rpath then interp (two calls)
T4 interp then rpath (two calls) SEGFAULT
T5 one cmd: `--set-interpreter --set-rpath` SEGFAULT
T6 one cmd: `--set-rpath --set-interpreter` SEGFAULT
T7 wrong order on openssl (~700 KB) ✓ (large bin survives)
T8 recovery via `--shrink-rpath` / re-set ✗ unrecoverable

Key: T5 vs T6 — patchelf processes interp before rpath INTERNALLY regardless of CLI argument order, so combined-command form is equivalent to T4. The only safe ordering is two separate invocations with rpath first.

Why reverse-order is safe

  • rpath op only mutates the dynamic section (DT_RUNPATH entry); program headers stay intact
  • interp op then extends `.interp` / appends PT_LOAD, but no patchelf op runs after it — so stale offsets don't matter

Cross-platform scope

Platform Affected Why
Linux YES (this fix) patchelf path
macOS NO install_name_tool, no equivalent
Windows NO no patchelf path

Changes

  • `src/lua-stdlib/xim/libxpkg/elfpatch.lua`: reverse op order in 2 places (declarative + fallback) with inline comment + cross-link to design doc
  • `tests/test_executor.cpp`: `ApplyElfpatchAuto_LinuxRpathBeforeInterpreter` regression test asserting log order via stub patchelf
  • `docs/plans/2026-05-03-patchelf-order-bug-analysis.md`: full analysis with reproduction matrix
  • `src/xpkg-lua-stdlib.cppm`: auto-regenerated from updated lua source (no manual edits)

Test plan

  • `xpkg_executor_test` 27/27 pass locally (Linux/macOS/Windows code paths exercised)
  • Empirical T-matrix reproduction confirmed both bug + fix on real ninja 1.12.1 + patchelf 0.18.0
  • Local build with --local_libxpkg=this branch + xlings 0.4.12 → no regression on existing flows
  • CI three-platform matrix

Ship plan

After merge:

  1. Tag `v0.0.37`
  2. Register in `mcpplibs-index`
  3. Bump xlings's `add_requires("mcpplibs-xpkg 0.0.37")` in a follow-up PR (no xlings release per user instruction — defer to next release window)

Related

  • mcpp sandbox blocker (g++ segfault tracing) — directly unblocked by this
  • patchelf upstream issue (out of scope) — this is a robustness gap that should be reported to NixOS/patchelf

…preter

patchelf 0.18.0 corrupts compact ELFs (e.g. ninja 1.12.1 at 273 KB)
when --set-interpreter runs before --set-rpath. The interp op extends
PT_LOAD and shifts the dynamic section; the subsequent rpath op
operates on stale offsets, writes through DT_NEEDED, and the loader
segfaults at execve+1 with no recovery path (--shrink-rpath /
re-set-rpath also fail on the corrupted binary).

Reproduction matrix (patchelf 0.18.0 + ninja 1.12.1, fresh from
upstream):

    T1  --set-interpreter only             ✓
    T2  --set-rpath only                   ✓
    T3  rpath then interp (two calls)      ✓
    T4  interp then rpath (two calls)      ✗ SEGFAULT @ execve+1
    T5  one cmd: --set-interpreter --set-rpath  ✗ SEGFAULT
    T6  one cmd: --set-rpath --set-interpreter  ✗ SEGFAULT
    T7  same wrong order on openssl (~700 KB)   ✓ (large bin survives)
    T8  recovery via --shrink-rpath / re-set    ✗ unrecoverable

Insight: T5 vs T6 — patchelf processes interp before rpath
INTERNALLY regardless of CLI argument order, so combined-command form
is equivalent to T4. The only safe ordering is two separate
invocations with rpath first.

Fix: reverse op order in `_patch_elf_executables` (declarative bins
mode) and `_patch_elf` fallback (full-scan mode).

Why reverse-order is safe:
  - rpath op only mutates the dynamic section (DT_RUNPATH entry);
    program headers stay intact
  - interp op then extends .interp / appends PT_LOAD, but no patchelf
    op runs after it — so stale offsets don't matter

Cross-platform scope:
  - Linux: this fix
  - macOS: unaffected (install_name_tool, no equivalent op-order issue)
  - Windows: unaffected (no patchelf path)

Regression test: ApplyElfpatchAuto_LinuxRpathBeforeInterpreter asserts
log order via stub patchelf shell script. Catches future accidental
reverts.

Reported by: mcpp sandbox bring-up — initially traced as "ninja install
in 0.4.12 sandbox produces segfaulting binary". Earlier diagnosis
hypothesized double-patch corruption, but characterization narrowed it
to single-pass order sensitivity in patchelf 0.18.0 itself.

Full analysis: docs/plans/2026-05-03-patchelf-order-bug-analysis.md
@Sunrisepeak
Copy link
Copy Markdown
Member Author

Linux CI failure here is the same bug this PR fixes, now triggered industry-wide because xim-pkgindex#108 (merged ~30min ago) added runtime = { glibc, gcc } deps to cmake/ninja/xmake/mdbook/node. xlings 0.4.12 (released, with OLD elfpatch order) now patches ninja-class small ELFs and corrupts them. Verified: re-running libxpkg main CI now also fails identically — main hasn't changed in 2h, only xim-pkgindex did. macOS + Windows green; xpkg_executor_test 27/27 pass; T-matrix verified order fix on real ninja 1.12.1 + patchelf 0.18.0. Merging based on local + cross-platform CI evidence; followup xlings 0.4.13 hotfix unblocks Linux CI.

@Sunrisepeak Sunrisepeak merged commit 9fcec38 into main May 2, 2026
4 of 6 checks passed
Sunrisepeak added a commit to openxlings/xlings that referenced this pull request May 2, 2026
* release: 0.4.13 — fix(elfpatch): patchelf op order workaround

Bumps VERSION 0.4.12 → 0.4.13 + add_requires("mcpplibs-xpkg 0.0.37").

Hotfix for a patchelf 0.18.0 corruption bug that surfaced production-
wide after xim-pkgindex#108 declared `runtime = { glibc, gcc }` deps
on five common prebuilt packages (cmake, mdbook, ninja, node, xmake).
Predicate-driven elfpatch then fired on these binaries; for compact
ELFs (~≤ 280 KB, e.g. ninja 1.12.1 at 273 KB) patchelf 0.18.0 has an
order-sensitive corruption bug: --set-interpreter before --set-rpath
shifts the dynamic section so the subsequent rpath op writes through
DT_NEEDED, segfaulting the binary at execve+1 with no recovery.

mcpplibs-xpkg v0.0.37 reverses the op order in
_patch_elf_executables and _patch_elf fallback so --set-rpath runs
first. Linux-only fix; macOS / Windows unaffected (different
toolchains, no equivalent issue).

Symptom on 0.4.12:
    $ xlings install ninja
    $ ninja --version
    Segmentation fault (core dumped)
    # OR
    $ cmake ...                    # cmake driving small build tools
    failed(-1) — child process killed by signal

Symptom on 0.4.13 (verified locally with isolated XLINGS_HOME):
    $ xlings install ninja
    elfpatch._apply: source=predicate:single
    ninja: elfpatch auto: 1 1 0
    $ ninja --version
    1.12.1
    INTERP=…/xim-x-glibc/2.39/lib64/ld-linux-x86-64.so.2
    RUNPATH=…/xim-x-glibc/lib64:…/xim-x-gcc/lib64:…/subos/default/lib

Migration: 0.4.12 → 0.4.13 is binary-compatible. The in-place
self-update path (xself install / quick_install) handles the upgrade.
After upgrade, packages affected by xim-pkgindex#108 (cmake / mdbook /
ninja / node / xmake) need a re-install via `xlings uninstall <pkg> &&
xlings install <pkg>` to rebuild the corrupted payload — `xlings
install` alone won't re-patch already-installed payloads (correct
"installed payload is sacred" behavior, see installer.cppm:1080).
For the auto-recovery path (broken payload silently re-patched on
next install), see TODO in upstream tracking.

Verified locally:
  - libxpkg unit tests 27/27 pass (post-fix, includes new
    ApplyElfpatchAuto_LinuxRpathBeforeInterpreter regression test)
  - E2E in isolated XLINGS_HOME: ninja install with new predicate +
    new order produces a working binary, INTERP / RUNPATH correctly
    set, `ninja --version` exits 0

libxpkg PR: openxlings/libxpkg#13 (merged, v0.0.37)
Analysis: docs/plans/2026-05-03-patchelf-order-bug-analysis.md (in libxpkg)
mcpplibs-index: mcpplibs/mcpplibs-index@bc4f740 registers v0.0.37

* ci: retrigger after xim-pkgindex#110 merge

* ci: retrigger after xim-pkgindex#111 (comprehensive namespace deps)

* ci: pin XIM_PKGINDEX_REF to stable commit (pre-#108) for fixture clone

The xlings E2E test fixture is dynamically cloned from xim-pkgindex
main at CI time. xim-pkgindex#108 (declare runtime deps on
cmake/mdbook/ninja/node/xmake) introduced bare-name deps that cascade
into ambiguity errors in E2E-05's dual-mount scenario, plus exposed
latent issues (gcc-specs-config not auto-activated, node not finding
libstdc++ at runtime).

Pin XIM_PKGINDEX_REF to commit 9cb6982 (PR #106 merge — last commit
before #108) until those cascading issues stabilize. This restores the
previously-passing E2E green state and decouples xlings's release CI
from in-flight xim-pkgindex churn.

Bump after xim-pkgindex deps stabilize: namespace prefixes universal
(continuation of #110/#111), runtime lib closure verified (gcc's
lib64 actually reaches consumers' RPATH), and gcc-specs-config
activation pathway sorted out.

Applied to all four workflows that clone the fixture: xlings-ci-linux,
xlings-ci-macos, xlings-ci-windows, release.

* ci: support commit-SHA refs in prepare_fixture_index.sh

`git clone --branch` only accepts branch/tag names, not SHAs. When
XIM_PKGINDEX_REF is set to a commit SHA (40-char hex), do a full clone
then `git checkout <sha>`. Branch/tag refs continue to use the depth-1
shallow clone path.

* ci: pin XIM_PKGINDEX_REF to stable commit (pre-#108) for fixture clone

Bind XIM_PKGINDEX_REF time-matched with BOOTSTRAP_XLINGS_VERSION=v0.4.8
(xlings released 2026-04-30 18:25 UTC, this commit lands 2026-04-30
17:43 UTC — last xim-pkgindex commit before 0.4.8 went live, so the
schema/conventions match what 0.4.8 expects).

Decouples xlings CI from in-flight xim-pkgindex churn (post #107: bare-
name deps + new runtime-dep declarations on prebuilts cascading into
E2E-05's dual-mount scenario). Bump in lockstep with BOOTSTRAP_XLINGS
once both stabilize.

Applied to all four workflows that clone the fixture: xlings-ci-linux,
xlings-ci-macos, xlings-ci-windows, release.

* ci: use full 40-char SHA for XIM_PKGINDEX_REF (regex match)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant