Skip to content

Fix: Fork+execve of dynamically-linked ELFs crashed the child in the dynamic-linker bring-up at a small absolute address#63

Open
jserv wants to merge 2 commits into
mainfrom
fileops
Open

Fix: Fork+execve of dynamically-linked ELFs crashed the child in the dynamic-linker bring-up at a small absolute address#63
jserv wants to merge 2 commits into
mainfrom
fileops

Conversation

@jserv
Copy link
Copy Markdown
Contributor

@jserv jserv commented May 30, 2026

Summary by cubic

Fixes a crash in fork+execve of dynamically linked ELFs by fully zeroing the PT_LOAD page tail and ignoring zero-sized PT_LOADs. Adds full openat2 RESOLVE_NO_XDEV enforcement with a pre-walk and post-open check to block cross-mount paths, including symlink and transient crossings.

  • Bug Fixes

    • Zero the whole PT_LOAD page tail using PAGE_ALIGN_UP(gpa + memsz) - gpa; skip PT_LOAD with memsz == 0 to match Linux and prevent stale interpreter state across execve.
    • Remove the dynamic-linker limitation from docs.
  • New Features

    • Enforce RESOLVE_NO_XDEV in sc_openat2 using a component walker and a post-open verifier; classify mounts as root, /proc, /dev, /sys, /tmp, /dev/shm, and per-FUSE mount; respect absolute vs dirfd anchors, clamp .. under RESOLVE_IN_ROOT, and extend tests for transient crosses, symlink targets, bare /proc, and /dev/dev/shm.

Written for commit 9b9a2e8. Summary will update on new commits.

Review in cubic

elf_map_segments computed the BSS clear extent as PAGE_ALIGN_UP(memsz)
bytes from gpa. When gpa is not page-aligned, that left the tail of the
last page covered by the segment untouched. A fresh bootstrap saw zero
bytes there because the primary slab was MAP_ANON; after execve into the
same interpreter at the same address, the host slab still held the
previous incarnation's bytes, and glibc ld.so allocated the new main
link_map into that tail through dl_minimal_malloc, picking up a stale
l_ld value and crashing in dl_main at a small absolute address.

ld.so's RW LOAD sits at vaddr 0x2f650 in the cross-toolchain build,
which made fork+execve and any dyn-to-dyn execve under --sysroot a
reliable reproducer; the matrix in tests/test-fork-exec exercises the
fork case end to end.

Compute the extent as PAGE_ALIGN_UP(gpa + memsz) - gpa so the trailing
page is fully zeroed regardless of gpa alignment. Skip PT_LOAD entries
with memsz == 0 entirely; the unaligned-gpa rounding above would
otherwise let a crafted ELF splat zeros across the tail of an earlier
segment in the same page, or trip the infra-overlap check with no live
mapping behind it. Linux's loader ignores zero-memsz PT_LOADs and
elfuse mirrors that.
@jserv jserv requested a review from Max042004 May 30, 2026 22:12
cubic-dev-ai[bot]

This comment was marked as resolved.

@jserv jserv force-pushed the fileops branch 2 times, most recently from 470846d to edde64c Compare May 31, 2026 05:59
sc_openat2 previously accepted RESOLVE_NO_XDEV and let the open through
without enforcement, leaving the only RESOLVE_* flag in
include/uapi/linux/openat2.h unimplemented. The replacement is a
left-to-right component walker in path_openat2_crosses_mount that
classifies each running prefix against a mount-class taxonomy and
returns -EXDEV the first time the class changes.

The taxonomy distinguishes the root filesystem, /proc, /dev, /sys,
/tmp, /dev/shm, and each live or tombstoned FUSE mount keyed by its
mount_id. /tmp and /dev/shm are split out because Linux mounts them as
separate tmpfs filesystems, and treating them as DEV or ROOT would
under-reject. FUSE classes live above PATH_MOUNT_FUSE_BASE = 0x10000000
so mount_id growth never collides with the named classes.

The walk anchor matches kernel semantics: absolute paths under !in_root
begin at /, anything else begins at the dirfd's tracked guest path.
dirfd_guest_base_path pulls that from proc_path for /proc dirfds, from
fuse_resolve_at_path(".") for FUSE dirfds, and from F_GETPATH stripped
of the configured sysroot for regular dirfds. Components advance the
running path; . is skipped; .. pops the trailing component but clamps
at a floor (1 for non-IN_ROOT walks so a /proc/1 -> /proc -> / cross
still surfaces, dirfd-base length for IN_ROOT walks so the precheck
never out-rejects what path_openat2_normalize_in_root applies later in
the open).

Component-by-component classification is required because lexical
collapse hides transient mount visits: /proc/self/../../tmp/foo
normalizes to /tmp/foo even though the walk passes through /proc, and
Linux NO_XDEV catches that. The walker classifies after every step so
the transient PROC excursion surfaces as EXDEV before the upward
components apply.

fuse_path_mount_id is a new helper in src/syscall/fuse.c that looks up
the mount_id for a path under fuse_lock, returning -1 outside any FUSE
mount. The walker calls it for FUSE classification, sized so distinct
mounts compare unequal.

path_openat2_crosses_mount gains an out_start_class parameter; the
walker populates it whenever it returns non-error so the caller can
pass it straight into the post-open check. The signature change is
contained: sc_openat2 is the only caller.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant