From a221500a85a6dea92de1aa9793b0088800de7523 Mon Sep 17 00:00:00 2001 From: Jim Huang Date: Sun, 31 May 2026 06:00:59 +0800 Subject: [PATCH 1/2] Zero ELF segment page tail across execve elf_map_segments computed the BSS clear extent as PAGE_ALIGN_UP(memsz) bytes from gpa. When gpa is not page-aligned, that left the tail of the last page covered by the segment untouched. A fresh bootstrap saw zero bytes there because the primary slab was MAP_ANON; after execve into the same interpreter at the same address, the host slab still held the previous incarnation's bytes, and glibc ld.so allocated the new main link_map into that tail through dl_minimal_malloc, picking up a stale l_ld value and crashing in dl_main at a small absolute address. ld.so's RW LOAD sits at vaddr 0x2f650 in the cross-toolchain build, which made fork+execve and any dyn-to-dyn execve under --sysroot a reliable reproducer; the matrix in tests/test-fork-exec exercises the fork case end to end. Compute the extent as PAGE_ALIGN_UP(gpa + memsz) - gpa so the trailing page is fully zeroed regardless of gpa alignment. Skip PT_LOAD entries with memsz == 0 entirely; the unaligned-gpa rounding above would otherwise let a crafted ELF splat zeros across the tail of an earlier segment in the same page, or trip the infra-overlap check with no live mapping behind it. Linux's loader ignores zero-memsz PT_LOADs and elfuse mirrors that. --- docs/internals.md | 4 +--- src/core/elf.c | 29 +++++++++++++++++++++++------ 2 files changed, 24 insertions(+), 9 deletions(-) diff --git a/docs/internals.md b/docs/internals.md index e9ca130..d29ba5d 100644 --- a/docs/internals.md +++ b/docs/internals.md @@ -593,9 +593,7 @@ correctly. `elf_resolve_interp()` in `src/core/elf.c` is shared between ### Known Limitations -- `timeout` fails: it uses `fork`/`clone` to create a child process with a - timer, and the forked child inherits the dynamic-linker state but the - fork+exec path has issues in interpreter space. +None currently tracked for the aarch64-linux dynamic-linker path. ## GDB Stub diff --git a/src/core/elf.c b/src/core/elf.c index c20195c..7da7c80 100644 --- a/src/core/elf.c +++ b/src/core/elf.c @@ -327,13 +327,30 @@ int elf_map_segments(const elf_info_t *info, return -1; } - /* The host memset zeros PAGE_ALIGN_UP(memsz) bytes, not just memsz, - * so the infra-overlap check has to use the same rounded extent. - * Without the rounding here, a segment that ends just below - * infra_lo passes the check and still spills up to PAGE_SIZE-1 - * bytes of zero into the infra reserve via the page tail. + /* PT_LOAD with memsz == 0 maps no bytes, but the page-tail zero + * extent below still rounds up to the next page boundary. For an + * unaligned gpa that means a crafted ELF could splat zeros across + * the tail of a previously loaded segment in the same page, or + * trip the infra-overlap check with no live mapping behind it. + * Linux ignores zero-memsz PT_LOADs; mirror that here. */ - uint64_t zero_len = PAGE_ALIGN_UP(memsz); + if (memsz == 0) { + seg_idx++; + continue; + } + + /* The host memset zeros up to the next page boundary AFTER the + * segment ends, so the infra-overlap check has to use the same + * rounded extent. The end is PAGE_ALIGN_UP(gpa + memsz) rather + * than gpa + PAGE_ALIGN_UP(memsz) because gpa is not always + * page-aligned (e.g. ld.so's RW segment at vaddr 0x2f650): with + * the older bytes-from-gpa formula the page covering the last + * memsz byte kept its mid-page tail untouched, and execve into a + * dynamic-linked target then read stale state from the prior + * incarnation of the same interpreter at offsets ld.so allocates + * from beyond memsz (e.g. the first link_map in _dl_new_object). + */ + uint64_t zero_len = PAGE_ALIGN_UP(gpa + memsz) - gpa; if (gpa + zero_len > guest_size) zero_len = guest_size - gpa; if (infra_active && gpa < infra_hi && gpa + zero_len > infra_lo) { From 9b9a2e84acaabdcdf209acb6745755461b0d7b9f Mon Sep 17 00:00:00 2001 From: Jim Huang Date: Sun, 31 May 2026 06:01:27 +0800 Subject: [PATCH 2/2] Enforce openat2 RESOLVE_NO_XDEV sc_openat2 previously accepted RESOLVE_NO_XDEV and let the open through without enforcement, leaving the only RESOLVE_* flag in include/uapi/linux/openat2.h unimplemented. The replacement is a left-to-right component walker in path_openat2_crosses_mount that classifies each running prefix against a mount-class taxonomy and returns -EXDEV the first time the class changes. The taxonomy distinguishes the root filesystem, /proc, /dev, /sys, /tmp, /dev/shm, and each live or tombstoned FUSE mount keyed by its mount_id. /tmp and /dev/shm are split out because Linux mounts them as separate tmpfs filesystems, and treating them as DEV or ROOT would under-reject. FUSE classes live above PATH_MOUNT_FUSE_BASE = 0x10000000 so mount_id growth never collides with the named classes. The walk anchor matches kernel semantics: absolute paths under !in_root begin at /, anything else begins at the dirfd's tracked guest path. dirfd_guest_base_path pulls that from proc_path for /proc dirfds, from fuse_resolve_at_path(".") for FUSE dirfds, and from F_GETPATH stripped of the configured sysroot for regular dirfds. Components advance the running path; . is skipped; .. pops the trailing component but clamps at a floor (1 for non-IN_ROOT walks so a /proc/1 -> /proc -> / cross still surfaces, dirfd-base length for IN_ROOT walks so the precheck never out-rejects what path_openat2_normalize_in_root applies later in the open). Component-by-component classification is required because lexical collapse hides transient mount visits: /proc/self/../../tmp/foo normalizes to /tmp/foo even though the walk passes through /proc, and Linux NO_XDEV catches that. The walker classifies after every step so the transient PROC excursion surfaces as EXDEV before the upward components apply. fuse_path_mount_id is a new helper in src/syscall/fuse.c that looks up the mount_id for a path under fuse_lock, returning -1 outside any FUSE mount. The walker calls it for FUSE classification, sized so distinct mounts compare unequal. path_openat2_crosses_mount gains an out_start_class parameter; the walker populates it whenever it returns non-error so the caller can pass it straight into the post-open check. The signature change is contained: sc_openat2 is the only caller. --- src/syscall/fuse.c | 14 + src/syscall/fuse.h | 5 + src/syscall/path.c | 495 ++++++++++++++++++++++++++++++++++ src/syscall/path.h | 49 ++++ src/syscall/syscall.c | 47 +++- tests/test-syscall-fidelity.c | 351 +++++++++++++++++++++++- 6 files changed, 949 insertions(+), 12 deletions(-) diff --git a/src/syscall/fuse.c b/src/syscall/fuse.c index 157191a..adab4a8 100644 --- a/src/syscall/fuse.c +++ b/src/syscall/fuse.c @@ -1488,6 +1488,20 @@ bool fuse_path_matches_mount(const char *path) return matched; } +int fuse_path_mount_id(const char *path) +{ + if (!path || path[0] != '/') + return -1; + char canon[LINUX_PATH_MAX]; + if (fuse_canonical_abs(path, canon, sizeof(canon)) < 0) + return -1; + pthread_mutex_lock(&fuse_lock); + fuse_mount_t *m = fuse_mount_for_path_locked(canon, NULL); + int id = m ? m->mount_id : -1; + pthread_mutex_unlock(&fuse_lock); + return id; +} + /* Resolve a guest-absolute path to a (session, mount_id, nodeid, attr). * retain_final_lookup controls whether the terminal LOOKUP's nlookup is kept * alive for a later open/release cycle, or forgotten before return for diff --git a/src/syscall/fuse.h b/src/syscall/fuse.h index fd5e58b..f7ee5a8 100644 --- a/src/syscall/fuse.h +++ b/src/syscall/fuse.h @@ -20,6 +20,11 @@ int fuse_proc_stat(struct stat *st); int64_t fuse_open_path(guest_t *g, const char *path, int linux_flags, int mode); bool fuse_path_matches_mount(const char *path); +/* Returns the mount_id of the FUSE mount containing path, or -1 if path is + * not inside any live or tombstoned FUSE mount. Distinct FUSE mounts have + * distinct mount_ids; used by RESOLVE_NO_XDEV to detect cross-mount paths. + */ +int fuse_path_mount_id(const char *path); /* Stat a FUSE-mounted path. at_flags carries the Linux AT_* mask from the * caller; only LINUX_AT_SYMLINK_NOFOLLOW is consulted today. When the * daemon returns S_IFLNK for the final component and the caller did not diff --git a/src/syscall/path.c b/src/syscall/path.c index c82b46f..5d49892 100644 --- a/src/syscall/path.c +++ b/src/syscall/path.c @@ -774,3 +774,498 @@ int path_openat2_resolved_within_root(guest_fd_t dirfd, return 0; } + +/* Mount-class taxonomy used by RESOLVE_NO_XDEV. Distinct return values + * mean distinct logical filesystems from the guest's perspective. FUSE + * mounts encode mount_id into the high bits so two distinct FUSE mounts + * compare unequal. + */ +#define PATH_MOUNT_ROOT 0 +#define PATH_MOUNT_PROC 1 +#define PATH_MOUNT_DEV 2 +#define PATH_MOUNT_SYS 3 +#define PATH_MOUNT_TMP 4 +#define PATH_MOUNT_DEV_SHM 5 +/* fuse_next_mount_id is a monotonic int starting at 100 (see fuse.c). + * The base is sized well clear of any realistic mount_id so the four + * non-FUSE classes never collide with the FUSE class numbers even after + * hundreds of millions of mount cycles. mount_id values that ever do + * approach this bound would represent a runtime that long outlived + * elfuse's intended lifetime. + */ +#define PATH_MOUNT_FUSE_BASE 0x10000000 + +static int classify_guest_path_mount(const char *guest_path) +{ + if (!guest_path || guest_path[0] != '/') + return -1; + + int fuse_id = fuse_path_mount_id(guest_path); + if (fuse_id >= 0) + return PATH_MOUNT_FUSE_BASE + fuse_id; + + if (path_prefix_match(guest_path, "/proc", 5)) + return PATH_MOUNT_PROC; + if (path_prefix_match(guest_path, "/tmp", 4)) + return PATH_MOUNT_TMP; + if (path_prefix_match(guest_path, "/dev/shm", 8)) + return PATH_MOUNT_DEV_SHM; + if (path_prefix_match(guest_path, "/dev", 4)) + return PATH_MOUNT_DEV; + if (path_prefix_match(guest_path, "/sys", 4)) + return PATH_MOUNT_SYS; + + return PATH_MOUNT_ROOT; +} + +static int host_path_to_guest_path(const char *host_path, + char *out, + size_t outsz) +{ + char sysroot[LINUX_PATH_MAX]; + const char *guest_path = host_path; + + if (proc_sysroot_snapshot(sysroot, sizeof(sysroot))) { + size_t sysroot_len = strlen(sysroot); + if (!strncmp(host_path, sysroot, sysroot_len) && + (host_path[sysroot_len] == '\0' || host_path[sysroot_len] == '/')) { + guest_path = host_path + sysroot_len; + if (*guest_path == '\0') + guest_path = "/"; + } + } + + size_t len = str_copy_trunc(out, guest_path, outsz); + if (len >= outsz) { + errno = ENAMETOOLONG; + return -1; + } + return 0; +} + +static int dirfd_guest_base_path(guest_fd_t dirfd, char *out, size_t outsz) +{ + if (dirfd == LINUX_AT_FDCWD) { + proc_cwd_view_t view; + if (proc_acquire_cwd_view(&view) < 0) { + errno = EBADF; + return -1; + } + size_t len = str_copy_trunc(out, view.path, outsz); + proc_release_cwd_view(&view); + if (len >= outsz) { + errno = ENAMETOOLONG; + return -1; + } + return 0; + } + + fd_entry_t snap; + if (!fd_snapshot(dirfd, &snap)) { + errno = EBADF; + return -1; + } + if (snap.proc_path[0] != '\0') { + size_t len = str_copy_trunc(out, snap.proc_path, outsz); + if (len >= outsz) { + errno = ENAMETOOLONG; + return -1; + } + return 0; + } + + if (snap.type == FD_FUSE_DIR) { + int rc = fuse_resolve_at_path(dirfd, ".", out, outsz); + if (rc < 0) + return -1; + if (rc > 0) + return 0; + } + + char host_path[LINUX_PATH_MAX]; + if (path_openat2_dirfd_host_path(dirfd, host_path, sizeof(host_path)) == 0) + return host_path_to_guest_path(host_path, out, outsz); + + if (snap.type != FD_DIR) { + errno = EBADF; + return -1; + } + + /* Some host-backed directory handles cannot be named back through + * F_GETPATH. Keep a root-class fallback for those rare cases so regular + * relative paths can still proceed. + */ + out[0] = '/'; + out[1] = '\0'; + return 0; +} + +/* Pop one trailing component from an absolute path, refusing to drop + * below the supplied floor length. floor_len is strlen of the walk root + * (1 == "/" for the bare-absolute case, dirfd-base length for IN_ROOT + * resolution). At the floor the path is left unchanged, matching Linux's + * ".." at "/" semantics and RESOLVE_IN_ROOT's clamp-at-dirfd rule. + */ +static void guest_path_pop(char *current, size_t floor_len) +{ + size_t len = strlen(current); + if (len <= floor_len) + return; + char *slash = strrchr(current, '/'); + if (!slash || slash == current) { + current[0] = '/'; + current[1] = '\0'; + return; + } + if ((size_t) (slash - current) < floor_len) + return; + *slash = '\0'; +} + +static int guest_path_append(char *current, + size_t currentsz, + const char *comp, + size_t len) +{ + size_t cur_len = strlen(current); + bool need_slash = (cur_len == 0 || current[cur_len - 1] != '/'); + size_t want = cur_len + (need_slash ? 1 : 0) + len + 1; + if (want > currentsz) { + errno = ENAMETOOLONG; + return -1; + } + if (need_slash) + current[cur_len++] = '/'; + memcpy(current + cur_len, comp, len); + current[cur_len + len] = '\0'; + return 0; +} + +static int open_guest_walk_root_fd(guest_fd_t dirfd, + bool absolute, + host_fd_t *out) +{ + if (absolute) { + char sysroot[LINUX_PATH_MAX]; + const char *root = "/"; + if (proc_sysroot_snapshot(sysroot, sizeof(sysroot))) + root = sysroot; + *out = open(root, O_RDONLY | O_DIRECTORY | O_CLOEXEC); + return *out < 0 ? -1 : 0; + } + + host_fd_ref_t dir_ref; + if (host_dirfd_ref_open(dirfd, &dir_ref) < 0) { + errno = EBADF; + return -1; + } + + if (dir_ref.fd == AT_FDCWD) + *out = open(".", O_RDONLY | O_DIRECTORY | O_CLOEXEC); + else + *out = dup(dir_ref.fd); + host_fd_ref_close(&dir_ref); + return *out < 0 ? -1 : 0; +} + +static int replace_walk_fd(host_fd_t *current_fd, host_fd_t next_fd) +{ + if (next_fd < 0) + return -1; + if (*current_fd >= 0) + close(*current_fd); + *current_fd = next_fd; + return 0; +} + +static int reset_walk_fd(host_fd_t *current_fd, host_fd_t root_fd) +{ + host_fd_t next_fd = dup(root_fd); + if (next_fd < 0) + return -1; + return replace_walk_fd(current_fd, next_fd); +} + +int path_openat2_crosses_mount(guest_fd_t dirfd, + const char *path, + bool in_root, + int *out_start_class) +{ + if (out_start_class) + *out_start_class = -1; + if (!path) { + errno = EINVAL; + return -1; + } + + char current[LINUX_PATH_MAX]; + const char *walk = path; + char pending[LINUX_PATH_MAX]; + host_fd_t current_fd = -1; + host_fd_t root_fd = -1; + host_fd_t absolute_root_fd = -1; + bool host_walk = true; + int symlink_count = 0; + int rc = -1; + + /* The walk has to track every intermediate prefix because lexical + * collapsing of ".." would erase a transient mount crossing (e.g. + * "/proc/self/../../tmp" passes through /proc before the upward + * components apply, and Linux NO_XDEV detects that). The start frame + * matches how the kernel anchors resolution: absolute paths begin at + * "/" regardless of dirfd; relative paths and RESOLVE_IN_ROOT begin at + * the dirfd's tracked guest path. + */ + if (path[0] == '/' && !in_root) { + current[0] = '/'; + current[1] = '\0'; + } else if (dirfd_guest_base_path(dirfd, current, sizeof(current)) < 0) { + goto out; + } + + /* IN_ROOT clamps ".." at dirfd; outside IN_ROOT the walker can + * traverse up to "/" so a transition like /proc/1 -> /proc -> / + * surfaces as the expected cross. The floor matches whichever rule + * applies so the precheck never out-rejects the actual resolution + * that follows in path_openat2_normalize_in_root. + */ + size_t floor_len = in_root ? strlen(current) : 1; + + int start_class = classify_guest_path_mount(current); + if (start_class < 0) { + errno = EINVAL; + goto out; + } + if (out_start_class) + *out_start_class = start_class; + + if (open_guest_walk_root_fd(dirfd, path[0] == '/' && !in_root, + ¤t_fd) < 0) { + if (path[0] == '/' || errno != EBADF) + goto out; + host_walk = false; + errno = 0; + } + if (host_walk) { + root_fd = dup(current_fd); + if (root_fd < 0) + goto out; + if (open_guest_walk_root_fd(LINUX_AT_FDCWD, true, &absolute_root_fd) < + 0) + goto out; + } + + while (*walk) { + while (*walk == '/') + walk++; + if (!*walk) + break; + + const char *comp = walk; + while (*walk && *walk != '/') + walk++; + size_t len = (size_t) (walk - comp); + + if (len == 1 && comp[0] == '.') + continue; + + if (len == 2 && comp[0] == '.' && comp[1] == '.') { + size_t before_len = strlen(current); + guest_path_pop(current, floor_len); + if (host_walk && strlen(current) < before_len) { + host_fd_t parent_fd = openat( + current_fd, "..", O_RDONLY | O_DIRECTORY | O_CLOEXEC); + if (replace_walk_fd(¤t_fd, parent_fd) < 0) + goto out; + } + } else { + char name[NAME_MAX + 1]; + char parent[LINUX_PATH_MAX]; + if (len > NAME_MAX) { + errno = ENAMETOOLONG; + goto out; + } + memcpy(name, comp, len); + name[len] = '\0'; + if (str_copy_trunc(parent, current, sizeof(parent)) >= + sizeof(parent)) { + errno = ENAMETOOLONG; + goto out; + } + + struct stat st; + if (host_walk && + fstatat(current_fd, name, &st, AT_SYMLINK_NOFOLLOW) == 0) { + if (S_ISLNK(st.st_mode)) { + if (guest_path_append(current, sizeof(current), comp, len) < + 0) + goto out; + + int cls = classify_guest_path_mount(current); + if (cls < 0) { + errno = EINVAL; + goto out; + } + if (cls != start_class) { + rc = 1; + goto out; + } + str_copy_trunc(current, parent, sizeof(current)); + + char target[LINUX_PATH_MAX]; + ssize_t target_len = readlinkat(current_fd, name, target, + sizeof(target) - 1); + if (target_len < 0) + goto out; + if (++symlink_count > MAXSYMLINKS) { + errno = ELOOP; + goto out; + } + target[target_len] = '\0'; + + char rest_buf[LINUX_PATH_MAX]; + const char *rest = walk; + while (*rest == '/') + rest++; + if (str_copy_trunc(rest_buf, rest, sizeof(rest_buf)) >= + sizeof(rest_buf)) { + errno = ENAMETOOLONG; + goto out; + } + if (snprintf(pending, sizeof(pending), "%s%s%s", target, + rest_buf[0] ? "/" : "", + rest_buf) >= (int) sizeof(pending)) { + errno = ENAMETOOLONG; + goto out; + } + walk = pending; + + if (target[0] == '/') { + host_fd_t reset_fd = + in_root ? root_fd : absolute_root_fd; + if (reset_walk_fd(¤t_fd, reset_fd) < 0) + goto out; + if (in_root) { + if (dirfd_guest_base_path(dirfd, current, + sizeof(current)) < 0) + goto out; + } else { + current[0] = '/'; + current[1] = '\0'; + } + } + continue; + } + } else if (host_walk && errno != ENOENT) { + goto out; + } + + if (guest_path_append(current, sizeof(current), comp, len) < 0) + goto out; + } + + int cls = classify_guest_path_mount(current); + if (cls < 0) { + errno = EINVAL; + goto out; + } + if (cls != start_class) { + rc = 1; + goto out; + } + + const char *rest = walk; + while (*rest == '/') + rest++; + if (host_walk && *rest != '\0' && + !(len == 2 && comp[0] == '.' && comp[1] == '.')) { + char name[NAME_MAX + 1]; + if (len > NAME_MAX) { + errno = ENAMETOOLONG; + goto out; + } + memcpy(name, comp, len); + name[len] = '\0'; + host_fd_t next_fd = + openat(current_fd, name, O_RDONLY | O_DIRECTORY | O_CLOEXEC); + if (replace_walk_fd(¤t_fd, next_fd) < 0) + goto out; + } + } + + rc = 0; + +out: + if (current_fd >= 0) + close(current_fd); + if (root_fd >= 0) + close(root_fd); + if (absolute_root_fd >= 0) + close(absolute_root_fd); + return rc; +} + +int path_openat2_check_fd_xdev(int guest_fd, int start_class) +{ + if (start_class < 0) { + errno = EINVAL; + return -1; + } + + fd_entry_t snap; + if (!fd_snapshot(guest_fd, &snap)) { + errno = EBADF; + return -1; + } + + /* Synthetic /dev fds (FD_URANDOM) and FUSE fds have no resolvable + * host path, but their semantic class is fixed by the fd type; + * classify those without F_GETPATH so a NO_XDEV resolution that + * intended to land outside /dev or outside the originating FUSE + * mount catches them. + */ + /* The post-check is only meaningful for resolutions that started in + * the root class. For PROC/DEV/SYS/TMP/DEV_SHM/FUSE the precheck's + * walker already classified the dirfd against the right intercept, + * and any successful open went through the intercept layer (procfs + * emulation backs FD_REGULAR with a /tmp/elfuse-proc-XXXXXX temp + * file whose F_GETPATH would mis-classify as /tmp). Trust the + * precheck in those cases and only re-derive the class when the + * resolution started at root: that is precisely the window where a + * symlink can escape into an intercept class without the walker + * seeing it (sidecar shadows hide the link node from fstatat). + */ + if (start_class != PATH_MOUNT_ROOT) + return 0; + + char guest_path[LINUX_PATH_MAX]; + int end_class; + if (snap.proc_path[0] != '\0') { + end_class = classify_guest_path_mount(snap.proc_path); + } else if (snap.type == FD_URANDOM) { + end_class = PATH_MOUNT_DEV; + } else if (snap.type == FD_FUSE_DIR || snap.type == FD_FUSE_FILE || + snap.type == FD_FUSE_DEV) { + int mnt_id; + if (fuse_fd_mnt_id(guest_fd, &mnt_id) < 0) + return -1; + end_class = PATH_MOUNT_FUSE_BASE + mnt_id; + } else if (snap.host_fd >= 0) { + char host_path[LINUX_PATH_MAX]; + if (fcntl(snap.host_fd, F_GETPATH, host_path) < 0) + return -1; + if (host_path_to_guest_path(host_path, guest_path, sizeof(guest_path)) < + 0) + return -1; + end_class = classify_guest_path_mount(guest_path); + } else { + errno = EBADF; + return -1; + } + if (end_class < 0) { + errno = EINVAL; + return -1; + } + + return (end_class != start_class) ? 1 : 0; +} diff --git a/src/syscall/path.h b/src/syscall/path.h index 765a6e3..fba5250 100644 --- a/src/syscall/path.h +++ b/src/syscall/path.h @@ -70,3 +70,52 @@ int path_openat2_resolved_within_root(guest_fd_t dirfd, const char *path, uint64_t oflags, bool in_root); +/* Returns 1 if resolving path against dirfd would cross a mount boundary + * from the guest's perspective, 0 if it stays inside the same logical + * filesystem, and -1 with errno set on dirfd lookup failures. Mount + * classes are: regular guest filesystem, /proc, /dev, /sys, /tmp, + * /dev/shm, and each live or tombstoned FUSE mount (keyed by mount_id). + * The walker classifies every intermediate prefix as it advances, so + * transient excursions through /proc that lexically resolve back into + * the root class still surface as a crossing. Symlink components are + * expanded inline against the host-walk fd when possible so a link + * whose target lives in a different class is caught at the precheck. + * + * When out_start_class is non-NULL it is populated with the dirfd's + * mount class on every non-error return so the caller can re-run the + * check against the actually opened fd via path_openat2_check_fd_xdev. + * The post-open check is what closes the symlink bypass for callers + * that do not also set RESOLVE_NO_SYMLINKS: the precheck's fstatat + * walk cannot see symlinks that live in a sidecar shadow directory + * (case-fold sysroot), so the kernel may follow a link the walker did + * not, and only F_GETPATH on the resulting fd reveals the real + * landing site. + * + * Known gaps (best-effort by design): + * - host_path_to_guest_path strips the configured sysroot prefix with + * a case-sensitive strncmp; on case-insensitive macOS volumes a + * differently-cased F_GETPATH could fail to strip and the dirfd is + * then classified as the root class. Sysroots that happen to live + * under /proc, /dev, or /sys on the host are not supported. + * - A sibling vCPU that chdir(2)s, dup3(2)s over dirfd, or mounts / + * unmounts a FUSE filesystem between this check and the subsequent + * sys_openat may shift the resolution into a different mount class + * without the cross being detected. The race window is narrow and + * the guest is in elfuse's trust domain. + */ +int path_openat2_crosses_mount(guest_fd_t dirfd, + const char *path, + bool in_root, + int *out_start_class); + +/* Post-open verification for RESOLVE_NO_XDEV. Reads the host-side + * canonical path of the just-opened guest fd via fcntl(F_GETPATH), + * strips the sysroot prefix, and classifies the result against the + * start class captured by path_openat2_crosses_mount. Returns 1 if + * the resolved fd sits in a different mount class than the resolution + * started in, 0 if it stays in the same class, -1 with errno set on + * lookup failures (e.g. fd closed, F_GETPATH refused). Catches the + * symlink-driven crossings that the string-only precheck misses by + * design. + */ +int path_openat2_check_fd_xdev(int guest_fd, int start_class); diff --git a/src/syscall/syscall.c b/src/syscall/syscall.c index be97787..b9bb4f9 100644 --- a/src/syscall/syscall.c +++ b/src/syscall/syscall.c @@ -1392,11 +1392,11 @@ static int64_t sc_openat2(guest_t *g, return -LINUX_EAGAIN; /* For RESOLVE_NO_SYMLINKS, RESOLVE_NO_MAGICLINKS, RESOLVE_BENEATH, - * RESOLVE_IN_ROOT: read the guest path and enforce constraints before - * opening. + * RESOLVE_IN_ROOT, RESOLVE_NO_XDEV: read the guest path and enforce + * constraints before opening. */ if (resolve & (RESOLVE_NO_SYMLINKS | RESOLVE_NO_MAGICLINKS | - RESOLVE_BENEATH | RESOLVE_IN_ROOT)) { + RESOLVE_BENEATH | RESOLVE_IN_ROOT | RESOLVE_NO_XDEV)) { char path[LINUX_PATH_MAX]; if (guest_read_str(g, x1, path, sizeof(path)) < 0) return -LINUX_EFAULT; @@ -1427,6 +1427,17 @@ static int64_t sc_openat2(guest_t *g, path_openat2_is_proc_magiclink((int) x0, path)) return -LINUX_ELOOP; + int no_xdev_start_class = -1; + if (resolve & RESOLVE_NO_XDEV) { + int crossed = path_openat2_crosses_mount( + (int) x0, path, (resolve & RESOLVE_IN_ROOT) != 0, + &no_xdev_start_class); + if (crossed < 0) + return linux_errno(); + if (crossed > 0) + return -LINUX_EXDEV; + } + if (resolve & (RESOLVE_BENEATH | RESOLVE_IN_ROOT)) { if (path_openat2_resolved_within_root( (int) x0, path, oflags, (resolve & RESOLVE_IN_ROOT) != 0) < @@ -1437,21 +1448,39 @@ static int64_t sc_openat2(guest_t *g, } } + int64_t opened; if (resolve & RESOLVE_IN_ROOT) { char rooted[LINUX_PATH_MAX]; if (path_openat2_normalize_in_root(path, rooted, sizeof(rooted)) < 0) { return -LINUX_ENAMETOOLONG; } - return sys_openat_path(g, (int) x0, rooted, (int) oflags, - (int) mode); + opened = + sys_openat_path(g, (int) x0, rooted, (int) oflags, (int) mode); + } else { + opened = sys_openat(g, (int) x0, x1, (int) oflags, (int) mode); + } + if (opened >= 0 && (resolve & RESOLVE_NO_XDEV) && + no_xdev_start_class >= 0) { + /* The string walker cannot see symlinks that the kernel + * followed during the actual open (sysroot case-fold sidecar + * shadows hide the link node from the precheck's fstatat + * walk). Re-classify the opened fd's resolved host path; if + * it landed in a different mount class, drop the fd and + * return EXDEV. This also tightens the precheck-vs-open + * TOCTOU window since the post-check sees the exact path + * the kernel resolved. + */ + int crossed = + path_openat2_check_fd_xdev((int) opened, no_xdev_start_class); + if (crossed > 0) { + sys_close((int) opened); + return -LINUX_EXDEV; + } } + return opened; } - /* RESOLVE_NO_XDEV is not enforced yet. elfuse currently resolves all - * guest paths within one host-backed filesystem view. - */ - return sys_openat(g, (int) x0, x1, (int) oflags, (int) mode); } diff --git a/tests/test-syscall-fidelity.c b/tests/test-syscall-fidelity.c index a4e1001..f813f12 100644 --- a/tests/test-syscall-fidelity.c +++ b/tests/test-syscall-fidelity.c @@ -6,7 +6,7 @@ * Covers Linux syscalls whose semantics elfuse must emulate exactly: * fchmodat2 (SYS 452) including AT_SYMLINK_NOFOLLOW, getcpu (SYS 168), * openat2 (SYS 437) with each RESOLVE_* flag variant (BENEATH, - * IN_ROOT, NO_SYMLINKS, NO_MAGICLINKS), O_PATH descriptor enforcement + * IN_ROOT, NO_SYMLINKS, NO_MAGICLINKS, NO_XDEV), O_PATH descriptor enforcement * for read/write/fstat, madvise corner cases (MADV_COLD acceptance and * MADV_DONTNEED across an unmapped hole), and the low-address mmap * hint preservation that ET_EXEC layout depends on. @@ -165,10 +165,11 @@ struct open_how { unsigned long long flags, mode, resolve; }; -#define RESOLVE_BENEATH 0x08 -#define RESOLVE_IN_ROOT 0x10 +#define RESOLVE_NO_XDEV 0x01 #define RESOLVE_NO_MAGICLINKS 0x02 #define RESOLVE_NO_SYMLINKS 0x04 +#define RESOLVE_BENEATH 0x08 +#define RESOLVE_IN_ROOT 0x10 static void test_openat2_basic(void) { @@ -456,6 +457,337 @@ static void test_openat2_resolve_no_magiclinks_proc_cwd(void) EXPECT_TRUE(errno == ELOOP, "wrong errno"); } +static void test_openat2_resolve_no_xdev_rejects_proc(void) +{ + TEST("openat2 RESOLVE_NO_XDEV rejects crossing into /proc"); + int dirfd = open("/tmp", O_RDONLY | O_DIRECTORY); + if (dirfd < 0) { + FAIL("open /tmp"); + return; + } + struct open_how how = { + .flags = O_RDONLY, .mode = 0, .resolve = RESOLVE_NO_XDEV}; + long fd = + syscall(SYS_openat2, dirfd, "/proc/self/status", &how, sizeof(how)); + close(dirfd); + if (fd >= 0) { + close((int) fd); + FAIL("expected EXDEV"); + return; + } + EXPECT_TRUE(errno == EXDEV, "wrong errno"); +} + +static void test_openat2_resolve_no_xdev_allows_same_mount(void) +{ + TEST("openat2 RESOLVE_NO_XDEV allows same-mount path"); + int dirfd = open("/tmp", O_RDONLY | O_DIRECTORY); + if (dirfd < 0) { + FAIL("open /tmp"); + return; + } + struct open_how how = { + .flags = O_RDONLY, .mode = 0, .resolve = RESOLVE_NO_XDEV}; + /* /tmp is a regular dir; another regular path stays in the same class. */ + long fd = syscall(SYS_openat2, dirfd, "/etc/passwd", &how, sizeof(how)); + close(dirfd); + if (fd >= 0) { + close((int) fd); + PASS(); + return; + } + /* Acceptable if /etc/passwd doesn't exist on the host running the test, + * as long as the error is not EXDEV. + */ + EXPECT_TRUE(errno != EXDEV, "should not return EXDEV for same-class path"); +} + +static void test_openat2_resolve_no_xdev_absolute_ignores_dirfd_mount(void) +{ + TEST("openat2 RESOLVE_NO_XDEV absolute path ignores dirfd mount"); + int dirfd = open("/proc", O_RDONLY | O_DIRECTORY); + if (dirfd < 0) { + FAIL("open /proc"); + return; + } + struct open_how how = { + .flags = O_RDONLY, .mode = 0, .resolve = RESOLVE_NO_XDEV}; + long fd = syscall(SYS_openat2, dirfd, "/etc/passwd", &how, sizeof(how)); + close(dirfd); + if (fd >= 0) { + close((int) fd); + PASS(); + return; + } + EXPECT_TRUE(errno != EXDEV, "absolute /etc should start from root mount"); +} + +static void test_openat2_resolve_no_xdev_rejects_relative_proc(void) +{ + TEST("openat2 RESOLVE_NO_XDEV rejects relative crossing into /proc"); + int dirfd = open("/", O_RDONLY | O_DIRECTORY); + if (dirfd < 0) { + FAIL("open /"); + return; + } + struct open_how how = { + .flags = O_RDONLY, .mode = 0, .resolve = RESOLVE_NO_XDEV}; + long fd = + syscall(SYS_openat2, dirfd, "proc/self/status", &how, sizeof(how)); + close(dirfd); + if (fd >= 0) { + close((int) fd); + FAIL("expected EXDEV"); + return; + } + EXPECT_TRUE(errno == EXDEV, "wrong errno"); +} + +static void test_openat2_resolve_no_xdev_rejects_relative_escape(void) +{ + TEST("openat2 RESOLVE_NO_XDEV rejects relative escape from /proc"); + int dirfd = open("/proc", O_RDONLY | O_DIRECTORY); + if (dirfd < 0) { + FAIL("open /proc"); + return; + } + struct open_how how = { + .flags = O_RDONLY, .mode = 0, .resolve = RESOLVE_NO_XDEV}; + long fd = syscall(SYS_openat2, dirfd, "../etc/passwd", &how, sizeof(how)); + close(dirfd); + if (fd >= 0) { + close((int) fd); + FAIL("expected EXDEV"); + return; + } + EXPECT_TRUE(errno == EXDEV, "wrong errno"); +} + +static void test_openat2_resolve_no_xdev_allows_regular_proc_name(void) +{ + TEST("openat2 RESOLVE_NO_XDEV allows regular dir named proc"); + char dir_template[] = "/tmp/elfuse-openat2-xdev-XXXXXX"; + char *dir = mkdtemp(dir_template); + if (!dir) { + FAIL("mkdtemp"); + return; + } + char proc_dir[PATH_MAX]; + char file_path[PATH_MAX]; + if (snprintf(proc_dir, sizeof(proc_dir), "%s/proc", dir) >= + (int) sizeof(proc_dir) || + snprintf(file_path, sizeof(file_path), "%s/status", proc_dir) >= + (int) sizeof(file_path)) { + FAIL("path too long"); + rmdir(dir); + return; + } + if (mkdir(proc_dir, 0700) < 0) { + FAIL("mkdir proc"); + rmdir(dir); + return; + } + int fd = open(file_path, O_CREAT | O_WRONLY, 0600); + if (fd < 0) { + FAIL("create status"); + rmdir(proc_dir); + rmdir(dir); + return; + } + close(fd); + + int dirfd = open(dir, O_RDONLY | O_DIRECTORY); + if (dirfd < 0) { + FAIL("open temp dir"); + unlink(file_path); + rmdir(proc_dir); + rmdir(dir); + return; + } + struct open_how how = { + .flags = O_RDONLY, .mode = 0, .resolve = RESOLVE_NO_XDEV}; + long opened = syscall(SYS_openat2, dirfd, "proc/status", &how, sizeof(how)); + close(dirfd); + unlink(file_path); + rmdir(proc_dir); + rmdir(dir); + if (opened >= 0) { + close((int) opened); + PASS(); + return; + } + EXPECT_TRUE(errno != EXDEV, "regular proc name is not a mount crossing"); +} + +static void test_openat2_resolve_no_xdev_rejects_dev(void) +{ + TEST("openat2 RESOLVE_NO_XDEV rejects crossing into /dev"); + int dirfd = open("/tmp", O_RDONLY | O_DIRECTORY); + if (dirfd < 0) { + FAIL("open /tmp"); + return; + } + struct open_how how = { + .flags = O_RDONLY, .mode = 0, .resolve = RESOLVE_NO_XDEV}; + long fd = syscall(SYS_openat2, dirfd, "/dev/null", &how, sizeof(how)); + close(dirfd); + if (fd >= 0) { + close((int) fd); + FAIL("expected EXDEV"); + return; + } + EXPECT_TRUE(errno == EXDEV, "wrong errno"); +} + +static void test_openat2_resolve_no_xdev_rejects_dev_shm_from_dev(void) +{ + TEST("openat2 RESOLVE_NO_XDEV rejects /dev to /dev/shm"); + int dirfd = open("/dev", O_RDONLY | O_DIRECTORY); + if (dirfd < 0) { + FAIL("open /dev"); + return; + } + struct open_how how = { + .flags = O_RDONLY, .mode = 0, .resolve = RESOLVE_NO_XDEV}; + long fd = syscall(SYS_openat2, dirfd, "shm/missing", &how, sizeof(how)); + close(dirfd); + if (fd >= 0) { + close((int) fd); + FAIL("expected EXDEV"); + return; + } + EXPECT_TRUE(errno == EXDEV, "wrong errno"); +} + +static void test_openat2_resolve_no_xdev_rejects_relative_tmp(void) +{ + TEST("openat2 RESOLVE_NO_XDEV rejects relative crossing into /tmp"); + int dirfd = open("/", O_RDONLY | O_DIRECTORY); + if (dirfd < 0) { + FAIL("open /"); + return; + } + struct open_how how = { + .flags = O_RDONLY, .mode = 0, .resolve = RESOLVE_NO_XDEV}; + long fd = syscall(SYS_openat2, dirfd, "tmp/elfuse-no-xdev-missing", &how, + sizeof(how)); + close(dirfd); + if (fd >= 0) { + close((int) fd); + FAIL("expected EXDEV"); + return; + } + EXPECT_TRUE(errno == EXDEV, "wrong errno"); +} + +static void test_openat2_resolve_no_xdev_rejects_transient_proc(void) +{ + /* A naive endpoint-only check would accept /proc/self/../../tmp/foo + * because the lexical normalization collapses /proc/self/../.. to / + * and the final classifier sees only /tmp/foo. Linux walks the path + * component by component and catches the transient crossing into + * /proc. The component walker must do the same. + */ + TEST("openat2 RESOLVE_NO_XDEV catches transient /proc visit"); + struct open_how how = { + .flags = O_RDONLY, .mode = 0, .resolve = RESOLVE_NO_XDEV}; + long fd = + syscall(SYS_openat2, AT_FDCWD, + "/proc/self/../../tmp/elfuse-no-xdev-probe", &how, sizeof(how)); + if (fd >= 0) { + close((int) fd); + FAIL("walker accepted a path that traversed /proc"); + return; + } + EXPECT_TRUE(errno == EXDEV, "wrong errno"); +} + +static void test_openat2_resolve_no_xdev_rejects_bare_proc(void) +{ + TEST("openat2 RESOLVE_NO_XDEV rejects bare /proc"); + struct open_how how = { + .flags = O_RDONLY | O_DIRECTORY, .mode = 0, .resolve = RESOLVE_NO_XDEV}; + long fd = syscall(SYS_openat2, AT_FDCWD, "/proc", &how, sizeof(how)); + if (fd >= 0) { + close((int) fd); + FAIL("expected EXDEV opening bare /proc"); + return; + } + EXPECT_TRUE(errno == EXDEV, "wrong errno"); +} + +static void test_openat2_resolve_no_xdev_rejects_symlink_to_proc(void) +{ + TEST("openat2 RESOLVE_NO_XDEV rejects symlink crossing into /proc"); + char dir_template[] = "/tmp/elfuse-openat2-xdev-link-XXXXXX"; + char *dir = mkdtemp(dir_template); + if (!dir) { + FAIL("mkdtemp"); + return; + } + + char link_path[PATH_MAX]; + if (snprintf(link_path, sizeof(link_path), "%s/link", dir) >= + (int) sizeof(link_path)) { + FAIL("path too long"); + rmdir(dir); + return; + } + if (symlink("/proc/self", link_path) < 0) { + FAIL("symlink"); + rmdir(dir); + return; + } + + int dirfd = open(dir, O_RDONLY | O_DIRECTORY); + if (dirfd < 0) { + FAIL("open temp dir"); + unlink(link_path); + rmdir(dir); + return; + } + struct open_how how = { + .flags = O_RDONLY, .mode = 0, .resolve = RESOLVE_NO_XDEV}; + long fd = syscall(SYS_openat2, dirfd, "link/status", &how, sizeof(how)); + close(dirfd); + unlink(link_path); + rmdir(dir); + if (fd >= 0) { + close((int) fd); + FAIL("expected EXDEV"); + return; + } + EXPECT_TRUE(errno == EXDEV, "wrong errno"); +} + +static void test_openat2_resolve_no_xdev_in_root_clamps_dotdot(void) +{ + /* IN_ROOT clamps ".." at the dirfd. The combined NO_XDEV precheck must + * use the same floor, otherwise a path like /../../tmp from a /proc/1 + * dirfd lexically pops above /proc and the walker would falsely report + * EXDEV even though the actual resolution clamps and stays inside the + * proc class. + */ + TEST("openat2 RESOLVE_IN_ROOT | NO_XDEV clamps .. at dirfd"); + int dirfd = open("/proc/self", O_RDONLY | O_DIRECTORY); + if (dirfd < 0) { + FAIL("open /proc/self"); + return; + } + struct open_how how = {.flags = O_RDONLY, + .mode = 0, + .resolve = RESOLVE_NO_XDEV | RESOLVE_IN_ROOT}; + long fd = syscall(SYS_openat2, dirfd, "/../../status", &how, sizeof(how)); + close(dirfd); + if (fd >= 0) { + close((int) fd); + PASS(); + return; + } + EXPECT_TRUE(errno != EXDEV, + "IN_ROOT clamp should keep the walk inside /proc/self"); +} + /* O_PATH enforcement. */ #ifndef O_PATH @@ -595,6 +927,19 @@ int main(void) test_openat2_resolve_beneath_rejects_symlink_escape(); test_openat2_resolve_no_magiclinks_proc_fd(); test_openat2_resolve_no_magiclinks_proc_cwd(); + test_openat2_resolve_no_xdev_rejects_proc(); + test_openat2_resolve_no_xdev_rejects_dev(); + test_openat2_resolve_no_xdev_allows_same_mount(); + test_openat2_resolve_no_xdev_absolute_ignores_dirfd_mount(); + test_openat2_resolve_no_xdev_rejects_relative_proc(); + test_openat2_resolve_no_xdev_rejects_relative_escape(); + test_openat2_resolve_no_xdev_allows_regular_proc_name(); + test_openat2_resolve_no_xdev_rejects_dev_shm_from_dev(); + test_openat2_resolve_no_xdev_rejects_relative_tmp(); + test_openat2_resolve_no_xdev_rejects_transient_proc(); + test_openat2_resolve_no_xdev_rejects_bare_proc(); + test_openat2_resolve_no_xdev_rejects_symlink_to_proc(); + test_openat2_resolve_no_xdev_in_root_clamps_dotdot(); /* O_PATH */ test_opath_read_fails();