Skip to content

smp: always wake up core when ready#2468

Open
zyuiop wants to merge 1 commit into
hermit-os:mainfrom
zyuiop:feat/smp-race-cond
Open

smp: always wake up core when ready#2468
zyuiop wants to merge 1 commit into
hermit-os:mainfrom
zyuiop:feat/smp-race-cond

Conversation

@zyuiop

@zyuiop zyuiop commented Jun 6, 2026

Copy link
Copy Markdown
Contributor

This is a partial revert of commit f0bb0a0.

This fixes a very tricky race condition that caused some cores to not be woken up, causing tests to fail (and weird bugs in the scheduling also)

Context

In some PRs, cargo xtask ci rs --arch x86_64 --profile dev --package rusty_demo --features fs --smp 4 qemu --accel --sudo --devices virtio-fs-pci fails due to timeout.

This bug is reproducible locally, and is even more frequent with higher core counts (e.g., try it with cargo xtask ci rs --arch x86_64 --profile dev --package rusty_demo --features fs --smp 32 qemu --accel --sudo --devices virtio-fs-pci on main and you're almost certain to get the bug. You can similarly confirm that this patch solves the bug.

The symptom is that the "Laplace" computation will get stuck. You can confirm this is not just the loop taking a long time by adding a small log in compute in laplace.rs

	for i in 0..iterations {
		iteration(current, next, size_x, size_y);
		mem::swap(&mut current, &mut next);
		println!("iteration {i}")
	}

You will observe the counter go up, then suddenly stop and nothing happens. This is the bug.

I initially believed this was a futex problem (so I rewrote them partially, another PR is incoming), but it was actually something different.

The Bug (i think)

The thing that tipped me was that the bug was absent with the idle-loop flag. This greatly reduce the area that needed to be searched, since this flag is only present in a couple of places.

The bug is caused by a race condition when a core B wants to wake up a core A, which is going into halt.

The relevant call points are these:

  • PerCoreScheduler::run. This is the idle task. It uses a backoff, checking periodically the core inbox to see if there is work. If there is none and the backoff is exhausted, it goes to sleep, calling enable_and_wait.
  • wakeup_core. This is called by another CPU, after adding a task to the inbox. It sends an interrupt to the other core, to wake it up.

f0bb0a0 introduced a boolean for each core that indicates if the core is halting. If this boolean is set to false, wakeup_core skips the interrupt and returns immediately.
This works most of the time because PerCoreScheduler::run will always pick up tasks when some are available.

However, there is a possible race condition. Imagine the two following cores:

Core 1                                     Core 2

                                           run:
                                              check_inbox // inbox is empty
                                              enable_and_wait // go to enable interrupts

                                           enable_and_wait:
custom_wakeup:
  add_task_to_inbox
  wakeup_core

wakeup_core:
  check_hlt_state // hlt state is currently false
  return          // do nothing

                                              set_hlt_state(true)
                                              interrupts_enable
                                              hlt

I hope you appreciate my artistic talents, but that's the gist.

The fix

Simply a partial revert of f0bb0a0.
I was considering changing the boolean into a state machine, with a "don't sleep please" variant that could be sent by wakeup_core if the halt state was false, to indicate to a core that was about to go to sleep that it should not.
I'm not sure it was working well because I was getting weird values in my compare_and_exchange calls, but this may be a viable alternative to reduce the number of interrupts we send.

Notes

Increasing the core count increases the likelihood of the race condition happening. Laplace appears to be a good test-case because all threads are waiting on each other, so if ONE thread does not wake up, the program remains stuck.

It is possible that this patch will also solve other issues related to the scheduler suddenly becoming unresponsive until you cause an interrupt, but I have not tested that yet.

Fixes #2317

This is a partial revert of commit f0bb0a0
@zyuiop zyuiop force-pushed the feat/smp-race-cond branch from 1b60684 to f4942c9 Compare June 6, 2026 14:26

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Benchmark Results

Details
Benchmark Current: f4942c9 Previous: 28c7089 Performance Ratio
startup_benchmark Build Time 113.60 s 115.28 s 0.99
startup_benchmark File Size 0.77 MB 0.77 MB 1.00
Startup Time - 1 core 1.00 s (±0.02 s) 1.00 s (±0.03 s) 1.01
Startup Time - 2 cores 1.03 s (±0.06 s) 1.04 s (±0.05 s) 0.99
Startup Time - 4 cores 0.85 s (±0.10 s) 1.00 s (±0.06 s) 0.86
multithreaded_benchmark Build Time 113.78 s 118.33 s 0.96
multithreaded_benchmark File Size 0.86 MB 0.87 MB 0.99
Multithreaded Pi Efficiency - 2 Threads 89.84 % (±15.27 %) 92.34 % (±11.70 %) 0.97
Multithreaded Pi Efficiency - 4 Threads 44.71 % (±7.25 %) 45.83 % (±5.87 %) 0.98
Multithreaded Pi Efficiency - 8 Threads 25.10 % (±3.77 %) 25.71 % (±3.50 %) 0.98
micro_benchmarks Build Time 91.92 s 96.08 s 0.96
micro_benchmarks File Size 0.86 MB 0.88 MB 0.99
Scheduling time - 1 thread 84.39 ticks (±3.50 ticks) 73.85 ticks (±3.74 ticks) 1.14
Scheduling time - 2 threads 46.44 ticks (±4.08 ticks) 41.46 ticks (±4.37 ticks) 1.12
Micro - Time for syscall (getpid) 3.11 ticks (±0.43 ticks) 3.08 ticks (±0.19 ticks) 1.01
Memcpy speed - (built_in) block size 4096 74528.23 MByte/s (±51550.28 MByte/s) 73208.48 MByte/s (±50616.82 MByte/s) 1.02
Memcpy speed - (built_in) block size 1048576 29435.39 MByte/s (±24156.07 MByte/s) 29392.18 MByte/s (±24125.47 MByte/s) 1.00
Memcpy speed - (built_in) block size 16777216 26248.27 MByte/s (±21612.65 MByte/s) 23751.74 MByte/s (±19605.56 MByte/s) 1.11
Memset speed - (built_in) block size 4096 74689.57 MByte/s (±51661.39 MByte/s) 73313.38 MByte/s (±50689.78 MByte/s) 1.02
Memset speed - (built_in) block size 1048576 30180.99 MByte/s (±24573.26 MByte/s) 30169.43 MByte/s (±24587.38 MByte/s) 1.00
Memset speed - (built_in) block size 16777216 26959.34 MByte/s (±22039.42 MByte/s) 24402.29 MByte/s (±20008.02 MByte/s) 1.10
Memcpy speed - (rust) block size 4096 66129.21 MByte/s (±46109.14 MByte/s) 66427.37 MByte/s (±46490.49 MByte/s) 1.00
Memcpy speed - (rust) block size 1048576 29519.67 MByte/s (±24267.87 MByte/s) 29425.11 MByte/s (±24175.79 MByte/s) 1.00
Memcpy speed - (rust) block size 16777216 26484.97 MByte/s (±21804.24 MByte/s) 24403.77 MByte/s (±20179.96 MByte/s) 1.09
Memset speed - (rust) block size 4096 66520.04 MByte/s (±46379.04 MByte/s) 66262.32 MByte/s (±46403.75 MByte/s) 1.00
Memset speed - (rust) block size 1048576 30279.23 MByte/s (±24693.83 MByte/s) 30172.11 MByte/s (±24599.55 MByte/s) 1.00
Memset speed - (rust) block size 16777216 27221.49 MByte/s (±22251.69 MByte/s) 25069.95 MByte/s (±20590.63 MByte/s) 1.09
alloc_benchmarks Build Time 91.97 s 91.43 s 1.01
alloc_benchmarks File Size 0.85 MB 0.85 MB 1.00
Allocations - Allocation success 100.00 % 100.00 % 1
Allocations - Deallocation success 100.00 % 100.00 % 1
Allocations - Pre-fail Allocations 100.00 % 100.00 % 1
Allocations - Average Allocation time 4647.33 Ticks (±57.72 Ticks) 4633.96 Ticks (±62.19 Ticks) 1.00
Allocations - Average Allocation time (no fail) 4647.33 Ticks (±57.72 Ticks) 4633.96 Ticks (±62.19 Ticks) 1.00
Allocations - Average Deallocation time 1198.13 Ticks (±109.06 Ticks) 683.28 Ticks (±99.25 Ticks) 1.75
mutex_benchmark Build Time 95.02 s 90.68 s 1.05
mutex_benchmark File Size 0.87 MB 0.88 MB 0.99
Mutex Stress Test Average Time per Iteration - 1 Threads 13.80 ns (±0.75 ns) 13.82 ns (±0.84 ns) 1.00
Mutex Stress Test Average Time per Iteration - 2 Threads 19.48 ns (±4.29 ns) 14.48 ns (±0.75 ns) 1.35

This comment was automatically generated by workflow using github-action-benchmark.

@zyuiop

zyuiop commented Jun 6, 2026

Copy link
Copy Markdown
Contributor Author

I will note that after running a couple of times

cargo xtask ci rs --arch x86_64 --profile release  --package rusty_demo --features fs --features hermit/idle-poll --smp 4 qemu --accel

and

cargo xtask ci rs --arch x86_64 --profile release  --package rusty_demo --features fs --smp 4 qemu --accel

idle-poll seems to outperform the non idle-poll version by a factor around 2x. Maybe this flag should become the default again?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants