smp: always wake up core when ready#2468
Open
zyuiop wants to merge 1 commit into
Open
Conversation
This is a partial revert of commit f0bb0a0
1b60684 to
f4942c9
Compare
There was a problem hiding this comment.
Benchmark Results
Details
| Benchmark | Current: f4942c9 | Previous: 28c7089 | Performance Ratio |
|---|---|---|---|
| startup_benchmark Build Time | 113.60 s |
115.28 s |
0.99 ❗ |
| startup_benchmark File Size | 0.77 MB |
0.77 MB |
1.00 ❗ |
| Startup Time - 1 core | 1.00 s (±0.02 s) |
1.00 s (±0.03 s) |
1.01 |
| Startup Time - 2 cores | 1.03 s (±0.06 s) |
1.04 s (±0.05 s) |
0.99 |
| Startup Time - 4 cores | 0.85 s (±0.10 s) |
1.00 s (±0.06 s) |
0.86 ❗ |
| multithreaded_benchmark Build Time | 113.78 s |
118.33 s |
0.96 ❗ |
| multithreaded_benchmark File Size | 0.86 MB |
0.87 MB |
0.99 ❗ |
| Multithreaded Pi Efficiency - 2 Threads | 89.84 % (±15.27 %) |
92.34 % (±11.70 %) |
0.97 |
| Multithreaded Pi Efficiency - 4 Threads | 44.71 % (±7.25 %) |
45.83 % (±5.87 %) |
0.98 |
| Multithreaded Pi Efficiency - 8 Threads | 25.10 % (±3.77 %) |
25.71 % (±3.50 %) |
0.98 |
| micro_benchmarks Build Time | 91.92 s |
96.08 s |
0.96 ❗ |
| micro_benchmarks File Size | 0.86 MB |
0.88 MB |
0.99 ❗ |
| Scheduling time - 1 thread | 84.39 ticks (±3.50 ticks) |
73.85 ticks (±3.74 ticks) |
1.14 ❗ |
| Scheduling time - 2 threads | 46.44 ticks (±4.08 ticks) |
41.46 ticks (±4.37 ticks) |
1.12 |
| Micro - Time for syscall (getpid) | 3.11 ticks (±0.43 ticks) |
3.08 ticks (±0.19 ticks) |
1.01 |
| Memcpy speed - (built_in) block size 4096 | 74528.23 MByte/s (±51550.28 MByte/s) |
73208.48 MByte/s (±50616.82 MByte/s) |
1.02 |
| Memcpy speed - (built_in) block size 1048576 | 29435.39 MByte/s (±24156.07 MByte/s) |
29392.18 MByte/s (±24125.47 MByte/s) |
1.00 |
| Memcpy speed - (built_in) block size 16777216 | 26248.27 MByte/s (±21612.65 MByte/s) |
23751.74 MByte/s (±19605.56 MByte/s) |
1.11 |
| Memset speed - (built_in) block size 4096 | 74689.57 MByte/s (±51661.39 MByte/s) |
73313.38 MByte/s (±50689.78 MByte/s) |
1.02 |
| Memset speed - (built_in) block size 1048576 | 30180.99 MByte/s (±24573.26 MByte/s) |
30169.43 MByte/s (±24587.38 MByte/s) |
1.00 |
| Memset speed - (built_in) block size 16777216 | 26959.34 MByte/s (±22039.42 MByte/s) |
24402.29 MByte/s (±20008.02 MByte/s) |
1.10 |
| Memcpy speed - (rust) block size 4096 | 66129.21 MByte/s (±46109.14 MByte/s) |
66427.37 MByte/s (±46490.49 MByte/s) |
1.00 |
| Memcpy speed - (rust) block size 1048576 | 29519.67 MByte/s (±24267.87 MByte/s) |
29425.11 MByte/s (±24175.79 MByte/s) |
1.00 |
| Memcpy speed - (rust) block size 16777216 | 26484.97 MByte/s (±21804.24 MByte/s) |
24403.77 MByte/s (±20179.96 MByte/s) |
1.09 |
| Memset speed - (rust) block size 4096 | 66520.04 MByte/s (±46379.04 MByte/s) |
66262.32 MByte/s (±46403.75 MByte/s) |
1.00 |
| Memset speed - (rust) block size 1048576 | 30279.23 MByte/s (±24693.83 MByte/s) |
30172.11 MByte/s (±24599.55 MByte/s) |
1.00 |
| Memset speed - (rust) block size 16777216 | 27221.49 MByte/s (±22251.69 MByte/s) |
25069.95 MByte/s (±20590.63 MByte/s) |
1.09 |
| alloc_benchmarks Build Time | 91.97 s |
91.43 s |
1.01 ❗ |
| alloc_benchmarks File Size | 0.85 MB |
0.85 MB |
1.00 ❗ |
| Allocations - Allocation success | 100.00 % |
100.00 % |
1 |
| Allocations - Deallocation success | 100.00 % |
100.00 % |
1 |
| Allocations - Pre-fail Allocations | 100.00 % |
100.00 % |
1 |
| Allocations - Average Allocation time | 4647.33 Ticks (±57.72 Ticks) |
4633.96 Ticks (±62.19 Ticks) |
1.00 |
| Allocations - Average Allocation time (no fail) | 4647.33 Ticks (±57.72 Ticks) |
4633.96 Ticks (±62.19 Ticks) |
1.00 |
| Allocations - Average Deallocation time | 1198.13 Ticks (±109.06 Ticks) |
683.28 Ticks (±99.25 Ticks) |
1.75 ❗ |
| mutex_benchmark Build Time | 95.02 s |
90.68 s |
1.05 ❗ |
| mutex_benchmark File Size | 0.87 MB |
0.88 MB |
0.99 ❗ |
| Mutex Stress Test Average Time per Iteration - 1 Threads | 13.80 ns (±0.75 ns) |
13.82 ns (±0.84 ns) |
1.00 |
| Mutex Stress Test Average Time per Iteration - 2 Threads | 19.48 ns (±4.29 ns) |
14.48 ns (±0.75 ns) |
1.35 ❗ |
This comment was automatically generated by workflow using github-action-benchmark.
Contributor
Author
|
I will note that after running a couple of times and
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This is a partial revert of commit f0bb0a0.
This fixes a very tricky race condition that caused some cores to not be woken up, causing tests to fail (and weird bugs in the scheduling also)
Context
In some PRs,
cargo xtask ci rs --arch x86_64 --profile dev --package rusty_demo --features fs --smp 4 qemu --accel --sudo --devices virtio-fs-pcifails due to timeout.This bug is reproducible locally, and is even more frequent with higher core counts (e.g., try it with
cargo xtask ci rs --arch x86_64 --profile dev --package rusty_demo --features fs --smp 32 qemu --accel --sudo --devices virtio-fs-pcion main and you're almost certain to get the bug. You can similarly confirm that this patch solves the bug.The symptom is that the "Laplace" computation will get stuck. You can confirm this is not just the loop taking a long time by adding a small log in
computeinlaplace.rsYou will observe the counter go up, then suddenly stop and nothing happens. This is the bug.
I initially believed this was a futex problem (so I rewrote them partially, another PR is incoming), but it was actually something different.
The Bug (i think)
The thing that tipped me was that the bug was absent with the
idle-loopflag. This greatly reduce the area that needed to be searched, since this flag is only present in a couple of places.The bug is caused by a race condition when a core B wants to wake up a core A, which is going into halt.
The relevant call points are these:
PerCoreScheduler::run. This is the idle task. It uses a backoff, checking periodically the core inbox to see if there is work. If there is none and the backoff is exhausted, it goes to sleep, callingenable_and_wait.wakeup_core. This is called by another CPU, after adding a task to the inbox. It sends an interrupt to the other core, to wake it up.f0bb0a0 introduced a boolean for each core that indicates if the core is halting. If this boolean is set to
false,wakeup_coreskips the interrupt and returns immediately.This works most of the time because
PerCoreScheduler::runwill always pick up tasks when some are available.However, there is a possible race condition. Imagine the two following cores:
I hope you appreciate my artistic talents, but that's the gist.
The fix
Simply a partial revert of f0bb0a0.
I was considering changing the boolean into a state machine, with a "don't sleep please" variant that could be sent by
wakeup_coreif the halt state was false, to indicate to a core that was about to go to sleep that it should not.I'm not sure it was working well because I was getting weird values in my
compare_and_exchangecalls, but this may be a viable alternative to reduce the number of interrupts we send.Notes
Increasing the core count increases the likelihood of the race condition happening. Laplace appears to be a good test-case because all threads are waiting on each other, so if ONE thread does not wake up, the program remains stuck.
It is possible that this patch will also solve other issues related to the scheduler suddenly becoming unresponsive until you cause an interrupt, but I have not tested that yet.
Fixes #2317