Skip to content

Fix KA global linear index#720

Open
christiangnrd wants to merge 2 commits into
mainfrom
fixglob
Open

Fix KA global linear index#720
christiangnrd wants to merge 2 commits into
mainfrom
fixglob

Conversation

@christiangnrd

@christiangnrd christiangnrd commented Jul 1, 2026

Copy link
Copy Markdown
Member

(do not squash)
It's like this in CUDA, I fixed it in Metal, I don't know why I got it wrong with the KI PR

@github-actions

github-actions Bot commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

Benchmark Results

main 541f205... main / 541f205...
saxpy/default/Float32/1024 0.0503 ± 0.024 ms 0.0483 ± 0.024 ms 1.04 ± 0.73
saxpy/default/Float32/1048576 0.193 ± 0.017 ms 0.201 ± 0.02 ms 0.96 ± 0.13
saxpy/default/Float32/16384 0.0531 ± 0.022 ms 0.0518 ± 0.021 ms 1.03 ± 0.59
saxpy/default/Float32/2048 0.0516 ± 0.02 ms 0.0467 ± 0.023 ms 1.1 ± 0.69
saxpy/default/Float32/256 0.0539 ± 0.022 ms 0.0457 ± 0.027 ms 1.18 ± 0.84
saxpy/default/Float32/262144 0.0916 ± 0.022 ms 0.0903 ± 0.024 ms 1.01 ± 0.37
saxpy/default/Float32/32768 0.0578 ± 0.023 ms 0.0535 ± 0.023 ms 1.08 ± 0.63
saxpy/default/Float32/4096 0.0523 ± 0.02 ms 0.0511 ± 0.021 ms 1.02 ± 0.56
saxpy/default/Float32/512 0.0538 ± 0.022 ms 0.0479 ± 0.026 ms 1.12 ± 0.76
saxpy/default/Float32/64 0.0497 ± 0.027 ms 0.0452 ± 0.027 ms 1.1 ± 0.89
saxpy/default/Float32/65536 0.0633 ± 0.023 ms 0.0573 ± 0.023 ms 1.1 ± 0.6
saxpy/default/Float64/1024 0.0497 ± 0.022 ms 0.0432 ± 0.026 ms 1.15 ± 0.85
saxpy/default/Float64/1048576 0.277 ± 0.026 ms 0.268 ± 0.022 ms 1.03 ± 0.13
saxpy/default/Float64/16384 0.0535 ± 0.022 ms 0.0518 ± 0.022 ms 1.03 ± 0.61
saxpy/default/Float64/2048 0.054 ± 0.018 ms 0.0489 ± 0.023 ms 1.11 ± 0.64
saxpy/default/Float64/256 0.0533 ± 0.023 ms 0.0471 ± 0.026 ms 1.13 ± 0.79
saxpy/default/Float64/262144 0.104 ± 0.018 ms 0.102 ± 0.019 ms 1.02 ± 0.26
saxpy/default/Float64/32768 0.0588 ± 0.022 ms 0.0556 ± 0.022 ms 1.06 ± 0.58
saxpy/default/Float64/4096 0.0482 ± 0.021 ms 0.0467 ± 0.022 ms 1.03 ± 0.67
saxpy/default/Float64/512 0.0535 ± 0.02 ms 0.0472 ± 0.027 ms 1.13 ± 0.77
saxpy/default/Float64/64 0.0513 ± 0.026 ms 0.0442 ± 0.027 ms 1.16 ± 0.92
saxpy/default/Float64/65536 0.0671 ± 0.023 ms 0.062 ± 0.022 ms 1.08 ± 0.53
saxpy/static workgroup=(1024,)/Float32/1024 0.0509 ± 0.023 ms 0.0465 ± 0.025 ms 1.09 ± 0.77
saxpy/static workgroup=(1024,)/Float32/1048576 0.186 ± 0.016 ms 0.706 ± 0.015 ms 0.264 ± 0.023
saxpy/static workgroup=(1024,)/Float32/16384 0.0514 ± 0.022 ms 0.0607 ± 0.022 ms 0.847 ± 0.47
saxpy/static workgroup=(1024,)/Float32/2048 0.0495 ± 0.02 ms 0.0497 ± 0.024 ms 0.996 ± 0.63
saxpy/static workgroup=(1024,)/Float32/256 0.0507 ± 0.023 ms 0.0473 ± 0.026 ms 1.07 ± 0.76
saxpy/static workgroup=(1024,)/Float32/262144 0.0885 ± 0.022 ms 0.219 ± 0.018 ms 0.405 ± 0.11
saxpy/static workgroup=(1024,)/Float32/32768 0.0549 ± 0.022 ms 0.0686 ± 0.023 ms 0.799 ± 0.42
saxpy/static workgroup=(1024,)/Float32/4096 0.0534 ± 0.015 ms 0.0528 ± 0.021 ms 1.01 ± 0.49
saxpy/static workgroup=(1024,)/Float32/512 0.0503 ± 0.024 ms 0.0472 ± 0.025 ms 1.06 ± 0.75
saxpy/static workgroup=(1024,)/Float32/64 0.0495 ± 0.026 ms 0.0476 ± 0.024 ms 1.04 ± 0.75
saxpy/static workgroup=(1024,)/Float32/65536 0.0595 ± 0.023 ms 0.091 ± 0.022 ms 0.654 ± 0.3
saxpy/static workgroup=(1024,)/Float64/1024 0.0464 ± 0.024 ms 0.0459 ± 0.024 ms 1.01 ± 0.73
saxpy/static workgroup=(1024,)/Float64/1048576 0.272 ± 0.024 ms 0.746 ± 0.055 ms 0.365 ± 0.042
saxpy/static workgroup=(1024,)/Float64/16384 0.0526 ± 0.022 ms 0.0614 ± 0.022 ms 0.856 ± 0.47
saxpy/static workgroup=(1024,)/Float64/2048 0.05 ± 0.021 ms 0.0512 ± 0.024 ms 0.977 ± 0.61
saxpy/static workgroup=(1024,)/Float64/256 0.0486 ± 0.024 ms 0.0482 ± 0.024 ms 1.01 ± 0.72
saxpy/static workgroup=(1024,)/Float64/262144 0.103 ± 0.018 ms 0.225 ± 0.019 ms 0.458 ± 0.09
saxpy/static workgroup=(1024,)/Float64/32768 0.0567 ± 0.021 ms 0.0733 ± 0.022 ms 0.773 ± 0.37
saxpy/static workgroup=(1024,)/Float64/4096 0.0486 ± 0.021 ms 0.052 ± 0.022 ms 0.935 ± 0.57
saxpy/static workgroup=(1024,)/Float64/512 0.0503 ± 0.02 ms 0.0484 ± 0.025 ms 1.04 ± 0.68
saxpy/static workgroup=(1024,)/Float64/64 0.0489 ± 0.025 ms 0.0481 ± 0.024 ms 1.02 ± 0.73
saxpy/static workgroup=(1024,)/Float64/65536 0.0638 ± 0.022 ms 0.0958 ± 0.021 ms 0.666 ± 0.27
time_to_load 0.938 ± 0.015 s 0.951 ± 0.031 s 0.986 ± 0.036

Benchmark Plots

A plot of the benchmark results have been uploaded as an artifact to the workflow run for this PR.
Go to "Actions"->"Benchmark a pull request"->[the most recent run]->"Artifacts" (at the bottom).

@christiangnrd christiangnrd force-pushed the fixglob branch 3 times, most recently from e6c7304 to 2aac22a Compare July 1, 2026 22:36
@christiangnrd christiangnrd requested a review from vchuravy July 1, 2026 23:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant