-
Dependencies
- PyTorch >= 2.0
- CUDA >= 12.0
-
Download the codebase
git clone --recursive https://github.com/AccelProf/AccelProf.git cd AccelProf && git checkout cgo26 git submodule update --init --recursive # compile code make ENABLE_CS=1 ENABLE_NVBIT=1 ENABLE_TORCH=1 # Set env, under AccelProf directory export ACCEL_PROF_HOME=$(pwd) export PATH=${ACCEL_PROF_HOME}/bin:${PATH} # download AE package bash bin/setup_ae
-
Setup Artifact
cd cgo26-ae # download benchmark bash ./bin/setup_artifact.sh
bash ./bin/run_figure_7.shThe result is in results/figure_7/figure7.pdf.
We expect to see there're some kernels (top-20 are shown) are repeatedly invoked.
bash ./bin/run_table_v.shThe result is in ``results/table_v/table_v.log`.
We expect that the memory footprint is larger than working set sizes.
We will provide the trace we collected during our evaluation for your reference, which is under the pre-results folder.
# Plot the result, use the pre-generate log
bash ./bin/plot_figure_9.sh pre-results/overhead-resultResult is in results/figure_9/.
We expect the GPU-accelerated version (CS-GPU) is much faster than the CPU version (CS-CPU) and NVBit version(NVBIT-CPU).
# Plot the result, use the pre-generate log
bash ./bin/plot_figure_10.sh pre-results/overhead-breakdownResult is in results/figure_10/.
We expect that the CPU versions (CS-CPU and NVBIT-CPU) spend a significant portion of execution time on analysis, whereas the GPU version (CS-GPU) incurs relatively little analysis overhead.
Figure 9 and 10 requires substantial time to complete the experiments, in our evaluation, for CPU evaluation (CS-CPU and NVBIT-CPU), we set the sample rate ("30" "30" "50" "10" "10" "10") for ("alexnet" "resnet18" "resnet34" "bert" "gpt2" "whisper"), which means we profile one kernel in a given sample rate window. For example, we profile one kernel for every 30 consecutive kernels in alexnet. Note that, for GPU-accelerated profiling, we don't set sample rate, which means GPU will profiling every kernel.
While it still takes days to complete, and NVBIT-CPU for whisper even didn't finish in one week.
In this script, we set the sample rate to ("60" "60" "100" "20" "20" "20"), 2 times higher compared to our evaluation. We still expect that CS-CPU and NVBIT-CPU will need substantial profiling time compared to GPU-accelerated profiling.
We collect the data of Figure 10 for both Figure 9 & 10, because Figure 10 trace include both total profiling time and profiling time breakdown.
-
Checkout to the specific branch and rebuild the tool
# go to the AccelProf folder cd ${ACCEL_PROF_HOME} cd nv-nvbit && git checkout oh-breakdown && cd .. cd nv-compute && git checkout oh-breakdown && cd .. # compile the code make ENABLE_CS=1 ENABLE_NVBIT=1 ENABLE_TORCH=1
-
Collect the overhead data
bash ./bin/run_figure_10.sh
Result is in results/figure_10/.
Figure 9 will reuse the trace generated by Figure 10, which include both tool profiling time and profiling time breakdown.
-
Collect the overhead data
bash ./bin/run_figure_9.sh
Result is in results/figure_9/.
Figure 11 and 12 shows the case study that demonstrates the capability of PASTA to capture tensor-level information. We use this information to guide UVM prefetch. To avoid UVM drivers internal prefetch behaviors, we disable it by modifying the NVIDIA open-gpu kernel modules (https://github.com/NVIDIA/open-gpu-kernel-modules).
# https://github.com/NVIDIA/open-gpu-kernel-modules/blob/2af9f1f0f7de4988432d4ae875b5858ffdb09cc2/kernel-open/nvidia-uvm/uvm_va_space.c#L235C5-L235C49
# change this line to false like the following
va_space->test.page_prefetch_enabled = false;
# compile
make -j
# replace the uvm module
sudo rmmod nvidia_uvm
sudo insmod nvidia_uvm.ko- Figure 11
# you can specify number of runs, we run 5 times and get the average
bash bin/run_figure_11.sh 5Result is in results/figure_11/
We expect object-level and tensor-level prefetch doesn’t have too much difference on UVM prefetch.
- Figure 12
# you can specify number of runs, we run 5 times and get the average
bash bin/run_figure_12.sh 5Result is in results/figure_12/.
We expect object-level even has worse performance than tensor-level prefetch highlighting the importance of tensor-aware prefetch.
bash bin/run_figure_13.shResult is in results/figure_13/
We expect there’s a heatmap of memory access that has similar characteristics to the figure in the paper.
-
Download the codebase
git clone --recursive https://github.com/AccelProf/AccelProf.git cd AccelProf && git checkout cgo26 git submodule update --init --recursive # compile code make ENABLE_ROCM=1 # Set env, under AccelProf directory export ACCEL_PROF_HOME=$(pwd) export PATH=${ACCEL_PROF_HOME}/bin:${PATH} # download AE package bash bin/setup_ae
-
Setup Artifact
cd cgo26-ae # download benchmark bash ./bin/setup_artifact.sh
-
Run the experiment
bash ./bin/run_figure_14_amd.sh
Result is in file results/figure_14/out_amd.log.
Please move this figure to the GPU server, and place it in the results/figure_14 folder.
-
Run the experiment
bash ./bin/run_figure_14_nvidia.sh
Result is in file
results/figure_14/out_nvidia.log. -
Plot Figure 14 (please copy the out_amd.log file generated on AMD server to
results/figure_14before running the following script.)bash ./bin/plot_figure_14.sh results/figure_14/
We expect a figure similar to Figure 14.
-
Download the codebase
git clone --recursive https://github.com/AccelProf/AccelProf.git cd AccelProf && git checkout cgo26 git submodule update --init --recursive # compile code make ENABLE_CS=1 ENABLE_NVBIT=1 ENABLE_TORCH=1 # Set env, under AccelProf directory export ACCEL_PROF_HOME=$(pwd) export PATH=${ACCEL_PROF_HOME}/bin:${PATH} # download AE package bash bin/setup_ae
-
Setup Artifact
cd cgo26-ae # download benchmark bash ./bin/setup_artifact.sh
-
Run the experiment
# Megatron is installed in path-to-megatron bash bin/run_figure_15.sh path-to-megatron
Result is in file results/figure_15/.
We expect figures similar to Figure 15.