Skip to content
This repository was archived by the owner on Apr 13, 2026. It is now read-only.

AccelProf/cgo26-ae

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

AE Instructions of PASTA (CGO'26)

Installation

  • Dependencies

    • PyTorch >= 2.0
    • CUDA >= 12.0
  • Download the codebase

    git clone --recursive https://github.com/AccelProf/AccelProf.git
    cd AccelProf && git checkout cgo26
    git submodule update --init --recursive
    
    # compile code
    make ENABLE_CS=1 ENABLE_NVBIT=1 ENABLE_TORCH=1
    
    # Set env, under AccelProf directory
    export ACCEL_PROF_HOME=$(pwd)
    export PATH=${ACCEL_PROF_HOME}/bin:${PATH}
    
    # download AE package
    bash bin/setup_ae
  • Setup Artifact

    cd cgo26-ae
    # download benchmark
    bash ./bin/setup_artifact.sh

Experiments

Figure 7

bash ./bin/run_figure_7.sh

The result is in results/figure_7/figure7.pdf.

We expect to see there're some kernels (top-20 are shown) are repeatedly invoked.

Table V

bash ./bin/run_table_v.sh

The result is in ``results/table_v/table_v.log`.

We expect that the memory footprint is larger than working set sizes.

Figure 9 & 10

Reproduction with pre-generated trace

We will provide the trace we collected during our evaluation for your reference, which is under the pre-results folder.

Plot Figure 9
# Plot the result, use the pre-generate log
bash ./bin/plot_figure_9.sh pre-results/overhead-result

Result is in results/figure_9/.

We expect the GPU-accelerated version (CS-GPU) is much faster than the CPU version (CS-CPU) and NVBit version(NVBIT-CPU).

Plot Figure 10
# Plot the result, use the pre-generate log
bash ./bin/plot_figure_10.sh pre-results/overhead-breakdown

Result is in results/figure_10/.

We expect that the CPU versions (CS-CPU and NVBIT-CPU) spend a significant portion of execution time on analysis, whereas the GPU version (CS-GPU) incurs relatively little analysis overhead.

Reproduction with real run

Figure 9 and 10 requires substantial time to complete the experiments, in our evaluation, for CPU evaluation (CS-CPU and NVBIT-CPU), we set the sample rate ("30" "30" "50" "10" "10" "10") for ("alexnet" "resnet18" "resnet34" "bert" "gpt2" "whisper"), which means we profile one kernel in a given sample rate window. For example, we profile one kernel for every 30 consecutive kernels in alexnet. Note that, for GPU-accelerated profiling, we don't set sample rate, which means GPU will profiling every kernel.

While it still takes days to complete, and NVBIT-CPU for whisper even didn't finish in one week.

In this script, we set the sample rate to ("60" "60" "100" "20" "20" "20"), 2 times higher compared to our evaluation. We still expect that CS-CPU and NVBIT-CPU will need substantial profiling time compared to GPU-accelerated profiling.

Figure 10

We collect the data of Figure 10 for both Figure 9 & 10, because Figure 10 trace include both total profiling time and profiling time breakdown.

  • Checkout to the specific branch and rebuild the tool

    # go to the AccelProf folder
    cd ${ACCEL_PROF_HOME}
    cd nv-nvbit && git checkout oh-breakdown && cd ..
    cd nv-compute && git checkout oh-breakdown && cd ..
    # compile the code 
    make ENABLE_CS=1 ENABLE_NVBIT=1 ENABLE_TORCH=1
  • Collect the overhead data

    bash ./bin/run_figure_10.sh

Result is in results/figure_10/.

Figure 9

Figure 9 will reuse the trace generated by Figure 10, which include both tool profiling time and profiling time breakdown.

  • Collect the overhead data

    bash ./bin/run_figure_9.sh

Result is in results/figure_9/.

Figure 11 & 12

Figure 11 and 12 shows the case study that demonstrates the capability of PASTA to capture tensor-level information. We use this information to guide UVM prefetch. To avoid UVM drivers internal prefetch behaviors, we disable it by modifying the NVIDIA open-gpu kernel modules (https://github.com/NVIDIA/open-gpu-kernel-modules).

Setup the machine

# https://github.com/NVIDIA/open-gpu-kernel-modules/blob/2af9f1f0f7de4988432d4ae875b5858ffdb09cc2/kernel-open/nvidia-uvm/uvm_va_space.c#L235C5-L235C49
# change this line to false like the following
  va_space->test.page_prefetch_enabled = false;
# compile
make -j
# replace the uvm module
sudo rmmod nvidia_uvm
sudo insmod nvidia_uvm.ko

Run experiments

  • Figure 11
# you can specify number of runs, we run 5 times and get the average
bash bin/run_figure_11.sh 5

Result is in results/figure_11/

We expect object-level and tensor-level prefetch doesn’t have too much difference on UVM prefetch.

  • Figure 12
# you can specify number of runs, we run 5 times and get the average
bash bin/run_figure_12.sh 5

Result is in results/figure_12/.

We expect object-level even has worse performance than tensor-level prefetch highlighting the importance of tensor-aware prefetch.

Figure 13

bash bin/run_figure_13.sh

Result is in results/figure_13/

We expect there’s a heatmap of memory access that has similar characteristics to the figure in the paper.

Figure 14

On AMD Server

  • Download the codebase

    git clone --recursive https://github.com/AccelProf/AccelProf.git
    cd AccelProf && git checkout cgo26
    git submodule update --init --recursive
    ​
    # compile code
    make ENABLE_ROCM=1​
    # Set env, under AccelProf directory
    export ACCEL_PROF_HOME=$(pwd)
    export PATH=${ACCEL_PROF_HOME}/bin:${PATH}# download AE package
    bash bin/setup_ae
  • Setup Artifact

    cd cgo26-ae
    # download benchmark
    bash ./bin/setup_artifact.sh
  • Run the experiment

    bash ./bin/run_figure_14_amd.sh

Result is in file results/figure_14/out_amd.log.

Please move this figure to the GPU server, and place it in the results/figure_14 folder.

On NVIDIA Server

  • Run the experiment

    bash ./bin/run_figure_14_nvidia.sh

    Result is in file results/figure_14/out_nvidia.log.

  • Plot Figure 14 (please copy the out_amd.log file generated on AMD server to results/figure_14 before running the following script.)

    bash ./bin/plot_figure_14.sh results/figure_14/

We expect a figure similar to Figure 14.

Figure 15

  • Download the codebase

    git clone --recursive https://github.com/AccelProf/AccelProf.git
    cd AccelProf && git checkout cgo26
    git submodule update --init --recursive
    ​
    # compile code
    make ENABLE_CS=1 ENABLE_NVBIT=1 ENABLE_TORCH=1
    ​
    # Set env, under AccelProf directory
    export ACCEL_PROF_HOME=$(pwd)
    export PATH=${ACCEL_PROF_HOME}/bin:${PATH}# download AE package
    bash bin/setup_ae
  • Setup Artifact

    cd cgo26-ae
    # download benchmark
    bash ./bin/setup_artifact.sh
  • Run the experiment

    # Megatron is installed in path-to-megatron
    bash bin/run_figure_15.sh path-to-megatron

Result is in file results/figure_15/.

We expect figures similar to Figure 15.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors