AE Instructions of PASTA (CGO'26)

Installation

Dependencies
- PyTorch >= 2.0
- CUDA >= 12.0

Download the codebase

git clone --recursive https://github.com/AccelProf/AccelProf.git
cd AccelProf && git checkout cgo26
git submodule update --init --recursive

# compile code
make ENABLE_CS=1 ENABLE_NVBIT=1 ENABLE_TORCH=1

# Set env, under AccelProf directory
export ACCEL_PROF_HOME=$(pwd)
export PATH=${ACCEL_PROF_HOME}/bin:${PATH}

# download AE package
bash bin/setup_ae

Setup Artifact

cd cgo26-ae
# download benchmark
bash ./bin/setup_artifact.sh

Experiments

Figure 7

bash ./bin/run_figure_7.sh

The result is in results/figure_7/figure7.pdf.

We expect to see there're some kernels (top-20 are shown) are repeatedly invoked.

Table V

bash ./bin/run_table_v.sh

The result is in ``results/table_v/table_v.log`.

We expect that the memory footprint is larger than working set sizes.

Figure 9 & 10

Reproduction with pre-generated trace

We will provide the trace we collected during our evaluation for your reference, which is under the pre-results folder.

Plot Figure 9

# Plot the result, use the pre-generate log
bash ./bin/plot_figure_9.sh pre-results/overhead-result

Result is in results/figure_9/.

We expect the GPU-accelerated version (CS-GPU) is much faster than the CPU version (CS-CPU) and NVBit version(NVBIT-CPU).

Plot Figure 10

# Plot the result, use the pre-generate log
bash ./bin/plot_figure_10.sh pre-results/overhead-breakdown

Result is in results/figure_10/.

We expect that the CPU versions (CS-CPU and NVBIT-CPU) spend a significant portion of execution time on analysis, whereas the GPU version (CS-GPU) incurs relatively little analysis overhead.

Reproduction with real run

Figure 9 and 10 requires substantial time to complete the experiments, in our evaluation, for CPU evaluation (CS-CPU and NVBIT-CPU), we set the sample rate ("30" "30" "50" "10" "10" "10") for ("alexnet" "resnet18" "resnet34" "bert" "gpt2" "whisper"), which means we profile one kernel in a given sample rate window. For example, we profile one kernel for every 30 consecutive kernels in alexnet. Note that, for GPU-accelerated profiling, we don't set sample rate, which means GPU will profiling every kernel.

While it still takes days to complete, and NVBIT-CPU for whisper even didn't finish in one week.

In this script, we set the sample rate to ("60" "60" "100" "20" "20" "20"), 2 times higher compared to our evaluation. We still expect that CS-CPU and NVBIT-CPU will need substantial profiling time compared to GPU-accelerated profiling.

Figure 10

We collect the data of Figure 10 for both Figure 9 & 10, because Figure 10 trace include both total profiling time and profiling time breakdown.

Checkout to the specific branch and rebuild the tool

# go to the AccelProf folder
cd ${ACCEL_PROF_HOME}
cd nv-nvbit && git checkout oh-breakdown && cd ..
cd nv-compute && git checkout oh-breakdown && cd ..
# compile the code 
make ENABLE_CS=1 ENABLE_NVBIT=1 ENABLE_TORCH=1

Collect the overhead data
```
bash ./bin/run_figure_10.sh
```

Result is in results/figure_10/.

Figure 9

Figure 9 will reuse the trace generated by Figure 10, which include both tool profiling time and profiling time breakdown.

Collect the overhead data
```
bash ./bin/run_figure_9.sh
```

Result is in results/figure_9/.

Figure 11 & 12

Figure 11 and 12 shows the case study that demonstrates the capability of PASTA to capture tensor-level information. We use this information to guide UVM prefetch. To avoid UVM drivers internal prefetch behaviors, we disable it by modifying the NVIDIA open-gpu kernel modules (https://github.com/NVIDIA/open-gpu-kernel-modules).

Setup the machine

# https://github.com/NVIDIA/open-gpu-kernel-modules/blob/2af9f1f0f7de4988432d4ae875b5858ffdb09cc2/kernel-open/nvidia-uvm/uvm_va_space.c#L235C5-L235C49
# change this line to false like the following
  va_space->test.page_prefetch_enabled = false;
# compile
make -j
# replace the uvm module
sudo rmmod nvidia_uvm
sudo insmod nvidia_uvm.ko

Run experiments

Figure 11

# you can specify number of runs, we run 5 times and get the average
bash bin/run_figure_11.sh 5

Result is in results/figure_11/

We expect object-level and tensor-level prefetch doesn’t have too much difference on UVM prefetch.

Figure 12

# you can specify number of runs, we run 5 times and get the average
bash bin/run_figure_12.sh 5

Result is in results/figure_12/.

We expect object-level even has worse performance than tensor-level prefetch highlighting the importance of tensor-aware prefetch.

Figure 13

bash bin/run_figure_13.sh

Result is in results/figure_13/

We expect there’s a heatmap of memory access that has similar characteristics to the figure in the paper.

Figure 14

On AMD Server

Download the codebase

git clone --recursive https://github.com/AccelProf/AccelProf.git
cd AccelProf && git checkout cgo26
git submodule update --init --recursive

# compile code
make ENABLE_ROCM=1
# Set env, under AccelProf directory
export ACCEL_PROF_HOME=$(pwd)
export PATH=${ACCEL_PROF_HOME}/bin:${PATH}

# download AE package
bash bin/setup_ae

Setup Artifact

cd cgo26-ae
# download benchmark
bash ./bin/setup_artifact.sh

Run the experiment
```
bash ./bin/run_figure_14_amd.sh
```

Result is in file results/figure_14/out_amd.log.

Please move this figure to the GPU server, and place it in the results/figure_14 folder.

On NVIDIA Server

Run the experiment
```
bash ./bin/run_figure_14_nvidia.sh
```
Result is in file results/figure_14/out_nvidia.log.
Plot Figure 14 (please copy the out_amd.log file generated on AMD server to results/figure_14 before running the following script.)
```
bash ./bin/plot_figure_14.sh results/figure_14/
```

We expect a figure similar to Figure 14.

Figure 15

Download the codebase

git clone --recursive https://github.com/AccelProf/AccelProf.git
cd AccelProf && git checkout cgo26
git submodule update --init --recursive

# compile code
make ENABLE_CS=1 ENABLE_NVBIT=1 ENABLE_TORCH=1

# Set env, under AccelProf directory
export ACCEL_PROF_HOME=$(pwd)
export PATH=${ACCEL_PROF_HOME}/bin:${PATH}

# download AE package
bash bin/setup_ae

Setup Artifact

cd cgo26-ae
# download benchmark
bash ./bin/setup_artifact.sh

Run the experiment

# Megatron is installed in path-to-megatron
bash bin/run_figure_15.sh path-to-megatron

Result is in file results/figure_15/.

We expect figures similar to Figure 15.

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
bin		bin
python		python
uvm-advisor		uvm-advisor
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AE Instructions of PASTA (CGO'26)

Installation

Experiments

Figure 7

Table V

Figure 9 & 10

Reproduction with pre-generated trace

Plot Figure 9

Plot Figure 10

Reproduction with real run

Figure 10

Figure 9

Figure 11 & 12

Setup the machine

Run experiments

Figure 13

Figure 14

On AMD Server

On NVIDIA Server

Figure 15

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

AE Instructions of PASTA (CGO'26)

Installation

Experiments

Figure 7

Table V

Figure 9 & 10

Reproduction with pre-generated trace

Plot Figure 9

Plot Figure 10

Reproduction with real run

Figure 10

Figure 9

Figure 11 & 12

Setup the machine

Run experiments

Figure 13

Figure 14

On AMD Server

On NVIDIA Server

Figure 15

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages