AI inference example scripts for supercomputers

Starting a vLLM inference server

Scripts to run DeepSeek-R1 (distilled versions, Qwen-32B or Llama-70B) vLLM using 4 GPUs on Puhti, Mahti or LUMI. The Roihu examples use Qwen3-32B since the vLLM version had some compatibility issues with the Deepseek model. There is also a script to run on Roihu using two full nodes (8 GPUs). Finally, there is a script to run the full DeepSeek-R1-0528 model on two full LUMI nodes (16 GPUs).

run-vllm-puhti4.sh ( deepseek-ai/DeepSeek-R1-Distill-Qwen-32B )
run-vllm-mahti4.sh ( deepseek-ai/DeepSeek-R1-Distill-Qwen-32B )
run-vllm-roihu4.sh ( Qwen/Qwen3-32B )
run-vllm-roihu8.sh ( Qwen/Qwen3-32B )
run-vllm-lumi4 (deepseek-ai/DeepSeek-R1-Distill-Qwen-32B)
run-vllm-lumi16 (deepseek-ai/DeepSeek-R1-0528)

Note: all script are Slurm batch job scripts and need to be submitted with sbatch, for example:

sbatch run-vllm-lumi4.sh

LUMI & Roihu

The LUMI and Roihu scripts start the vLLM server listening on a Unix Domain Socket which is represented by a file on the filesystem (by default vllm-<slurm_job_id>.sock) rather than opening a network port on the node for security reasons. This also has the advantage that we cannot get into conflicts with other processes that might block the same port.

While the job is running, you can connect connect to the vLLM server with a process on the same node via that node. For example, the following opens a terminal on the node running vLLM and sends a request via the cURL command line tool:

On LUMI with DeepSeek-R1-Distill-Qwen-32B

username@login-node$ srun --overlap --jobid <slurm-job-id> --pty bash

username@compute-node$ curl --unix-socket $TMPDIR/vllm-$SLURM_JOB_ID.sock http://localhost:8000/v1/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "deepseek-ai/DeepSeek-R1-Distill-Qwen-32B",
        "prompt": "Running vLLM on a supercomputer is",
        "max_tokens": 100,
        "temperature": 0.5,
        "stream": false
    }'

On Roihu with Qwen3-32B

username@login-node$ srun --overlap --jobid <slurm-job-id> --pty bash

username@compute-node$ curl --unix-socket $TMPDIR/vllm-$SLURM_JOB_ID.sock http://localhost:8000/v1/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "Qwen/Qwen3-32B",
        "prompt": "Running vLLM on a supercomputer is",
        "max_tokens": 100,
        "temperature": 0.5,
        "stream": false
    }'

You can also use e.g. the OpenAI client to programmatically interact with the vLLM server in Python:

import httpx
import openai
import argparse

parser = argparse.ArgumentParser()
parser.add_argument("socket_file", type=str)
args = parser.parse_args()

transport = httpx.HTTPTransport(uds=args.socket_file)
httpx_client = httpx.Client(transport=transport)
client = openai.OpenAI(
        api_key='',
        base_url='http://localhost/v1',
        http_client=httpx_client
)

prompt="Running vLLM on a supercomputer is "
print(prompt, end="")

for chunk in client.completions.create(
        model="deepseek-ai/DeepSeek-R1-Distill-Qwen-32B",
        prompt=prompt,
        max_tokens=100,
        temperature=0.5,
        stream=True
):
        print(chunk.choices[0].text, end="")

You can run the script as follows.

On LUMI

username@login-node$ srun --overlap --jobid <slurm-job-id> --pty bash

username@compute-node$ singularity run -B /pfs,/scratch,/projappl /appl/local/laifs/containers/lumi-multitorch-latest.sif python vllm_client.py $TMPDIR/vllm-$SLURM_JOB_ID.sock

On Roihu

username@login-node$ srun --overlap --jobid <slurm-job-id> --pty bash

username@compute-node$ module load python-vllm/0.19
username@compute-node$ python vllm_client.py $TMPDIR/vllm-$SLURM_JOB_ID.sock

Note: The script uses the Deepseek model by default, for Roihu you need to change the model name in the script.

Puhti & Mahti

The version of vLLM installed on Puhti and Mahti does not currently support UDS, so instead we configure it to require authentication with an API key which we generate in the sbatch script. You can find the key in the job log. The following opens a terminal on the node running vLLM and sends a request via the cURL command line tool:

username@login-node$ srun --overlap --jobid <slurm-job-id> --pty bash

username@compute-node$ curl http://localhost:8000/v1/completions \
    -H "Authorization: Bearer <api-key>" \
    -H "Content-Type: application/json" \
    -d '{
        "model": "deepseek-ai/DeepSeek-R1-Distill-Qwen-32B",
        "prompt": "Running vLLM on a supercomputer is",
        "max_tokens": 100,
        "temperature": 0.5,
        "stream": false
    }'

Ollama examples

Scripts to run with Ollama:

Note: all script are Slurm batch job scripts and need to be submitted with sbatch, for example:

sbatch run-ollama-puhti4.sh

TODO

the Ollama scripts don't seem to use all GPUs, probably scripts are reserving too much

Name		Name	Last commit message	Last commit date
Latest commit History 64 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
run-ollama-lumi8.sh		run-ollama-lumi8.sh
run-ollama-puhti4.sh		run-ollama-puhti4.sh
run-vllm-lumi16.sh		run-vllm-lumi16.sh
run-vllm-lumi4.sh		run-vllm-lumi4.sh
run-vllm-mahti4.sh		run-vllm-mahti4.sh
run-vllm-process.sh		run-vllm-process.sh
run-vllm-puhti4.sh		run-vllm-puhti4.sh
run-vllm-roihu4.sh		run-vllm-roihu4.sh
run-vllm-roihu8.sh		run-vllm-roihu8.sh
vllm_client.py		vllm_client.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AI inference example scripts for supercomputers

Starting a vLLM inference server

LUMI & Roihu

Puhti & Mahti

Ollama examples

TODO

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

AI inference example scripts for supercomputers

Starting a vLLM inference server

LUMI & Roihu

Puhti & Mahti

Ollama examples

TODO

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages