A reusable marketplace workflow for executing scripts on HPC and cloud compute resources via SSH, PBS, or SLURM. Job Runner abstracts away the complexity of job submission, monitoring, and cleanup, allowing you to focus on your actual workload.
Note: The marketplace name is
job_runner(with underscore) for compatibility with ACTIVATE market naming requirements.
This workflow provides a flexible way to run user-defined scripts on cloud compute and on-prem resources using SSH, PBS, or SLURM. The user supplies a script and (when applicable) configures scheduler directives directly through the workflow UI. Based on these selections, the workflow automatically generates a fully populated job script—including the shebang, run directory, scheduler options, and user script—and executes or submits it on the target system.
| Version | File | Description |
|---|---|---|
| v3.5 | v3.5.yaml |
Original stable version with basic SSH/PBS/SLURM support |
| v4.0 | v4.0.yaml |
Enhanced version with structured inputs, cleanup handlers, and job markers |
- Bug Fixes: Fixed scheduler_directives expansion, typo in error messages
- Structured Inputs: SLURM (account, qos, nodes, gres, mem, constraint, array) and PBS (account, queue, walltime, select)
- Cleanup Handlers: Proper job cancellation on all execution paths
- Job Markers: Optional
inject_markersfor session management coordination - Failure Detection: Reports final job state (COMPLETED, FAILED, TIMEOUT, etc.)
- Configurable Polling:
poll_intervalparameter for status checks
The workflow creates a simple script and executes it directly on the remote host via SSH. Best for quick tasks that don't require scheduler resource allocation.
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Create │ --> │ Execute │ --> │ Stream │
│ Script │ │ via SSH │ │ Output │
└─────────────┘ └─────────────┘ └─────────────┘
The workflow builds a SLURM job script using user-selected options and any additional scheduler directives. The script is submitted with sbatch, and its status is monitored using squeue and sacct.
┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Create │ --> │ Submit │ --> │ Monitor │ --> │ Cleanup │
│ SLURM Script│ │ via sbatch │ │ via squeue │ │ │
└─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘
The workflow constructs a PBS-compatible job script using the user-provided account and scheduler directives, submits it with qsub, monitors the queue, and waits for completion.
┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Create │ --> │ Submit │ --> │ Monitor │ --> │ Cleanup │
│ PBS Script │ │ via qsub │ │ via qstat │ │ │
└─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘
For PBS and SLURM, the workflow continuously monitors job status until completion or until the job is no longer found in the queue.
If the workflow run itself is canceled, the cleanup logic automatically attempts to terminate the remote job (qdel or scancel) to prevent orphaned workloads on the compute resource.
v4.0 Additional Cleanup Features:
- Runs user's
cancel.shscript if present in the run directory - Cleans up temporary files (jobid, CANCEL_STREAMING, job.started, HOSTNAME)
- SSH jobs attempt to kill background processes
Deploy job_runner directly to run ad-hoc scripts on your resources:
# Your workflow.yaml
uses: marketplace/job_runner/v4.0Reference job_runner from within your workflow's job steps:
jobs:
my_job:
ssh:
remoteHost: ${{ inputs.resource.ip }}
steps:
- name: Run My Script
uses: marketplace/job_runner/v4.0
with:
resource: ${{ inputs.resource }}
rundir: /path/to/workdir
use_existing_script: true
script_path: /path/to/my_script.sh
scheduler: ${{ inputs.scheduler.enabled }}
slurm:
partition: ${{ inputs.slurm.partition }}
time: ${{ inputs.slurm.time }}
gres: "gpu:1"| Input | Type | Default | Description |
|---|---|---|---|
resource |
compute-clusters | (required) | The compute resource to execute the script on |
shebang |
string | #!/bin/bash |
The shell interpreter line for the script |
rundir |
string | ${PWD} |
The directory where the script will be executed |
| Input | Type | Default | Description |
|---|---|---|---|
use_existing_script |
boolean | false |
true = use script at script_path; false = use inline script |
script |
editor | (demo script) | Inline script content (when use_existing_script=false) |
script_path |
string | - | Path to existing script on target (when use_existing_script=true) |
| Input | Type | Default | Description |
|---|---|---|---|
scheduler |
boolean | false |
true = submit to scheduler; false = execute via SSH |
inject_markers |
boolean | true |
Auto-inject job.started and HOSTNAME markers (v4.0) |
poll_interval |
number | 15 |
How often to check job status in seconds (v4.0) |
| Input | Type | Default | Description |
|---|---|---|---|
slurm.is_disabled |
boolean | (auto) | Auto-computed based on scheduler type |
slurm.account |
slurm-accounts | - | SLURM account (--account) |
slurm.partition |
slurm-partitions | - | SLURM partition (--partition) |
slurm.qos |
slurm-qos | - | Quality of Service (--qos) |
slurm.time |
string | 04:00:00 |
Walltime limit (--time) |
slurm.nodes |
number | 1 |
Number of nodes (--nodes) (v4.0) |
slurm.cpus_per_task |
number | 4 |
CPUs per task (--cpus-per-task) (v4.0) |
slurm.gres |
string | - | Generic resources, e.g., gpu:1 (--gres) (v4.0) |
slurm.mem |
string | - | Memory per node, e.g., 32G (--mem) (v4.0) |
slurm.constraint |
string | - | Node constraint (--constraint) (v4.0) |
slurm.array |
string | - | Array job spec, e.g., 0-9 (--array) (v4.0) |
slurm.scheduler_directives |
editor | - | Additional #SBATCH directives |
| Input | Type | Default | Description |
|---|---|---|---|
pbs.is_disabled |
boolean | (auto) | Auto-computed based on scheduler type |
pbs.account |
string | - | PBS account (-A) (v4.0) |
pbs.queue |
string | - | PBS queue (-q) (v4.0) |
pbs.walltime |
string | 04:00:00 |
Walltime limit (-l walltime=) (v4.0) |
pbs.select |
string | - | Resource selection (-l select=) (v4.0) |
pbs.scheduler_directives |
editor | - | Additional #PBS directives |
Run a script directly on the login node without scheduler:
jobs:
hello_world:
steps:
- name: Say Hello
uses: marketplace/job_runner/v4.0
with:
resource: ${{ inputs.resource }}
rundir: ~/my_project
scheduler: false
script: |
echo "Hello from $(hostname)"
echo "Current directory: $(pwd)"
echo "Date: $(date)"Submit a GPU job to SLURM:
jobs:
gpu_training:
steps:
- name: Train Model
uses: marketplace/job_runner/v4.0
with:
resource: ${{ inputs.resource }}
rundir: ${{ inputs.workdir }}
use_existing_script: true
script_path: ${{ inputs.workdir }}/train.sh
scheduler: true
slurm:
is_disabled: false
account: ${{ inputs.slurm.account }}
partition: gpu
time: "08:00:00"
nodes: 1
cpus_per_task: 8
gres: "gpu:a100:2"
mem: "64G"Submit a job to a PBS cluster:
jobs:
simulation:
steps:
- name: Run Simulation
uses: marketplace/job_runner/v4.0
with:
resource: ${{ inputs.resource }}
rundir: /scratch/$USER/simulation
scheduler: true
pbs:
is_disabled: false
account: my_allocation
queue: normal
walltime: "24:00:00"
select: "4:ncpus=32:mpiprocs=32:ngpus=0"
script: |
module load openmpi
mpirun -np 128 ./my_simulation --input config.yamlSubmit an array job for parameter sweeps:
jobs:
parameter_sweep:
steps:
- name: Run Sweep
uses: marketplace/job_runner/v4.0
with:
resource: ${{ inputs.resource }}
rundir: ~/experiments
scheduler: true
slurm:
is_disabled: false
partition: compute
time: "01:00:00"
array: "0-99%10" # 100 tasks, max 10 concurrent
script: |
echo "Running task ${SLURM_ARRAY_TASK_ID}"
python process.py --index ${SLURM_ARRAY_TASK_ID}The vLLM/RAG workflow uses job_runner for unified job submission:
# From activate-rag-vllm/workflow.yaml
run_service:
needs: [setup, prepare_containers, prepare_model, prepare_tiktoken]
working-directory: ${{ needs.setup.outputs.rundir }}
ssh:
remoteHost: ${{ inputs.resource.ip }}
steps:
- name: Run Service
uses: marketplace/job_runner/v4.0
early-cancel: any-job-failed
with:
resource: ${{ inputs.resource }}
shebang: '#!/bin/bash'
rundir: ${{ needs.setup.outputs.rundir }}
use_existing_script: true
script_path: ${{ needs.setup.outputs.rundir }}/start_service.sh
scheduler: ${{ inputs.scheduler.enabled }}
inject_markers: true
slurm:
is_disabled: ${{ inputs.scheduler.slurm.is_disabled }}
account: ${{ inputs.scheduler.slurm.account }}
partition: ${{ inputs.scheduler.slurm.partition }}
qos: ${{ inputs.scheduler.slurm.qos }}
time: ${{ inputs.scheduler.slurm.time }}
cpus_per_task: ${{ inputs.scheduler.slurm.cpus_per_task }}
gres: ${{ inputs.scheduler.slurm.gres }}
scheduler_directives: |
${{ inputs.scheduler.slurm.scheduler_directives }}
pbs:
is_disabled: ${{ inputs.scheduler.pbs.is_disabled }}
account: ${{ inputs.scheduler.pbs.account }}
queue: ${{ inputs.scheduler.pbs.queue }}
walltime: ${{ inputs.scheduler.pbs.walltime }}
select: ${{ inputs.scheduler.pbs.select }}
scheduler_directives: |
${{ inputs.scheduler.pbs.scheduler_directives }}For workflows like activate-medical-finetuning:
jobs:
finetune:
needs: [setup, download_data]
steps:
- name: Fine-tune Model
uses: marketplace/job_runner/v4.0
with:
resource: ${{ inputs.resource }}
rundir: ${{ needs.setup.outputs.workdir }}
use_existing_script: true
script_path: ${{ needs.setup.outputs.workdir }}/finetune.sh
scheduler: ${{ inputs.use_scheduler }}
inject_markers: true
slurm:
is_disabled: ${{ inputs.resource.schedulerType != 'slurm' || !inputs.use_scheduler }}
partition: ${{ inputs.slurm.partition }}
time: ${{ inputs.slurm.time }}
gres: "gpu:${{ inputs.num_gpus }}"
mem: "${{ inputs.memory }}G"
scheduler_directives: |
#SBATCH --exclusiveWhen inject_markers: true (default in v4.0), job_runner automatically creates:
job.started- Touched when the job begins executionHOSTNAME- Contains the compute node hostname
These markers enable session management workflows to:
- Wait for the job to start
- Determine the target hostname for port forwarding
- Coordinate cleanup on job completion
Example session management pattern:
create_session:
needs: [setup]
steps:
- name: Wait for Job to Start
run: |
timeout=600; elapsed=0
while [ ! -f ${{ needs.setup.outputs.rundir }}/job.started ]; do
sleep 5; ((elapsed+=5))
[[ $elapsed -ge $timeout ]] && exit 1
done
- name: Get Hostname
run: |
target_hostname=$(cat ${{ needs.setup.outputs.rundir }}/HOSTNAME | head -1)
echo "target_hostname=${target_hostname}" >> $OUTPUTSCreate a cancel.sh in your run directory to handle cleanup when the workflow is cancelled:
#!/bin/bash
# cancel.sh - called by job_runner cleanup
# Stop your services
docker-compose down 2>/dev/null || true
singularity-compose down 2>/dev/null || true
# Kill background processes
pkill -f "my_app" 2>/dev/null || true
# Clean up temp files
rm -rf /tmp/my_cache_*This workflow can serve as a template for more specialized workflows:
- Hide the script input: Set
hidden: trueon thescriptinput - Pre-define the script: Set
use_existing_script: trueand providescript_path - Add custom inputs: Extend the input section with workflow-specific parameters
- Reference inputs in script: Use
${{ inputs.my_param }}in your script
- Check that the resource has the correct
schedulerTypeconfigured - Verify account/partition names are valid for the cluster
- Check scheduler logs:
scontrol show job <jobid>orqstat -f <jobid>
- Ensure the script writes to stdout/stderr (not just files)
- Check that
run.${PW_JOB_ID}.outis being created in the rundir - Verify SSH connectivity to the resource
- Check the poll_interval setting (default 15 seconds)
- Verify the job is actually running:
squeue -j <jobid>orqstat <jobid> - Check for scheduler-specific issues (node failures, resource unavailability)
- Ensure the workflow is being cancelled (not just the browser closed)
- Check that
scancel/qdelcommands are available on the resource - Verify the
jobidfile was created in the rundir
Apache 2.0 - See LICENSE file for details.