Tutorial in Reinforcement Learning of the RL-Bootcamp Salzburg 25
- Learning Goals
- Introduction: The Problem of Locomotion
- Approaches to Solve the Control Problem
- Getting Started: The Tutorial Workflow
- Installation Guide
Welcome to the RL Bootcamp Tutorial! This tutorial guides you through fundamental reinforcement learning (RL) techniques using a classic robotics locomotion task. We will train an agent to walk, introduce a "domain shift" by changing the agent's body, observe the consequences, and then explore strategies for adaptation.
- Learn the basics of continuous control problems in a didactic and visual way.
- Understand the challenge of teaching a simulated robot to walk.
- Learn how to use off-the-shelf RL agents like PPO to solve complex locomotion tasks.
- Grasp the concept of "domain shift" and why policies often fail when the environment changes.
- Get an idea of where to start when tackling a robotics control problem.
- Learn how to get reproducible results and the importance of hyperparameter tuning.
- Be creative and explore further challenges like fine-tuning and domain randomization!
Teaching a machine to walk is a classic problem in both robotics and artificial intelligence. It requires coordinating multiple joints (motors) to produce a stable and efficient pattern of movement, known as a "gait." This is a perfect problem for Reinforcement Learning because the exact sequence of motor commands is incredibly difficult to program by hand, but it's easy to define a high-level goal: "move forward as fast as possible."
To explore this problem, we use the Ant environment, a standard benchmark in the field. Think of it as the "Hello World" for continuous control in robotics.
-
The Agent: A four-legged "ant" creature simulated using the MuJoCo physics engine.
-
The Goal: Learn a policy to apply torques to its eight joints to make it run forward as quickly as possible without falling over.
-
State Space (S): A high-dimensional continuous space that includes the positions, orientations, and velocities of all parts of the ant's body.
-
Action Space (A): A continuous space representing the torque applied to each of the eight hip and ankle joints.
What happens when an agent trained perfectly in one scenario is deployed in a slightly different one? To explore this, we introduce a modification: the CrippledAntEnv.
-
The Change: This environment is identical to the standard
AntEnv, except we have programmatically "broken" one of its legs by disabling the joints. -
The Purpose: This serves as a powerful lesson in robustness and adaptation. A policy trained on the original ant will likely fail dramatically here, as its learned gait is highly specialized for a four-legged body. This forces us to ask: how can we make an agent adapt to this new reality?
We chose this Ant-based setup for its didactic value:
-
Visually Intuitive: It's easy to see if the agent is succeeding or failing. You can visually inspect the learned gait and see how it stumbles when the environment changes.
-
Real-World Parallel: This setup mimics real-world robotics challenges, where a robot might suffer hardware damage or face an environment different from its simulation.
-
Demonstrates Key Concepts: It provides a clear and compelling way to understand complex RL topics like specialization, domain shift, and the need for adaptive strategies like fine-tuning or retraining.
-
High FPS: This environment has been optimized to run in parallel. It can maintain a train FPS in the order of 1000 FPS on modern computer hardware.
RL is our primary tool for this problem. It's a data-driven approach where an agent learns an optimal policy through trial-and-error by interacting with its environment.
Advantages:
-
Model-Free: It doesn't require an explicit, hand-crafted mathematical model of the ant's physics, which would be incredibly complex to create. It learns directly from experience.
-
Handles Complexity: RL algorithms are well-suited for high-dimensional, continuous state and action spaces like those in robotics.
-
Discovers Novel Strategies: RL can discover complex and efficient gaits that a human engineer might not have imagined.
Drawbacks:
-
Sample Efficiency: It can require millions of simulation steps to learn an effective policy.
-
Tuning Complexity: Performance is often very sensitive to the choice of algorithm and its hyperparameters.
For a problem like ant locomotion, a purely analytical solution (e.g., deriving a set of equations that describe a perfect walking gait) is practically impossible. The system is:
-
High-Dimensional: Many joints must be controlled simultaneously.
-
Non-Linear: The physics of friction, contact, and momentum are highly non-linear.
-
Underactuated: The agent has to use momentum and contact forces to control its overall body position.
This is where data-driven, learning-based methods like RL shine. They can learn effective control policies for complex systems where traditional engineering approaches would be intractable.
This tutorial is a hands-on guide to training, testing, and adapting a policy.
Before you start, make sure you have set up your Python environment correctly by following the Installation Guide at the end of this document. This involves creating a virtual environment and installing the packages from requirements.txt.
Our first goal is to train a competent agent on the standard AntEnv.
- Command:
python train.py
- What happens: Hydra will use the default configuration (
config/env/ant.yaml) to train a PPO agent. After training, the best policy will be saved to a file likelogs/runs/YYYY-MM-DD_HH-MM-SS/best_model.zip.
post_training_analysis.py:- A script for post-training analysis, which includes rendering a video of the trained policu and plotting training metrics.
- You need to pass the
--runparameter on the command line, followed by the parent directory of the trained agent to be analysed - The following flags:
--render&--plotshould be passed to the script to enable the video rendering, and metrics plotting, respectively. - The renders are saved to a
videosdirectory in the base directory of the repository. - Example:
python post_training_analysis --run logs/runs/train_save_best/2025-08-22_18-55-24/ --plot --renderNow, let's see how our pro agent handles an unexpected change.
- You can use the same script as before, and simply add a
--crippleflag. This will save the renders to avideos-crippledirectory in the base directory of the repository. E.g.:
python post_training_analysis.py --run logs/runs/train_save_best/2025-08-22_18-55-24/ --cripple --render-
AnalyseRun.ipynb:- In this Jupyter notebook, you can choose the trained agent's weights (e.g.
best_model.zip) that you want to evaluate - You can also choose on which environment type you want to evaluate (e.g.
Ant-v5&CrippledAnt-v5)- Note that the observation and action sizes, must match the trained policy input and output sizes, respectively (i.e. you cannot directly use an agent trained on an Ant with 4 legs, on some Ant with 6 legs)
- In this Jupyter notebook, you can choose the trained agent's weights (e.g.
-
What happens: We load our trained agent but use a Hydra override (
env=crippled_ant) to run it in the modified environment. You will likely see the ant stumble and fail, proving its policy was not robust to this change.
Let's train a new agent from scratch that only ever experiences the crippled environment.
- Command:
python train.py env=crippled_ant- What happens: This will create a new agent that learns a specialized gait for the crippled body. You can test it with
AnalyseRun.ipynband see that it learns to walk effectively under its new circumstances.
Now you have two specialist agents!
-
Compare them: Watch both agents in their respective environments. Do they learn different gaits?
-
Explore on your own:
-
Fine-Tuning: Can you adapt the "pro" agent to the crippled environment faster than training from scratch? Try modifying the training script to load the pro agent's model and continue training it on
CrippledAntEnv. -
Domain Randomization: Modify the environment code to cripple a random leg at the start of each episode. Can you train a single, super-robust agent that can walk with any injury?
- Git: (Install Git)[https://git-scm.com/downloads]
- Python 3.11.9 or higher: (Download Python)[https://www.python.org/downloads/]
git clone https://github.com/SARL-PLUS/RL_bootcamp_tutorial.git
cd RL_bootcamp_tutorial- Create:
python3 -m venv venv- Activate (macOS/Linux):
source venv/bin/activate- Activate (Windows):
venv\Scripts\activate
Install all required packages using the appropriate requirements.txt file for your system.
# Start with the general file. If it fails, use the OS-specific version.
pip install -r requirements.txtThe template uses Stable-Baselines3 along with Hydra for the configuration management. Hydra is a hierarchical confiugration tool and essentially takes care of the tiresome parts like maintaining your configuration and storing it along with the training results. Although the use of hydra might be a matter of taste, we believe it is important to demonstrate its benefits. Configuration management is a highly relevant task that deserves similar attention to that given to the algorithms themselves.
All commands assume you are running from the repository root with the virtual
Hydra will automatically create output directories under logs/runs or logs/multiruns depending on whether the job launcher is started in a single or multirun mode. Results and checkpoints are stored by the callbacks defined but the configuration of each run is automatically storeed in the .hydra directory.
To run the training code in default configuration defined in train.yaml just execute the following code with the virtual environment activated:
python train.pyTrainings with an entirely different configuration are done via:
python train.py cfg=your_configAs mentioned hydra brings the benefit of a hierarchical configuration tool, where every key can be overwritten. E.g. let's run the trainig with a differnt enviornment configuration:
python train.py env=crippled_antIt is very convinient that hydra stores the configuration in the logs/runs/../<run_dir> directory along with a list defining the overwritten keys.
It is a good praxis to take advantage of the hierarchy by using a well definied default configuration and overwrite only neccessary parts in an experiment file:
python train.py experiment=your_custom_experimentLet's for example define change of the enviorment configuration entirly and modify some parameters like the number of training evnironments used and a enviroment parameter which is passed to the constructor of the gym enviorment. Be careful not to forget # @package _global_ right before the defaults list, as this tells hydra to merge configurations in the global configuration space.
# @package _global_
defaults:
- override /env: crippled_ant
train_env:
n_envs: 6 # increase number of training environments
env_kwargs:
injury: medium # disable two instead of one leg
# define a proper task name making it easier to link results with configurations
task_name: "train_${env.id}_${env.train_env.env_kwargs.injury}"One of the major advantages of hydra is that it provides multirun support. Consider e.g. the follwing case where we want to run the training with three differnt configurations for the learning rate:
python train.py -m agent.learning_rate=1e-4,5e-4,1e-3Hydra creates now three run directories in logs/multiruns/... where the results and configurations stored similar to the single run case.
Per default this jobs are executed sequentally which is not the workflow suited to train reinforcement learning agents. Luckily, this can be very easily fixed since hydra offers several plugins for job launching. Consider e.g. the following configuration for hyperparmeter tuning:
# @package _global_
defaults:
- override /hydra/launcher: ray
- override /hydra/sweeper: optuna
hydra:
mode: "MULTIRUN"
launcher:
ray:
remote:
num_cpus: 4
init:
local_mode: false
sweeper:
_target_: hydra_plugins.hydra_optuna_sweeper.optuna_sweeper.OptunaSweeper
n_trials: 20
n_jobs: 4
direction: minimize
sampler:
_target_: optuna.samplers.TPESampler
seed: ${seed}
n_startup_trials: 10override /hydra/launcher: ray essentially tells hydra to use the ray plugin for job launching which is defined below. In the present case we use 4 CPUs. In addition to launcher plugin we also take advantage of the optuna plugin via override /hydra/sweeper: optuna which gives as access to more elaborated hyperparameter sampling. In the present case we use the TPESampler which comes with a bit of intelligence instead of brute force grid sampling.
The next step now is to define the parameters we want to optimize for which is again best done via an experiment configuration. E.g. let's create a hp_ant_baseline.yaml experiment file, essentially loading the the relvant plugins via override /hparams_search: optuna and definig the parameter space for the learn rate and the clip range for PPO which are the parameters in this example we want to optimize for.
# @package _global_
defaults:
- override /hparams_search: optuna
task_name: "hparams_search_PPO@ANT"
hydra:
sweeper:
params:
agent.clip_range: interval(0.05, 0.3)
agent.learning_rate: interval(0.0001, 0.01)
learner.total_timesteps: 1000000
# Since we optimize for minimum training time we need early stopping defined
callbacks:
eval_callback:
callback_on_new_best:
_target_: stable_baselines3.common.callbacks.StopTrainingOnRewardThreshold
reward_threshold: 1000
verbose: 1Again to run the hyperparameter search we just need to run hydra in multrun mode with configuration we defined above.
python train.py -m experiment=hp_ant_baselineRL_bootcamp_2025_tutorial/
├── config/ # Hydra configuration files
│ ├── agent/ # Agent-specific settings
│ ├── callbacks/ # Callbacks during training/evaluation
│ ├── env/ # Environment definitions and parameters
│ ├── experiment/ # Experiment configuration files
│ ├── hparams_search/ # Hyperparameter search configs
│ ├── learner/ # Learning wrapper configs
│ ├── policy/ # Policy architecture and parameters
│ ├── hparams_search/ # Hyperparameter search configs
│ └── train.yaml # Main training configuration
├── src/ # Core source code
│ ├── agent/ # Agent source code
│ ├── envs/ # Environment source code
│ ├── models/ # Neural net definitons for feature extractors
│ ├── utils/ # Helpers for instantiation and postprocessing
│ └── wrappers/ # Code wrappers
├── AnalyseRun.ipynb # Inference script evaluating policy snapshots
├── train.py # Main training entry point
├── vanilla_train.py # A simple scipt to train without hydra (aimed for visualization, not recommended to use)
├── requirements.txt # Python dependencies
├── README.md # This file
└── LICENSE # License information