Skip to content

Korosh-Rajaei/DRL_via_Evolution

Repository files navigation

Deep Reinforcement Learning via Evolution

Introduction

This project investigates to what extend evolutionary methods such as the Cross Entropy Method and Evolution Strategies can be used to optimize a neural policy compared to the baseline REINFORCE.

Reinforcement Learning (RL) methods can be divided into gradient-based and black-box optimization methods. Evolutionary algorithms work by perturbing (policy network) parameters and choosing or updating based on the total return of a population for an episode. But policy-gradient algorithms like REINFORCE explicitly use state-action trajectories for credit assignment. This project compares the baseline REINFORCE against two population-based evolutionary methods: Cross-Entropy Method (CEM) and Evolution Strategies (ES). Methods use neural network policies with similar architectures. These methods are compared under matched environment conditions and interaction budgets. Performances are compared for the pendulum swing-up (Pendulum-v1, OpenAI Gym) continuous-control task.

Initially, ES and CEM are implemented in a simple configuration based on a discrete action space and a multilayer perceptron (MLP) policy. However, both ES and CEM show unstable learning and fail to reliably improve performance. Therefore, the project then aligns these evolutionary algorithms with their theoretical assumptions. For instance, policy is adjusted to choose a continuous torque instead of choosing between discrete torques. In addition, antithetic sampling is also used, and for the ES method, adjusted policies are ranked, and the choice depends on relative order rather than raw return values. This way, the potential problem concerning increased noise during fitness evaluation is addressed.

Results

Discrete methods

Learning improves fast and consistently for the baseline REINFORCE method. Compared to the discrete evolutionary methods, REINFORCE performs better. The return starts poorly, but it quickly converges to around -200. The learning curve of the discrete CEM shows that learning is somewhat stable but slow. For each population, the best policy outperforms the population. This indicates that CEM is able to find top-performing policies through sampling. However, the performance of the mean population indicates that learning improves very slowly compared to the baseline REINFORCE. for the discrete ES method, although we can see occasional spikes reaching near -200, but these improvements are not consistent. As a result, the results for the ES discrete method are highly unstable and noisy. The results indicate a significant distinction between evolutionary and gradient-based methods.

Continuous evolutionary methods

Using the continuous CEM method performs better than the discrete version. In addition, in some iterations, the best in the population can reach an optimal result. However, the population mean shows that learning (although better than discrete) is still somewhat noisy and not stable compared to baseline REINFORCE.

The results for the continuous ES method show that learning is still very noisy and unstable. Similar to the discrete ES method, there are iterations where learning improves, but these improvements are not consistent and again decrease to lower value returns. In the initial setting where a discrete action space was used, both CEM and ES methods show inconsistent and unstable learning and noisy returns across iterations, compared to the stable learning of the REINFORCE method. These findings are consistent with the theory since discretizing the action space leads to small changes in the policy parameters that can result in choosing different actions. Thus, causing unpredictable behavior, and therefore both methods receive noisy feedback. As a result, these methods cannot consistently improve the policy.

Adjusting the evolutionary methods to better align with the structure of the environment improves performance and learning. Changing to a continuous action space makes it possible for the CEM model to focus more probability on top-performing policies. Although in some iterations, best performing policies achieve near-optimal results, learning is still unstable, noisy, and slower compared to REINFORCE, which can indicate the exploratory nature of this method. The adjusted continuous ES method (with antithetic sampling and rank-based fitness normalization) shows lower variance compared to the initial version. However, the method still fails to exhibit consistent and stable learning. Improvements remain highly noisy and quickly fluctuate. This can be since, contrary to CEM, ES relies on estimating the gradient in a high-dimensional parameter space. As a result, limiting number of samples leads to noisy and weak gradient estimates.

About

This project investigates to what extend evolutionary methods such as the Cross Entropy Method and Evolution Strategies can be used to optimize a neural policy compared to the baseline REINFORCE.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors