Temporal-Difference Learning and the importance of exploration, An illustrated guide

This article has been published on Towards Data Science, read it here!

Intro

Recently, Reinforcement Learning (RL) algorithms have received a lot of traction by solving research problems such as protein folding, reaching a superhuman level in drone racing, or even integrating human feedback in your favorite chatbots.

Indeed, RL provides useful solutions to a variety of sequential decision-making problems. Temporal-Difference Learning (TD learning) methods are a popular subset of RL algorithms. TD learning methods combine key aspects of Monte Carlo and Dynamic Programming methods to accelerate learning without requiring a perfect model of the environment dynamics.

In this article, we’ll compare different kinds of TD algorithms in a custom Grid World. The design of the experiment will outline the importance of continuous exploration as well as the individual characteristics of the tested algorithms: Q-learning, Dyna-Q, and Dyna-Q+.

The outline of this post contains:

  • Description of the environment
  • Temporal-Difference (TD) Learning
  • Model-free TD methods (Q-learning) and model-based TD methods (Dyna-Q and Dyna-Q+)
  • Parameters
  • Performance comparisons
Ryan Pégoud
Ryan Pégoud
MSc Student in Computational Statistics and Machine Learning

My research interests include Reinforcement Learning in Open-Ended settings, leading to more general and robust agents.