Q-Learning

The ability to have agents automatically learn complex policies, solely through environmental rewards - A Step-by-Step Beginner‘s Guide / HN

Understanding (deep) Q-learning in 2min

caption

Discretize the World

Let’s assume that we are able to discretize our environment as well as the set of actions available to us. Then we can imagine that some fonction exist that would gives the best action possible for a certain situation (when positioned on a cell in our discretized environment).

This would be our Q-Function.

If we lived in a discretized environment we also can think of this Q-function as a Q-table:
a table for which each entries is a tuple identified the cell we are in (in our environment) and map to a vector of available actions pondered by their score. The best action for this cell being the one having the highest value.

$ [ x, y, z] \Rightarrow ( a_1, a_2, a_3, a_4)$

In this situation there is a simple algorithm to learn that Q-function (and Q-table) by only looking at the reward of our actions (by simulating their outcome) (1)([^3)

caption

This World is to Big to fit in a table

Discretizing most world this way would require a huge table. Rather than uzing an explicit table, we will use a neural network to learn and approximate that Q-Function without storing it in a table:

  • the input layer will match the tuple identifying our cell
  • the hidden layer is however need to be (convolution network / whatever)
  • the output layer will match the set of actions available to us and give the learned weigth.

That way we trade memory of the Q-table for compactness of the NN, with same usage, but learning will take more time.

This is what is Deep Q-Learning 2

neural network illusation

Notes

the function to learn the Q-function is the Bellman equation for optimality.

see also

Reference

Written on January 22, 2025, Last update on
NN AI