Q-Learning 🚧

The ability to have agents automatically learn complex policies, solely through environmental rewards - A Step-by-Step Beginner‘s Guide / HN

Understanding (deep) Q-learning in 2min

Discretize the World

Let’s assume that we are able to discretize our environment as well as the set of actions available to us. Then we can imagine that some fonction exist that would gives the best action possible for a certain situation (when positioned on a cell in our discretized environment).

This would be our Q-Function (tThe Q-Learning algorithmhe Q comes from “the Quality” (the value) of that action at that state).

If we lived in a discretized environment we also can think of this Q-function as a Q-table:
a table for which each entries is a tuple identifying the cell we are in (in our environment) and map to a vector of available actions pondered by their score. The best action for this cell being the one having the highest value.

$ [ x, y, z] \Rightarrow ( a_1, a_2, a_3, a_4)$

In this situation there is a simple algorithm to learn that Q-function (and Q-table) by only looking at the reward of our actions (by simulating their outcome) (¹)(²)

$caption$

The Q-Learning algorithm

The learning algorithm iterates over several episodes to estimate the Q-function, where an episode is a complete sequence of actions leading to a terminal states. Iterations over episode make the Q-value flows to the early states through the formula above.

Different policy exists to make that process converge(³), ε-greedy being the simplest to implement.

Initialize Q arbitrary (either 0, or randome value)

for a number of episodes
    t = 0
    Observe S0
    repeat
    	choose an action At using policy derived from Q (eg: ε-greedy)
        Take Action At and observe Tr+1,St+1
        Q(St,At) = Q(St,At) + alph(Rt+1+ymaxQ(St+1,q) -Q(St,At))
        t += 1
    until terminate
end

Learning policy

Similarly to Monte-Carlo Tree Search, the idea is to have strategy that handles exploration (learn new thing) vs exploitation (use what we already know).

The ε-greedy policy do that the following way: We start with an initial value of ɛ = 1.0:

With probability 1-ɛ : we do exploitation (aka our agent selects the action with the highest state-action pair value).
With probability ɛ: we do exploration (trying random action).

As the number of episodes processed grows, ɛ can be decreased to explore less and exploit more (similar to temperature ⁴ concept found in other algorithm).

Note that ε-greedy is the simplest to tune, but also the worst performing exploration strategy (³). It should be fine as a starting point though.

This World is to Big to fit in a table 🚧

And also the more granular, the more amount of time it will take to learn the best move from it…

Discretizing most world this way would require a huge table. Rather than uzing an explicit table, we will use a neural network to learn and approximate that Q-Function without storing it in a table:

the input layer will match the tuple identifying our cell
the hidden layer is however need to be (convolution network / whatever)
the output layer will match the set of actions available to us and give the learned weigth.

That way we trade memory of the Q-table for compactness of the NN, with same usage, but also a more direct link between the input and the outcome (this should in fact make the learning process feasible, while still time consuming).

This is what is Deep Q-Learning ⁵

neural network illusation

Notes

the function to learn the Q-function is the Bellman equation for optimality.

Reference

Q Learning Intro/Table - Reinforcement Learning p.1 - the presentation itsef take many shortcuts that will lead to a not working q algorithm, look at Introducing Q-Learning (Hugging Face) for a better algorithm. ↩
Q Learning Explained (tutorial) - Montecarlo vs TD. / Q-Learning ↩
Comparing Exploration Strategies for Q-learning in Random Stochastic Mazes - Focused on the undirected strategies: softmax, ε-greedy, pursuit, and compared them to the directed exploration strat-egy UCB-1. The results show that softmax or Boltzmann exploration outperforms the other strategies, although it is harder to tune its parameters. The easiest techniques to tune are ε-greedy and UCB-1, but ε-greedy performs worst of all exploration strategies ↩ ↩²
TBD ↩
Deep Q Learning w/ DQN - Reinforcement Learning p.5 ↩

Written on January 22, 2025, Last update on July 6, 2025

RL NN AI in-progress search algorithm

#4577