Q-Learning 🚧

The ability to have agents automatically learn complex policies, solely through environmental rewards - A Step-by-Step Beginner‘s Guide / HN

Understanding (deep) Q-learning in 2min

caption

Discretize the World

Let’s assume that we are able to discretize our environment as well as the set of actions available to us. Then we can imagine that some fonction exist that would gives the best action possible for a certain situation (when positioned on a cell in our discretized environment).

This would be our Q-Function (tThe Q-Learning algorithmhe Q comes from “the Quality” (the value) of that action at that state).

If we lived in a discretized environment we also can think of this Q-function as a Q-table:
a table for which each entries is a tuple identifying the cell we are in (in our environment) and map to a vector of available actions pondered by their score. The best action for this cell being the one having the highest value.

$ [ x, y, z] \Rightarrow ( a_1, a_2, a_3, a_4)$

In this situation there is a simple algorithm to learn that Q-function (and Q-table) by only looking at the reward of our actions (by simulating their outcome) (1)(2)

caption

The Q-Learning algorithm

The learning algorithm iterates over several episodes to estimate the Q-function, where an episode is a complete sequence of actions leading to a terminal states. Iterations over episode make the Q-value flows to the early states through the formula above.

Different policy exists to make that process converge(3), ε-greedy being the simplest to implement.

Initialize Q arbitrary (either 0, or randome value)

for a number of episodes
    t = 0
    Observe S0
    repeat
    	choose an action At using policy derived from Q (eg: ε-greedy)
        Take Action At and observe Tr+1,St+1
        Q(St,At) = Q(St,At) + alph(Rt+1+ymaxQ(St+1,q) -Q(St,At))
        t += 1
    until terminate
end

Learning policy

Similarly to Monte-Carlo Tree Search, the idea is to have strategy that handles exploration (learn new thing) vs exploitation (use what we already know).

The ε-greedy policy do that the following way: We start with an initial value of ɛ = 1.0:

  • With probability 1-ɛ : we do exploitation (aka our agent selects the action with the highest state-action pair value).
  • With probability ɛ: we do exploration (trying random action).

As the number of episodes processed grows, ɛ can be decreased to explore less and exploit more (similar to temperature 4 concept found in other algorithm).

Note that ε-greedy is the simplest to tune, but also the worst performing exploration strategy (3). It should be fine as a starting point though.

This World is to Big to fit in a table 🚧

Discretizing most world this way would require a huge table. Rather than uzing an explicit table, we will use a neural network to learn and approximate that Q-Function without storing it in a table:

  • the input layer will match the tuple identifying our cell
  • the hidden layer is however need to be (convolution network / whatever)
  • the output layer will match the set of actions available to us and give the learned weigth.

That way we trade memory of the Q-table for compactness of the NN, with same usage, but learning will take more time.

This is what is Deep Q-Learning 5

neural network illusation

Notes

the function to learn the Q-function is the Bellman equation for optimality.

see also

  • Introducing Q-Learning
  • Deep Q-Learning (Space Invaders) / HN
  • Q-Learning Tic-Tac-Toe, Briefly
  • DQN vs PPO
    • DQN and Q-learning in general are performed off-policy, meaning that while learning, each update in the policy can use data collected at any point in its time of training, so using a different policy for acting (inference) and updating (training).
    • PPO are on-policy, only using the data collected from the most up-to-date policy and making decisions based on that.
    • Both algorithms are model-free, meaning that while training agents, they do not have access to the model of the environment. The model of the environment is a function which predicts state transitions and rewards.

Reference

  1. Q Learning Intro/Table - Reinforcement Learning p.1 - the presentation itsef take many shortcuts that will lead to a not working q algorithm, look at Introducing Q-Learning (Hugging Face) for a better algorithm. 

  2. Q Learning Explained (tutorial) - Montecarlo vs TD. / Q-Learning 

  3. Comparing Exploration Strategies for Q-learning in Random Stochastic Mazes - Focused on the undirected strategies: softmax, ε-greedy, pursuit, and compared them to the directed exploration strat-egy UCB-1. The results show that softmax or Boltzmann exploration outperforms the other strategies, although it is harder to tune its parameters. The easiest techniques to tune are ε-greedy and UCB-1, but ε-greedy performs worst of all exploration strategies  2

  4. TBD 

  5. Deep Q Learning w/ DQN - Reinforcement Learning p.5 

Written on January 22, 2025, Last update on April 8, 2025
NN AI in-progress search algorithm