Q-Learning is an Off-Policy, Model-Free Temporal Difference (TD) learning algorithm used to find the optimal action-selection policy for any given MDP.

The Core Idea

The goal of Q-learning is to learn the Optimal Action-Value Function , which represents the maximum expected future reward for taking action in state .

The Update Rule

The Q-values are updated iteratively using the Bellman Equation as an update target:

Where:

  • : The learning rate.
  • : The discount factor.
  • : The TD Target (the estimated future value).
  • : The TD Error.

Key Characteristics

Off-Policy Learning

Q-Learning is β€œOff-Policy” because it updates the Q-value based on the greedy action (the maximum possible Q-value in the next state), even if the agent actually took a different (e.g., exploratory) action.

Convergence

Under the assumption of infinite exploration and a decaying learning rate, Q-Learning is guaranteed to converge to the optimal values for finite state and action spaces.

Q-Table

In its simplest form, Q-learning maintains a table (the Q-Table) where rows represent states and columns represent actions. This works well for environments with small, discrete state spaces but fails in high-dimensional or continuous spaces.

Deep Q-Learning

To handle complex environments (like Atari games), the Q-table is replaced with a Neural Network (a Deep Q-Network or DQN) that approximates the Q-values.

See Also