Deep Reinforcement Learning (DRL) combines Reinforcement Learning with Deep Learning. It uses Neural Networks to approximate functions in RL, allowing agents to solve tasks with high-dimensional state spaces (like pixels or sensor readings).
Why Deep Learning?
In many real-world problems, the state space is too large for a traditional Q-Table. For example, in an Atari game, the “state” is the pixel buffer ( pixels), resulting in possible states. Deep Learning allows us to map these high-dimensional inputs to actions or values.
Key Architectures
1. Deep Q-Networks (DQN)
DQN was the first major DRL success (DeepMind, 2013). It uses a CNN to estimate Q-values from images.
- Experience Replay: Stores transitions in a buffer and samples random batches for training to break data correlations.
- Target Network: Uses a separate, slowly-updating network to calculate the TD target, which stabilizes training.
2. Policy Gradient Methods
Instead of learning a value function, these methods directly learn a policy network .
- Advantage: Can learn stochastic policies and work well in continuous action spaces.
- PPO (Proximal Policy Optimization): A stable, state-of-the-art policy gradient algorithm.
3. Actor-Critic Methods
A hybrid approach:
- Actor: Learns the policy (how to act).
- Critic: Learns the value function (how good the current state/action is).
- The critic provides feedback to the actor to improve the policy update.
Challenges in DRL
- Sample Inefficiency: DRL often requires millions of interactions to learn.
- Instability: Small changes in neural network weights can lead to huge changes in behavior.
- Reward Engineering: Designing a reward function that accurately reflects the goal without leading to “reward hacking.”