A Markov Decision Process (MDP) provides a mathematical framework for modeling decision-making in situations where outcomes are partly random and partly under the control of a decision-maker.
The Markov Property
A state is Markov if and only if: In other words, the future is independent of the past given the present. The current state captures all relevant information from the history.
Formal Definition
An MDP is defined by a 5-tuple :
- : A finite set of states.
- : A finite set of actions.
- : A state transition probability matrix .
- : A reward function .
- : A discount factor , which determines the importance of future rewards.
Goal of an MDP
The goal is to find a Policy that maximizes the Expected Return :
Value Functions
State-Value Function
The expected return starting from state and following policy .
Action-Value Function
The expected return starting from state , taking action , and then following policy .
The Bellman Equation
The fundamental recursive relationship for value functions: