A Markov Decision Process (MDP) provides a mathematical framework for modeling decision-making in situations where outcomes are partly random and partly under the control of a decision-maker.

The Markov Property

A state is Markov if and only if: In other words, the future is independent of the past given the present. The current state captures all relevant information from the history.

Formal Definition

An MDP is defined by a 5-tuple :

  • : A finite set of states.
  • : A finite set of actions.
  • : A state transition probability matrix .
  • : A reward function .
  • : A discount factor , which determines the importance of future rewards.

Goal of an MDP

The goal is to find a Policy that maximizes the Expected Return :

Value Functions

State-Value Function

The expected return starting from state and following policy .

Action-Value Function

The expected return starting from state , taking action , and then following policy .

The Bellman Equation

The fundamental recursive relationship for value functions:

See Also