Markov Decision Processes

A Markov Decision Process (MDP) provides a mathematical framework for modeling decision-making in situations where outcomes are partly random and partly under the control of a decision-maker.

The Markov Property

A state $S_{t}$ is Markov if and only if: $P (S_{t + 1} ∣ S_{t}) = P (S_{t + 1} ∣ S_{1}, ..., S_{t})$ In other words, the future is independent of the past given the present. The current state captures all relevant information from the history.

Formal Definition

An MDP is defined by a 5-tuple $(S, A, P, R, γ)$ :

$S$ : A finite set of states.
$A$ : A finite set of actions.
$P$ : A state transition probability matrix $P (s^{'} ∣ s, a)$ .
$R$ : A reward function $R (s, a)$ .
$γ$ : A discount factor $γ \in [0, 1]$ , which determines the importance of future rewards.

Goal of an MDP

The goal is to find a Policy $π (a ∣ s)$ that maximizes the Expected Return $G_{t}$ : $G_{t} = R_{t + 1} + γ R_{t + 2} + γ^{2} R_{t + 3} + ... = \sum_{k = 0}^{\infty} γ^{k} R_{t + k + 1}$

Value Functions

State-Value Function $V_{π} (s)$

The expected return starting from state $s$ and following policy $π$ . $V_{π} (s) = E_{π} [G_{t} ∣ S_{t} = s]$

Action-Value Function $Q_{π} (s, a)$

The expected return starting from state $s$ , taking action $a$ , and then following policy $π$ . $Q_{π} (s, a) = E_{π} [G_{t} ∣ S_{t} = s, A_{t} = a]$

The Bellman Equation

The fundamental recursive relationship for value functions: $V_{π} (s) = \sum_{a \in A} π (a ∣ s) \sum_{s^{'} \in S} P (s^{'} ∣ s, a) [R (s, a, s^{'}) + γ V_{π} (s^{'})]$

Harbor 🪼

Explorer

Markov Decision Processes

The Markov Property

Formal Definition

Goal of an MDP

Value Functions

State-Value Function $V_{π} (s)$

Action-Value Function $Q_{π} (s, a)$

The Bellman Equation

See Also

Table of Contents

Backlinks

Harbor 🪼

Explorer

Markov Decision Processes

The Markov Property

Formal Definition

Goal of an MDP

Value Functions

State-Value Function Vπ​(s)

Action-Value Function Qπ​(s,a)

The Bellman Equation

See Also

Table of Contents

Backlinks

State-Value Function $V_{π} (s)$

Action-Value Function $Q_{π} (s, a)$