9. Vector Calculus

Vector calculus extends standard calculus to multidimensional spaces. In machine learning and deep learning, most operations involve vectors, matrices, or tensors. Understanding vector-vector derivatives is crucial for algorithms like backpropagation.

Vector-Vector Derivative (The Jacobian)

When we have a vector-valued function $f (x)$ that maps an $n$ -dimensional vector $x \in R^{n}$ to an $m$ -dimensional vector $f \in R^{m}$ , the derivative of $f$ with respect to $x$ is a matrix of partial derivatives called the Jacobian matrix.

Let $f (x) = f_{1} (x) f_{2} (x) ⋮ f_{m} (x)$ and $x = x_{1} x_{2} ⋮ x_{n}$ .

The Jacobian matrix $J$ of $f$ with respect to $x$ is an $m \times n$ matrix defined as:

J = \frac{\partial f}{\partial x} = \frac{\partial f _{1}}{\partial x _{1}} \frac{\partial f _{2}}{\partial x _{1}} ⋮ \frac{\partial f _{m}}{\partial x _{1}} \frac{\partial f _{1}}{\partial x _{2}} \frac{\partial f _{2}}{\partial x _{2}} ⋮ \frac{\partial f _{m}}{\partial x _{2}} \dots \dots ⋱ \dots \frac{\partial f _{1}}{\partial x _{n}} \frac{\partial f _{2}}{\partial x _{n}} ⋮ \frac{\partial f _{m}}{\partial x _{n}}

In numerator layout (the most common convention in ML), the $i, j$ -th entry of the Jacobian is: $J_{i, j} = \frac{\partial f _{i}}{\partial x _{j}}$

Row $i$ represents the gradient of the scalar output $f_{i}$ with respect to the input vector $x$ .
Column $j$ represents how all output components change when the input component $x_{j}$ is perturbed.

Common Vector Derivatives

Here are some of the most common vector derivatives encountered in machine learning:

Linear transformation (Matrix-Vector multiplication): If $f (x) = A x$ , where $A$ is an $m \times n$ matrix (independent of $x$ ), then: $\frac{\partial}{\partial x} (A x) = A$
Dot Product: If $f (x) = w^{T} x$ , where $w$ is a constant vector, the output is a scalar, so the Jacobian is a $1 \times n$ matrix (a row vector): $\frac{\partial}{\partial x} (w^{T} x) = w^{T}$
Quadratic Form: If $f (x) = x^{T} A x$ , where $A$ is an $n \times n$ matrix: $\frac{\partial}{\partial x} (x^{T} A x) = x^{T} (A + A^{T})$ If $A$ is symmetric, this simplifies to $2 x^{T} A$ .
Pointwise / Element-wise Functions: If $f (x)$ applies a scalar function $g$ element-wise to $x$ (e.g., an activation function like ReLU or Sigmoid), then $f_{i} (x) = g (x_{i})$ . Because $f_{i}$ only depends on $x_{i}$ , all off-diagonal partial derivatives $\frac{\partial f _{i}}{\partial x _{j}}$ for $i \neq = j$ are zero. The Jacobian is a diagonal matrix:

\begin{aligned} \frac{\partial \mathbf{f}}{\partial \mathbf{x}} = \text{diag}(g’(\mathbf{x})) = \begin{bmatrix} g’(x_1) & 0 & \dots & 0 \ 0 & g’(x_2) & \dots & 0 \ \vdots & \vdots & \ddots & \vdots \ 0 & 0 & \dots & g’(x_n) \end{bmatrix} \end{aligned}

## Chain Rule for Vectors The multivariate chain rule is the foundation of **Backpropagation** in neural networks. If we have nested vector functions $\mathbf{y} = \mathbf{g}(\mathbf{x})$ and $\mathbf{z} = \mathbf{f}(\mathbf{y})$, then the composition is $\mathbf{z} = \mathbf{f}(\mathbf{g}(\mathbf{x}))$. The Jacobian of $\mathbf{z}$ with respect to $\mathbf{x}$ is the matrix product of their individual Jacobians: $$ \frac{\partial \mathbf{z}}{\partial \mathbf{x}} = \frac{\partial \mathbf{z}}{\partial \mathbf{y}} \frac{\partial \mathbf{y}}{\partial \mathbf{x}} $$ If $\mathbf{x} \in \mathbb{R}^n$, $\mathbf{y} \in \mathbb{R}^m$, and $\mathbf{z} \in \mathbb{R}^p$: - $\frac{\partial \mathbf{z}}{\partial \mathbf{y}}$ is a $p \times m$ Jacobian matrix. - $\frac{\partial \mathbf{y}}{\partial \mathbf{x}}$ is an $m \times n$ Jacobian matrix. - The product is a $p \times n$ matrix, which correctly matches the dimension of $\frac{\partial \mathbf{z}}{\partial \mathbf{x}}$. ### Application in Backpropagation In deep learning, the final output $L$ (the loss) is a scalar. During backpropagation, we compute the gradient of the loss with respect to a layer's weights or inputs. Let $L = h(\mathbf{y})$ and $\mathbf{y} = \mathbf{f}(\mathbf{x})$. By the chain rule: $$ \frac{\partial L}{\partial \mathbf{x}} = \frac{\partial L}{\partial \mathbf{y}} \frac{\partial \mathbf{y}}{\partial \mathbf{x}} $$ Here, $\frac{\partial L}{\partial \mathbf{y}}$ is a $1 \times m$ row vector (the gradient from the layer above), and $\frac{\partial \mathbf{y}}{\partial \mathbf{x}}$ is the $m \times n$ Jacobian of the current layer. The result is a $1 \times n$ row vector, which is the gradient to pass down to the next layer.

Harbor 🪼

Explorer

Vector-Vector Derivative (The Jacobian)

Common Vector Derivatives

Table of Contents

Backlinks