Vector calculus extends standard calculus to multidimensional spaces. In machine learning and deep learning, most operations involve vectors, matrices, or tensors. Understanding vector-vector derivatives is crucial for algorithms like backpropagation.
Vector-Vector Derivative (The Jacobian)
When we have a vector-valued function that maps an -dimensional vector to an -dimensional vector , the derivative of with respect to is a matrix of partial derivatives called the Jacobian matrix.
Let and .
The Jacobian matrix of with respect to is an matrix defined as:
In numerator layout (the most common convention in ML), the -th entry of the Jacobian is:
- Row represents the gradient of the scalar output with respect to the input vector .
- Column represents how all output components change when the input component is perturbed.
Common Vector Derivatives
Here are some of the most common vector derivatives encountered in machine learning:
-
Linear transformation (Matrix-Vector multiplication): If , where is an matrix (independent of ), then:
-
Dot Product: If , where is a constant vector, the output is a scalar, so the Jacobian is a matrix (a row vector):
-
Quadratic Form: If , where is an matrix: If is symmetric, this simplifies to .
-
Pointwise / Element-wise Functions: If applies a scalar function element-wise to (e.g., an activation function like ReLU or Sigmoid), then . Because only depends on , all off-diagonal partial derivatives for are zero. The Jacobian is a diagonal matrix:
\begin{aligned} \frac{\partial \mathbf{f}}{\partial \mathbf{x}} = \text{diag}(g’(\mathbf{x})) = \begin{bmatrix} g’(x_1) & 0 & \dots & 0 \ 0 & g’(x_2) & \dots & 0 \ \vdots & \vdots & \ddots & \vdots \ 0 & 0 & \dots & g’(x_n) \end{bmatrix} \end{aligned}
## Chain Rule for Vectors The multivariate chain rule is the foundation of **Backpropagation** in neural networks. If we have nested vector functions $\mathbf{y} = \mathbf{g}(\mathbf{x})$ and $\mathbf{z} = \mathbf{f}(\mathbf{y})$, then the composition is $\mathbf{z} = \mathbf{f}(\mathbf{g}(\mathbf{x}))$. The Jacobian of $\mathbf{z}$ with respect to $\mathbf{x}$ is the matrix product of their individual Jacobians: $$ \frac{\partial \mathbf{z}}{\partial \mathbf{x}} = \frac{\partial \mathbf{z}}{\partial \mathbf{y}} \frac{\partial \mathbf{y}}{\partial \mathbf{x}} $$ If $\mathbf{x} \in \mathbb{R}^n$, $\mathbf{y} \in \mathbb{R}^m$, and $\mathbf{z} \in \mathbb{R}^p$: - $\frac{\partial \mathbf{z}}{\partial \mathbf{y}}$ is a $p \times m$ Jacobian matrix. - $\frac{\partial \mathbf{y}}{\partial \mathbf{x}}$ is an $m \times n$ Jacobian matrix. - The product is a $p \times n$ matrix, which correctly matches the dimension of $\frac{\partial \mathbf{z}}{\partial \mathbf{x}}$. ### Application in Backpropagation In deep learning, the final output $L$ (the loss) is a scalar. During backpropagation, we compute the gradient of the loss with respect to a layer's weights or inputs. Let $L = h(\mathbf{y})$ and $\mathbf{y} = \mathbf{f}(\mathbf{x})$. By the chain rule: $$ \frac{\partial L}{\partial \mathbf{x}} = \frac{\partial L}{\partial \mathbf{y}} \frac{\partial \mathbf{y}}{\partial \mathbf{x}} $$ Here, $\frac{\partial L}{\partial \mathbf{y}}$ is a $1 \times m$ row vector (the gradient from the layer above), and $\frac{\partial \mathbf{y}}{\partial \mathbf{x}}$ is the $m \times n$ Jacobian of the current layer. The result is a $1 \times n$ row vector, which is the gradient to pass down to the next layer.