Convolutional Neural Networks are designed to process grid-like data, such as images. They use convolution operations in place of general matrix multiplication in at least one of their layers.
The Forward Pass (1D / 2D Convolution)
During the forward pass, a filter (kernel) slides over the input to produce an output feature map . For a simple 2D convolution (without padding or stride):
Backpropagation in CNNs
Backpropagation in a CNN requires computing three main gradients:
- The gradient with respect to the filter weights (to update the filter).
- The gradient with respect to the biases (to update the biases).
- The gradient with respect to the input feature map (to pass the error down to the previous layer).
Let be the gradient of the loss with respect to the output of the convolutional layer (this is passed backward from the subsequent layer).
1. Gradient with respect to the Filter Weights ()
To update the filter weights, we need to know how the loss changes as each weight changes. Because each weight in is multiplied by different parts of the input across the sliding window, the gradient is the convolution of the input with the incoming gradient . In code, this is computed as a valid cross-correlation between the input and the upstream gradient .
2. Gradient with respect to the Bias ()
The bias is added to every element of the output feature map . Therefore, the gradient with respect to the bias is simply the sum of all gradients in :
3. Gradient with respect to the Input ()
To propagate the error back to the previous layer, we need the gradient with respect to the input . Since each input pixel contributes to multiple output pixels (depending on the filter size), the gradient with respect to a single input pixel is the sum of the gradients from all output pixels it influenced, weighted by the filter weights. Mathematically, this is equivalent to a full convolution of the incoming gradient (zero-padded) with the 180-degree rotated (flipped) filter .
Backpropagation Through Pooling Layers
Pooling layers have no learnable parameters, but gradients still need to flow through them to reach earlier layers.
- Max Pooling: During the forward pass, max pooling selects the maximum value in a window. During backpropagation, the gradient from the next layer is routed only to the index that had the maximum value in the forward pass. All other elements in the pooling window receive a gradient of .
- Average Pooling: During the forward pass, the average of the window is taken. During backpropagation, the incoming gradient is distributed equally among all elements in the pooling window (i.e., divided by the window size).
For the mathematical prerequisites, see 9. Vector Calculus and Neural Networks for the general backpropagation chain rule.