When we look at an image, even a simple photo of a dog, our brain can recognize many details at the same time: teeth, whiskers, tongue, eyes. A Convolutional Neural Network (CNN) works in a similar way, but it does so by using filters.
One Region, Many Patterns
A single region of an image can contain multiple visual features. To properly understand it, a CNN does not rely on just one filter, but on multiple filters operating in parallel, each designed to detect a different pattern such as edges, curves, or textures.
Each filter creates its own collection of nodes in the convolutional layer. These nodes share the same weights with each other, but their weights are different from those of other filters. In practice, modern CNNs often use dozens or even hundreds of filters in a single convolutional layer.
Feature Maps and Activation Maps
As each filter slides across the image, it produces a matrix of values. These matrices are known as:
Feature maps
Activation maps
Visually, feature maps look like filtered versions of the original image. They contain less information, but only what is relevant to the specific pattern the filter is designed to detect.
For example, if we apply four 4×4 filters to an image (such as one from a self-driving car dataset), we obtain four feature maps. Some may highlight vertical edges, while others emphasize horizontal edges.
Brighter values in a feature map indicate that the pattern defined by the filter was strongly detected in that region of the image.
Why Edge Detection Matters
Edges are one of the most important visual features in images. They often appear as a line of light pixels next to a line of darker pixels. Because of this, edge-detecting filters play a crucial role in CNNs, especially in the early layers.
By matching the bright regions in a feature map with the original image, we can see which parts of the image activated the filter.
What About Color Images?
So far, we have considered grayscale images, which computers interpret as 2D arrays (height × width).
Color images, however, are represented as 3D arrays:
height
width
depth (three channels: red, green, and blue)
An RGB image can be thought of as a stack of three 2D matrices, one for each color channel.
To convolve a color image, the filter must also be three-dimensional. Each filter contains weights for the red, green, and blue channels at every spatial location. The convolution process is the same as before, but the sum now includes values from all three channels.
Multiple Filters and Increasing Depth
When we apply multiple filters to a color image, we obtain multiple feature maps. These maps can be stacked together to form a new 3D array, which then becomes the input to the next convolutional layer.
This is how CNNs gradually learn:
patterns
patterns within patterns
and increasingly complex visual structures
CNNs vs. Dense Layers
In many ways, convolutional layers are similar to fully connected (dense) layers:
both use weights and biases
both rely on backpropagation
both minimize a loss function
The key difference is connectivity:
dense layers are fully connected
convolutional layers are locally connected and use shared weights
Learning Happens Automatically
When building a CNN:
filter weights are initialized randomly
we do not specify which patterns to detect
we only define a loss function (for example, categorical cross-entropy)
During training, backpropagation updates the filters to minimize the loss. As a result, the network learns on its own which patterns are useful for the task.