A CNN architecture follows the use of convolutional layers often followed by pooling layers (max, mean...). The resulting feature maps are all flattened and forwarded to dense layers which ultimately end
A convolutional layer in a CNN makes use of kernels (or filters), like the ones discussed in image processing. Each layer is defined to have a fixed number of filters with the kernel values randomly initialized. Unlike a typical correlation operation, we can also define a stride, which is the number of pixels to slide accross for each dot product. For example, a stride of 1 would simply be a correlation operation.
A pooling layer takes the feature maps and downscales them based on some crtieria. Commonly used pooling criteria are max and mean. A max pooling kernel, assigns the maximum value in the kernel's window to the pixel in the feature map. The point of doing this is to preserver the rough location of the activations from the convolutional layer, while discarding lower activations and reducing the number of inputs for the next layer. This downscaling is often called dimensionality reduction.
The number of filters in subsequent convolutional layers can be increased to account for the reduction in available data.