In supervised learning, we train a model on a dataset by tuning a few parameters. In essence, a model can be thought of as a function approximation of a large multi dimensional function, which is described via samples in a dataset.
While there are different types of neural networks, each with their own pros and cons. The description below aims to be as variant agnostic as possible.
A "classical" neural network consists of multiple layers. Each layer, containing one or more units called neurons. A neuron's purpose - take in input from the previous layer, performs a few operations and outputs the result.
These operations are often multiplying the inputs with individual weights, adding a bias to the result, and passing the result through what is called an activation function for the final output.
The purpose of an activation function for a neuron's output is to introduce some sort of non-linearity to the neural network. Non linearity allows for better fits to the given data.
A cost function also interchangeably called a loss function, is a function that quantifies the inaccuracy of a neural network.
Examples include
The training phase of a neural network can be described as having 3 main parts.
All the parameters in the network (weights and biases) are randomly initialized. Usually a bad idea to initialize to 0 from the start.
In the forward pass, the network takes whatever input was given to it and passes it through each layer - weighing, adding a bias and activating the result layer by layer.
The result from the feed forward pass is used to calculate the loss. Using an algorithm (usually gradient descent) we go back layer by layer changing the parameters in such a way that they contribute in decreasing the loss function.
Once we go through all the layers, we repeat with another forward pass followed by a backpropagation step. We repeat until a reasonably accuracy or a set number of iterations are acheived.
When a network with too many parameters is trained on a dataset, the network is able to fit so well to the training data, that it fails to generalize to newer unseen examples. This phenomenon is called overfitting, and the network is said to have a high variance.
Similarly, when a network's architecture is poorly designed that the network isn't able to train properly due to it just not having enough nuance and control, it is called underfitting and the network is said to have high bias.
In simpler words, underfitting is highly biased on simplified assumptions and therefore cannot grasp the full extent of the nature of the data. And overfitting is highly varied, enought to fit the given data so well, that it fails to generalize to never before seen samples. [1]
Regularization introduces a penalty term into the loss function that is directly proportional to the parameters of the model. This ensures that the weights stay low and hence reduce the chances of overfitting.
In this method, a few neurons are randomly chosen and ignored for the entirety of the forward and backward pass, and their parameters aren't updated. This, like regularization helps in reducing overfitting. They are usually placed after dense or convolutional layers. And are turned off during prediction.
A cross validation dataset is evaluated periodically during training to help with hyperparameter tuning since it provides an estimate of the model's performance on unseen data. Both the training and cross validation are used only during training. And the test dataset is used to evaluate the model's performance.
The simplest of types is a dense neural network, where each layer is made up of neurons, and all the neurons in any layer are connected to all the neurons in the previous layer. Such a network is said to be "densely" connected.
CNNs are commonly used for training image classifiers due to their ability to consider groups of pixels together at a time unlike a dense network.