Why can we use batch normalization?
We normalize the input layer by adjusting and scaling the activations. For instance, once we have features from 0 to 1 and a few from 1 to 1000, we should always normalize them to speed up learning. If the input layer is taking advantage of it, why not do an equivalent thing also for the values within the hidden layers, that are changing all the time, and obtain 10 times or more improvement within the training speed.
Batch normalization reduces the quantity by what the hidden unit values shift around (covariance shift). To elucidate covariance shift, let’s have a deep network on cat detection. We train our data on only black cats’ images. So, if we now attempt to apply this network to data with colored cats, it’s obvious; we’re not getting too had best. The training set and therefore the prediction set are both cats’ images but they differ a touch bit. In other words, if an algorithm learned some X to Y mapping, and if the distribution of X changes, then we’d got to retrain the training algorithm by trying to align the distribution of X with the distribution of Y. ( Deeplearning.ai: Why Does Batch Norm Work? (C2W3L06))
Also, batch normalization allows each layer of a network to find out by itself a touch bit more independently of other layers.
We can use higher learning rates because batch normalization makes sure that there’s no activation that’s gone really high or really low. And by that, things that previously couldn’t get to coach, it’ll start to coach.
It reduces overfitting because it’s a small regularization effects. Almost like dropout, it adds some noise to every hidden layer’s activations. Therefore, if we use batch normalization, we’ll use less dropout, which may be a good thing because we aren’t getting to lose tons of data. However, we should always not depend only on batch normalization for regularization; we should better use it along side dropout.
How does batch normalization work?
To increase the steadiness of a neural network, batch normalization normalizes the output of a previous activation layer by subtracting the batch mean and dividing by the batch variance .
However, after this shift/scale of activation outputs by some randomly initialized parameters, the weights within the next layer are not any longer optimal. SGD ( Stochastic gradient descent) undoes this normalization if it’s how for it to attenuate the loss function.
Consequently, batch normalization adds two trainable parameters to every layer; therefore the normalized output is multiplied by a “standard deviation” parameter (gamma) and add a “mean” parameter (beta). In other words, batch normalization lets SGD do the denormalization by changing only these two weights for every activation, rather than losing the steadiness of the network by changing all the weights.
Batch normalization and pre-trained networks like VGG:
VGG doesn’t have a batch norm layer in it because batch normalization didn’t exist before VGG. If we train it with it from the beginning, the pre-trained weight will enjoy the normalization of the activations. So adding a batch norm layer actually improves ImageNet, which is cool. You’ll add it to dense layers, and also to convolutional layers.
If we insert a batch norm during a pre-trained network, it’ll change the pre-trained weights; because it’ll subtract the mean and divide by the quality deviation for the activation layers and that we don’t want that to happen because we’d like those pre-trained weights to remain an equivalent. So, what we’d like to try to be to insert a batch norm layer and find out gamma and beta so as to undo the outputs change.