In this blog we will talk about Batch Normalization layer.
Before understanding Batch Normalization first we need to understand what is internal covariant shift.
Internal Covariant Shift is defined as the change in the distribution of the network activation (feature map) due to change in the network parameter during training.
So because of this internal covariant shift, training deep neural network is difficult because hidden layers input distribution keep changing, so the layers need to be continuously adapt to the new distribution.
Theta2 need to readjust to re-compensate for the change in the distribution of x. This slow down the training by requiring lower learning rate and careful parameter initialization.
To address this problem we can normalize the layer inputs, that's where batch normalization is useful.
Why we call this a Batch normalization because, training happens over a batch (64/128/256) of training images so we normalize batch of feature maps.
For a layer with k-dimensional input x = ( x(1).......x(k)), we will normalize each dimension, where the expectation and variance are computer over the training data set. Such normalization speed up the convergence.
Note that simply normalizing each input of a layer may change what the layer can represent. For instance, normalizing the inputs of a sigmoid would constrain them to the linear regime of the non linearity.
To address this, we make sure that the transformation inserted in the network can represent the identity transform. To accomplish this we introduce, for each activation x(k), a pair of parameters γ(k), β(k), which scale and shift the normalized value.
These parameters are learned along with the original model parameters, and restore the representation power of the network.
Indeed, by setting γ(k)=√Var[x(k)] and β(k)=E[x(k)], we could recover the original activations, if that were the optimal thing to do.
Batch Normalization During Training
- During Training, mean and variance of the each sample of mini-batch is calculated and for each sample mean and variance of each dimension (channels) are calculated separately.
- Then we normalize the input feature map using calculate mean and variance to make the distribution with zero mean and unit variance.
- As we are changing (normalizing) the input of layer, this may change what layer can represent. To address this, we make sure that the transformation inserted in the network can represent the identity transform by using γ (scale) and β (shift) parameters.
Batch Norm during training -
Batch Normalization During Inference
The normalization of activations that depends on the mini-batch allows efficient training, but is neither necessary nor desirable during inference, we want the output to depend only on the input, deterministically.
Instead of using mini-batch statistic we use population statistics that is moving average mean and variance of whole mini batch used in training. Moving mean and moving variance are non trainable parameters that are calculated (updated) and stored during training.
moving_mean(t) = moving_mean(t-1)* momentum + mean(batch)(t) * (1 - momentum)
moving_var(t) = moving_var(t-1) * momentum + var(batch)(t) * (1 - momentum)
momentum = 0.99
Then normalize the testing sample using population mean and variance. After that we need to transform this normalized sample using learned scale (γ) and shift (β) parameter of the respective layer to yield single linear transformation.
Since the parameters are fixed in this transformation, the batch normalization procedure is essentially applying a linear transform to the activation.
One more important question for batch normalization is after which layer batch normalization should be used before activation layer or after activation layer.
We can use batch normalization layer before or after activation layer
For Sigmoid and Hyperbolic tangent (Tanh), s-shaped function, one can use batch normalization after activation function
For activation which may result in non-Gaussian distribution like rectified linear activation function, one can use batch normalization before activation function
Batch Normalization Paper suggests, Batch Normalization to use before Activation Function
Let's see the output of batch normalization layers (Keras implementation), after a convolution layer whose output feature map dimension is (batch_size, 14, 14, 128).
Observe the dimension of γ, β, moving mean and moving variance is equal to number of channel of input feature map.
That's all about Batch Normalization.
Thanks for reading the blog.
Reference
- https://arxiv.org/pdf/1502.03167.pdf
- https://machinelearningmastery.com/batch-normalization-for-training-of-deep-neural-networks/
No comments:
Post a Comment