Pages

Sunday, June 20, 2021

Fully Convolutional Networks for Semantic Segmentation

In this blog I will talk about the Fully Convolution Network for Semantic Segmentation Paper. 

Let's first understand what is semantic segmentation problem. Semantic segmentation classify each pixel into one of the class including background without differentiating instance of object.

Semantic Segmentation

In the above image we can see two cow but semantic segmentation does not differentiate instance of connected similar object. One can use Instance segmentation if want to segment two cow separately.

Fully Convolutional Networks for Semantic Segmentation Paper majorly talk about two things

  • First is using fully convolutional network pixel-to-pixel end-to-end training, that takes arbitrary size input images and produces corresponding size of output.
  • Second is using novel "skip" architecture that combine the semantic information from a deep, coarse layer with local appearance information from a shallow, fine layer to produce accurate and detailed segmentation.

 

Classification Network to FCN Network for Semantic Segmentation

Lets see what we mean by fully convolution network. Usually, in the classification network like AlexNet, VGG and GoogLeNet input image goes through multiple convolution block followed by fully connected layer and then output layer.

Because of fully connected layer these network requires fixed input dimension images because fully connected layer takes fixed dimension of input.

In FCN, author, replaces this fully connected layer by 1x1 convolutional layer hence making the network fully convolutional.  

In classification network we usually have N number of output node for N classes in output layer, but for semantic segmentation we want pixel mask as an output having spatial dimension same as input. For this author removes the final output layer of classification network and uses deconvolution (upsampling) layer to produce output mask.

Don't get confused with the name of deconvolution, it is not the reverse process of convolution, in deep learning deconvolution or some time also called back convolution or up convolution or transposed convolution is only used for upsampling.   

Note that the deconvolution filter in such layer need not to be fixed (e.g to bilinear upsampling), but can be learned. A stack of deconvolution layers and activation functions can even learn a nonlinear upsampling.

So till now we have seen that, two changes requires in fully connected classification network in order to make it fully convolutional network for semantic segmentation. Let's see how to do these changes.

FCN-32

First recall the architecture of VGG network.

VGG original architecture

After fifth max pooling layer output feature map dimension is 7x7x512 and after that we are using two fully connected layer.

In order to use only convolutional layer, we change original fc6 layer with convolutional layer of filter size 7x7 (7x7x512x4096) which produces 7x7x4096 size feature map and replace fc7 with convolution layer of filter size 1x1 (1x1x4096x4096) which produces 7x7x4096 size of feature map. 

After that this output feature map (7x7x4096) of new fc7 layer passes through one more 1x1 (1x1x4096x21) convolution layer to give individual prediction mask for each of the PASCAL classes (including background), at last deconvolution layer is used to upsample the output 32 times, from 7x7 to 224x224.
 
 
Note for any arbitrary input dimension (512x512),  feature map after last pooling layer will be 1/32 times (16x16) and after deconvolution layer (upsample by 32 times) output mask will have same dimension (512x512) as input.  

This Network is what we call FCN-32. 

FCN-32
 
FCN-32 network outputs the segmented mask of input, but the output maps are coarse (rough output map) because of the 32 pixel stride at the final prediction layer which limit the scale of details in the upsampled output.
 
To overcome this, author uses skip connection which combine the semantic information from deep layers with spatial location information from the shallow layers to produce accurate and detailed segmentation.

This gives us different variant of FCN i.e. FCN-16 and FCN-8.

FCN-16

For FCN-16, output of pool4 is convolved with 1x1 convolution to get class specific prediction for 21 channels. This predicted output is fused with the 2x upsampled output of conv7 and 16x upsampling is performed on the fused output to get the final segmentation mask.

FCN-32, FCN-16 and FCN-8

FCN-8

Similarly for FCN-8, pool3 is convolved with 1x1 convolution to get class specific prediction for 21 classes and this output is fused with 2x upsampled feature map (last) of FCN-16 i.e 2x pool4 and 4x conv7 and after that 8x upsampling is performed on the fused output to get the final segmentation mask.
 
As FCN-8 is having both spatial information and deep semantic information, so it formed better than FCN-16 and FCN-32.

Refining fully convolutional nets by fusing information from layers with different strides improves segmentation detail

Result 

Pixel accuracy, mean accuracy, mean intersection over union (IU) and frequency weighted IU is reported for these 3 network.

 

Comparison of skip FCN's on a subset of PASCAL VOC 2011 validation

That's all for this blog, Thanks for reading.

Reference

  • https://arxiv.org/pdf/1411.4038.pdf
  • http://cs231n.stanford.edu/slides/2016/winter1516_lecture13.pdf

Wednesday, June 16, 2021

Sigmoid vs ReLU

As we know that Sigmoid and ReLU are two most important activation function that we use in neural network, so we will talk about these two in detail in this blog.

In Neural Network we use neurons in hidden layers, neuron perform a linear transformation on the input using weight and bias of the layer i.e adding the bias with the product of input and weight of the layer. 

After that, an activation function is applied. Output of neuron after activation function decide whether neuron should be activated or not. 

Activation function can be of two type linear and non-linear. 

Linear Activation Function

Adding linear activation function in hidden layers is like adding linear transformation between input and weight again, model will behave as if it is linear only and network won't be able to learn complex non-linear function. 

A neural network with N number of layer, all with linear activation function, is equivalent to neural network with one layer with linear activation because a linear combination of linear function is still a linear function.

Another problem with linear activation function is, it's gradient is always constant and independent of input. If there is an error in prediction, there will be a constant change made by back propagation and not depending on the change in input. 

That's why we use non-linear activation function in neural network.

Non-Linear Activation Function

Neural Network should be able to learn complex function to map input (non-linear data, images) to output, using non-linear activation function in neural network adds non-linear transformation which help network to learn the complex function.

Let's see two most important non-linear activation function Sigmoid and ReLU.

Sigmoid

Sigmoid transform the values between 0 and 1. As it's output values is in between 0 to 1, so we mostly use this in output layer where we want to predict the probability as output.

Sigmoid function: z = f(x) = 1/( 1 + exp(-x) ) 

Derivative: f'(x) = z * (1-z)

Sigmoid and it's derivative curve

Properties

  • It is differentiable and monotonic function.
  • It does not blow up the activation.
  • Non-linear function.

Disadvantage

  • Sigmoid is compute expensive because we calculate exponential of input.
  • Far large input values, output saturates and this kill the gradient.
  • Because of small range of gradient value (max 0.235 @ x=0) and gradient reduction as input value increases, it may cause vanishing gradient and slow down the convergence.
  • Output values are not zero-centered.

 

ReLU (Rectified Linear Unit)

ReLU squashes the input in the range of 0 to +inf. 

ReLU function: z = f(x) = max(0, x) 

Derivative: f'(x) = 1 for x > 0, else 0

ReLU and it's derivative curve
 

ReLU is not differentiable at 0.

Properties 

  • More computationally efficient to compute than Sigmoid as it doesn't perform expensive exponential operation.
  • In practice, network with ReLU tend to show better convergence performance than Sigmoid.
  • Solves the vanishing gradient problem because of strong gradient i.e 1 for positive input.
  • As some neuron dies because of negative input, it makes sparse activation hence sparse network. Sparse representation seems to be more beneficial than dense representation.
  • Non-linear function.

Disadvantage

  • ReLU tend to blow up the activation as there is no mechanism to constrain the output. This may cause exploding gradient problem.
  • For negative input, output of ReLU is zero. If two many activation get below zero then most of the neurons in network will simply output zero, in other word, die and thereby prohibiting learning. This is known as Dying ReLU problem. Dying ReLU problem is solved by Leaky ReLU (f(x)=max(0.01x, x)).

One of the important thing about ReLU is, it is not differentiable at 0. So how we calculate gradient of ReLU at 0 ?

There are two ways to deal with it. First is to assign a value for the derivative at 0, typically we use 0, 0.5 or 1.

A second alternative is, instead of using the actual ReLU function, use an approximation to ReLU which is differentiable for all values of x. One Such approximation is called Softplus which is defined as y = ln(1 + exp(x)) which has derivative of y' = 1/(1 + exp(-x)) .

ReLU vs Softplus

That's all for this blog, thanks for reading.

Reference

  • https://jamesmccaffrey.wordpress.com/2017/06/23/two-ways-to-deal-with-the-derivative-of-the-relu-function/
  • https://www.datasciencecentral.com/profiles/blogs/deep-learning-advantages-of-relu-over-sigmoid-function-in-deep

Saturday, June 12, 2021

Batch Normalization: Accelerating Deep Network Training

In this blog we will talk about Batch Normalization layer.

Before understanding Batch Normalization first we need to understand what is internal covariant shift. 

Internal Covariant Shift is defined as the change in the distribution of  the network activation (feature map) due to change in the network parameter during training. 

So because of this internal covariant shift, training deep neural network is difficult because hidden layers input distribution keep changing, so the layers need to be continuously adapt to the new distribution. 

Theta2 need to readjust to re-compensate for the change in the distribution of x. This slow down the training by requiring lower learning rate and careful parameter initialization.  

To address this problem we can normalize the layer inputs, that's where batch normalization is useful.

Why we call this a Batch normalization because, training happens over a batch (64/128/256) of training images so we normalize batch of feature maps.

For a layer with k-dimensional input x = ( x(1).......x(k)), we will normalize each dimension, where the expectation and variance are computer over the training data set. Such normalization speed up the convergence.

The resulting normalized activation x^(k) have zero mean and unit variance.

Note that simply normalizing each input of a layer may change what the layer can represent. For instance, normalizing the inputs of a sigmoid would constrain them to the linear regime of the non linearity.

To address this, we make sure that the transformation inserted in the network can represent the identity transform. To accomplish this we introduce, for each activation x(k), a pair of parameters γ(k), β(k), which scale and shift the normalized value. 

These parameters are learned along with the original model parameters, and restore the representation power of the network.  

Indeed, by setting γ(k)=√Var[x(k)] and β(k)=E[x(k)], we could recover the original activations, if that were the optimal thing to do. 

Batch Normalization During Training

  • During Training, mean and variance of the each sample of mini-batch is calculated and for each sample mean and variance of each dimension (channels) are calculated separately. 

     
  • Then we normalize the input feature map using calculate mean and variance to make the distribution with zero mean and unit variance.

     
  • As we are changing (normalizing) the input of layer, this may change what layer can represent. To address this, we make sure that the transformation inserted in the network can represent the identity transform by using γ (scale) and β (shift) parameters.

Batch Norm during training - 


Batch Normalization During Inference

  • The normalization of activations that depends on the mini-batch allows efficient training, but is neither necessary nor desirable during inference, we want the output to depend only on the input, deterministically.

  • Instead of using mini-batch statistic we use population statistics that is moving average mean and variance of whole mini batch used in training. Moving mean and moving variance are non trainable parameters that are  calculated (updated) and stored during training.

    • moving_mean(t) = moving_mean(t-1)* momentum + mean(batch)(t) * (1 - momentum)
    • moving_var(t) = moving_var(t-1) * momentum + var(batch)(t) * (1 - momentum)
    • momentum = 0.99

  • Then normalize the testing sample using population mean and variance. After that we need to transform this normalized sample using learned scale (γ) and shift (β) parameter of the respective layer to yield single linear transformation.

  • Since the parameters are fixed in this transformation, the batch normalization procedure is essentially applying a linear transform to the activation.

One more important question for batch normalization is after which layer batch normalization should be used before activation layer or after activation layer.

  • We can use batch normalization layer before or after activation layer

  • For Sigmoid and Hyperbolic tangent (Tanh), s-shaped function, one can use batch normalization after activation function

  • For activation which may result in non-Gaussian distribution like rectified linear activation function, one can use batch normalization before activation function

  • Batch Normalization Paper suggests, Batch Normalization to use  before Activation Function

Let's see the output of batch normalization layers (Keras implementation), after a convolution layer whose output feature map dimension is (batch_size, 14, 14, 128). 

Observe the dimension of γ, β, moving mean and moving variance is equal to number of channel of input feature map.

That's all about Batch Normalization.

Thanks for reading the blog.

Reference

  • https://arxiv.org/pdf/1502.03167.pdf
  • https://machinelearningmastery.com/batch-normalization-for-training-of-deep-neural-networks/

Tuesday, June 8, 2021

Dropout

What is Dropout Layer ?

Dropout layer randomly sets input units to 0 with a frequency of drop_rate at each iteration during training time, which help prevent overfitting.  

The key idea of dropout is randomly drop nodes along with their connection from the neural network during training time.

Dropout layer takes single float values as input between 0 to 1. In Keras implementation it denotes drop probability of unit. We will call it p_drop, so keep probability of unit is p_keep = 1 - p_drop.

Each unit is retained with a fixed probability (p_keep) independent of the other units. Generally we use p_drop 0.5 for dense layer.

Why do we need Dropout ? 

To solve the problem of Overfitting.

Overfitting means our model is performing well on training data but not performing well on test data (or new data).

One of the reason for overfitting is because our model is quite complex (having large number of parameter), so instead of just learning (generalizing) patterns/features in the data it also learn the noise present in the data and so it adjust it's weight to perform well on training data or we can also say that it adjust it's weight to memories the training data. And other reason of overfitting is training data is not good representation of overall (real) data. 

If training dataset is good representation of real data but not in good amount, this can also cause overfitting.

How dropout is solving the problem of Overfitting.

Multiple way to look into this

1. One way to look into this is, it reduces model complexity by randomly setting layer units to zero and so reducing model complexity that help in solving the overfitting.

2. During each training step, it drops unit with p_drop probability from the layer and then train a thinned network. 

Because at each training step, it trains a unique thinned network with less neurons, so the neuron present in network learn the representation(features) required for correct prediction. This prevent neurons from co-adapting too much on each other.

This make the network capable of better generalization and hence solving overfitting.

3. ""Overfitting can also be solved by training all possible neural network for a dataset and average the prediction form all model. But this is not possible.""

Let's see how we can interpret the above concept with dropout layer.

During training with dropout we train multiple sparse (thinned) neural network and at test time, we approximate the effect of averaging the predictions from all these thinned network by simply using original unthinned network that has smaller weights. This help in solving overfitting problem. Let's see in detail.

A neural network with n units, can be seen as 2^n possible thinned neural network and all these network shares the same weights.

During each training step we sample one out of 2^n network and train, so during whole training process we train multiple thinned network.

So training a neural network with dropout can be seen as training a collection of 2^n thinned network with extensive weight sharing, where each thinned network get trained very rarely, if at all. 

At test time, we can not take average of the prediction from all those networks. However simple approximate average method work well. So during inference time, idea is to use full network with all units with scaled-down version of weights. 

If a unit is retained with p_keep during training, then outgoing weights of that unit are multiplied by p_keep at test time. This ensure the expected out of hidden unit is same as the actual output at test time. By doing this scaling, 2^n network with shared weights can be combined into a single neural network to be used at test time.

 

Dropout during training and inference time

Lets say we want to apply dropout on this input data d = {1,2,3,4,5} with p_drop = 0.2 so now during training any one unit of d will become zero and d could be {1, 2, 3, 0, 5} because p_drop is 0.2 another way to look into this is we keep each node with probability (p_keep) 0.8 .

During inference time we will be using all the unit as dropout don't remove units during inference time. If we use all unit during inference, expected output will be different than training time. To make sure that the distribution of the values after the transformation during inference time remains almost the same, we multiply input with keep probability p_keep(1-p_drop) at inference time, during inference same d would be set to {0.8, 1.6, 2.4, 3.2, 4.0}. 

But in general we don't want to do anything with dropout layer during inference time so during training time only, we scale the values by 1/p_keep.

So now during training d could be set to {1.25, 2.5, 3.75, 0, 6.25} and nothing will happen with the input d during inference time.

That is why if you see Keras documentation of dropout it will say, dropout first set units to 0 with given drop probability(p) and then scale the remaining values by 1/(1-p).

That's all about dropout, thanks for reading the blog!

References

1. https://jmlr.org/papers/volume15/srivastava14a/srivastava14a.pdf
2. https://keras.io/api/layers/regularization_layers/dropout/
3. https://leimao.github.io/blog/Dropout-Explained/