Perseus AI: Dropout

What is Dropout Layer ?

Dropout layer randomly sets input units to 0 with a frequency of drop_rate at each iteration during training time, which help prevent overfitting.

The key idea of dropout is randomly drop nodes along with their connection from the neural network during training time.

Dropout layer takes single float values as input between 0 to 1. In Keras implementation it denotes drop probability of unit. We will call it p_drop, so keep probability of unit is p_keep = 1 - p_drop.

Each unit is retained with a fixed probability (p_keep) independent of the other units. Generally we use p_drop 0.5 for dense layer.

Why do we need Dropout ?

To solve the problem of Overfitting.

Overfitting means our model is performing well on training data but not performing well on test data (or new data).

One of the reason for overfitting is because our model is quite complex (having large number of parameter), so instead of just learning (generalizing) patterns/features in the data it also learn the noise present in the data and so it adjust it's weight to perform well on training data or we can also say that it adjust it's weight to memories the training data. And other reason of overfitting is training data is not good representation of overall (real) data.

If training dataset is good representation of real data but not in good amount, this can also cause overfitting.

How dropout is solving the problem of Overfitting.

Multiple way to look into this

1. One way to look into this is, it reduces model complexity by randomly setting layer units to zero and so reducing model complexity that help in solving the overfitting.

2. During each training step, it drops unit with p_drop probability from the layer and then train a thinned network.

Because at each training step, it trains a unique thinned network with less neurons, so the neuron present in network learn the representation(features) required for correct prediction. This prevent neurons from co-adapting too much on each other.

This make the network capable of better generalization and hence solving overfitting.

3. ""Overfitting can also be solved by training all possible neural network for a dataset and average the prediction form all model. But this is not possible.""

Let's see how we can interpret the above concept with dropout layer.

During training with dropout we train multiple sparse (thinned) neural network and at test time, we approximate the effect of averaging the predictions from all these thinned network by simply using original unthinned network that has smaller weights. This help in solving overfitting problem. Let's see in detail.

A neural network with n units, can be seen as 2^n possible thinned neural network and all these network shares the same weights.

During each training step we sample one out of 2^n network and train, so during whole training process we train multiple thinned network.

So training a neural network with dropout can be seen as training a collection of 2^n thinned network with extensive weight sharing, where each thinned network get trained very rarely, if at all.

At test time, we can not take average of the prediction from all those networks. However simple approximate average method work well. So during inference time, idea is to use full network with all units with scaled-down version of weights.

If a unit is retained with p_keep during training, then outgoing weights of that unit are multiplied by p_keep at test time. This ensure the expected out of hidden unit is same as the actual output at test time. By doing this scaling, 2^n network with shared weights can be combined into a single neural network to be used at test time.

Dropout during training and inference time

Lets say we want to apply dropout on this input data d = {1,2,3,4,5} with p_drop = 0.2 so now during training any one unit of d will become zero and d could be {1, 2, 3, 0, 5} because p_drop is 0.2 another way to look into this is we keep each node with probability (p_keep) 0.8 .

During inference time we will be using all the unit as dropout don't remove units during inference time. If we use all unit during inference, expected output will be different than training time. To make sure that the distribution of the values after the transformation during inference time remains almost the same, we multiply input with keep probability p_keep(1-p_drop) at inference time, during inference same d would be set to {0.8, 1.6, 2.4, 3.2, 4.0}.

But in general we don't want to do anything with dropout layer during inference time so during training time only, we scale the values by 1/p_keep.

So now during training d could be set to {1.25, 2.5, 3.75, 0, 6.25} and nothing will happen with the input d during inference time.

That is why if you see Keras documentation of dropout it will say, dropout first set units to 0 with given drop probability(p) and then scale the remaining values by 1/(1-p).

That's all about dropout, thanks for reading the blog!

References

1. https://jmlr.org/papers/volume15/srivastava14a/srivastava14a.pdf

2. https://keras.io/api/layers/regularization_layers/dropout/

3. https://leimao.github.io/blog/Dropout-Explained/

Perseus AI

Pages

Tuesday, June 8, 2021

Dropout

What is Dropout Layer ?

Why do we need Dropout ?

How dropout is solving the problem of Overfitting.

Dropout during training and inference time

References

No comments:

Post a Comment