Pages

Wednesday, June 16, 2021

Sigmoid vs ReLU

As we know that Sigmoid and ReLU are two most important activation function that we use in neural network, so we will talk about these two in detail in this blog.

In Neural Network we use neurons in hidden layers, neuron perform a linear transformation on the input using weight and bias of the layer i.e adding the bias with the product of input and weight of the layer. 

After that, an activation function is applied. Output of neuron after activation function decide whether neuron should be activated or not. 

Activation function can be of two type linear and non-linear. 

Linear Activation Function

Adding linear activation function in hidden layers is like adding linear transformation between input and weight again, model will behave as if it is linear only and network won't be able to learn complex non-linear function. 

A neural network with N number of layer, all with linear activation function, is equivalent to neural network with one layer with linear activation because a linear combination of linear function is still a linear function.

Another problem with linear activation function is, it's gradient is always constant and independent of input. If there is an error in prediction, there will be a constant change made by back propagation and not depending on the change in input. 

That's why we use non-linear activation function in neural network.

Non-Linear Activation Function

Neural Network should be able to learn complex function to map input (non-linear data, images) to output, using non-linear activation function in neural network adds non-linear transformation which help network to learn the complex function.

Let's see two most important non-linear activation function Sigmoid and ReLU.

Sigmoid

Sigmoid transform the values between 0 and 1. As it's output values is in between 0 to 1, so we mostly use this in output layer where we want to predict the probability as output.

Sigmoid function: z = f(x) = 1/( 1 + exp(-x) ) 

Derivative: f'(x) = z * (1-z)

Sigmoid and it's derivative curve

Properties

  • It is differentiable and monotonic function.
  • It does not blow up the activation.
  • Non-linear function.

Disadvantage

  • Sigmoid is compute expensive because we calculate exponential of input.
  • Far large input values, output saturates and this kill the gradient.
  • Because of small range of gradient value (max 0.235 @ x=0) and gradient reduction as input value increases, it may cause vanishing gradient and slow down the convergence.
  • Output values are not zero-centered.

 

ReLU (Rectified Linear Unit)

ReLU squashes the input in the range of 0 to +inf. 

ReLU function: z = f(x) = max(0, x) 

Derivative: f'(x) = 1 for x > 0, else 0

ReLU and it's derivative curve
 

ReLU is not differentiable at 0.

Properties 

  • More computationally efficient to compute than Sigmoid as it doesn't perform expensive exponential operation.
  • In practice, network with ReLU tend to show better convergence performance than Sigmoid.
  • Solves the vanishing gradient problem because of strong gradient i.e 1 for positive input.
  • As some neuron dies because of negative input, it makes sparse activation hence sparse network. Sparse representation seems to be more beneficial than dense representation.
  • Non-linear function.

Disadvantage

  • ReLU tend to blow up the activation as there is no mechanism to constrain the output. This may cause exploding gradient problem.
  • For negative input, output of ReLU is zero. If two many activation get below zero then most of the neurons in network will simply output zero, in other word, die and thereby prohibiting learning. This is known as Dying ReLU problem. Dying ReLU problem is solved by Leaky ReLU (f(x)=max(0.01x, x)).

One of the important thing about ReLU is, it is not differentiable at 0. So how we calculate gradient of ReLU at 0 ?

There are two ways to deal with it. First is to assign a value for the derivative at 0, typically we use 0, 0.5 or 1.

A second alternative is, instead of using the actual ReLU function, use an approximation to ReLU which is differentiable for all values of x. One Such approximation is called Softplus which is defined as y = ln(1 + exp(x)) which has derivative of y' = 1/(1 + exp(-x)) .

ReLU vs Softplus

That's all for this blog, thanks for reading.

Reference

  • https://jamesmccaffrey.wordpress.com/2017/06/23/two-ways-to-deal-with-the-derivative-of-the-relu-function/
  • https://www.datasciencecentral.com/profiles/blogs/deep-learning-advantages-of-relu-over-sigmoid-function-in-deep

No comments:

Post a Comment