It seems a bit confusing when we need to decide which activation function will be the best for a particular neural network. Through this post, you will be able to understand all about activation functions as it’s meaning, different types of activation functions, how to use them, and their right place to use.

So let’s start:

Table of contents:

  • What does it mean by activation function?
  • Activation function list:
    • Binary
    • Linear
    • Non-linear
  • Activation functions list:
    • Sigmoid/Logistic Function
    • Tanh/hyperbolic Function
    • ReLU Function
    • Leaky ReLU
    • Parametric ReLU
    • Softmax Function
    • Swish Function

What does it mean by activation function?

Just think, how does a neural network decide which class has to be predicted for a particular problem.

In a neural network, all data is coming in from the input side and you need to decide which data you have to let go out. So, how to do that?

Activation Function helps you do that.

The activation function is a kind of gate for the neural networks which decide the output of the model.

Activation function types:

Okay! So, we understood what activation function means?

So, let’s check out different types of activation functions.

Binary Activation Functions:

The binary activation function is like “Threshold function” which computes the output in the form of ‘0’ or ‘1’

e.g. Step function

For e.g. If you want to predict if it’s going to rain today or not?

or

“If this particular image belongs to cat or not” (binary classification)

f(x) = 1, x>=0 
f(x) = 0, x <0
Step Function
[Image Source: Varsity Tutors]

Limitations of Binary activation function

The most important thing in a neural network is “feedback”. Using feedback only, the output may be improved or accuracy can be increased. To incorporate a feedback mechanism, we need to apply a gradient descent algorithm. For that purpose, the derivative of the activation function is needed.

If you calculate derivatives of the binary activation function, the derivative is not dependent on the input. It means you won’t be able to figure out which input can provide a better prediction.

Linear

The linear activation function creates output in such a way that output is proportional to the input.

 f(x) = ax

Pros:

It is not limited to binary output but can be used for multi-class classification.

Cons:

  • Biggest issue with the linear activation function is, due to linear activation function all layers of neural network will act as one layer only. So model will act like a linear regression function.
  • Derivative of linear function is constant so it does not depend upon input. That’s why it is not possible to identify that which input neurons will affect the output.

Non-linear Activation Functions:

Mostly neural networks use non-linear activation functions.

Why?

As non-linear activation functions are capable of mapping complex features to output.

Pros:

  • With the help of non-linear activation function, back-propagation algorithm (feedback mechanism) can be applied
  • Using non-linear activation function, multiple layers of neural network will be able to capture different features so it’s possible to create “Deep neural Network” using non-linear activation function.

Let’s discuss some of the non-linear activation functions which are most widely used:

Sigmoid Function

The sigmoid function transforms the output between 0 and 1. It is helpful where prediction is required in the form of probability.

E.g. Multi-class Classification. In such cases, we just need to predict the probability of a class.

f(x) = 1/(1+e^-x)

The function used in Keras:

tf.keras.activations.sigmoid(x)
Activation Functions in Neural Networks | by SAGAR SHARMA | Towards Data  Science
[Image Source: Towards Data Science]

Cons:

  • If you take derivative of this activation function and check the graph, it will show that for very high or very low values of x, output does not changes much which leads to vanishing gradient problem.

Tanh Function

It is similar to the Sigmoid function except that it can take both positive and negative values because it is symmetric around the origin.

tanh(x) = 2/(1+e^(-2x)) -1

The function used in Keras:

tf.keras.activations.tanh(x)
Tanh Activation Explained | Papers With Code
[Image Source: Papers with Code]

Note:

Usually, the ‘Tanh’ function is preferred over ‘Sigmoid’ as it’s symmetrical around the origin.

RelU Function

ReLU stands for Rectified Linear Unit. It works in such a way that it does activate only a few neurons at a time. Negative neurons will be set to zero and output will be proportional to positive neurons.

f(x) = max(0,x)

The function used in Keras:

tf.keras.activations.relu(x, alpha=0.0, max_value=None, threshold=0)
ReLU : Not a Differentiable Function: Why used in Gradient Based  Optimization? and Other Generalizations of ReLU. | by Kanchan Sarkar |  Medium
[Image Source: Medium]

Cons:

  • For negative neurons, output is zero. So during feedback mechanism, negative neurons will always will become zero leading to dead neuron problem. Once a neuron is dead, it can’t be activated again that’s why ‘Leaky ReLU’ came as alternative of ReLU.

Leaky ReLU

Though Relu is computationally efficient still it has a problem that it makes all the neurons zero which are negative.

So, for its solution “Leaky Relu” came as its alternative.

f(x)= x,      x>=0
f(x)= 0.01x,  x<0
[Image Source: i2tutorials]

Parametric Relu

To deal with the negative neurons, parametric ReLU takes the slope of the x as its argument.

f(x) = max(ax,x)
Parameterized Relu Activation Function
[Image Source: Robofied]

Softmax Function

It normalizes the output of given classes between 0 to 1, divides the output with the sum of given probabilities, and gives the probability of particular classes accordingly.

The softmax function is widely used for neural network training in the output layer (sometimes used as a fully connected layer) as it gives the probability for multiple classes.

f(x) = e^(x)/Sum(e^(x))

The function used in Keras:

tf.keras.activations.softmax(x, axis=-1)
Softmax Activation Function
[Image Source: Robofied]

Swish Function

It is a recently discovered function by Google which is a self-gate-activated function.

f(x) = x*sigmoid(x)
12 Types of Neural Networks Activation Functions: How to Choose?
[Image Source: V7 Labs]

Conclusion

This was all about the different Activation functions used in Neural Networks. There is no best or worst activation function. It all depends on the use case. You need to take care of speed, efficiency, and the accuracy of the activation function.

Let us know through the comments if you found it useful.