It seems bit confusing when we need to decide which activation function will be the best for a particular neural network. Through this post, you will be able to understand all about activation functions as it’s meaning, different types of activation functions, how to use them and their right place to use.

So let’s start:

## Table of contents:

- Activation function definition
- Use of activation function
- Activation function list:
- Binary
- Linear
- Non-linear

- Activation functions list:
- Sigmoid/Logistic Function
- Tanh/hyperbolic Function
- ReLU Function
- Leaky ReLU
- Parametric ReLU
- Softmax Function
- Swish Function

## What does it mean by activation function?

Just think, how does a neural network decide that which class has to be predicted for a particular problem.

In neural network, all data is coming in from the input side and you need to decide which data you have to let go out. So, how to do that?

And this task is done by “Activation Functions”

Activation function is kind of gate for neural network which decide the output of the model.

## Activation function types:

Okay! we understood what does activation function mean and why is it needed?

So, let’s check out different types of activation functions.

Overall activation functions can be divided into three categories:

### Binary Activation Functions:

Binary activation function is like “Threshold function” which computes the output in the form of ‘0’ or ‘1’

e.g. Step function

For e.g. If you wan to predict if it’s going to rain today or not?

or

“If this particular images belongs to cat or not” (binary classification)

Such questions can be answered using binary activation functions.

f(x) = 1, x>=0 f(x) = 0, x <0

#### Limitations of Binary activation function

The most important thing in neural network is “feedback”. Using feedback only, output may be improved or accuracy can be increased. To incorporate feedback mechanism, we need to apply gradient descent algorithm. For that purpose, derivative of the activation function is needed.

If you calculate derivative of binary activation function, derivative is not dependent on input. It means, you won’t be able to figure out that which input can provide a better prediction.

### Linear

There is a limitation of binary step function that it can be used for classification of two classes only. Linear activation function creates output in such a way that output is proportional to input.

f(x) = ax

#### Pros:

It can be used for multiple output scenario. Not limited to binary output.

#### Cons:

- Biggest issue with the linear activation function is, due to linear activation function all layers of neural network will act as one layer only. So model will act like a linear regression function.

- Derivative of linear function is constant so it does not depend upon input. That’s why it is not possible to identify that which input neurons will affect the output.

### Non-linear Activation Functions:

Mostly neural networks use non-linear activation functions.

Why?

As non-linear activation functions are capable of mapping complex features to output. These complex features can be extracted from any type of data like images, audio files, text data etc.

### Pros:

- With the help of non-linear activation function, back-propagation algorithm (feedback mechanism) can be applied
- Using non-linear activation function, multiple layers of neural network will be able to capture different features so it’s possible to create “Deep neural Network” using non-linear activation function.

**Let’s discuss some of non-linear activation functions which are most widely used: **

## Sigmoid Function

Sigmoid function transforms the output between 0 and 1. As it maps the output between 0 to 1, it is widely used where probability is needed in the prediction.

E.g. Multi-class Classification. In such cases, we just need to predict the probability of a class.

f(x) = 1/(1+e^-x)

Function used in keras:

tf.keras.activations.sigmoid(x)

### Cons:

- If you take derivative of this activation function and check the graph, it will show that for very high or very low values of x, output does not changes much which leads to vanishing gradient problem.

## Tanh Function

It is similar to Sigmoid function except that it can take both positive and negative values because it is symmetric around origin.

tanh(x) = 2/(1+e^(-2x)) -1

Function used in keras:

`tf.keras.activations.tanh(x)`

Note:

Usually, ‘Tanh’ function is preferred over ‘Sigmoid’ as it’s symmetrical around the origin.

## RelU Function

ReLU is known as rectifier linear unit. It works in such a way that it does activate only few neurons at a time. Negative neurons will be set to zero and output will be proportional positive neurons.

f(x) = max(0,x)

Function used in keras:

tf.keras.activations.relu(x, alpha=0.0, max_value=None, threshold=0)

#### Cons:

- For negative neurons, output is zero. So during feedback mechanism, negative neurons will always will become zero. That’s called dead neurons. Once a neuron is dead, it can’t be activated again that’s why ‘Leaky ReLU’ came as alternative of ReLU

## Leaky ReLU

Though Relu is computationally efficient still it has problem that it makes all the neurons zero which are negative.

So, for it’s solution “Leaky Relu” came as it’s alternative.

f(x)= x, x>=0 f(x)= 0.01x, x<0

## Parametric Relu

To deal with the negative neurons, parametric relu takes slope of the x as it’s argument.

f(x) = max(ax,x)

## Softmax Function

The best part of softmax function is, it can be used for multi classification. It normalizes the output of given classes between 0 to 1, divides the output with the sum of given probabilities and gives probability of particular classes accordingly.

Softmax function is widely used for neural network training in the output layer (sometimes used as fully connected layer)as it gives probability for multiple classes.

f(x) = e^(x)/Sum(e^(x))

Function used in keras:

`tf.keras.activations.softmax(x, axis=-1)`

## Swish Function

It is recently discovered function by Google which is self-gate activated function.

f(x) = x*sigmoid(x)