Introduction:
Activation functions are one of the most important components of a neural network model. Activation functions instill non-linearity into neural networks and also helps to suppress noisy data and improve signals where needed. The early idea of activation functions came into the engineering field via biological censoring of informations and how neurons stay inactive on some signal level and then activate on more than that. This phenomenon motivated early scientists to let the artificial neurons also behave in certain ways which leads it to emphasis or suppress based on different signal values.
But in current trends, activation functions are mostly motivated from mathematical point of views or even optimization or purely technical reasons. We will explore some of the most famous activation functions in this blog; will try to see why they were motivated and how/where they can become important.
Summary of the article:
In this article, we will explore activation functions as softmax, sigmoid, Relu, Gelu, elu and others. We will talk about how these activation functions perform, why are they useful and how they are motivated. We will also mention codes from one of our code repo about how to implement a few of them using tensorflow.
A brief intro to neural networks and why activation functions are important?
Neural network refers to a circuit of interconnected neurons; whether
biological or artificial. In case of artificial neural networks, the
neurons are called nodes and the nodes are connected with edges between
each other. While there are so many different varieties of neural
networks; we will stick to basic feed forward artificial neural networks
without any complication. If you want to read a proper introduction to what is neural network, then read this article first and then resume from the next section.
there are three important layers for one such neural network. These are:
(a) input layer (b) output layer and (c) hidden layer
The input layer is where we input the features of a data point. If you are say, training an image classifier, then you would be creating a input of all the pixels in the image.
The output layer is where the output from the neural networks come out from. The output layer is most often than not is a collection of probabilities output when we are dealing with multi-class classification. Otherwise in case of deep neural net regressions, you may find numeric output as well.
Now, the most important and the real gem of neural networks is the hidden layer. hidden layers are layers of nodes( neurons if you want to call) connected with either other hidden layers or input/output layers. These are called hidden layers; as they are separated from the input as well as output; and in the running procedure of the algorithm, they are therefore "hidden" from the 'outside'.
The more important stuff is how the hidden layers get the values calculated. First of all, you have to understand a small building criteria here. The neural network nodes are nothing but units for calculation and storing of values. See the below picture to really get the idea:
Look at the picture. Each node carries a value which is a function of all the incoming edges and nodes connected.
Now, there are three parts to this calculation. First one is that we
are taking an average sum of values of the nodes connected. Second one
is the bias term. Now while we create a weighted sum out of the node
values; we need to add a constant term too. The bias term is just that.
The bias term is the constant term, which we add with the weighted
average to get a better approximation or prediction using the neural
net.
Third one is the use of activation function. Now, the activation function is generally a non-linear function which we introduce in neural network to bring non-linearity in the system. We will get into details of activation function NOW.
But it is pretty clear from the discussion above that a neural network thus built without activation function is pretty much a stacked linear regression model. We are better off with building traditional linear regression models than doing that; therefore we are always adding an activation function to our neural networks.
Now that we are done with the neural network basics, let's dive in to the activation functions.
Before starting with any of the formal descriptions, I want to say respectfully that there are functions like binary function ( f(x) = 1 when x>0 -1 when x<0) and linear function( f(x) = x) which are also considered as activation functions. Because of their simplicity and lesser popularity, I am skipping them here.
The heavenly father: sigmoid:
Sigmoid, is one of the most famous neural network activation functions.
f(x) = 1/(1+e^-x)
Sigmoid is a non-linear function, which transforms any value to a value between 0 and 1. The significance of the sigmoid function is that it is
(a) non-linear(b) transforms highly negative values to 0 and highly positive values to 1. Therefore, sigmoid not only nullifies negative signals to almost 0, but it also reduces the scale of any signal to a logarithmic range of what the actual range is. This is a important feature but to the uninduced, it may take some time to sinc in.
(c) beyond 6.5 and -6.5; the values are essentially 1 and 0; i.e. sigmoid(6.5) = 0.999 and sigmoid(-6.5) = 0.001. Sometimes this much decrease in this small interval can be an unsettling factor.
The function can be implemented by as simple as numpy code using this;
import numpy as np def sigmoid_function(x): z = (1/(1 + np.exp(-x))) return z
or if you are using tensorflow then just call the tf.activations.sigmoid. Check the tf documentation of the function for more details.
Let's look at the derivative of the function. The derivative is
f'(x) = sigmoid(x)*(1-sigmoid(x))
This leads to the issues why this function creates problem in more complicated cases. The derivative, clearly vanishes beyond -3 and 3 as the value is very less. For this reason, when we use sigmoid in training complex networks, sometimes gradient vanishes effectively and the training achieves poor quality.
The other issue with sigmoid is that, sigmoid is not symmetric around the center i.e. 0. So this leads to neural net getting weights trained in one direction more than other. To avoid this, we move to tanh. Let's read about tanh now.
The knight who came to rescue us: tanh:
One main issue, with sigmoid was that it didn't have a symmetry around 0. To resolve this issue, we bring tanh function into the picture. The function, has the mathematical definition as below:
tanh(
.
We can implement this function using tensorflow.activation.tanh.
Now, observe that f(x) = -f(-x). This function therefore has f'(x) = f'(-x); and thus leads to the fact that the function has a gradient symmetric around 0. Therefore, we can use the tanh function to clear the issue of asymmetric training. But yet, a lot of issues are observed with tanh too, most importantly, that of the vanishing gradient type.
The old but new kid in block: ReLU:
Now, tanh and sigmoid die out after a certain value; i.e. they have a fixed range of values. But in many case, we don't want that. We want something which is more of something which increases without a stop. For this activation function scientists got a motivation from a cell phenomenon that in many cases, neurons don't fire unless they are activated up to some level. Imitating this, they came up with a very easy function
f(x)=max(0,x)
This function has two new properties.
(a) it is not activated always, but only activates on positive signal.
(b) it doesn't scale any signal down, if not blocking ( negative signal) it totally.
(c) it has derivative 0 for x<0 and 1 for x>0; i.e. it doesn't suffer from the vanishing gradient issue in the positive half.
So this function got named ReLU i.e. rectified linear unit and got pretty famous with modeling deep neural networks. One can use it as from tf.keras.activations.relu in case of tensorflow.
But.. still there are problems with ReLU. ReLU has a 0 gradient for the whole x<0 part. This creates an issue of some nodes not getting trained properly; or not changing their values. Then there are dead neurons because of ReLU, the kinds which never start getting trained because of their settings at the beginning of the training. To solve these problems, new changes were made to ReLU. We got leaky ReLU on the way.
The father of complicated sons: LReLU:
Leaky ReLU got invented to now solve the issue of dead relu i.e. neurons in a network staying dead due to relu's zero gradient side. This marks a long line of more complicated variations starting here; that's why I call it the father of complicated sons.
Leaky ReLU solves the problem of dead relu by simply leaking values in x<0 region. Therefore LReLU looks like:
f(x) = x, x>=0 = ax,x<0 0.1<= a <= 0.3
Now, although LReLU solves these issues; there are more issues we get with LReLU. First of all, the alpha parameter of LReLU is not trained to the model. To implement it in tensorflow, one can use tf.keras.layers.LeakyReLU. So this leaves an option to a partially optimized model. The second thing which stays intact and is the biggest issue at this point is, 'exploding gradient problem'. And to solve this, the new levels of ReLU corrections started to come.
What is exploding gradient problem?
Exploding gradient means literally that. In case of using ReLU, the gradient is constant; and the ReLU values are ever increasing in the positive part of x. This leads to issues in case of bigger and deeper neural networks where these weights merge up to very high values; and thus leads calculations going NaN or out of proper bounds. To solve this has been very important issue in neural networks for the last decade.
The solutions to exploding gradient problem have been weight regularization, gradient clipping and norm capping type ideas. weight regularization and norm capping solutions focus to add weight related norms to the neural network cost function; and therefore keeping the weights of neural network models in check. But while that maybe a partially working solution, people have come up with more variations of ReLU model to properly solve the problem in the source.
The complicated brothers: ELU, SELU, GELU and others:
ELU refers to a function which is linear in positive part, but is exponentially decreasing in negative part. The function is a bit complicated and is:
This function reduces the exploding gradient in the negative part using the logarithmic scale thing. And this model also introduces the non-linearity back to the negative part.
On cons side, the positive side still contributes to the exploding gradient issue. Also, the calculation of ELU is a bit complicated, and increases computation time because of the logarithmic type. And last but not the least, leaving the alpha parameter non-trainable leaves ELU again sub-optimal at best.
There are a couple other similar but far more complicated versions of ReLU. SELU, GELU are examples of such. While SELU came from a long theorem and discussion, GELU came from the gpt-2 project. Both of these functions avoid vanishing gradient issues, as well as it tries to solve the normalization issue.
I will mention the formula for GELU; and the plot; as GPT-2 was the state-of-the-art last year, so definitely GELU has some merit. For learning more about SELU, follow this awesome link to learn further.
As you can see, GELU is a combination of tanh, cubic and linear functions; basically combining a bunch of known activation functions. This function looks like this when plotted:
This function again solves the vanishing gradient problem and has moderate gradient; thus probably helps in exploding gradient issue too. As this is a recent activation function much is yet to be known about it.
In conclusion:
In this post, we went through a variety of activation functions and described how they look, work and what are their pros and cons. We also mentioned the normal way to use them from tensorflow; which is generally one among the formats: tf.activations.function_name or tf.keras.layers.function_name or tf.nn.function_name. In this digit recognition project we have implemented and tried out different activation functions such as relu, leakyrelu, elu and others. Check the codes and run it on colab to learn more about these.
If you have some exciting observation about these; or want to share us result/insight about any of the activation functions we mentioned here; feel free to comment.
Share this article with your peer groups and stay tuned for more contents on neural networks, tensorflow and related topics. Thanks for reading!
Other interesting articles:
(1) Introduction to convolutional neural networks
(1) pose-estimation using opencv
Comments
Post a Comment