Skip to main content

Neural network basic introduction and FAQ

 Introduction:

All of us, when start out with neural network, go through pictures of networks, try and understand complex equations; get baffled by the back-propagation equations and take our time to eventually assimilate the knowledge of what all that stands for. Recently while mentoring students for our new effort mentorbruh , one of my students asked some pretty interesting yet very basic questions. This content is an effort to write down those so that it can help other students who are going through neural networks first time and have these doubts. But before that, let's brush up the basics once pretty quickly. 

Abstract of this article:

In this post, we are going to take small steps in explaining what is neural network, what are input,output and hidden layers; how does a node calculate its values. We will also briefly touch the concepts of bias, activation, hidden layer number count and all the related artifacts. In the more looser second part of this neural network basics post, we will discuss some of the questions like why do we need which different types of activation functions; how do we assign weights to neural network parameters and etc.

What is neural network?

Neural network refers to a circuit of interconnected neurons; whether biological or artificial. In case of artificial neural networks, the neurons are called nodes and the nodes are connected with edges between each other. While there are so many different varieties of neural networks; we will stick to basic feed forward artificial neural networks without any complication. To start with the concept, just consider this as neural network:

neural network with multiple hidden layer and multi-class classification

Now, clearly, to naked eyes, there are three important layers for one such neural network. These are:

(a) input layer (b) output layer and (c) hidden layer

The input layer is where we input the features of a data point. If you are say, training an image classifier, then you would be creating a input of all the pixels in the image. 

The output layer is where the output from the neural networks come out from. The output layer is most often than not is a collection of probabilities output when we are dealing with multi-class classification. Otherwise in case of deep neural net regressions, you may find numeric output as well. 

Now, the most important and the real gem of neural networks is the hidden layer. The concept of hidden layer comes from a long trend in traditional machine learning where you have to create the features from your raw data; i.e. somehow create the actual representation of the data; instead of the algorithms abstractly finding it out. Hidden layers is what actually leads neural networks to do that for you.

What is hidden layer then? 

hidden layers are layers of nodes( neurons if you want to call) connected with either other hidden layers or input/output layers. These are called hidden layers; as they are separated from the input as well as output; and in the running procedure of the algorithm, they are therefore "hidden" from the 'outside'. This created great confusion in my mind when I learnt first; and therefore I will tell you a little secret.

hidden layers have nothing hidden in them. They are just extra layers of node to provide extra feature creation room to the neural networks.

There I said it. And yet, it takes a significant time to realize and understand this fact sometimes. 

Now that we are done with what we can see first, let's move to what is more important in the context of neural networks. 

The more important stuff is how the hidden layers get the values calculated. First of all, you have to understand a small building criteria here. The neural network nodes are nothing but units for calculation and storing of values. See the below picture to really get the idea:

Look at the picture. Each node carries a value which is a function of all the incoming edges and nodes connected. Now, that's a lot of information; so let's take it slowly and one at a time.

Each node, in a feed-forward type neural network, is connected from a number of nodes from the previous layer. 

[We are using the term feed-forward, as in the current and most basic structure of neural network, the values feed from the back, input layer and reaches the output layer. So feed-->forward is the term. There are other neural network structures (too many infact), but let's not discuss about them as of now.]

So in a node, we take all the inputs, as in the picture, [x1,x2,x3] and we multiply them with the respective edge's weight i.e. [w1,w2,w3]. This summation is then added by a bias term; and the summation b+x1w1+x2w2+..xnwn is then passed through a function called activation function, and the total value is considered to be the value of the node which we are at. i.e. Y is considered to be the value of the node; or the value this node feeds forward.

Now, there are three parts to this calculation. First one is that we are taking an average sum of values of the nodes connected. Second one is the bias term. Now while we create a weighted sum out of the node values; we need to add a constant term too. The bias term is just that. The bias term is the constant term, which we add with the weighted average to get a better approximation or prediction using the neural net.

 Third one is the use of activation function. Now, the activation function is generally a non-linear function which we introduce in neural network to bring non-linearity in the system. We will get into details of activation function later.

Now that we know each of the elements as it is; let's get to the more serious matters in hand.

Question1: what is the role of the bias term?

Bias term, or as we can say the b of y = f(b+x1w1+x2w2+..xnwn); is nothing but the constant term we assign to each node's equation. To understand exactly why we need a bias term in neural network's equation; we need to realize what we are doing in a neural network. In a neural network, we try to approximate a complex function; using a step by step composition of simple non-linear functions. In such a process, we need to keep a constant term for each of the simple functions we compose; otherwise there is no way of setting simple constant differences in approximation. This is why we need a bias term.

question 2: how many hidden layers do we need?

The hidden layers are the key to the neural network's great achievements. So it is normal to ask the question that how many hidden layers do we need, when we start training a neural network. But the answer like many things in machine learning is, "it depends on the problem." 

In most cases, if you have a straight forward problem, then you will start with only one hidden layer. According to Intro to neural nets by jeff heatons, a neural network with one hidden layer, allows a neural network to approximate any function involving “a continuous mapping from one finite space to another.” While, a NN with 2 hidden layer, the network is able to “represent an arbitrary decision boundary to arbitrary accuracy.” 

while on a face value the theoretical statements will not help much, there is a hint of answer there. The answer underlying this is that, if you have to approximate one function with proper accuracy, maybe using 2 hidden layers is a good starting point. 

Now, you may differ and ask why are we using more layers then and when. That leads to the main crux of the matter; which is to assess how complex your function actually is. I will describe three types of complexities and will suggest number of hidden layers based on my intuition and this article.

(a) Your problem is very straight forward, and you have one single output and all the inputs directly impact the output; in such a case, one hidden layer is enough to perform the task. Similarly, If the network has only one output node and the required input–output relationship is fairly straightforward, start with a hidden-layer dimensionality that is equal to two-thirds of the input dimensionality.

(b) If you have multiple output nodes or you believe that the required input–output relationship is complex, make the hidden-layer dimensionality equal to the input dimensionality plus the output dimensionality (but keep it less than twice the input dimensionality)

(c) If you believe that the required input–output relationship is extremely complex, set the hidden dimensionality to one less than twice the input dimensionality.

So although this is a thumb rule; the basic understanding from this discussion you should carry on is that the number of hidden layers depend on how complex your problem is, how raw your data features are and finally it depends on trial and error experimentation to see at what level the hidden layers start overfitting and upto what there is still room to fit more. 

As you must have guessed pretty correctly, the answer to the question that how many nodes should be there in a neural network also follows the same line and is experimental and problem complexity based.

Question 3: why do we need activation functions ?

Activation functions are the most important part of neural network's mathematical construct; and that decides that why are they so important. The activation functions bring the non-linearity in the neural networks. This non-linearity, on top of weighted linear sums, makes the neural networks universal approximator by theory; and therefore best algorithms for problem solving in practice. 

In conclusion:

So in this neural network basic intro and faq we learned about what is neural networks and what are the common parts of a simple artificial neural network (ann). Then we also resolved some queries which normally stem from first time exposure to NNs such as why do we need activation function, what is the need of a bias term; how many hidden layers we need to fit and the other questions. Thanks for reading! 

Further readings:

(a) different activation functions and their descriptions

(b) how many hidden layers you need

(c) 7 types of activation functions

(d) A peek into research of activation functions 

(e) Another research paper on impact of activation functions 

Comments

Popular posts from this blog

Tinder bio generation with OpenAI GPT-3 API

Introduction: Recently I got access to OpenAI API beta. After a few simple experiments, I set on creating a simple test project. In this project, I will try to create good tinder bio for a specific person.  The abc of openai API playground: In the OpenAI API playground, you get a prompt, and then you can write instructions or specific text to trigger a response from the gpt-3 models. There are also a number of preset templates which loads a specific kind of prompt and let's you generate pre-prepared results. What are the models available? There are 4 models which are stable. These are: (1) curie (2) babbage (3) ada (4) da-vinci da-vinci is the strongest of them all and can perform all downstream tasks which other models can do. There are 2 other new models which openai introduced this year (2021) named da-vinci-instruct-beta and curie-instruct-beta. These instruction models are specifically built for taking in instructions. As OpenAI blog explains and also you will see in our

Can we write codes automatically with GPT-3?

 Introduction: OpenAI created and released the first versions of GPT-3 back in 2021 beginning. We wrote a few text generation articles that time and tested how to create tinder bio using GPT-3 . If you are interested to know more on what is GPT-3 or what is openai, how the server look, then read the tinder bio article. In this article, we will explore Code generation with OpenAI models.  It has been noted already in multiple blogs and exploration work, that GPT-3 can even solve leetcode problems. We will try to explore how good the OpenAI model can "code" and whether prompt tuning will improve or change those performances. Basic coding: We will try to see a few data structure coding performance by GPT-3. (a) Merge sort with python:  First with 200 words limit, it couldn't complete the Write sample code for merge sort in python.   def merge(arr, l, m, r):     n1 = m - l + 1     n2 = r- m       # create temp arrays     L = [0] * (n1)     R = [0] * (n

What is Bort?

 Introduction: Bort, is the new and more optimized version of BERT; which came out this october from amazon science. I came to know about it today while parsing amazon science's news on facebook about bort. So Bort is the newest addition to the long list of great LM models with extra-ordinary achievements.  Why is Bort important? Bort, is a model of 5.5% effective and 16% total size of the original BERT model; and is 20x faster than BERT, while being able to surpass the BERT model in 20 out of 23 tasks; to quote the abstract of the paper,  ' it obtains performance improvements of between 0 . 3% and 31%, absolute, with respect to BERT-large, on multiple public natural language understanding (NLU) benchmarks. ' So what made this achievement possible? The main idea behind creation of Bort is to go beyond the shallow depth of weight pruning, connection deletion or merely factoring the NN into different matrix factorizations and thus distilling it. While methods like knowle