Skip to main content

Neural network basic introduction and FAQ

 Introduction:

All of us, when start out with neural network, go through pictures of networks, try and understand complex equations; get baffled by the back-propagation equations and take our time to eventually assimilate the knowledge of what all that stands for. Recently while mentoring students for our new effort mentorbruh , one of my students asked some pretty interesting yet very basic questions. This content is an effort to write down those so that it can help other students who are going through neural networks first time and have these doubts. But before that, let's brush up the basics once pretty quickly. 

Abstract of this article:

In this post, we are going to take small steps in explaining what is neural network, what are input,output and hidden layers; how does a node calculate its values. We will also briefly touch the concepts of bias, activation, hidden layer number count and all the related artifacts. In the more looser second part of this neural network basics post, we will discuss some of the questions like why do we need which different types of activation functions; how do we assign weights to neural network parameters and etc.

What is neural network?

Neural network refers to a circuit of interconnected neurons; whether biological or artificial. In case of artificial neural networks, the neurons are called nodes and the nodes are connected with edges between each other. While there are so many different varieties of neural networks; we will stick to basic feed forward artificial neural networks without any complication. To start with the concept, just consider this as neural network:

neural network with multiple hidden layer and multi-class classification

Now, clearly, to naked eyes, there are three important layers for one such neural network. These are:

(a) input layer (b) output layer and (c) hidden layer

The input layer is where we input the features of a data point. If you are say, training an image classifier, then you would be creating a input of all the pixels in the image. 

The output layer is where the output from the neural networks come out from. The output layer is most often than not is a collection of probabilities output when we are dealing with multi-class classification. Otherwise in case of deep neural net regressions, you may find numeric output as well. 

Now, the most important and the real gem of neural networks is the hidden layer. The concept of hidden layer comes from a long trend in traditional machine learning where you have to create the features from your raw data; i.e. somehow create the actual representation of the data; instead of the algorithms abstractly finding it out. Hidden layers is what actually leads neural networks to do that for you.

What is hidden layer then? 

hidden layers are layers of nodes( neurons if you want to call) connected with either other hidden layers or input/output layers. These are called hidden layers; as they are separated from the input as well as output; and in the running procedure of the algorithm, they are therefore "hidden" from the 'outside'. This created great confusion in my mind when I learnt first; and therefore I will tell you a little secret.

hidden layers have nothing hidden in them. They are just extra layers of node to provide extra feature creation room to the neural networks.

There I said it. And yet, it takes a significant time to realize and understand this fact sometimes. 

Now that we are done with what we can see first, let's move to what is more important in the context of neural networks. 

The more important stuff is how the hidden layers get the values calculated. First of all, you have to understand a small building criteria here. The neural network nodes are nothing but units for calculation and storing of values. See the below picture to really get the idea:

Look at the picture. Each node carries a value which is a function of all the incoming edges and nodes connected. Now, that's a lot of information; so let's take it slowly and one at a time.

Each node, in a feed-forward type neural network, is connected from a number of nodes from the previous layer. 

[We are using the term feed-forward, as in the current and most basic structure of neural network, the values feed from the back, input layer and reaches the output layer. So feed-->forward is the term. There are other neural network structures (too many infact), but let's not discuss about them as of now.]

So in a node, we take all the inputs, as in the picture, [x1,x2,x3] and we multiply them with the respective edge's weight i.e. [w1,w2,w3]. This summation is then added by a bias term; and the summation b+x1w1+x2w2+..xnwn is then passed through a function called activation function, and the total value is considered to be the value of the node which we are at. i.e. Y is considered to be the value of the node; or the value this node feeds forward.

Now, there are three parts to this calculation. First one is that we are taking an average sum of values of the nodes connected. Second one is the bias term. Now while we create a weighted sum out of the node values; we need to add a constant term too. The bias term is just that. The bias term is the constant term, which we add with the weighted average to get a better approximation or prediction using the neural net.

 Third one is the use of activation function. Now, the activation function is generally a non-linear function which we introduce in neural network to bring non-linearity in the system. We will get into details of activation function later.

Now that we know each of the elements as it is; let's get to the more serious matters in hand.

Question1: what is the role of the bias term?

Bias term, or as we can say the b of y = f(b+x1w1+x2w2+..xnwn); is nothing but the constant term we assign to each node's equation. To understand exactly why we need a bias term in neural network's equation; we need to realize what we are doing in a neural network. In a neural network, we try to approximate a complex function; using a step by step composition of simple non-linear functions. In such a process, we need to keep a constant term for each of the simple functions we compose; otherwise there is no way of setting simple constant differences in approximation. This is why we need a bias term.

question 2: how many hidden layers do we need?

The hidden layers are the key to the neural network's great achievements. So it is normal to ask the question that how many hidden layers do we need, when we start training a neural network. But the answer like many things in machine learning is, "it depends on the problem." 

In most cases, if you have a straight forward problem, then you will start with only one hidden layer. According to Intro to neural nets by jeff heatons, a neural network with one hidden layer, allows a neural network to approximate any function involving “a continuous mapping from one finite space to another.” While, a NN with 2 hidden layer, the network is able to “represent an arbitrary decision boundary to arbitrary accuracy.” 

while on a face value the theoretical statements will not help much, there is a hint of answer there. The answer underlying this is that, if you have to approximate one function with proper accuracy, maybe using 2 hidden layers is a good starting point. 

Now, you may differ and ask why are we using more layers then and when. That leads to the main crux of the matter; which is to assess how complex your function actually is. I will describe three types of complexities and will suggest number of hidden layers based on my intuition and this article.

(a) Your problem is very straight forward, and you have one single output and all the inputs directly impact the output; in such a case, one hidden layer is enough to perform the task. Similarly, If the network has only one output node and the required input–output relationship is fairly straightforward, start with a hidden-layer dimensionality that is equal to two-thirds of the input dimensionality.

(b) If you have multiple output nodes or you believe that the required input–output relationship is complex, make the hidden-layer dimensionality equal to the input dimensionality plus the output dimensionality (but keep it less than twice the input dimensionality)

(c) If you believe that the required input–output relationship is extremely complex, set the hidden dimensionality to one less than twice the input dimensionality.

So although this is a thumb rule; the basic understanding from this discussion you should carry on is that the number of hidden layers depend on how complex your problem is, how raw your data features are and finally it depends on trial and error experimentation to see at what level the hidden layers start overfitting and upto what there is still room to fit more. 

As you must have guessed pretty correctly, the answer to the question that how many nodes should be there in a neural network also follows the same line and is experimental and problem complexity based.

Question 3: why do we need activation functions ?

Activation functions are the most important part of neural network's mathematical construct; and that decides that why are they so important. The activation functions bring the non-linearity in the neural networks. This non-linearity, on top of weighted linear sums, makes the neural networks universal approximator by theory; and therefore best algorithms for problem solving in practice. 

In conclusion:

So in this neural network basic intro and faq we learned about what is neural networks and what are the common parts of a simple artificial neural network (ann). Then we also resolved some queries which normally stem from first time exposure to NNs such as why do we need activation function, what is the need of a bias term; how many hidden layers we need to fit and the other questions. Thanks for reading! 

Further readings:

(a) different activation functions and their descriptions

(b) how many hidden layers you need

(c) 7 types of activation functions

(d) A peek into research of activation functions 

(e) Another research paper on impact of activation functions 

Comments

Popular posts from this blog

Mastering SQL for Data Science: Top SQL Interview Questions by Experience Level

Introduction: SQL (Structured Query Language) is a cornerstone of data manipulation and querying in data science. SQL technical rounds are designed to assess a candidate’s ability to work with databases, retrieve, and manipulate data efficiently. This guide provides a comprehensive list of SQL interview questions segmented by experience level—beginner, intermediate, and experienced. For each level, you'll find key questions designed to evaluate the candidate’s proficiency in SQL and their ability to solve data-related problems. The difficulty increases as the experience level rises, and the final section will guide you on how to prepare effectively for these rounds. Beginner (0-2 Years of Experience) At this stage, candidates are expected to know the basics of SQL, common commands, and elementary data manipulation. What is SQL? Explain its importance in data science. Hint: Think about querying, relational databases, and data manipulation. What is the difference between WHERE ...

Spacy errors and their solutions

 Introduction: There are a bunch of errors in spacy, which never makes sense until you get to the depth of it. In this post, we will analyze the attribute error E046 and why it occurs. (1) AttributeError: [E046] Can't retrieve unregistered extension attribute 'tag_name'. Did you forget to call the set_extension method? Let's first understand what the error means on superficial level. There is a tag_name extension in your code. i.e. from a doc object, probably you are calling doc._.tag_name. But spacy suggests to you that probably you forgot to call the set_extension method. So what to do from here? The problem in hand is that your extension is not created where it should have been created. Now in general this means that your pipeline is incorrect at some level.  So how should you solve it? Look into the pipeline of your spacy language object. Chances are that the pipeline component which creates the extension is not included in the pipeline. To check the pipe eleme...

What is Bort?

 Introduction: Bort, is the new and more optimized version of BERT; which came out this october from amazon science. I came to know about it today while parsing amazon science's news on facebook about bort. So Bort is the newest addition to the long list of great LM models with extra-ordinary achievements.  Why is Bort important? Bort, is a model of 5.5% effective and 16% total size of the original BERT model; and is 20x faster than BERT, while being able to surpass the BERT model in 20 out of 23 tasks; to quote the abstract of the paper,  ' it obtains performance improvements of between 0 . 3% and 31%, absolute, with respect to BERT-large, on multiple public natural language understanding (NLU) benchmarks. ' So what made this achievement possible? The main idea behind creation of Bort is to go beyond the shallow depth of weight pruning, connection deletion or merely factoring the NN into different matrix factorizations and thus distilling it. While methods like know...