Skip to main content

A Complete Guide for Understanding Random forest model


Introduction:

Random forest is one of the highest used models in classical machine learning. Because of its robustness in high noisy data, and its much better ability to learn irregular patterns of data makes the random forest a worthy candidate for modeling in many fields like genomics and others. In this post, we will discuss:

  1. definition of a random forest
  2. What is the meaning of bagging
  3. What do you mean by ensemble model
  4. What is a decision tree
  5. The mathematics behind decision tree
  6. Training a Random forest
  7. Random forest packages
  8. Hyperparameter tuning Random forest
  9. How many trees should there be in Random forest
  10. Further study links

definition of a random forest

"Random forests are a combination of tree predictors such that each tree depends on the values of a random vector sampled independently and with the same distribution for all trees in the forest."
--Leo Breiman

Random forest is an ensemble model consisting of number of decision trees. A random forest can be used to predict or classify. In a random forest, using the train data, one trains several decision trees with bagged sample and randomly chosen features; and then during prediction, we average the results from each decision tree or in case of classification we take the majority of the vote. This is what a random forest is. The training process of the random forest consists of the following steps, therefore. The steps are:

  1. choosing a bagged sample, and a random subset of the features provided in the training dataset.
  2. using the bagged sample and the subset of features train a decision tree.
  3. create a number of decision trees; where the number is generally mentioned as the number of estimators.
During prediction, the steps are:
  1. predict the sample using all the trees
  2. take the average of the prediction and that will be the prediction of the random forest.
We will now discuss the theory part of the random forest one by one.

What is the meaning of bagging

bagging stands for bootstrap aggregating. Bootstrapping means sampling from a small sample multiple times to prepare a generally smaller sample. bootstrap aggregation refers to aggregate a number of unstable model being aggregated to predict something, using a bootstrapped sample for each of the component models.
In the case of the random forest model, each decision tree on its own is unstable and high variance models. Therefore, we use the bagging method to train the different decision trees and then aggregate their result to predict, as we mentioned in the last paragraph as well.
For further knowing about bootstrapping and its applications, follow this wikipedia link. We will proceed to define what do we mean by an ensemble model also, as it is one of the important aspects of random forest.

What do you mean by ensemble model

In machine learning, the final goal to train models is to reach a good prediction, whatever the metric good as predicted in. To achieve that, we often take weak predictors or algorithms which can't achieve the accuracy at one go, and then we create multiple series or parallel versions of that model; finally, use the result from all such models to predict the target with much better accuracy and stability. This process of creating and then using numerous models generated essentially from the same algorithm, but from different manipulated or modified versions of the data is named ensembling; the final predictor built using all the model results is called an ensemble model.

In this case, the random forest is created ensembling a high number of decision trees. Therefore, a random forest is called an ensemble model. Another very famous ensemble model is the Xgboost model; which is beyond the scope of discussion here. Now, clearly, a random forest model training is mostly based on training a decision tree. That's why we will discuss the decision tree in the next section.

What is a decision tree?

Decision tree is a rule-based system, which consists of a tree-like structures, where a data enters through root, and is tested at each node for a test, generally some kind of cut off for a variable, and then after a finite many steps, reach a leave node which indicates either a class or a value respectively in classification and regression problem.

The partitioning of the data in each node is performed using metrics like Gini impurity, entropy or something similar. The goal of a decision tree is to partition the data into completely homogeneous nodes. And for that reason, each node finds the best cuts in terms of decreasing the impurity in the child nodes most and then applies that as the test. Finally, in the case of classification, the leave nodes contain one class, while in the case of regression, the leave nodes contain the average of the values in the homogenous class. Now, to get into greater detail, let's jump into the algorithm behind it.

The algorithm behind decision tree(overview)

For understanding the decision tree algorithms, I went back to where it started all. Yes, you guessed it right as I have read the paper on the decision tree. If you want to read the whole paper on your own, then you can follow the link here.

In this paper, the author has discussed the ID3 method. ID3 method is an iterative algorithm to train the decision tree. ID3 starts with a "window" or a small part of the training data. Then it trains the decision tree. After this training, the decision tree predicts the whole training data. If the decision tree can correctly predict all the observation then the training terminates. Otherwise, the incorrectly predicted samples are added and the decision tree is retrained. This process continues unless the decision tree predicts the training data correctly.
the author has mentioned that usually, the ID3 method completes much before getting trained on the whole dataset and therefore is a better algorithm in general than training a decision tree on the whole dataset. But it can not be mathematically guaranteed that the tree converges to a final tree unless one takes a window of the size of the whole training dataset.

Now, we will concentrate on the fact that how does a node split. In the paper currently on discussion, the author has vividly discussed the information gain procedure. For each node splitting, the information gain procedure, concentrates on the increment on information contained in the data, using the entropy formula from information theory. Then for each attributes existing, one goes through the information gains corresponding to each of the attributes. Finally, the method is to choose the one with the highest information gain. And in each node split the process continues like this until we reach the homogeneous splits and therefore create the leave nodes.
I will provide the snippets of the original paper which will help in understanding the above gist in greater detail and with all the formulas and one example worked out.

In the paper, the author, J.R.Quinlan has also focused on the creation of a decision tree in the situation where there is significant noise present. Doing a large amount of research, he has concluded that up to a 5% noise level in one single feature does not degrade the performance of the system more than 2%, while a 5% noise level in all the attributes can lead to a degradation of up to 12%. Although this is due to research and thus an empirical result, but still these numbers can vary and be more or less when you will conduct your research.

Training for noisy data:
The problem is that while giving noisy data, if decision trees fit on the noisy data, then it becomes quite inaccurate in other general data and also the decision tree becomes quite complex in structure. That is why the author suggests using the chi-square test to check whether an attribute is relevant or irrelevant. In this case, you consider first that without using the attribute what are the probabilities of each classes being in the sample. Then, you also consider that after splitting based on the attribute, what the probabilities will be.
Now, we run a chi-square test on these two samples, with the null hypothesis being that the samples are actually coming from the same distribution and therefore it is not worth using this split, while if it is rejected, means that the probabilities change and therefore we should keep the split.
Also, later on, the information gain has been changed to a different metric named gain ratio; solving problems of the earlier metric. But to summarize, it is now clear how a decision tree works. So, now let's dive into the mechanism detail on the ensemble part of a Random forest.

Training a Random forest

Although we have gone through the older paper on the decision tree, let's go through another paper for this section. This time I have selected the paper on Random Forest by Leo Breiman. Now, you can either go through the paper on your own; or you may follow the rest of the section. Now, this paper has mostly mathematical derivation and thus is not in the main scope of the blog. But then again, this paper ensures the idea used behind training a random forest.
In the case of random forest, we use an ensemble of decision trees, grown on a random bootstrapped sample from the main training data. This data, with a randomly selected subset of the features, are used to train each decision tree. Then finally, in case of classification, a vote is taken from each tree and the majority class is used as the prediction. In the case of regression, the prediction from each tree is averaged up and the average is taken as the prediction. There are some different small variations you can implement, but then those do not impact much on the main process. So, the main algorithm to train a random forest is:

  1. instantiate n number of decision trees
  2. for each of them, create a random sample as well as generate a random feature set
  3. train the decision trees
  4. in the case of prediction, take the results and aggregate them or take the majority prediction

In the next section, we will talk about hyperparameters and hyperparameter tuning for the random forest.

Hyperparameter tuning Random forest

Here, I will pick the random forest package from sklearn and talk about what are the hyperparameters and how to tune them. First of all, hyperparameters mean parameters of a specific model. i.e., a random forest model can have different n_estimators, max_features, min_samples_split,max_depth and other parameters like that. These parameters have to be changed and changing these, creates different models. To find the best models in terms of prediction by searching across different possible parameter combinations is called tuning of hyperparameters.
For hyperparameter tuning, one can use sklearn gridsearchcv or randomsearchcv. These are functions that search through different combinations of parameters and report the accuracy metrics. Therefore, one can choose the best parameter seeing the reports.
For random forest, the parameters to tune are: (1) number of estimators (2) max_depth (3) min_samples_split (4) max features. Now, the number of estimators, as it increases the error of the random forest slowly decreases. Although we will discuss in greater detail why n_estimators do not make much effect after 128 trees in general. You can try 64, 128 and 256 trees for any tree configuration and then you can observe the difference in performance you get. From research and experience both, it is known not to decrease the training error much after 128 trees in general. Here, I must note this point that, theoretically, increasing trees continually, you can reach the generalization error and also you can fit the training data almost perfectly. But that is not the goal of any model building. Your model should ideally learn the patterns from the data and be able to generalize that to any other data. Therefore, the number of estimators is not actually a parameter to tune, but you should rather keep it fixed according to your sample volume, i.e. for a few thousand to tens of thousands, keep it around 64; for lakhs of instants, you can safely keep the number of estimators at 128 and move to other parameters.
One important parameter to train is max_depth. If you search the internet, you will find several different notions on how to train for max depth and whether to train for it. As we discussed in the earlier section, the underlying training goes into a decision tree. Therefore, theoretically, you need not put a depth value either, as the decision tree is supposed to get trained to proper depth on its own.
But typically it is seen that due to software implementations or data variance and other reasons, the models get overfitted if we don't provide the max_depth. For a rule of thumb, you should never provide depth such that depth is more than log(n_instance) w.r.t. base 2. Because, in that case, the random forest can contain nodes more than the sample and therefore tends to perfectly remember all the samples.
Obviously, increasing depth will lead to an increase in the accuracy of a random forest. So you should generally try 3,4 different depths with a gap of 2 and 3; then finally select one depth based on the training accuracy and difference between training and test accuracy. Ideally, your training and test accuracy should not be more than 8-10% in such a case.
min_samples_split is the third important hyperparameter to train a random forest. While if you train your model properly first after fixing a good number of estimators and after tuning the depth; min_samples_split does not affect your performance much; you should keep it a high value, at least above 5 and if you have abundant data then near 30; as it represents that how much data is at your node at least. i.e. this parameter denotes how many data points you should have at least, to split a node. Therefore, the higher it is, the more reliable your data is, as it has lesser and lesser chance to contain rare and statistically insignificant patterns detected and stored in some of the nodes.
Therefore, you should kind of tune your min_sample_split for numbers around 5 and 30 and then store the best one out of them.There are some other hyperparameters also which can be tuned, but, they are not necessary to tune if you fix the above 4 hyperparameters correctly.

How many trees should there be in Random forest

As I hinted already in the hyperparameter tuning section, the number of trees should be fixed from the number of instances before tuning. The reason is better explained in the following paper:How many trees in a random forest. I will discuss this paper briefly and thereby address the problem of determining trees.
In this paper, the authors have taken a collection of 29 datasets and then they have trained random forest models of tree numbers in power of 2, i.e. from 16 to 4096, they have trained model on these datasets. In the final result, they have found that none of these random forests have outperformed their performance at 128 trees, with some of the cases, having 64 achieving similar performances. So from this empirical result, authors have argued that 128 is a well enough number of trees to keep in a random forest.
Training error vs number of trees in Random forest, image taken from researchGate A justification of this observation can be followed from the fact that the OOB error or the out of bag error goes down as the number of trees increases, but the rate of decrease slows down exponentially, and therefore, after a certain number, the random forest stops gaining any more real information, but the training error just keeps on decreasing as the forest starts to overfit. This is the reason why you should fix your number of estimators or the number of trees at 128, for medium to large datasets.
Please follow the above-linked paper if you want to read more about how they performed the research, heuristics and the parent papers.

Further study links

Congratulations if you have reached till now! This was a long and tedious post and you read it all. I hope you follow the links and read on your own for more developments in the random forest model. I will soon upload a practical example of how to work with the random forest model from sklearn and also give a small theoretical description of how it works internally. Until then, here are some further links to read about random forest model: (1) Breiman paper on random forest, it was already mentioned in the section but still mentioned again (2) Detailed thesis on random forest,must-read!
For practical applications, follow this example post on house price prediction using random forest.

Comments

Popular posts from this blog

Tinder bio generation with OpenAI GPT-3 API

Introduction: Recently I got access to OpenAI API beta. After a few simple experiments, I set on creating a simple test project. In this project, I will try to create good tinder bio for a specific person.  The abc of openai API playground: In the OpenAI API playground, you get a prompt, and then you can write instructions or specific text to trigger a response from the gpt-3 models. There are also a number of preset templates which loads a specific kind of prompt and let's you generate pre-prepared results. What are the models available? There are 4 models which are stable. These are: (1) curie (2) babbage (3) ada (4) da-vinci da-vinci is the strongest of them all and can perform all downstream tasks which other models can do. There are 2 other new models which openai introduced this year (2021) named da-vinci-instruct-beta and curie-instruct-beta. These instruction models are specifically built for taking in instructions. As OpenAI blog explains and also you will see in our

Can we write codes automatically with GPT-3?

 Introduction: OpenAI created and released the first versions of GPT-3 back in 2021 beginning. We wrote a few text generation articles that time and tested how to create tinder bio using GPT-3 . If you are interested to know more on what is GPT-3 or what is openai, how the server look, then read the tinder bio article. In this article, we will explore Code generation with OpenAI models.  It has been noted already in multiple blogs and exploration work, that GPT-3 can even solve leetcode problems. We will try to explore how good the OpenAI model can "code" and whether prompt tuning will improve or change those performances. Basic coding: We will try to see a few data structure coding performance by GPT-3. (a) Merge sort with python:  First with 200 words limit, it couldn't complete the Write sample code for merge sort in python.   def merge(arr, l, m, r):     n1 = m - l + 1     n2 = r- m       # create temp arrays     L = [0] * (n1)     R = [0] * (n

What is Bort?

 Introduction: Bort, is the new and more optimized version of BERT; which came out this october from amazon science. I came to know about it today while parsing amazon science's news on facebook about bort. So Bort is the newest addition to the long list of great LM models with extra-ordinary achievements.  Why is Bort important? Bort, is a model of 5.5% effective and 16% total size of the original BERT model; and is 20x faster than BERT, while being able to surpass the BERT model in 20 out of 23 tasks; to quote the abstract of the paper,  ' it obtains performance improvements of between 0 . 3% and 31%, absolute, with respect to BERT-large, on multiple public natural language understanding (NLU) benchmarks. ' So what made this achievement possible? The main idea behind creation of Bort is to go beyond the shallow depth of weight pruning, connection deletion or merely factoring the NN into different matrix factorizations and thus distilling it. While methods like knowle