Skip to main content

Random forest modeling on house price prediction data from kaggle


Introduction:

Hi guys! I have started working on my Kaggle profile recently. So, this is the first of many, Kaggle project discussions I am going to write in this blog. One of the very famous projects in Kaggle has been the house price prediction data. This data is often used by beginners like me for learning and demonstrating regression. But the reason I have used this data has been to mainly demonstrate the usefulness of a random forest model as a successor of my random forest theory post. If you have not worked on, or read about the random forest before this, please read the random forest for beginner post. In this link, I have curated the complete description for random forests' theory, algorithm, tuning details and every other specific thing.

My stand

Now, let's begin with the house price prediction data. The full jupyter notebook can be found in this Github link. The achieved score is 0.15411, with which I ranked 2863 in the leaderboard, therefore, it is not a very good work which you can say. But the point of this post is not to rank, but to teach how to create simple random forest modeling. For ranking better, you have to do better feature engineering and other things.

Prepping the data:

Now that those things are out of our way, let's jump on exploring and prepping the data before we work with the random forest. So like any normal dataset, this data contains some categorical and some other numerical columns. As categorical data can't be used directly in the model, we will use get_dummies from pandas and then created numerical, one-hot encoded versions of them. The code regarding exploration and one-hot encoding is in the below screenshot:

Clearly, I have taken all the columns which are of the data type object, i.e. denoting string type columns by the "__name__" attribute and then I have appended those columns names in the string column list. Now I will use get dummies in the next shell and create small dataframes from the one-hot-encoded vectors and then keep appending them with the main dataframe. Look at the code snippet below for understanding:

Note that how I have created one function for doing this creation and appending task in case of a generic column. Therefore, I didn't have to write repetitive code for all the columns. Also, note that within these columns some of them are actually ordinal in nature, therefore, they should not be ideally one-hot-encoded to bring the point of comparison. So, this can be a potential point to start better feature creation. But as told beforehand, we will not go down that lane for now.

Now, the final slide from this small notebook will be the train-test-split, import of the necessary libraries and training, testing and showing the scores. The necessary point here is that, as it is a training data only and there is no testing data, you need to think about how to validate the result of this training. To do that, one way is to just take a part of the training data and name it as validation data; then treating that score as test data score. But one other thing you can do specifically in case of random forest is that you can use its oob_score_. If you don't know what oob score is or where does it come from in the algorithm, please refer to the random forest for beginner post.

So I have used the oob score method here. I have done manual hyper-parameter-tuning, and the basis of that hyper-parameter-tuning was to basically keep the difference of oob_score_ and the accuracy in training data minimum while increasing the oob_score.
In doing so, I gradually reached the parameters mentioned in the above code snippet. Please note that although I have argued in the theoretical post about no incremental value in increasing n_estimators, here, as the mean squared error was the mark in the competition, therefore, we have taken this high number of estimators. But the end result has not been much different from 128 trees, to clarify.
One other thing we could have done here is to run a grid search for the best parameters in terms of training accuracy and then use that to further predict our values. You can definitely try that by forking my notebook and then starting from this prediction as a baseline.

Finally, in case of Kaggle submissions, you need to be super attentive in maintaining the sample submission format. In my case, the last part of the script, i.e. the following code snippet, I have used to format and create the "vanilla_submission.csv" file, which I finally submitted. Note the use of pd.DataFrame() and csv writing technique below. To know further about how to write csv in python, follow the link.

Thanks for reading this post! for more awesome but basic machine learning tutorials, please subscribe to my channel and stay tuned with my blog!

Comments

Popular posts from this blog

Mastering SQL for Data Science: Top SQL Interview Questions by Experience Level

Introduction: SQL (Structured Query Language) is a cornerstone of data manipulation and querying in data science. SQL technical rounds are designed to assess a candidate’s ability to work with databases, retrieve, and manipulate data efficiently. This guide provides a comprehensive list of SQL interview questions segmented by experience level—beginner, intermediate, and experienced. For each level, you'll find key questions designed to evaluate the candidate’s proficiency in SQL and their ability to solve data-related problems. The difficulty increases as the experience level rises, and the final section will guide you on how to prepare effectively for these rounds. Beginner (0-2 Years of Experience) At this stage, candidates are expected to know the basics of SQL, common commands, and elementary data manipulation. What is SQL? Explain its importance in data science. Hint: Think about querying, relational databases, and data manipulation. What is the difference between WHERE ...

What is Bort?

 Introduction: Bort, is the new and more optimized version of BERT; which came out this october from amazon science. I came to know about it today while parsing amazon science's news on facebook about bort. So Bort is the newest addition to the long list of great LM models with extra-ordinary achievements.  Why is Bort important? Bort, is a model of 5.5% effective and 16% total size of the original BERT model; and is 20x faster than BERT, while being able to surpass the BERT model in 20 out of 23 tasks; to quote the abstract of the paper,  ' it obtains performance improvements of between 0 . 3% and 31%, absolute, with respect to BERT-large, on multiple public natural language understanding (NLU) benchmarks. ' So what made this achievement possible? The main idea behind creation of Bort is to go beyond the shallow depth of weight pruning, connection deletion or merely factoring the NN into different matrix factorizations and thus distilling it. While methods like know...

Spacy errors and their solutions

 Introduction: There are a bunch of errors in spacy, which never makes sense until you get to the depth of it. In this post, we will analyze the attribute error E046 and why it occurs. (1) AttributeError: [E046] Can't retrieve unregistered extension attribute 'tag_name'. Did you forget to call the set_extension method? Let's first understand what the error means on superficial level. There is a tag_name extension in your code. i.e. from a doc object, probably you are calling doc._.tag_name. But spacy suggests to you that probably you forgot to call the set_extension method. So what to do from here? The problem in hand is that your extension is not created where it should have been created. Now in general this means that your pipeline is incorrect at some level.  So how should you solve it? Look into the pipeline of your spacy language object. Chances are that the pipeline component which creates the extension is not included in the pipeline. To check the pipe eleme...