Skip to main content

Random forest modeling on house price prediction data from kaggle


Introduction:

Hi guys! I have started working on my Kaggle profile recently. So, this is the first of many, Kaggle project discussions I am going to write in this blog. One of the very famous projects in Kaggle has been the house price prediction data. This data is often used by beginners like me for learning and demonstrating regression. But the reason I have used this data has been to mainly demonstrate the usefulness of a random forest model as a successor of my random forest theory post. If you have not worked on, or read about the random forest before this, please read the random forest for beginner post. In this link, I have curated the complete description for random forests' theory, algorithm, tuning details and every other specific thing.

My stand

Now, let's begin with the house price prediction data. The full jupyter notebook can be found in this Github link. The achieved score is 0.15411, with which I ranked 2863 in the leaderboard, therefore, it is not a very good work which you can say. But the point of this post is not to rank, but to teach how to create simple random forest modeling. For ranking better, you have to do better feature engineering and other things.

Prepping the data:

Now that those things are out of our way, let's jump on exploring and prepping the data before we work with the random forest. So like any normal dataset, this data contains some categorical and some other numerical columns. As categorical data can't be used directly in the model, we will use get_dummies from pandas and then created numerical, one-hot encoded versions of them. The code regarding exploration and one-hot encoding is in the below screenshot:

Clearly, I have taken all the columns which are of the data type object, i.e. denoting string type columns by the "__name__" attribute and then I have appended those columns names in the string column list. Now I will use get dummies in the next shell and create small dataframes from the one-hot-encoded vectors and then keep appending them with the main dataframe. Look at the code snippet below for understanding:

Note that how I have created one function for doing this creation and appending task in case of a generic column. Therefore, I didn't have to write repetitive code for all the columns. Also, note that within these columns some of them are actually ordinal in nature, therefore, they should not be ideally one-hot-encoded to bring the point of comparison. So, this can be a potential point to start better feature creation. But as told beforehand, we will not go down that lane for now.

Now, the final slide from this small notebook will be the train-test-split, import of the necessary libraries and training, testing and showing the scores. The necessary point here is that, as it is a training data only and there is no testing data, you need to think about how to validate the result of this training. To do that, one way is to just take a part of the training data and name it as validation data; then treating that score as test data score. But one other thing you can do specifically in case of random forest is that you can use its oob_score_. If you don't know what oob score is or where does it come from in the algorithm, please refer to the random forest for beginner post.

So I have used the oob score method here. I have done manual hyper-parameter-tuning, and the basis of that hyper-parameter-tuning was to basically keep the difference of oob_score_ and the accuracy in training data minimum while increasing the oob_score.
In doing so, I gradually reached the parameters mentioned in the above code snippet. Please note that although I have argued in the theoretical post about no incremental value in increasing n_estimators, here, as the mean squared error was the mark in the competition, therefore, we have taken this high number of estimators. But the end result has not been much different from 128 trees, to clarify.
One other thing we could have done here is to run a grid search for the best parameters in terms of training accuracy and then use that to further predict our values. You can definitely try that by forking my notebook and then starting from this prediction as a baseline.

Finally, in case of Kaggle submissions, you need to be super attentive in maintaining the sample submission format. In my case, the last part of the script, i.e. the following code snippet, I have used to format and create the "vanilla_submission.csv" file, which I finally submitted. Note the use of pd.DataFrame() and csv writing technique below. To know further about how to write csv in python, follow the link.

Thanks for reading this post! for more awesome but basic machine learning tutorials, please subscribe to my channel and stay tuned with my blog!

Comments

Popular posts from this blog

Mastering SQL for Data Science: Top SQL Interview Questions by Experience Level

Introduction: SQL (Structured Query Language) is a cornerstone of data manipulation and querying in data science. SQL technical rounds are designed to assess a candidate’s ability to work with databases, retrieve, and manipulate data efficiently. This guide provides a comprehensive list of SQL interview questions segmented by experience level—beginner, intermediate, and experienced. For each level, you'll find key questions designed to evaluate the candidate’s proficiency in SQL and their ability to solve data-related problems. The difficulty increases as the experience level rises, and the final section will guide you on how to prepare effectively for these rounds. Beginner (0-2 Years of Experience) At this stage, candidates are expected to know the basics of SQL, common commands, and elementary data manipulation. What is SQL? Explain its importance in data science. Hint: Think about querying, relational databases, and data manipulation. What is the difference between WHERE ...

Spacy errors and their solutions

 Introduction: There are a bunch of errors in spacy, which never makes sense until you get to the depth of it. In this post, we will analyze the attribute error E046 and why it occurs. (1) AttributeError: [E046] Can't retrieve unregistered extension attribute 'tag_name'. Did you forget to call the set_extension method? Let's first understand what the error means on superficial level. There is a tag_name extension in your code. i.e. from a doc object, probably you are calling doc._.tag_name. But spacy suggests to you that probably you forgot to call the set_extension method. So what to do from here? The problem in hand is that your extension is not created where it should have been created. Now in general this means that your pipeline is incorrect at some level.  So how should you solve it? Look into the pipeline of your spacy language object. Chances are that the pipeline component which creates the extension is not included in the pipeline. To check the pipe eleme...

20 Must-Know Math Puzzles for Data Science Interviews: Test Your Problem-Solving Skills

Introduction:   When preparing for a data science interview, brushing up on your coding and statistical knowledge is crucial—but math puzzles also play a significant role. Many interviewers use puzzles to assess how candidates approach complex problems, test their logical reasoning, and gauge their problem-solving efficiency. These puzzles are often designed to test not only your knowledge of math but also your ability to think critically and creatively. Here, we've compiled 20 challenging yet exciting math puzzles to help you prepare for data science interviews. We’ll walk you through each puzzle, followed by an explanation of the solution. 1. The Missing Dollar Puzzle Puzzle: Three friends check into a hotel room that costs $30. They each contribute $10. Later, the hotel realizes there was an error and the room actually costs $25. The hotel gives $5 back to the bellboy to return to the friends, but the bellboy, being dishonest, pockets $2 and gives $1 back to each friend. No...