Skip to main content

10 most used pandas functions in data science


Before I start:

 

I am a thriving data scientist writing data science codes all day and I use pandas all day. And the most interesting thing about pandas is the large variety of optimized functions present to process and manipulate dataframes. So, I will compile the list of most used and necessary pandas functions and a small example of how to use it. The goal of this article is therefore to aid the beginners with the resources to write code faster, shorter and cleaner.

Things you will learn in this post are:
(1) Introduction
(2) pandas basics
(3) pandas read_csv
(4) slicing data using pandas
(5) pandas dataframe
(6) pandas drop duplicate
(7) pandas fillna
(8) pandas merge
(9) pandas concat

Pandas introduction: 

Pandas is written by Wes Mckinney, a great businessman and all time benevolent dictator for life for the open source project named pandas. The main github resource is pandas github. Pandas is mainly written in cython, a language made of C and python. The reason of this choice of language is that, while python is very much useful and easy to write, its pretty much slow in large scale computations. This is the reason, in the backend, pandas is written in cython, to provide it with the speed of c, while still accessing the ease of python. This technical build is a important aspect of pandas. But all that said and done, you do not need to understand cython to use pandas. Before we dive into the functions which we use everyday in data science, I will start with some dataframe basics using pandas.

How to start with pandas:

For starters, you need to install pandas. For ubuntu, you can go to console, and type pip3 install pandas. This will install pandas based on python3. If you have windows system, please download anaconda or jupyter notebook . In this case, anaconda or jupyter itself downloads the pandas inside the package files and thus you can access pandas inside your python package. 
If you have not installed python yet, follow how to install python here.

pandas dataframe basics:

 You will want to know and do some basic things with dataframe before you use different types of functions. I will compile a list of such simple usages in this paragraph.

(1) how to find the shape of a dataframe:
to find the shape of a dataframe, you can just do dataframe.shape. this will return you a tuple with the form (row_number, column_number).

(2) how to find description of a dataframe:
Sometimes, you start with a pre-existing dataset which you spend sufficient time to read, understand and then you start to finally work using it. In such cases, you will need to understand each and every field of a dataframe. For this purpose, you can use dataframe.describe() and dataframe[column].describe(). Describe will give you datatype, count, minimum, maximum and percentiles in case of numeric columns and count, frequencies, unique values and top occurring value for a object or string type column.

(3)how to rename your columns:
After taking a raw data, often you will feel like changing the column names. If you want to change all the column names, then you have to write the new column names in a list just in the correct order you want to name your columns and then you can write
dataframe.columns = new_column_names_list
and voila! your column names are changed.


Description of functions:

(1) read_csv

this is the function which is used to load a csv file in your python session. The general use of read_csv is as below:
import pandas as pd
dataframe_loaded = pd.read_csv(filepath)
where the filepath is the absolute path or the relative path. Here, absolute path is a basically a path by which you can access the file from the home folder; while the relative path will be from the current working directory of the python script. 

How to avoid error lines in pandas read_csv?

In general, many times, there will be problematic lines in a csv file which can not be read by read_csv. In this case, you have to set the parameter 
error_bad_lines =False. This means, the bad lines will be dropped and the dataframe will be read anyway. 

How to skip rows in pandas read_csv?

There is a parameter called skiprows. The default value of this parameter is None, while, if you know that, there are some initial lines which you need to skip, it can be provided as skiprows = (no of lines to skip from header) and it will skip those many lines from the begining row. 
This is used generally when some of the csvs contain some descriptive lines in the begining and therefore those lines have to be ignored.

The same type of option used to skip lines from the footer, is done by using skipfooter in the same way. 

How to read high precision floats in pandas read_csv?

Sometimes, when you are training sensitive models which need to be accurate to the point of only machine error being present, you need absolute precision to be maintained during reading the dataframe. 

For further queries about read_csv, please pandas read_csv documentation.

(2) Slicing data: 

One of the main works in using a pandas dataframe is to be able to slice. In data-science, slicing means creating smaller chunks of dataframe based on some specific conditions. I will demonstrate how to use one condition slicing and multiple condition slicing.

How to slice dataframe?

Let us consider a dataframe which we want to slice and it contains columns named column_1,  column_2,..column_N.
Now, here we assume that the columns have numeric data types. Then,
(1) equality slicing: consider slicing a dataframe for column_1 being equal to a value val. The format to do this is:
data_sliced = dataframe[dataframe[column_1] == val]

(2) interval slicing: consider slicing a dataframe for column_1 being between  lower_val and uper_val. The format to do this is:
data_sliced = dataframe[(dataframe[column_1] >lower_val) & (dataframe[column_2] <upper_val)]

Note that how for doing multiple conditions you have to put each conditions in bracket and then concatenate with "&" sign, which basically stands for "and". Also, for putting multiple conditions in a or form i.e. either this or that, you have to put a "||" in between the conditions.

(3) list based slicing: consider slicing a dataframe for column_1 values belonging to a list . The format to do this is:
data_sliced = dataframe[dataframe[column_1].isin([i for i in list])]
Note that here we have used isin. to slice using a list.

 If you are more eager to know about how slicing works in advanced, please see
official documentation.

(3) drop_duplicates: 

drop_duplicates is for deleting duplicate rows in a dataframe. In a general dataframe, there will be cases where you will want to have one copy for each record of the data, and no duplicates. In those situations, you have to use drop_duplicates. The use for this function is:
 dataframe_deduplicated = dataframe[specific_columns].drop_duplicates()
This function is used generally like this only. If you want to know more details about it, please look pandas documentation

(4) to_csv:

to_csv is one of the most handy function while you are processing your data and want to save a dataframe for later use.  The general use of to_csv is 
dataframe.to_csv(filepath_to_save_dataframe,index =False) 
where index if set true then will add an additional column which is nothing but indices starting from 0, a count of row. If you don't have any special reason, it is best not to set index as true, as that may disturb your count of columns and data processing steps further.
to_csv has obviously many more details, to know them, follow the official docs for pandas.

(5) pd.DataFrame:

 Many of the pandas operations are best done using a dataframe. This is the reason, you may want to turn lists of list, lists of similar keyed dictionaries and matrices, multidimensional arrays into a dataframe for sake of operation and ease.
The general go to function for this is pd.dataframe. Let's say we have two columns, which are student ids and student names. Now, you want to form a table out of it. The way it will work is:

student_table = pd.DataFrame()
student_table['student_id'] = student_id 
student_table['student_name'] = student_name

Also, lets consider that you have a matrix or a 2-dimensional array which you want to turn into a dataframe. In this case,let the matrix(array) be X and put the intended column names in a list column_names. In that case, the way it will work is:
changed_dataframe = pd.DataFrame(X,columns = column_names)

Here, note that you have to use the parameter named columns.In cases, where you will want to create a empty dataframe with column names, you will use this parameter to put the column names.
pandas tricks!

Now, assume that you want to put one by one row together and create a dataframe. In this case, the best way to do so, is to create a separate dictionary for each row, with the column names using its keys and their corresponding values as its values. Then, you will have to create a list containing all these dictionaries.
This dictionary list can then be made a dataframe using the pd.DataFrame() function again.

These are pretty much all you will use it for, but if you want to know more specifics, please look here for dataframe.

(6) dataframe.fillna:

This is a general pre-processing function. For dataframe, a common problem is to fill the na values. na values can occur due to different problems and can therefore be solved using different tricks. The cases where the na value percentages are within workable limits, the fillna is used to fill the na positions with some suitable values. In these cases, this fillna function is used. The general use will be:
dataframe_changed = dataframe.fillna(value_to_fill,inplace = False) 
In this consider the parameter inplace. inplace means the original object being changed, i.e. the dataframe itself becomes changed. While in case of memory shortage, you should change the dataframe in place , but generally it is better to not use inplace = True, as it can confuse you into taking a wrong dataset into different operations. So, with proper care, inplace can be used. 
Also, its default value is False. So, if you want to work with this as,
dataframe.fillna(value_to_fill)
then it will run, but the dataframe stays unchanged. So, this is a good thing to consider while fillna.

If you are interested to know more about this, please follow the official doc for fillna.
This is just the tip of the iceberg. But as global warming is coming inevitably, you will soon see more parts of the iceberg to float up. So please stay tuned and read more about pandas with me.

(7) dataframe.dropna:

This is another function used generally for data processing. Whenever there is abundant data or a opportunity to discard wrong or incomplete data points, we generally decide to drop the data points with NA or missing values. For this thing, we use dropna function. The general use of dropna is:
dataframe = dataframe.dropna(how = 'any')
where, if it is set to 'any', any row with anyplace missing will be dropped. Again, if it is set to 'all', a row with only all values missing will be dropped. 
Also, there is a option of setting parameters for which direction to consider the dropping. In general we don't do any column level dropping, but let's say if you want to drop columns with all values missing, you can follow like below:
dataframe = dataframe.dropna( axis = 1, how = 'all')
   
These are the general use of dropna. If you want more detailed use of dropna, follow official pandas doc. 

(8) pd.concat:  

Many times, you will have number of dataframes which you will want to join horizontally, i.e. row-wise or column-wise. In this case, the general format is: 
dataframe_concatenated = pd.concat([df1,df2,..,dfn],axis = 0/1)
Now, to clarify things, you use the axis = 0 when you concat dataframes row-wise. These must be dataframes with same columns, and the column orders must be same. Here, if the columns are in different order in different dataframe, that can lead to a problem. So, if you are not sure about the order; the best thing to do is:
dataframe_concatenated = pd.concat([df1,df2[list(df1.columns)],..,dfn[list(df1.columns)],axis = 0)

Now, axis = 1 is used for column wise joining. i.e. if two or more dataframes are containing similar features, for same data points, then you can concat them to concat the features. This will work similarly as previous but you have to set axis = 1.
But here is a big caution. In general, the order of the data points, in such cases are not always correct, neither the number of data points present in two different data frames will be same always. That's why it is best not to use concat for column level concatenation but to use the join/merge function for pandas. 
For knowing more details about concat, follow here

(9) pd.merge:

Now, when you have two dataframes which you will want to merge based on some key columns. In such cases, pandas merge function comes into action. This merge and also pandas join(similar function as join, hence won't discuss here), are similar to sql table join. The general use of pandas merge is:

data_merged = pd.merge(data_left, data_right ,on = [col1,col2..col_n])

But there are a few parameters to note while you are going to use pandas merge. The first of them is "how". how refers to what type of join are you going to do: i.e. left, right, inner. this is similar to left,right and inner join as in sql. If you do not have any idea about what a table join is, please refer here
Now, the other ones are: left_on and right_on. 
The left_on and right_on are used when the key columns, based on which we do the matches, do not have the same name in both the tables. In that case, one have to provide the column names using l_on referring to the left join and r_on referring to the right join. 

Caution is that, the key columns on which you will join/merge the tables in this case, will get added twice in this new merged dataframe. While that should not create problem in general, but if you are going to train a model or something, you should not have duplicate columns.

Similar other functions are pd.join, pd.dataframe.join, pd.dataframe.merge. I will not discuss these functions as I personally prefer merge among these, and the idea is directly same as that of sql in case of the merge function.

(10) pd.dataframe.groupby: 

Groupby is a pandas method to group data based on some columns. This is one of the main aspects of normal SQL. In general, you will have data at most granular levels and then you will want to roll up your data to more gross level. i.e. you may have a university level data for students and marks and then you may want to group the students by classes, clubs,years or something else. To group the data with pandas, generally, you have to use group by in the following procedure:
data_grouped = 
dataframe.groupby(by = [col_1,col_2,...,col_n]).aggregate_func()
In this case, the aggregate function is  necessary. Otherwise, you will get no error, but randomly some records will be chosen for the columns not being used for grouping and thus the grouping will fail. Also, note that, you should not have non-numeric columns left to be grouped up as aggregate function will not work on them and therefore they will throw error.
For more use, please follow the official docs.



Comments

Popular posts from this blog

Mastering SQL for Data Science: Top SQL Interview Questions by Experience Level

Introduction: SQL (Structured Query Language) is a cornerstone of data manipulation and querying in data science. SQL technical rounds are designed to assess a candidate’s ability to work with databases, retrieve, and manipulate data efficiently. This guide provides a comprehensive list of SQL interview questions segmented by experience level—beginner, intermediate, and experienced. For each level, you'll find key questions designed to evaluate the candidate’s proficiency in SQL and their ability to solve data-related problems. The difficulty increases as the experience level rises, and the final section will guide you on how to prepare effectively for these rounds. Beginner (0-2 Years of Experience) At this stage, candidates are expected to know the basics of SQL, common commands, and elementary data manipulation. What is SQL? Explain its importance in data science. Hint: Think about querying, relational databases, and data manipulation. What is the difference between WHERE

What is Bort?

 Introduction: Bort, is the new and more optimized version of BERT; which came out this october from amazon science. I came to know about it today while parsing amazon science's news on facebook about bort. So Bort is the newest addition to the long list of great LM models with extra-ordinary achievements.  Why is Bort important? Bort, is a model of 5.5% effective and 16% total size of the original BERT model; and is 20x faster than BERT, while being able to surpass the BERT model in 20 out of 23 tasks; to quote the abstract of the paper,  ' it obtains performance improvements of between 0 . 3% and 31%, absolute, with respect to BERT-large, on multiple public natural language understanding (NLU) benchmarks. ' So what made this achievement possible? The main idea behind creation of Bort is to go beyond the shallow depth of weight pruning, connection deletion or merely factoring the NN into different matrix factorizations and thus distilling it. While methods like knowle

Spacy errors and their solutions

 Introduction: There are a bunch of errors in spacy, which never makes sense until you get to the depth of it. In this post, we will analyze the attribute error E046 and why it occurs. (1) AttributeError: [E046] Can't retrieve unregistered extension attribute 'tag_name'. Did you forget to call the set_extension method? Let's first understand what the error means on superficial level. There is a tag_name extension in your code. i.e. from a doc object, probably you are calling doc._.tag_name. But spacy suggests to you that probably you forgot to call the set_extension method. So what to do from here? The problem in hand is that your extension is not created where it should have been created. Now in general this means that your pipeline is incorrect at some level.  So how should you solve it? Look into the pipeline of your spacy language object. Chances are that the pipeline component which creates the extension is not included in the pipeline. To check the pipe eleme