Skip to main content

pandas groupby functions usage and examples

 Introduction:

pandas groupby function images

 

Pandas is one of the most basic data processing libraries data enthusiasts learn and use frequently. We have discussed 10 most basic functions to know from pandas in a previous post. Now, although I have known and used groupby for quite a bit of time now, there are a lot of tricky things and actions around the groupby functions we need to learn, so that one can utilize groupby functions most.

The basics:

Now, if you are new to pandas, let's gloss over the pandas groupby basics first. groupby() is a method to group the data with respect to one or more columns and aggregate some other columns based on that. The normal syntax of using groupby is:
pandas.DataFrame.groupby(columns).aggregate_functions()

For example, you have a credit card transaction data for customers, each transaction for each day. Now, you want to know how much transaction is being done on a day level. Then in such a case, to know the transaction on a day level, you will want to group the data at a day level and then sum up the transactions.

Let's say our hypothetical dataset has the columns customer_id, date, and transaction_value. Now, to see the transaction on a daily value, we have to groupby using customer_id, date and sum the transaction_value. i.e. 

data = data.groupby(['customer_id','date']).sum('transaction_value')

will provide the transaction value summed at a day level. Now that you know how groupby works normally let's see what are the different functions we can use in a groupby, and how each of them work.

aggregate functions:

On using groupby, you can apply a number of different aggregate functions on the column or columns on which you can apply them. Few of them are:

(1) mean(): take the average of all the values

(2) sum(): take the sum of all the values

(3) first(): take the first entry of all the values

(4) last(): take the last entry of all the values

Here is a more comprehensive list of all the groupby aggregate terms; which you can read about. Such aggregate functions follow the simple syntax which is:

dataframe.groupby([columns_to_group_on]).aggregate_function()

here, in place of aggregate_function we can use the aggregate function like mean, sum, first and others. Now, once you do this, the columns you use to group becomes an index. In many cases, you will not want that; as you can't anymore reference those columns from your dataframe as normal columns. This can be solved very easily using reset_index(); as that resets the index as columns. Therefore, the safe syntax is:

df = dataframe.groupby([columns_to_group_on]).aggregate_function().reset_index()

where df is the new, returned grouped dataframe. 

We will discuss more difficult grouping styles in the later part of this blog post tomorrow.

Comments

Popular posts from this blog

Mastering SQL for Data Science: Top SQL Interview Questions by Experience Level

Introduction: SQL (Structured Query Language) is a cornerstone of data manipulation and querying in data science. SQL technical rounds are designed to assess a candidate’s ability to work with databases, retrieve, and manipulate data efficiently. This guide provides a comprehensive list of SQL interview questions segmented by experience level—beginner, intermediate, and experienced. For each level, you'll find key questions designed to evaluate the candidate’s proficiency in SQL and their ability to solve data-related problems. The difficulty increases as the experience level rises, and the final section will guide you on how to prepare effectively for these rounds. Beginner (0-2 Years of Experience) At this stage, candidates are expected to know the basics of SQL, common commands, and elementary data manipulation. What is SQL? Explain its importance in data science. Hint: Think about querying, relational databases, and data manipulation. What is the difference between WHERE ...

Spacy errors and their solutions

 Introduction: There are a bunch of errors in spacy, which never makes sense until you get to the depth of it. In this post, we will analyze the attribute error E046 and why it occurs. (1) AttributeError: [E046] Can't retrieve unregistered extension attribute 'tag_name'. Did you forget to call the set_extension method? Let's first understand what the error means on superficial level. There is a tag_name extension in your code. i.e. from a doc object, probably you are calling doc._.tag_name. But spacy suggests to you that probably you forgot to call the set_extension method. So what to do from here? The problem in hand is that your extension is not created where it should have been created. Now in general this means that your pipeline is incorrect at some level.  So how should you solve it? Look into the pipeline of your spacy language object. Chances are that the pipeline component which creates the extension is not included in the pipeline. To check the pipe eleme...

fundamentals of LLM: A story from history of GPTs to the future

Introduction: So there has been a lot of developments in LLM and I have not gone through any of it. In the coming few parts, I will talk about LLM and its related eco-system that has developed and will try to reach to the cutting or more like bleeding edge. Lets go through the main concepts first. What is LLM? LLM[1] refers to large language models; that refer to mainly deep learning based big transformer models that can perform the natural language understanding and natural language generation tasks much better than the previous versions of the models generated in NLP history. LLM models are generally quite big, in terms of 10-100GBs and they can't fit in even in one machine's ram. So, most LLMs are inferenced using bigger GPU cluster systems and are quite computationally exhaustive. What was the first true LLM? The BERTs Transformers were invented on 2017 by vaswani et al in their revolutionary paper called "attention is all you need". After that we had the BER...