Skip to main content

How to do lemmatization using spacy?

 Introduction:

pic from unsplash about language, spacy, lemmatization

Lemmatization is the concept of reducing down a word to its nearest root word; i.e. removing the prefixes and suffices and replacing with the nearest word of that form. i.e. for checking --> check , girls --> girl etc. 

Lemmatization is one of the most important steps in information retrieval as well as natural language processing. Lemmatizing different words with same root word reduces the noise and improves the information concentration in both natural language processing as well as information retrieval exercises. 

Now this content is not about explaining how lemmatization or stemming works. For that refer to this comprehensive post about stemming and lemmatization which should give you proper idea about what and how these processes work. 

Now lets talk about spacy. If you don't know what spacy is, start here with introduction to spacy. spacy is one of the best production level natural language processing library which lets one perform different nlp tasks like parts of speech tagging, dependency parsing, text classification modeling and many other small and big tasks. 

In this post, we will briefly discuss how one can perform simple lemmatization using spacy. For using lemmatization in english or other language, find and load the pretrained, stable pipeline for your language. For example, in case of english, you can load the "en_core_web_sm" model. If you get stuck in this step; read . 

Now, consider that you are using english and want to perform the lemmatization. In that case, your code will be following this template:

The code for spacy lemmatization:

import spacy

nlp = spacy.load("en_core_web_sm")

text = " I went to the bank today for checking my bank balance."

doc = nlp(text)

text_lemmatized_list = []

for token in doc:

    if token.lemma_ != "-PRON-": 

        text_lemmatized_list.append(token.lemma_)

    else: 

        text_lemmatized_list.append(token)

text_lemmatized = ' '.join(text_lemmatized_list)

print("the lemmatized version is:", text_lemmatized)

Now, first things first, observe that once nlp is called on text, the lemmatized versions of each token is already created and stored in the token.lemma attribute of a token.

why spacy returns '-PRON-' on lemmatization and how to resolve it?

you are also probably wondering that why we are checking whether the lemmatized version of a token is equal to '-PRON-' or not. This occurs from the issue is that according to creators of spacy; there is no correct meaning of lemmatizing a pronoun; i.e. turning 'me' into 'I' just isn't sensible enough. while there can be debates on the same aspect; in spacy every pronoun returns  '-PRON-' on lemmatization. Therefore, we are checking whether this specific return is coming and if it does, we just return the token itself instead of the token.lemma_ which doesn't make sense in case of pronouns. 

Conclusion and speed discussion: 

In this small sweet post, we learnt how to use spacy lemmatization, once you read about data structures and pipelines, you will understand that spacy runs the lemmatization very fast along with other procedures; which enables you to create fast response in your nlp applications and scripts. But still there are enough complaint about the spacy's speed issues in case of lemmatization.

In such a case, you can follow two simple things mentioned in the git issue here. One is to disable some of the different elements in the nlp pipelines. If you don't need to run pos tagger, dependency parsing and other things; then disable them from the pipeline while loading. Read more about disabling from the official link. Simple format for disabling items are:

nlp = spacy.load("en_core_web_sm",disable = ['tagger','perser','ner'])

which really leaves you with lemmatization. 

Also, another thing to note is that if you are using multiple documents then run nlp.pipe on the list of all the documents. Instead of running them on a loop, using nlp.pipe on the large list of documents is more useful in this case. If you want to really nerd about how the speed issues were and how people optimized it more in version 2; read this very exciting git issue about spacy lemmatization being 10x slower

Thanks for reading! Please ask any question about spacy or topics which I have not discussed or you would like me to clarify or talk about.

Comments

Popular posts from this blog

Mastering SQL for Data Science: Top SQL Interview Questions by Experience Level

Introduction: SQL (Structured Query Language) is a cornerstone of data manipulation and querying in data science. SQL technical rounds are designed to assess a candidate’s ability to work with databases, retrieve, and manipulate data efficiently. This guide provides a comprehensive list of SQL interview questions segmented by experience level—beginner, intermediate, and experienced. For each level, you'll find key questions designed to evaluate the candidate’s proficiency in SQL and their ability to solve data-related problems. The difficulty increases as the experience level rises, and the final section will guide you on how to prepare effectively for these rounds. Beginner (0-2 Years of Experience) At this stage, candidates are expected to know the basics of SQL, common commands, and elementary data manipulation. What is SQL? Explain its importance in data science. Hint: Think about querying, relational databases, and data manipulation. What is the difference between WHERE ...

Spacy errors and their solutions

 Introduction: There are a bunch of errors in spacy, which never makes sense until you get to the depth of it. In this post, we will analyze the attribute error E046 and why it occurs. (1) AttributeError: [E046] Can't retrieve unregistered extension attribute 'tag_name'. Did you forget to call the set_extension method? Let's first understand what the error means on superficial level. There is a tag_name extension in your code. i.e. from a doc object, probably you are calling doc._.tag_name. But spacy suggests to you that probably you forgot to call the set_extension method. So what to do from here? The problem in hand is that your extension is not created where it should have been created. Now in general this means that your pipeline is incorrect at some level.  So how should you solve it? Look into the pipeline of your spacy language object. Chances are that the pipeline component which creates the extension is not included in the pipeline. To check the pipe eleme...

20 Must-Know Math Puzzles for Data Science Interviews: Test Your Problem-Solving Skills

Introduction:   When preparing for a data science interview, brushing up on your coding and statistical knowledge is crucial—but math puzzles also play a significant role. Many interviewers use puzzles to assess how candidates approach complex problems, test their logical reasoning, and gauge their problem-solving efficiency. These puzzles are often designed to test not only your knowledge of math but also your ability to think critically and creatively. Here, we've compiled 20 challenging yet exciting math puzzles to help you prepare for data science interviews. We’ll walk you through each puzzle, followed by an explanation of the solution. 1. The Missing Dollar Puzzle Puzzle: Three friends check into a hotel room that costs $30. They each contribute $10. Later, the hotel realizes there was an error and the room actually costs $25. The hotel gives $5 back to the bellboy to return to the friends, but the bellboy, being dishonest, pockets $2 and gives $1 back to each friend. No...