Skip to main content

How to do lemmatization using spacy?

 Introduction:

pic from unsplash about language, spacy, lemmatization

Lemmatization is the concept of reducing down a word to its nearest root word; i.e. removing the prefixes and suffices and replacing with the nearest word of that form. i.e. for checking --> check , girls --> girl etc. 

Lemmatization is one of the most important steps in information retrieval as well as natural language processing. Lemmatizing different words with same root word reduces the noise and improves the information concentration in both natural language processing as well as information retrieval exercises. 

Now this content is not about explaining how lemmatization or stemming works. For that refer to this comprehensive post about stemming and lemmatization which should give you proper idea about what and how these processes work. 

Now lets talk about spacy. If you don't know what spacy is, start here with introduction to spacy. spacy is one of the best production level natural language processing library which lets one perform different nlp tasks like parts of speech tagging, dependency parsing, text classification modeling and many other small and big tasks. 

In this post, we will briefly discuss how one can perform simple lemmatization using spacy. For using lemmatization in english or other language, find and load the pretrained, stable pipeline for your language. For example, in case of english, you can load the "en_core_web_sm" model. If you get stuck in this step; read . 

Now, consider that you are using english and want to perform the lemmatization. In that case, your code will be following this template:

The code for spacy lemmatization:

import spacy

nlp = spacy.load("en_core_web_sm")

text = " I went to the bank today for checking my bank balance."

doc = nlp(text)

text_lemmatized_list = []

for token in doc:

    if token.lemma_ != "-PRON-": 

        text_lemmatized_list.append(token.lemma_)

    else: 

        text_lemmatized_list.append(token)

text_lemmatized = ' '.join(text_lemmatized_list)

print("the lemmatized version is:", text_lemmatized)

Now, first things first, observe that once nlp is called on text, the lemmatized versions of each token is already created and stored in the token.lemma attribute of a token.

why spacy returns '-PRON-' on lemmatization and how to resolve it?

you are also probably wondering that why we are checking whether the lemmatized version of a token is equal to '-PRON-' or not. This occurs from the issue is that according to creators of spacy; there is no correct meaning of lemmatizing a pronoun; i.e. turning 'me' into 'I' just isn't sensible enough. while there can be debates on the same aspect; in spacy every pronoun returns  '-PRON-' on lemmatization. Therefore, we are checking whether this specific return is coming and if it does, we just return the token itself instead of the token.lemma_ which doesn't make sense in case of pronouns. 

Conclusion and speed discussion: 

In this small sweet post, we learnt how to use spacy lemmatization, once you read about data structures and pipelines, you will understand that spacy runs the lemmatization very fast along with other procedures; which enables you to create fast response in your nlp applications and scripts. But still there are enough complaint about the spacy's speed issues in case of lemmatization.

In such a case, you can follow two simple things mentioned in the git issue here. One is to disable some of the different elements in the nlp pipelines. If you don't need to run pos tagger, dependency parsing and other things; then disable them from the pipeline while loading. Read more about disabling from the official link. Simple format for disabling items are:

nlp = spacy.load("en_core_web_sm",disable = ['tagger','perser','ner'])

which really leaves you with lemmatization. 

Also, another thing to note is that if you are using multiple documents then run nlp.pipe on the list of all the documents. Instead of running them on a loop, using nlp.pipe on the large list of documents is more useful in this case. If you want to really nerd about how the speed issues were and how people optimized it more in version 2; read this very exciting git issue about spacy lemmatization being 10x slower

Thanks for reading! Please ask any question about spacy or topics which I have not discussed or you would like me to clarify or talk about.

Comments

Popular posts from this blog

Mastering SQL for Data Science: Top SQL Interview Questions by Experience Level

Introduction: SQL (Structured Query Language) is a cornerstone of data manipulation and querying in data science. SQL technical rounds are designed to assess a candidate’s ability to work with databases, retrieve, and manipulate data efficiently. This guide provides a comprehensive list of SQL interview questions segmented by experience level—beginner, intermediate, and experienced. For each level, you'll find key questions designed to evaluate the candidate’s proficiency in SQL and their ability to solve data-related problems. The difficulty increases as the experience level rises, and the final section will guide you on how to prepare effectively for these rounds. Beginner (0-2 Years of Experience) At this stage, candidates are expected to know the basics of SQL, common commands, and elementary data manipulation. What is SQL? Explain its importance in data science. Hint: Think about querying, relational databases, and data manipulation. What is the difference between WHERE ...

What is Bort?

 Introduction: Bort, is the new and more optimized version of BERT; which came out this october from amazon science. I came to know about it today while parsing amazon science's news on facebook about bort. So Bort is the newest addition to the long list of great LM models with extra-ordinary achievements.  Why is Bort important? Bort, is a model of 5.5% effective and 16% total size of the original BERT model; and is 20x faster than BERT, while being able to surpass the BERT model in 20 out of 23 tasks; to quote the abstract of the paper,  ' it obtains performance improvements of between 0 . 3% and 31%, absolute, with respect to BERT-large, on multiple public natural language understanding (NLU) benchmarks. ' So what made this achievement possible? The main idea behind creation of Bort is to go beyond the shallow depth of weight pruning, connection deletion or merely factoring the NN into different matrix factorizations and thus distilling it. While methods like know...

Spacy errors and their solutions

 Introduction: There are a bunch of errors in spacy, which never makes sense until you get to the depth of it. In this post, we will analyze the attribute error E046 and why it occurs. (1) AttributeError: [E046] Can't retrieve unregistered extension attribute 'tag_name'. Did you forget to call the set_extension method? Let's first understand what the error means on superficial level. There is a tag_name extension in your code. i.e. from a doc object, probably you are calling doc._.tag_name. But spacy suggests to you that probably you forgot to call the set_extension method. So what to do from here? The problem in hand is that your extension is not created where it should have been created. Now in general this means that your pipeline is incorrect at some level.  So how should you solve it? Look into the pipeline of your spacy language object. Chances are that the pipeline component which creates the extension is not included in the pipeline. To check the pipe eleme...