Skip to main content

How to do lemmatization using spacy?

 Introduction:

pic from unsplash about language, spacy, lemmatization

Lemmatization is the concept of reducing down a word to its nearest root word; i.e. removing the prefixes and suffices and replacing with the nearest word of that form. i.e. for checking --> check , girls --> girl etc. 

Lemmatization is one of the most important steps in information retrieval as well as natural language processing. Lemmatizing different words with same root word reduces the noise and improves the information concentration in both natural language processing as well as information retrieval exercises. 

Now this content is not about explaining how lemmatization or stemming works. For that refer to this comprehensive post about stemming and lemmatization which should give you proper idea about what and how these processes work. 

Now lets talk about spacy. If you don't know what spacy is, start here with introduction to spacy. spacy is one of the best production level natural language processing library which lets one perform different nlp tasks like parts of speech tagging, dependency parsing, text classification modeling and many other small and big tasks. 

In this post, we will briefly discuss how one can perform simple lemmatization using spacy. For using lemmatization in english or other language, find and load the pretrained, stable pipeline for your language. For example, in case of english, you can load the "en_core_web_sm" model. If you get stuck in this step; read . 

Now, consider that you are using english and want to perform the lemmatization. In that case, your code will be following this template:

The code for spacy lemmatization:

import spacy

nlp = spacy.load("en_core_web_sm")

text = " I went to the bank today for checking my bank balance."

doc = nlp(text)

text_lemmatized_list = []

for token in doc:

    if token.lemma_ != "-PRON-": 

        text_lemmatized_list.append(token.lemma_)

    else: 

        text_lemmatized_list.append(token)

text_lemmatized = ' '.join(text_lemmatized_list)

print("the lemmatized version is:", text_lemmatized)

Now, first things first, observe that once nlp is called on text, the lemmatized versions of each token is already created and stored in the token.lemma attribute of a token.

why spacy returns '-PRON-' on lemmatization and how to resolve it?

you are also probably wondering that why we are checking whether the lemmatized version of a token is equal to '-PRON-' or not. This occurs from the issue is that according to creators of spacy; there is no correct meaning of lemmatizing a pronoun; i.e. turning 'me' into 'I' just isn't sensible enough. while there can be debates on the same aspect; in spacy every pronoun returns  '-PRON-' on lemmatization. Therefore, we are checking whether this specific return is coming and if it does, we just return the token itself instead of the token.lemma_ which doesn't make sense in case of pronouns. 

Conclusion and speed discussion: 

In this small sweet post, we learnt how to use spacy lemmatization, once you read about data structures and pipelines, you will understand that spacy runs the lemmatization very fast along with other procedures; which enables you to create fast response in your nlp applications and scripts. But still there are enough complaint about the spacy's speed issues in case of lemmatization.

In such a case, you can follow two simple things mentioned in the git issue here. One is to disable some of the different elements in the nlp pipelines. If you don't need to run pos tagger, dependency parsing and other things; then disable them from the pipeline while loading. Read more about disabling from the official link. Simple format for disabling items are:

nlp = spacy.load("en_core_web_sm",disable = ['tagger','perser','ner'])

which really leaves you with lemmatization. 

Also, another thing to note is that if you are using multiple documents then run nlp.pipe on the list of all the documents. Instead of running them on a loop, using nlp.pipe on the large list of documents is more useful in this case. If you want to really nerd about how the speed issues were and how people optimized it more in version 2; read this very exciting git issue about spacy lemmatization being 10x slower

Thanks for reading! Please ask any question about spacy or topics which I have not discussed or you would like me to clarify or talk about.

Comments

Popular posts from this blog

Tinder bio generation with OpenAI GPT-3 API

Introduction: Recently I got access to OpenAI API beta. After a few simple experiments, I set on creating a simple test project. In this project, I will try to create good tinder bio for a specific person.  The abc of openai API playground: In the OpenAI API playground, you get a prompt, and then you can write instructions or specific text to trigger a response from the gpt-3 models. There are also a number of preset templates which loads a specific kind of prompt and let's you generate pre-prepared results. What are the models available? There are 4 models which are stable. These are: (1) curie (2) babbage (3) ada (4) da-vinci da-vinci is the strongest of them all and can perform all downstream tasks which other models can do. There are 2 other new models which openai introduced this year (2021) named da-vinci-instruct-beta and curie-instruct-beta. These instruction models are specifically built for taking in instructions. As OpenAI blog explains and also you will see in our

Can we write codes automatically with GPT-3?

 Introduction: OpenAI created and released the first versions of GPT-3 back in 2021 beginning. We wrote a few text generation articles that time and tested how to create tinder bio using GPT-3 . If you are interested to know more on what is GPT-3 or what is openai, how the server look, then read the tinder bio article. In this article, we will explore Code generation with OpenAI models.  It has been noted already in multiple blogs and exploration work, that GPT-3 can even solve leetcode problems. We will try to explore how good the OpenAI model can "code" and whether prompt tuning will improve or change those performances. Basic coding: We will try to see a few data structure coding performance by GPT-3. (a) Merge sort with python:  First with 200 words limit, it couldn't complete the Write sample code for merge sort in python.   def merge(arr, l, m, r):     n1 = m - l + 1     n2 = r- m       # create temp arrays     L = [0] * (n1)     R = [0] * (n

What is Bort?

 Introduction: Bort, is the new and more optimized version of BERT; which came out this october from amazon science. I came to know about it today while parsing amazon science's news on facebook about bort. So Bort is the newest addition to the long list of great LM models with extra-ordinary achievements.  Why is Bort important? Bort, is a model of 5.5% effective and 16% total size of the original BERT model; and is 20x faster than BERT, while being able to surpass the BERT model in 20 out of 23 tasks; to quote the abstract of the paper,  ' it obtains performance improvements of between 0 . 3% and 31%, absolute, with respect to BERT-large, on multiple public natural language understanding (NLU) benchmarks. ' So what made this achievement possible? The main idea behind creation of Bort is to go beyond the shallow depth of weight pruning, connection deletion or merely factoring the NN into different matrix factorizations and thus distilling it. While methods like knowle