Introduction:
Lemmatization is the concept of reducing down a word to its nearest root word; i.e. removing the prefixes and suffices and replacing with the nearest word of that form. i.e. for checking --> check , girls --> girl etc.
Lemmatization is one of the most important steps in information retrieval as well as natural language processing. Lemmatizing different words with same root word reduces the noise and improves the information concentration in both natural language processing as well as information retrieval exercises.
Now this content is not about explaining how lemmatization or stemming works. For that refer to this comprehensive post about stemming and lemmatization which should give you proper idea about what and how these processes work.
Now lets talk about spacy. If you don't know what spacy is, start here with introduction to spacy. spacy is one of the best production level natural language processing library which lets one perform different nlp tasks like parts of speech tagging, dependency parsing, text classification modeling and many other small and big tasks.
In this post, we will briefly discuss how one can perform simple lemmatization using spacy. For using lemmatization in english or other language, find and load the pretrained, stable pipeline for your language. For example, in case of english, you can load the "en_core_web_sm" model. If you get stuck in this step; read .
Now, consider that you are using english and want to perform the lemmatization. In that case, your code will be following this template:
The code for spacy lemmatization:
import spacy
nlp = spacy.load("en_core_web_sm")
text = " I went to the bank today for checking my bank balance."
doc = nlp(text)
text_lemmatized_list = []
for token in doc:
if token.lemma_ != "-PRON-":
text_lemmatized_list.append(token.lemma_)
else:
text_lemmatized_list.append(token)
text_lemmatized = ' '.join(text_lemmatized_list)
print("the lemmatized version is:", text_lemmatized)
Now, first things first, observe that once nlp is called on text, the lemmatized versions of each token is already created and stored in the token.lemma attribute of a token.
why spacy returns '-PRON-' on lemmatization and how to resolve it?
you are also probably wondering that why we are checking whether the lemmatized version of a token is equal to '-PRON-' or not. This occurs from the issue is that according to creators of spacy; there is no correct meaning of lemmatizing a pronoun; i.e. turning 'me' into 'I' just isn't sensible enough. while there can be debates on the same aspect; in spacy every pronoun returns '-PRON-' on lemmatization. Therefore, we are checking whether this specific return is coming and if it does, we just return the token itself instead of the token.lemma_ which doesn't make sense in case of pronouns.
Conclusion and speed discussion:
In this small sweet post, we learnt how to use spacy lemmatization, once you read about data structures and pipelines, you will understand that spacy runs the lemmatization very fast along with other procedures; which enables you to create fast response in your nlp applications and scripts. But still there are enough complaint about the spacy's speed issues in case of lemmatization.
In such a case, you can follow two simple things mentioned in the git issue here. One is to disable some of the different elements in the nlp pipelines. If you don't need to run pos tagger, dependency parsing and other things; then disable them from the pipeline while loading. Read more about disabling from the official link. Simple format for disabling items are:
nlp = spacy.load("en_core_web_sm",disable = ['tagger','perser','ner'])
which really leaves you with lemmatization.
Also, another thing to note is that if you are using multiple documents then run nlp.pipe on the list of all the documents. Instead of running them on a loop, using nlp.pipe on the large list of documents is more useful in this case. If you want to really nerd about how the speed issues were and how people optimized it more in version 2; read this very exciting git issue about spacy lemmatization being 10x slower.
Thanks for reading! Please ask any question about spacy or topics which I have not discussed or you would like me to clarify or talk about.
Comments
Post a Comment