Skip to main content

how to download and use different spacy pipelines?

 Introduction:

                      Photo by CHUTTERSNAP on Unsplash
 

Spacy is the new age, industrial usage and computationally economical nlp library which employs full pipelines for nlp works and attempts to democratize the natural language processing for smaller companies by publishing high end work into the open source software spacy. For last 2 months, I have been reading and using spacy heavily and been blogging about it. 

We have seen a detailed how to use spacy in the part1, part2, part3 and part4 of the spacy series. But as we have worked mostly with one model; we have never dealt with different pipelines the spacy library offers. So in this post, I am going to give a small summary of the different models we can use for spacy and what are the pipelines related to them. Let's dig in.

How to download spacy models and use them:

All spacy pipelines are downloaded and used using the following lines: 

python -m spacy download <pipeline_name>
import spacy
nlp = spacy.load(<model_name>) 

The basics:

spacy has the basic models based on CNN models trained on ontonotes. The models in this category are:

(a) en_core_web_sm

(b) en_core_web_md

(c) en_core_web_lg

Among these the sm, md and lg stands for small, medium and large referring to the size of the model objects. Each model objects come with a specific default pipeline and also contains the pretrained models for them inside them.

If you are trying to use a model for tagging, persing and other works, but also are concerned about the speed and size of the model; then you should use the en_core_web_sm model. If you are using similarity variables created from spacy, then you shouldn't use it; as the small model doesn't have any pretrained vector with it and typically predicts the similarity value from the other values; i.e. tagging and parsing related values. 

So if you want to use similarity variables; it is good to use the large model or atleast the medium variable. In the calculating similarity post, I have discussed on details about it. 

Now, you may say that CNN is pretty old; and what if you want to use transformers using spacy; or say bert models. There are both 3rd party projects based on spacy, as well as spacy nightly version 3.0; which provides that. Let's discuss that in the next section.

The advanced:

(1) A wrap for the transformer's package:

spacy-transformers is a package written by explosion-ai, the creators of spacy, which provides a wrap of the huggingface transformer library to use in a spacy environment. You can read about the detailed description of the transformer library wrap in their official blog. I have a wish to do detailed writing about it later this month, comment below and I will notify once I write it. Anyway, the way to use it is:

pip install spacy-transformers
import spacy

nlp = spacy.load("en_trf_bertbaseuncased_lg")
doc = nlp("Apple shares rose on the news. Apple pie is delicious.")
print(doc[0].similarity(doc[7]))
print(doc._.trf_last_hidden_state.shape) 

 

(2) spacy nightly version 3.0:

Now, you may say, that's a wrap; but you want something where the real pipelines are built on advanced 18-19 models like roberta, albert and others. Good news is that spacy is recently launching 3.0 and they have launched the nightly release and wrote about it on oct 14, 2020 itself. I myself am pretty excited about it as they have built everything based on transformers and based on roberta. 

But although that is a future assurance, it is not easily usable at this point of time as the nightly release is experimental and therefore can't be used in production right now. You can read about the detailed developments and changes the 3.0 is going to bring here from their official blog

As you can see, spacy has built 5 more transformer based pipelines in this version, one of which is basically english, the others are in german, spanish, french and chinese. The english model is created using transformer based model RoBerta. It will be pretty interesting to use the blazing fast pipelines of spacy once the version is out of nightly mode. 

One other model is pretty useful sometimes; which needs a mention when discussing different types of spacy pipelines; is the blank models. For building models on custom data, if you want to train them using a spacy structure; then you can initiate a spacy model using spacy.blank() method. i.e. to initiate a blank english model you just have to write:

nlp = spacy.blank("en")

This is pretty useful while building ent taggers or text label models along spacy pipeline structures but on custom data.

Other than this, useful pipelines are available generally under the spacy project pipelines. I have obviously not tried most of them out with being a spacy amateur myself; but some of the promising projects are:

(1) contextual spell correction : corrects spelling without actually knowing the spelling; i.e. spell correction for oov ( out of vocabulary words or say non-words even)

(2) pyATE: python based automatic term extraction. This project aims to create an automatic term extraction pipeline based on python.

(3) pytextrank: this project creates summarization as well as performs phrase extraction based on text rank method using a spacy based structure. Read my post about details of this project here.

(4) spaczz: a fuzzy matching process based on spacy. It employs fuzzy matching on strings using spacy framework.

That's all for today! I am going to write more about some of these spacy based projects this month; so if you are interested, do comment and I will notify you with a small mail whenever I publish a new post on any of these. 

Thanks for reading!

Comments

Popular posts from this blog

Mastering SQL for Data Science: Top SQL Interview Questions by Experience Level

Introduction: SQL (Structured Query Language) is a cornerstone of data manipulation and querying in data science. SQL technical rounds are designed to assess a candidate’s ability to work with databases, retrieve, and manipulate data efficiently. This guide provides a comprehensive list of SQL interview questions segmented by experience level—beginner, intermediate, and experienced. For each level, you'll find key questions designed to evaluate the candidate’s proficiency in SQL and their ability to solve data-related problems. The difficulty increases as the experience level rises, and the final section will guide you on how to prepare effectively for these rounds. Beginner (0-2 Years of Experience) At this stage, candidates are expected to know the basics of SQL, common commands, and elementary data manipulation. What is SQL? Explain its importance in data science. Hint: Think about querying, relational databases, and data manipulation. What is the difference between WHERE ...

What is Bort?

 Introduction: Bort, is the new and more optimized version of BERT; which came out this october from amazon science. I came to know about it today while parsing amazon science's news on facebook about bort. So Bort is the newest addition to the long list of great LM models with extra-ordinary achievements.  Why is Bort important? Bort, is a model of 5.5% effective and 16% total size of the original BERT model; and is 20x faster than BERT, while being able to surpass the BERT model in 20 out of 23 tasks; to quote the abstract of the paper,  ' it obtains performance improvements of between 0 . 3% and 31%, absolute, with respect to BERT-large, on multiple public natural language understanding (NLU) benchmarks. ' So what made this achievement possible? The main idea behind creation of Bort is to go beyond the shallow depth of weight pruning, connection deletion or merely factoring the NN into different matrix factorizations and thus distilling it. While methods like know...

Spacy errors and their solutions

 Introduction: There are a bunch of errors in spacy, which never makes sense until you get to the depth of it. In this post, we will analyze the attribute error E046 and why it occurs. (1) AttributeError: [E046] Can't retrieve unregistered extension attribute 'tag_name'. Did you forget to call the set_extension method? Let's first understand what the error means on superficial level. There is a tag_name extension in your code. i.e. from a doc object, probably you are calling doc._.tag_name. But spacy suggests to you that probably you forgot to call the set_extension method. So what to do from here? The problem in hand is that your extension is not created where it should have been created. Now in general this means that your pipeline is incorrect at some level.  So how should you solve it? Look into the pipeline of your spacy language object. Chances are that the pipeline component which creates the extension is not included in the pipeline. To check the pipe eleme...