Skip to main content

how to download and use different spacy pipelines?

 Introduction:

                      Photo by CHUTTERSNAP on Unsplash
 

Spacy is the new age, industrial usage and computationally economical nlp library which employs full pipelines for nlp works and attempts to democratize the natural language processing for smaller companies by publishing high end work into the open source software spacy. For last 2 months, I have been reading and using spacy heavily and been blogging about it. 

We have seen a detailed how to use spacy in the part1, part2, part3 and part4 of the spacy series. But as we have worked mostly with one model; we have never dealt with different pipelines the spacy library offers. So in this post, I am going to give a small summary of the different models we can use for spacy and what are the pipelines related to them. Let's dig in.

How to download spacy models and use them:

All spacy pipelines are downloaded and used using the following lines: 

python -m spacy download <pipeline_name>
import spacy
nlp = spacy.load(<model_name>) 

The basics:

spacy has the basic models based on CNN models trained on ontonotes. The models in this category are:

(a) en_core_web_sm

(b) en_core_web_md

(c) en_core_web_lg

Among these the sm, md and lg stands for small, medium and large referring to the size of the model objects. Each model objects come with a specific default pipeline and also contains the pretrained models for them inside them.

If you are trying to use a model for tagging, persing and other works, but also are concerned about the speed and size of the model; then you should use the en_core_web_sm model. If you are using similarity variables created from spacy, then you shouldn't use it; as the small model doesn't have any pretrained vector with it and typically predicts the similarity value from the other values; i.e. tagging and parsing related values. 

So if you want to use similarity variables; it is good to use the large model or atleast the medium variable. In the calculating similarity post, I have discussed on details about it. 

Now, you may say that CNN is pretty old; and what if you want to use transformers using spacy; or say bert models. There are both 3rd party projects based on spacy, as well as spacy nightly version 3.0; which provides that. Let's discuss that in the next section.

The advanced:

(1) A wrap for the transformer's package:

spacy-transformers is a package written by explosion-ai, the creators of spacy, which provides a wrap of the huggingface transformer library to use in a spacy environment. You can read about the detailed description of the transformer library wrap in their official blog. I have a wish to do detailed writing about it later this month, comment below and I will notify once I write it. Anyway, the way to use it is:

pip install spacy-transformers
import spacy

nlp = spacy.load("en_trf_bertbaseuncased_lg")
doc = nlp("Apple shares rose on the news. Apple pie is delicious.")
print(doc[0].similarity(doc[7]))
print(doc._.trf_last_hidden_state.shape) 

 

(2) spacy nightly version 3.0:

Now, you may say, that's a wrap; but you want something where the real pipelines are built on advanced 18-19 models like roberta, albert and others. Good news is that spacy is recently launching 3.0 and they have launched the nightly release and wrote about it on oct 14, 2020 itself. I myself am pretty excited about it as they have built everything based on transformers and based on roberta. 

But although that is a future assurance, it is not easily usable at this point of time as the nightly release is experimental and therefore can't be used in production right now. You can read about the detailed developments and changes the 3.0 is going to bring here from their official blog

As you can see, spacy has built 5 more transformer based pipelines in this version, one of which is basically english, the others are in german, spanish, french and chinese. The english model is created using transformer based model RoBerta. It will be pretty interesting to use the blazing fast pipelines of spacy once the version is out of nightly mode. 

One other model is pretty useful sometimes; which needs a mention when discussing different types of spacy pipelines; is the blank models. For building models on custom data, if you want to train them using a spacy structure; then you can initiate a spacy model using spacy.blank() method. i.e. to initiate a blank english model you just have to write:

nlp = spacy.blank("en")

This is pretty useful while building ent taggers or text label models along spacy pipeline structures but on custom data.

Other than this, useful pipelines are available generally under the spacy project pipelines. I have obviously not tried most of them out with being a spacy amateur myself; but some of the promising projects are:

(1) contextual spell correction : corrects spelling without actually knowing the spelling; i.e. spell correction for oov ( out of vocabulary words or say non-words even)

(2) pyATE: python based automatic term extraction. This project aims to create an automatic term extraction pipeline based on python.

(3) pytextrank: this project creates summarization as well as performs phrase extraction based on text rank method using a spacy based structure. Read my post about details of this project here.

(4) spaczz: a fuzzy matching process based on spacy. It employs fuzzy matching on strings using spacy framework.

That's all for today! I am going to write more about some of these spacy based projects this month; so if you are interested, do comment and I will notify you with a small mail whenever I publish a new post on any of these. 

Thanks for reading!

Comments

Popular posts from this blog

Tinder bio generation with OpenAI GPT-3 API

Introduction: Recently I got access to OpenAI API beta. After a few simple experiments, I set on creating a simple test project. In this project, I will try to create good tinder bio for a specific person.  The abc of openai API playground: In the OpenAI API playground, you get a prompt, and then you can write instructions or specific text to trigger a response from the gpt-3 models. There are also a number of preset templates which loads a specific kind of prompt and let's you generate pre-prepared results. What are the models available? There are 4 models which are stable. These are: (1) curie (2) babbage (3) ada (4) da-vinci da-vinci is the strongest of them all and can perform all downstream tasks which other models can do. There are 2 other new models which openai introduced this year (2021) named da-vinci-instruct-beta and curie-instruct-beta. These instruction models are specifically built for taking in instructions. As OpenAI blog explains and also you will see in our

Can we write codes automatically with GPT-3?

 Introduction: OpenAI created and released the first versions of GPT-3 back in 2021 beginning. We wrote a few text generation articles that time and tested how to create tinder bio using GPT-3 . If you are interested to know more on what is GPT-3 or what is openai, how the server look, then read the tinder bio article. In this article, we will explore Code generation with OpenAI models.  It has been noted already in multiple blogs and exploration work, that GPT-3 can even solve leetcode problems. We will try to explore how good the OpenAI model can "code" and whether prompt tuning will improve or change those performances. Basic coding: We will try to see a few data structure coding performance by GPT-3. (a) Merge sort with python:  First with 200 words limit, it couldn't complete the Write sample code for merge sort in python.   def merge(arr, l, m, r):     n1 = m - l + 1     n2 = r- m       # create temp arrays     L = [0] * (n1)     R = [0] * (n

What is Bort?

 Introduction: Bort, is the new and more optimized version of BERT; which came out this october from amazon science. I came to know about it today while parsing amazon science's news on facebook about bort. So Bort is the newest addition to the long list of great LM models with extra-ordinary achievements.  Why is Bort important? Bort, is a model of 5.5% effective and 16% total size of the original BERT model; and is 20x faster than BERT, while being able to surpass the BERT model in 20 out of 23 tasks; to quote the abstract of the paper,  ' it obtains performance improvements of between 0 . 3% and 31%, absolute, with respect to BERT-large, on multiple public natural language understanding (NLU) benchmarks. ' So what made this achievement possible? The main idea behind creation of Bort is to go beyond the shallow depth of weight pruning, connection deletion or merely factoring the NN into different matrix factorizations and thus distilling it. While methods like knowle