how to download and use different spacy pipelines?

Introduction:

Photo by CHUTTERSNAP on Unsplash

Spacy is the new age, industrial usage and computationally economical nlp library which employs full pipelines for nlp works and attempts to democratize the natural language processing for smaller companies by publishing high end work into the open source software spacy. For last 2 months, I have been reading and using spacy heavily and been blogging about it.

We have seen a detailed how to use spacy in the part1, part2, part3 and part4 of the spacy series. But as we have worked mostly with one model; we have never dealt with different pipelines the spacy library offers. So in this post, I am going to give a small summary of the different models we can use for spacy and what are the pipelines related to them. Let's dig in.

How to download spacy models and use them:

All spacy pipelines are downloaded and used using the following lines:

python -m spacy download <pipeline_name>

import spacy

nlp = spacy.load(<model_name>)

The basics:

spacy has the basic models based on CNN models trained on ontonotes. The models in this category are:

(a) en_core_web_sm

(b) en_core_web_md

Among these the sm, md and lg stands for small, medium and large referring to the size of the model objects. Each model objects come with a specific default pipeline and also contains the pretrained models for them inside them.

If you are trying to use a model for tagging, persing and other works, but also are concerned about the speed and size of the model; then you should use the en_core_web_sm model. If you are using similarity variables created from spacy, then you shouldn't use it; as the small model doesn't have any pretrained vector with it and typically predicts the similarity value from the other values; i.e. tagging and parsing related values.

So if you want to use similarity variables; it is good to use the large model or atleast the medium variable. In the calculating similarity post, I have discussed on details about it.

Now, you may say that CNN is pretty old; and what if you want to use transformers using spacy; or say bert models. There are both 3rd party projects based on spacy, as well as spacy nightly version 3.0; which provides that. Let's discuss that in the next section.

The advanced:

(1) A wrap for the transformer's package:

spacy-transformers is a package written by explosion-ai, the creators of spacy, which provides a wrap of the huggingface transformer library to use in a spacy environment. You can read about the detailed description of the transformer library wrap in their official blog. I have a wish to do detailed writing about it later this month, comment below and I will notify once I write it. Anyway, the way to use it is:

pip install spacy-transformers

import spacy

nlp = spacy.load("en_trf_bertbaseuncased_lg")
doc = nlp("Apple shares rose on the news. Apple pie is delicious.")
print(doc[0].similarity(doc[7]))
print(doc._.trf_last_hidden_state.shape)

(2) spacy nightly version 3.0:

Now, you may say, that's a wrap; but you want something where the real pipelines are built on advanced 18-19 models like roberta, albert and others. Good news is that spacy is recently launching 3.0 and they have launched the nightly release and wrote about it on oct 14, 2020 itself. I myself am pretty excited about it as they have built everything based on transformers and based on roberta.

But although that is a future assurance, it is not easily usable at this point of time as the nightly release is experimental and therefore can't be used in production right now. You can read about the detailed developments and changes the 3.0 is going to bring here from their official blog.

As you can see, spacy has built 5 more transformer based pipelines in this version, one of which is basically english, the others are in german, spanish, french and chinese. The english model is created using transformer based model RoBerta. It will be pretty interesting to use the blazing fast pipelines of spacy once the version is out of nightly mode.

One other model is pretty useful sometimes; which needs a mention when discussing different types of spacy pipelines; is the blank models. For building models on custom data, if you want to train them using a spacy structure; then you can initiate a spacy model using spacy.blank() method. i.e. to initiate a blank english model you just have to write:

nlp = spacy.blank("en")

This is pretty useful while building ent taggers or text label models along spacy pipeline structures but on custom data.

Other than this, useful pipelines are available generally under the spacy project pipelines. I have obviously not tried most of them out with being a spacy amateur myself; but some of the promising projects are:

(1) contextual spell correction : corrects spelling without actually knowing the spelling; i.e. spell correction for oov ( out of vocabulary words or say non-words even)

(2) pyATE: python based automatic term extraction. This project aims to create an automatic term extraction pipeline based on python.

(3) pytextrank: this project creates summarization as well as performs phrase extraction based on text rank method using a spacy based structure. Read my post about details of this project here.

(4) spaczz: a fuzzy matching process based on spacy. It employs fuzzy matching on strings using spacy framework.

That's all for today! I am going to write more about some of these spacy based projects this month; so if you are interested, do comment and I will notify you with a small mail whenever I publish a new post on any of these.

Thanks for reading!

20 Must-Know Math Puzzles for Data Science Interviews: Test Your Problem-Solving Skills

Introduction: When preparing for a data science interview, brushing up on your coding and statistical knowledge is crucial—but math puzzles also play a significant role. Many interviewers use puzzles to assess how candidates approach complex problems, test their logical reasoning, and gauge their problem-solving efficiency. These puzzles are often designed to test not only your knowledge of math but also your ability to think critically and creatively. Here, we've compiled 20 challenging yet exciting math puzzles to help you prepare for data science interviews. We’ll walk you through each puzzle, followed by an explanation of the solution. 1. The Missing Dollar Puzzle Puzzle: Three friends check into a hotel room that costs $30. They each contribute $10. Later, the hotel realizes there was an error and the room actually costs $25. The hotel gives $5 back to the bellboy to return to the friends, but the bellboy, being dishonest, pockets $2 and gives $1 back to each friend. No...

Machine learning and statistics with python

Search This Blog