Skip to main content

NLP using spacy: spacy exploration part 1

 Introduction:

 spaCy is an open source natural language software library for advanced natural language processing, written in 2015 by explosion ai founders Matthew Honnibal and Ines Montani. While NLTK is mainly used for teaching nlp concepts and research, spaCy is one of the most famous packages used in production for companies world-wide. Before spaCy, the market was lacking of a production level great packages, which people would integrate to their services and use the best nlp services present. And spaCy exactly did that. To quote Mr.Honnibal from '15,
 
"spaCy is a new library for text processing in Python and Cython. I wrote it because I think small companies are terrible at natural language processing (NLP). Or rather: small companies are using terrible NLP technology."

spaCy is a industrial library which is written on python and cython; and provides support for TensorFlow, PyTorch, MXNet and other deep learning platforms. In this post, we will explore the different things we can try with spacy and also try out named entity recognition using spaCy. In some upcoming parts, we will explore some other parts of spaCy too. Links for upcoming parts are provided below. 

Before you proceed to read, if you want to read an even better introduction, read it from honnibal directly. 

what all can you do with spacy?

Almost all types of normal nlp works can be done using spacy. The supported tasks are:

(1) tokenization: tokenization of sentences into word or word like tokens.

(2)POS tagging: pos tagging refers to parts of speech tagging. spacy's default language pipelines contain pos tagging and it supports pos in many languages including English.

(3)dependency parsing: dependency parsing is one of the advanced grammar actions in NLP which includes parsing a sentence to decode its structure of dependency between the words. spacy not only champions dependency parsing, but it also creates a visualization of the dependency tree. 

(4) lemmatization: lemmatization is reforming a word into its base form; i.e. turning "working","worked","workable" into work. Using spacy one can perform lemmatization.

(5) sentence boundary detection: This is finding and segmenting sentences in a paragraph.

(6) Entity recognition and linking: this is labeling words in text copy as real world objects as "person","location","organization" etc; as well as taking reference from a universal corpus for linking the entities present in text. 

(7) text classification: spacy can be used to run text labeling and classification actions. 

(8) Rule based matching: spacy can be used to find specific parts of speech based patterns. It is also called chunking.

(9) Training: spacy can be also trained ( one of the main objective too) to create custom models by training on your data as well as creating new model objects. Spacy also facilitates serialization for saving the custom trained models.

These are the different tasks spacy can very well perform. Now, in the next section, we will go through the basic training of spacy available in spacy website and mention the brief, important points for you to become well trained in spacy.

Basic NLP training walk-through:

For NLP basics, there is this amazing 5 chapter course set by ines montani in the spacy site. I will summarize the main and important points for you to start using spacy and also in later part, explore different details taught in that course.

(1) how spacy process basic text?

spacy has a language object called English for english, German for german, Spanish for spanish and so on. These language objects contain all the necessary pre-trained models, weights and vocabs. So, to do basic processing:
from spacy.lang.en import English
nlp = English()
text = "I want to learn spacy and nlp."
doc = nlp(text)

Clearly, here English is a language object for text processing in english. If you want to do any kind of work in spacy and language english, then you will most probably need to do this step of importing english and then loading the object into nlp.

Although in practice, for even moderately long texts, this nlp(text) line completes in milliseconds, this step is actually quite elaborate. The text is tokenized, pos tagged, ner tagged and stored into doc. Because of highly optimized algorithms of spacy, these happen so fast. In part 2, we will learn properly about how these processes work.

For now, let's see some code snips about how the same thing works for other languages too.



So I guess you get the point about how to import the language object for your language and then use that to process your text. Now that being the first step, in the next step we will know about how to use this doc object. 
 
So basically a doc is nothing but a collection of tokens, even we can go on to say, it is a token list. And that is why you can access the tokens by doc[index] notation. See the below example of accessing tokens using their list indices:
Although now clearly it looks like doc is just a list of tokens; it is much more than that. Actually we will see later and also in part 2 that how doc has other attributes stored from the processing the language object does and the cool usage of those attributes.
Now, as you can see, in spacy three main things are there: doc, tokens and a language model. A token is nothing but a piece of text which represents a word/punctuation/symbol and ideally a token is the smallest unit of the text. 
Now during the processing of the text, each of the tokens are filled with some of their attributes; such as their text, index, pos tags, entity tags, and dependency levels. There are other different attributes, but in this post, we will see these attributes and their usefulness one by one. For example, look at the following: 
spacy token's lexical attribute example with written description

In the above example, after processing the text using nlp object; we loop through the doc treating it as an iterable, and then use like_num and text attribute of the tokens; to find out percentages. 

You could have achieved something similar using a complex regex pattern ( \d+%) but once we get into depth you will see how spacy offers much more than simple regex alternatives. 

Let's see how spacy provides more information about each token; on pos,dependency and others.

Clearly as you can see, using pos_ and dep_ attributes, you can respectively find out the pos tag the spacy assigns as well the position of the token in the dependency tree of the sentence. We will discuss the dependency tree and dependency parsing basics in another post, so no need to get concerned about that for now. 

Let's also see how to parse the entities identified in the text by spacy ner. 

spacy ner entity tagging usage example code snippet

As, I was discussing earlier, other than being an iterable of tokens, doc object also has other attributes; like .ents. ents store the entities which are either tokens or a sublist of doc( called span in spacy terminology) and can be used by their label_ attribute to see their detected entities.

This is important to note that, entity tagger, pos tagger and some other models which spacy uses to do these tasks; are statistical models. Therefore, you will see wrong or missed entity tagging in some amount also. For those you have to manually handle like below:

One more point I will like to note at this point that, often one lands into spacy for entity tagging first time. They read about ner and spacy; try out coding.. and the things go south because of not understanding the flow in which spacy works.

I got this error when I tried spacy ner without using the flow:

ValueError: [E109] Model for component 'ner' not initialized. Did you forget to load a model, or forget to call begin_training()?

If you follow the above simple procedure of  (a) process the text (b) use the ents attribute to get the entities; then such value errors will not occur.

Now, let's discuss the last and final portion of part 1 of our spacy exploration; which is the usage of matcher api from spacy. 

Basically matcher works on defining patterns of phrases; which then runs through the text and finds out the matches accordingly. These types of matching is really useful in case of information extraction and data science applications; for many reasons.

We will first show you a basic example of calling the matcher attribute and how it works and then go on to discuss the specifics. 

 

spacy matcher simple Matcher example phraseMatcher

So for using matcher; there are four things you need to do:

(1) call the Matcher class from spacy.matcher 

(2) provide the vocab from the nlp object to matcher and initialize the matcher instance

(3) Write the pattern using the following structure:

pattern = [{attribute_of_token: equate_to_this}, {attribute_of_next_token: equate_to_this}...]

now this is the most tricky thing about the matchers. Let's say you want to match the word "amazon" case-insensitively. Then what attribute of a token should match to what so that your match is covered always?

the answer is that the .lower attribute of the token should always be equal to "amazon". Then, the pattern portion related to this will be:

pattern_name = [{"LOWER": "amazon"}]

which will now look at the lower attribute of each token; and if that matches to "amazon", then it will record that match. 

This part is tricky and needs practice as well as example. The example we'll provide you soon.

(4) Now once you have the pattern written; you have to add that to the matcher object initiated. That is done in the below way:

matcher.add(match_name,None, pattern_name)

And voila! your work is done. Now, you can directly apply matcher on doc and you will get the matches. The loop part to see/print the matched texts after that is very much straightforward.

As I said, matching is tricky; and that's why we will provide some examples of this. You will notice that, you can also provide POS instead of direct texts or lower version of it; as well as lemmatized version of a text, so that you can get all the versions in which the root is present. Now without further ado:

(a) example of matching complex patterns: 

Here we are trying to find all patterns; that start with a number, follows by "fifa world cup" and ends in a punctuation. 

See that you can not write more than one token inside a {} in case of writing a pattern for matcher class. In case of phraseMatcher, which we will describe in part2, you will be allowed to do so.


[ the woman in the picture is the instructor for spacy course: ines montani, cofounder of spacy and core-developer of spacy as well as prodigy the annotation tool]

(2) example of matching pos based patterns:

In this example, we are trying to match different versions of the root "love", occuring as a "verb", followed by a Noun.

Observe the fact, that you can actually provide multiple attributes in a pattern if they are for matching one pos. This helps to specifically find patterns and can't be achieved either by regex or string searches, pos searches and lemma searches on their own. Therefore, spacy clearly dominates the phrase search with higher complexities.


(3) Using optional for extending search patterns:

In this case, we want to search all versions of the root "buy", followed a noun, with/without a determiner within them. Now as the presence of determiner is optional, therefore we need to explicitly state that.
In such a case, optional parameter, or the key "OP" comes in handy.


Now, you may wonder what are values you can set for "OP" and do they always mean optional presence one time? obviously not. Turns out, you can actually use "OP" key to avoid, check presence strictly once or never, or strictly more than once and so on. Check out the OP rules snippet below.



Now, you are ready to try out your own patterns and matches using spacy too. We will end this spacy exploration with 2 more examples of complex matching.

spacy-complex-matchers

spacy matcher complex matches example

spacy complex matches spacy matcher

Conclusion:

In this article, we discussed performing basic nlp tasks using spacy. We will discuss data structures, details of matching and others in the upcoming part 2.

Further readings and questions:

part (2) what is dependency parsing?

part (3) How to manipulate and create spacy's pipeline and custom pipelines

Part (4) How to train neural network models using spacy

Generic open questions:

what is a simple turning method?

Comments

Popular posts from this blog

Tinder bio generation with OpenAI GPT-3 API

Introduction: Recently I got access to OpenAI API beta. After a few simple experiments, I set on creating a simple test project. In this project, I will try to create good tinder bio for a specific person.  The abc of openai API playground: In the OpenAI API playground, you get a prompt, and then you can write instructions or specific text to trigger a response from the gpt-3 models. There are also a number of preset templates which loads a specific kind of prompt and let's you generate pre-prepared results. What are the models available? There are 4 models which are stable. These are: (1) curie (2) babbage (3) ada (4) da-vinci da-vinci is the strongest of them all and can perform all downstream tasks which other models can do. There are 2 other new models which openai introduced this year (2021) named da-vinci-instruct-beta and curie-instruct-beta. These instruction models are specifically built for taking in instructions. As OpenAI blog explains and also you will see in our

Can we write codes automatically with GPT-3?

 Introduction: OpenAI created and released the first versions of GPT-3 back in 2021 beginning. We wrote a few text generation articles that time and tested how to create tinder bio using GPT-3 . If you are interested to know more on what is GPT-3 or what is openai, how the server look, then read the tinder bio article. In this article, we will explore Code generation with OpenAI models.  It has been noted already in multiple blogs and exploration work, that GPT-3 can even solve leetcode problems. We will try to explore how good the OpenAI model can "code" and whether prompt tuning will improve or change those performances. Basic coding: We will try to see a few data structure coding performance by GPT-3. (a) Merge sort with python:  First with 200 words limit, it couldn't complete the Write sample code for merge sort in python.   def merge(arr, l, m, r):     n1 = m - l + 1     n2 = r- m       # create temp arrays     L = [0] * (n1)     R = [0] * (n

What is Bort?

 Introduction: Bort, is the new and more optimized version of BERT; which came out this october from amazon science. I came to know about it today while parsing amazon science's news on facebook about bort. So Bort is the newest addition to the long list of great LM models with extra-ordinary achievements.  Why is Bort important? Bort, is a model of 5.5% effective and 16% total size of the original BERT model; and is 20x faster than BERT, while being able to surpass the BERT model in 20 out of 23 tasks; to quote the abstract of the paper,  ' it obtains performance improvements of between 0 . 3% and 31%, absolute, with respect to BERT-large, on multiple public natural language understanding (NLU) benchmarks. ' So what made this achievement possible? The main idea behind creation of Bort is to go beyond the shallow depth of weight pruning, connection deletion or merely factoring the NN into different matrix factorizations and thus distilling it. While methods like knowle