Skip to main content

NLP using spacy: spacy exploration part 1

 Introduction:

 spaCy is an open source natural language software library for advanced natural language processing, written in 2015 by explosion ai founders Matthew Honnibal and Ines Montani. While NLTK is mainly used for teaching nlp concepts and research, spaCy is one of the most famous packages used in production for companies world-wide. Before spaCy, the market was lacking of a production level great packages, which people would integrate to their services and use the best nlp services present. And spaCy exactly did that. To quote Mr.Honnibal from '15,
 
"spaCy is a new library for text processing in Python and Cython. I wrote it because I think small companies are terrible at natural language processing (NLP). Or rather: small companies are using terrible NLP technology."

spaCy is a industrial library which is written on python and cython; and provides support for TensorFlow, PyTorch, MXNet and other deep learning platforms. In this post, we will explore the different things we can try with spacy and also try out named entity recognition using spaCy. In some upcoming parts, we will explore some other parts of spaCy too. Links for upcoming parts are provided below. 

Before you proceed to read, if you want to read an even better introduction, read it from honnibal directly. 

what all can you do with spacy?

Almost all types of normal nlp works can be done using spacy. The supported tasks are:

(1) tokenization: tokenization of sentences into word or word like tokens.

(2)POS tagging: pos tagging refers to parts of speech tagging. spacy's default language pipelines contain pos tagging and it supports pos in many languages including English.

(3)dependency parsing: dependency parsing is one of the advanced grammar actions in NLP which includes parsing a sentence to decode its structure of dependency between the words. spacy not only champions dependency parsing, but it also creates a visualization of the dependency tree. 

(4) lemmatization: lemmatization is reforming a word into its base form; i.e. turning "working","worked","workable" into work. Using spacy one can perform lemmatization.

(5) sentence boundary detection: This is finding and segmenting sentences in a paragraph.

(6) Entity recognition and linking: this is labeling words in text copy as real world objects as "person","location","organization" etc; as well as taking reference from a universal corpus for linking the entities present in text. 

(7) text classification: spacy can be used to run text labeling and classification actions. 

(8) Rule based matching: spacy can be used to find specific parts of speech based patterns. It is also called chunking.

(9) Training: spacy can be also trained ( one of the main objective too) to create custom models by training on your data as well as creating new model objects. Spacy also facilitates serialization for saving the custom trained models.

These are the different tasks spacy can very well perform. Now, in the next section, we will go through the basic training of spacy available in spacy website and mention the brief, important points for you to become well trained in spacy.

Basic NLP training walk-through:

For NLP basics, there is this amazing 5 chapter course set by ines montani in the spacy site. I will summarize the main and important points for you to start using spacy and also in later part, explore different details taught in that course.

(1) how spacy process basic text?

spacy has a language object called English for english, German for german, Spanish for spanish and so on. These language objects contain all the necessary pre-trained models, weights and vocabs. So, to do basic processing:
from spacy.lang.en import English
nlp = English()
text = "I want to learn spacy and nlp."
doc = nlp(text)

Clearly, here English is a language object for text processing in english. If you want to do any kind of work in spacy and language english, then you will most probably need to do this step of importing english and then loading the object into nlp.

Although in practice, for even moderately long texts, this nlp(text) line completes in milliseconds, this step is actually quite elaborate. The text is tokenized, pos tagged, ner tagged and stored into doc. Because of highly optimized algorithms of spacy, these happen so fast. In part 2, we will learn properly about how these processes work.

For now, let's see some code snips about how the same thing works for other languages too.



So I guess you get the point about how to import the language object for your language and then use that to process your text. Now that being the first step, in the next step we will know about how to use this doc object. 
 
So basically a doc is nothing but a collection of tokens, even we can go on to say, it is a token list. And that is why you can access the tokens by doc[index] notation. See the below example of accessing tokens using their list indices:
Although now clearly it looks like doc is just a list of tokens; it is much more than that. Actually we will see later and also in part 2 that how doc has other attributes stored from the processing the language object does and the cool usage of those attributes.
Now, as you can see, in spacy three main things are there: doc, tokens and a language model. A token is nothing but a piece of text which represents a word/punctuation/symbol and ideally a token is the smallest unit of the text. 
Now during the processing of the text, each of the tokens are filled with some of their attributes; such as their text, index, pos tags, entity tags, and dependency levels. There are other different attributes, but in this post, we will see these attributes and their usefulness one by one. For example, look at the following: 
spacy token's lexical attribute example with written description

In the above example, after processing the text using nlp object; we loop through the doc treating it as an iterable, and then use like_num and text attribute of the tokens; to find out percentages. 

You could have achieved something similar using a complex regex pattern ( \d+%) but once we get into depth you will see how spacy offers much more than simple regex alternatives. 

Let's see how spacy provides more information about each token; on pos,dependency and others.

Clearly as you can see, using pos_ and dep_ attributes, you can respectively find out the pos tag the spacy assigns as well the position of the token in the dependency tree of the sentence. We will discuss the dependency tree and dependency parsing basics in another post, so no need to get concerned about that for now. 

Let's also see how to parse the entities identified in the text by spacy ner. 

spacy ner entity tagging usage example code snippet

As, I was discussing earlier, other than being an iterable of tokens, doc object also has other attributes; like .ents. ents store the entities which are either tokens or a sublist of doc( called span in spacy terminology) and can be used by their label_ attribute to see their detected entities.

This is important to note that, entity tagger, pos tagger and some other models which spacy uses to do these tasks; are statistical models. Therefore, you will see wrong or missed entity tagging in some amount also. For those you have to manually handle like below:

One more point I will like to note at this point that, often one lands into spacy for entity tagging first time. They read about ner and spacy; try out coding.. and the things go south because of not understanding the flow in which spacy works.

I got this error when I tried spacy ner without using the flow:

ValueError: [E109] Model for component 'ner' not initialized. Did you forget to load a model, or forget to call begin_training()?

If you follow the above simple procedure of  (a) process the text (b) use the ents attribute to get the entities; then such value errors will not occur.

Now, let's discuss the last and final portion of part 1 of our spacy exploration; which is the usage of matcher api from spacy. 

Basically matcher works on defining patterns of phrases; which then runs through the text and finds out the matches accordingly. These types of matching is really useful in case of information extraction and data science applications; for many reasons.

We will first show you a basic example of calling the matcher attribute and how it works and then go on to discuss the specifics. 

 

spacy matcher simple Matcher example phraseMatcher

So for using matcher; there are four things you need to do:

(1) call the Matcher class from spacy.matcher 

(2) provide the vocab from the nlp object to matcher and initialize the matcher instance

(3) Write the pattern using the following structure:

pattern = [{attribute_of_token: equate_to_this}, {attribute_of_next_token: equate_to_this}...]

now this is the most tricky thing about the matchers. Let's say you want to match the word "amazon" case-insensitively. Then what attribute of a token should match to what so that your match is covered always?

the answer is that the .lower attribute of the token should always be equal to "amazon". Then, the pattern portion related to this will be:

pattern_name = [{"LOWER": "amazon"}]

which will now look at the lower attribute of each token; and if that matches to "amazon", then it will record that match. 

This part is tricky and needs practice as well as example. The example we'll provide you soon.

(4) Now once you have the pattern written; you have to add that to the matcher object initiated. That is done in the below way:

matcher.add(match_name,None, pattern_name)

And voila! your work is done. Now, you can directly apply matcher on doc and you will get the matches. The loop part to see/print the matched texts after that is very much straightforward.

As I said, matching is tricky; and that's why we will provide some examples of this. You will notice that, you can also provide POS instead of direct texts or lower version of it; as well as lemmatized version of a text, so that you can get all the versions in which the root is present. Now without further ado:

(a) example of matching complex patterns: 

Here we are trying to find all patterns; that start with a number, follows by "fifa world cup" and ends in a punctuation. 

See that you can not write more than one token inside a {} in case of writing a pattern for matcher class. In case of phraseMatcher, which we will describe in part2, you will be allowed to do so.


[ the woman in the picture is the instructor for spacy course: ines montani, cofounder of spacy and core-developer of spacy as well as prodigy the annotation tool]

(2) example of matching pos based patterns:

In this example, we are trying to match different versions of the root "love", occuring as a "verb", followed by a Noun.

Observe the fact, that you can actually provide multiple attributes in a pattern if they are for matching one pos. This helps to specifically find patterns and can't be achieved either by regex or string searches, pos searches and lemma searches on their own. Therefore, spacy clearly dominates the phrase search with higher complexities.


(3) Using optional for extending search patterns:

In this case, we want to search all versions of the root "buy", followed a noun, with/without a determiner within them. Now as the presence of determiner is optional, therefore we need to explicitly state that.
In such a case, optional parameter, or the key "OP" comes in handy.


Now, you may wonder what are values you can set for "OP" and do they always mean optional presence one time? obviously not. Turns out, you can actually use "OP" key to avoid, check presence strictly once or never, or strictly more than once and so on. Check out the OP rules snippet below.



Now, you are ready to try out your own patterns and matches using spacy too. We will end this spacy exploration with 2 more examples of complex matching.

spacy-complex-matchers

spacy matcher complex matches example

spacy complex matches spacy matcher

Conclusion:

In this article, we discussed performing basic nlp tasks using spacy. We will discuss data structures, details of matching and others in the upcoming part 2.

Further readings and questions:

part (2) what is dependency parsing?

part (3) How to manipulate and create spacy's pipeline and custom pipelines

Part (4) How to train neural network models using spacy

Generic open questions:

what is a simple turning method?

Comments

Popular posts from this blog

Mastering SQL for Data Science: Top SQL Interview Questions by Experience Level

Introduction: SQL (Structured Query Language) is a cornerstone of data manipulation and querying in data science. SQL technical rounds are designed to assess a candidate’s ability to work with databases, retrieve, and manipulate data efficiently. This guide provides a comprehensive list of SQL interview questions segmented by experience level—beginner, intermediate, and experienced. For each level, you'll find key questions designed to evaluate the candidate’s proficiency in SQL and their ability to solve data-related problems. The difficulty increases as the experience level rises, and the final section will guide you on how to prepare effectively for these rounds. Beginner (0-2 Years of Experience) At this stage, candidates are expected to know the basics of SQL, common commands, and elementary data manipulation. What is SQL? Explain its importance in data science. Hint: Think about querying, relational databases, and data manipulation. What is the difference between WHERE ...

Spacy errors and their solutions

 Introduction: There are a bunch of errors in spacy, which never makes sense until you get to the depth of it. In this post, we will analyze the attribute error E046 and why it occurs. (1) AttributeError: [E046] Can't retrieve unregistered extension attribute 'tag_name'. Did you forget to call the set_extension method? Let's first understand what the error means on superficial level. There is a tag_name extension in your code. i.e. from a doc object, probably you are calling doc._.tag_name. But spacy suggests to you that probably you forgot to call the set_extension method. So what to do from here? The problem in hand is that your extension is not created where it should have been created. Now in general this means that your pipeline is incorrect at some level.  So how should you solve it? Look into the pipeline of your spacy language object. Chances are that the pipeline component which creates the extension is not included in the pipeline. To check the pipe eleme...

What is Bort?

 Introduction: Bort, is the new and more optimized version of BERT; which came out this october from amazon science. I came to know about it today while parsing amazon science's news on facebook about bort. So Bort is the newest addition to the long list of great LM models with extra-ordinary achievements.  Why is Bort important? Bort, is a model of 5.5% effective and 16% total size of the original BERT model; and is 20x faster than BERT, while being able to surpass the BERT model in 20 out of 23 tasks; to quote the abstract of the paper,  ' it obtains performance improvements of between 0 . 3% and 31%, absolute, with respect to BERT-large, on multiple public natural language understanding (NLU) benchmarks. ' So what made this achievement possible? The main idea behind creation of Bort is to go beyond the shallow depth of weight pruning, connection deletion or merely factoring the NN into different matrix factorizations and thus distilling it. While methods like know...