Skip to main content

spacy ner introduction and usage

 Introduction:

One of the main application of spacy is to perform named entity recognition or ner activity with it.  Named entity recognition stands for the NLP task where we detect named entity or entities such as organization, person, location etc different entity names. Spacy provides built-in methods to detect named entity recognition. In this article, we will go through the basics of detecting the different ner and using the spacy ner features.

What is NER and what are the different labels?

NER or named entity recognition, is the procedure to detect named entities using natural language processing algorithms. In spacy, universal ner labels are detected by default. So there are the following ner in this ner label list:

spacy ner label list

(1) PERSON: this represents any kind of person. This includes fictional and real people. fictional basically refers to cartoon, movie or book characters etc.

(2) NORP: nationalities or religious or political groups. This refers to basically a type of community of people again. This is sometime confusing as nationalities and name of countries are similar many times. But this doesn't refer to a geographical inanimate object, rather refers to a community or a group of people.

(3) FAC: this refers to some special type of locations or places which are mainly man made. For examples, buildings, airports, highways, bridges etc. There can seem to be a overlap of this and normal location; but the main difference is that FAC are man made while locations are more of geographical and natural.

(4) ORG: this refers to different types of organizations. For example, companies, agencies, institutions, academies etc fall in this category.

(5) GPE: this stands for geo-political entities. These are entities which has some sort of authoritative figure and governing body along with a geographical existence. These also can be apparently seen as locations; but the presence of authoritative figures or an administrative political body is the defining difference. Examples include states, cities, countries etc.

(6) LOC: this stands for locations which are non-GPE. This can have places, mountain ranges, forests, water bodies etc. 

(7) PRODUCT: this almost follows the english meaning of the word product. Includes different objects, vehicles, foods, etc. This doesn't include any sellable services though.

(8) EVENT: this also follows the simple english interpretation. Examples include named hurricanes, game tournaments, battles, wars, etc.

(9) WORK_OF_ART: this refers to any object (physical or virtual) which entitles a artistic work. This can include titles of books, songs, named famous paintings etc.

(10)  LAW: this quite simply refers to law which are basically named documents or pieces of papers turned into law.

(11) LANGUAGE: any spoken or communicated language falls in this category.

Rest are also pretty straight forward as (12) DATE (13) TIME (14) PERCENT (15) MONEY (16) QUANTITY (17) ORDINAL (18) CARDINAL.

So this is the most detailed and granular level of ner tags. One can reduce this down; by clubbing multiple tags together; for example, all ordinal, cardinal, percent, quantity back to numbers. date, time into times and so on. The most basic formats of ner tagging includes person, location and numbers. 

Can I create custom entity?

Obviously one can create custom entity and train a spacy ner model to detect those custom ner entities too. This is what we call a blank entity model training. How to train spacy ner models with your custom data is extensively discussed in the part 4 of our spacy training series blogs.

When creating your own custom entities; you should albeit take a caution that the entities should be non-overlapping; i.e. one shouldn't tag same thing with two entities. Another thing is advised to do is that when training custom entities, make sure they aren't too niche. For example, if you are training to detect clothes from unstructured text data; then instead of training for bell bottom pants, or capri pants, maybe train for only pants to begin with.

This also refers to the idea that, you may have small amount of data; therefore it is always good to start training from top to bottom approach instead of bottom to top approach. i.e. start with more umbrella and smaller number of entities and only if that performs very good then try going one step down. 

How to extract names or other ners from spacy?

Let's assume for the article's sake that we are mainly talking about english spacy. In english models, such as spacy's small model en_core_web_sm or the larger model en_core_web_lg; we get the ner tagging by default in the default pipeline. So once you make the doc by passing the text through the model; only you have to fetch the entities using the attribute doc.ents. A sample code will be like this below:

import spacy

model = spacy.load('en_core_web_sm')

english_text = "Trump, the president of America, gave a horrific speech."

eng_doc = model(english_text)

for ent in eng_doc.ents:

    print(ent.label_, ent.text)

Clearly, the ents attribute stores all the detected entities; including names; and looping through it reveals which tokens or combination of tokens are detected as that entity.

Comparison with other libraries:

When we talk about packages or services which provide ner tagging and annotation; stanfordnlp, nltk, flair are a few to name as an alternative. But generally, if you are not doing a task where above 90% difference of accuracy matters way too much; there spacy is pretty good. In some cases ( not statistically or experimentally proven and just an empirical experience), flair and stanfordnlp may perform better. 

But with the speed, option to train on top of baseline and nominal difference with other package performance, it is clear that spacy is the winner for this task as well.

How does spacy ner models work?

Spacy hasn't officially published the structure of exactly how their model architecture is. The latest version 3.1 and above spacy models applies transformer architecture and are trained using transformer model. Before that, spacy would use deep neural networks with specific speed optimization techniques. For further details, try to read spacy blogs. We also have embedded this video from spacy's author Matthew Honnibal; where they talk about spacy models work.

Conclusion:

So in this article, we discussed what NER is, what are the different entities spacy models are trained against and what do they stand for. We also brushed with questions such as whether spacy ner is fast enough, whether you can train custom entities with spacy etc. We also touched the question on how the models work and embedded a video where spacy's author discusses how the models are trained; with a focus on the ner model's structure. For further knowledge, check out our spacy introduction and their following parts. 

Thanks for reading and stay tuned for more articles!

Comments

Popular posts from this blog

Tinder bio generation with OpenAI GPT-3 API

Introduction: Recently I got access to OpenAI API beta. After a few simple experiments, I set on creating a simple test project. In this project, I will try to create good tinder bio for a specific person.  The abc of openai API playground: In the OpenAI API playground, you get a prompt, and then you can write instructions or specific text to trigger a response from the gpt-3 models. There are also a number of preset templates which loads a specific kind of prompt and let's you generate pre-prepared results. What are the models available? There are 4 models which are stable. These are: (1) curie (2) babbage (3) ada (4) da-vinci da-vinci is the strongest of them all and can perform all downstream tasks which other models can do. There are 2 other new models which openai introduced this year (2021) named da-vinci-instruct-beta and curie-instruct-beta. These instruction models are specifically built for taking in instructions. As OpenAI blog explains and also you will see in our

Can we write codes automatically with GPT-3?

 Introduction: OpenAI created and released the first versions of GPT-3 back in 2021 beginning. We wrote a few text generation articles that time and tested how to create tinder bio using GPT-3 . If you are interested to know more on what is GPT-3 or what is openai, how the server look, then read the tinder bio article. In this article, we will explore Code generation with OpenAI models.  It has been noted already in multiple blogs and exploration work, that GPT-3 can even solve leetcode problems. We will try to explore how good the OpenAI model can "code" and whether prompt tuning will improve or change those performances. Basic coding: We will try to see a few data structure coding performance by GPT-3. (a) Merge sort with python:  First with 200 words limit, it couldn't complete the Write sample code for merge sort in python.   def merge(arr, l, m, r):     n1 = m - l + 1     n2 = r- m       # create temp arrays     L = [0] * (n1)     R = [0] * (n

What is Bort?

 Introduction: Bort, is the new and more optimized version of BERT; which came out this october from amazon science. I came to know about it today while parsing amazon science's news on facebook about bort. So Bort is the newest addition to the long list of great LM models with extra-ordinary achievements.  Why is Bort important? Bort, is a model of 5.5% effective and 16% total size of the original BERT model; and is 20x faster than BERT, while being able to surpass the BERT model in 20 out of 23 tasks; to quote the abstract of the paper,  ' it obtains performance improvements of between 0 . 3% and 31%, absolute, with respect to BERT-large, on multiple public natural language understanding (NLU) benchmarks. ' So what made this achievement possible? The main idea behind creation of Bort is to go beyond the shallow depth of weight pruning, connection deletion or merely factoring the NN into different matrix factorizations and thus distilling it. While methods like knowle