Skip to main content

spacy ner introduction and usage

 Introduction:

One of the main application of spacy is to perform named entity recognition or ner activity with it.  Named entity recognition stands for the NLP task where we detect named entity or entities such as organization, person, location etc different entity names. Spacy provides built-in methods to detect named entity recognition. In this article, we will go through the basics of detecting the different ner and using the spacy ner features.

What is NER and what are the different labels?

NER or named entity recognition, is the procedure to detect named entities using natural language processing algorithms. In spacy, universal ner labels are detected by default. So there are the following ner in this ner label list:

spacy ner label list

(1) PERSON: this represents any kind of person. This includes fictional and real people. fictional basically refers to cartoon, movie or book characters etc.

(2) NORP: nationalities or religious or political groups. This refers to basically a type of community of people again. This is sometime confusing as nationalities and name of countries are similar many times. But this doesn't refer to a geographical inanimate object, rather refers to a community or a group of people.

(3) FAC: this refers to some special type of locations or places which are mainly man made. For examples, buildings, airports, highways, bridges etc. There can seem to be a overlap of this and normal location; but the main difference is that FAC are man made while locations are more of geographical and natural.

(4) ORG: this refers to different types of organizations. For example, companies, agencies, institutions, academies etc fall in this category.

(5) GPE: this stands for geo-political entities. These are entities which has some sort of authoritative figure and governing body along with a geographical existence. These also can be apparently seen as locations; but the presence of authoritative figures or an administrative political body is the defining difference. Examples include states, cities, countries etc.

(6) LOC: this stands for locations which are non-GPE. This can have places, mountain ranges, forests, water bodies etc. 

(7) PRODUCT: this almost follows the english meaning of the word product. Includes different objects, vehicles, foods, etc. This doesn't include any sellable services though.

(8) EVENT: this also follows the simple english interpretation. Examples include named hurricanes, game tournaments, battles, wars, etc.

(9) WORK_OF_ART: this refers to any object (physical or virtual) which entitles a artistic work. This can include titles of books, songs, named famous paintings etc.

(10)  LAW: this quite simply refers to law which are basically named documents or pieces of papers turned into law.

(11) LANGUAGE: any spoken or communicated language falls in this category.

Rest are also pretty straight forward as (12) DATE (13) TIME (14) PERCENT (15) MONEY (16) QUANTITY (17) ORDINAL (18) CARDINAL.

So this is the most detailed and granular level of ner tags. One can reduce this down; by clubbing multiple tags together; for example, all ordinal, cardinal, percent, quantity back to numbers. date, time into times and so on. The most basic formats of ner tagging includes person, location and numbers. 

Can I create custom entity?

Obviously one can create custom entity and train a spacy ner model to detect those custom ner entities too. This is what we call a blank entity model training. How to train spacy ner models with your custom data is extensively discussed in the part 4 of our spacy training series blogs.

When creating your own custom entities; you should albeit take a caution that the entities should be non-overlapping; i.e. one shouldn't tag same thing with two entities. Another thing is advised to do is that when training custom entities, make sure they aren't too niche. For example, if you are training to detect clothes from unstructured text data; then instead of training for bell bottom pants, or capri pants, maybe train for only pants to begin with.

This also refers to the idea that, you may have small amount of data; therefore it is always good to start training from top to bottom approach instead of bottom to top approach. i.e. start with more umbrella and smaller number of entities and only if that performs very good then try going one step down. 

How to extract names or other ners from spacy?

Let's assume for the article's sake that we are mainly talking about english spacy. In english models, such as spacy's small model en_core_web_sm or the larger model en_core_web_lg; we get the ner tagging by default in the default pipeline. So once you make the doc by passing the text through the model; only you have to fetch the entities using the attribute doc.ents. A sample code will be like this below:

import spacy

model = spacy.load('en_core_web_sm')

english_text = "Trump, the president of America, gave a horrific speech."

eng_doc = model(english_text)

for ent in eng_doc.ents:

    print(ent.label_, ent.text)

Clearly, the ents attribute stores all the detected entities; including names; and looping through it reveals which tokens or combination of tokens are detected as that entity.

Comparison with other libraries:

When we talk about packages or services which provide ner tagging and annotation; stanfordnlp, nltk, flair are a few to name as an alternative. But generally, if you are not doing a task where above 90% difference of accuracy matters way too much; there spacy is pretty good. In some cases ( not statistically or experimentally proven and just an empirical experience), flair and stanfordnlp may perform better. 

But with the speed, option to train on top of baseline and nominal difference with other package performance, it is clear that spacy is the winner for this task as well.

How does spacy ner models work?

Spacy hasn't officially published the structure of exactly how their model architecture is. The latest version 3.1 and above spacy models applies transformer architecture and are trained using transformer model. Before that, spacy would use deep neural networks with specific speed optimization techniques. For further details, try to read spacy blogs. We also have embedded this video from spacy's author Matthew Honnibal; where they talk about spacy models work.

Conclusion:

So in this article, we discussed what NER is, what are the different entities spacy models are trained against and what do they stand for. We also brushed with questions such as whether spacy ner is fast enough, whether you can train custom entities with spacy etc. We also touched the question on how the models work and embedded a video where spacy's author discusses how the models are trained; with a focus on the ner model's structure. For further knowledge, check out our spacy introduction and their following parts. 

Thanks for reading and stay tuned for more articles!

Comments

Popular posts from this blog

Mastering SQL for Data Science: Top SQL Interview Questions by Experience Level

Introduction: SQL (Structured Query Language) is a cornerstone of data manipulation and querying in data science. SQL technical rounds are designed to assess a candidate’s ability to work with databases, retrieve, and manipulate data efficiently. This guide provides a comprehensive list of SQL interview questions segmented by experience level—beginner, intermediate, and experienced. For each level, you'll find key questions designed to evaluate the candidate’s proficiency in SQL and their ability to solve data-related problems. The difficulty increases as the experience level rises, and the final section will guide you on how to prepare effectively for these rounds. Beginner (0-2 Years of Experience) At this stage, candidates are expected to know the basics of SQL, common commands, and elementary data manipulation. What is SQL? Explain its importance in data science. Hint: Think about querying, relational databases, and data manipulation. What is the difference between WHERE

What is Bort?

 Introduction: Bort, is the new and more optimized version of BERT; which came out this october from amazon science. I came to know about it today while parsing amazon science's news on facebook about bort. So Bort is the newest addition to the long list of great LM models with extra-ordinary achievements.  Why is Bort important? Bort, is a model of 5.5% effective and 16% total size of the original BERT model; and is 20x faster than BERT, while being able to surpass the BERT model in 20 out of 23 tasks; to quote the abstract of the paper,  ' it obtains performance improvements of between 0 . 3% and 31%, absolute, with respect to BERT-large, on multiple public natural language understanding (NLU) benchmarks. ' So what made this achievement possible? The main idea behind creation of Bort is to go beyond the shallow depth of weight pruning, connection deletion or merely factoring the NN into different matrix factorizations and thus distilling it. While methods like knowle

Spacy errors and their solutions

 Introduction: There are a bunch of errors in spacy, which never makes sense until you get to the depth of it. In this post, we will analyze the attribute error E046 and why it occurs. (1) AttributeError: [E046] Can't retrieve unregistered extension attribute 'tag_name'. Did you forget to call the set_extension method? Let's first understand what the error means on superficial level. There is a tag_name extension in your code. i.e. from a doc object, probably you are calling doc._.tag_name. But spacy suggests to you that probably you forgot to call the set_extension method. So what to do from here? The problem in hand is that your extension is not created where it should have been created. Now in general this means that your pipeline is incorrect at some level.  So how should you solve it? Look into the pipeline of your spacy language object. Chances are that the pipeline component which creates the extension is not included in the pipeline. To check the pipe eleme