spacy ner introduction and usage

Introduction:

One of the main application of spacy is to perform named entity recognition or ner activity with it. Named entity recognition stands for the NLP task where we detect named entity or entities such as organization, person, location etc different entity names. Spacy provides built-in methods to detect named entity recognition. In this article, we will go through the basics of detecting the different ner and using the spacy ner features.

What is NER and what are the different labels?

NER or named entity recognition, is the procedure to detect named entities using natural language processing algorithms. In spacy, universal ner labels are detected by default. So there are the following ner in this ner label list:

(1) PERSON: this represents any kind of person. This includes fictional and real people. fictional basically refers to cartoon, movie or book characters etc.

(2) NORP: nationalities or religious or political groups. This refers to basically a type of community of people again. This is sometime confusing as nationalities and name of countries are similar many times. But this doesn't refer to a geographical inanimate object, rather refers to a community or a group of people.

(3) FAC: this refers to some special type of locations or places which are mainly man made. For examples, buildings, airports, highways, bridges etc. There can seem to be a overlap of this and normal location; but the main difference is that FAC are man made while locations are more of geographical and natural.

(4) ORG: this refers to different types of organizations. For example, companies, agencies, institutions, academies etc fall in this category.

(5) GPE: this stands for geo-political entities. These are entities which has some sort of authoritative figure and governing body along with a geographical existence. These also can be apparently seen as locations; but the presence of authoritative figures or an administrative political body is the defining difference. Examples include states, cities, countries etc.

(6) LOC: this stands for locations which are non-GPE. This can have places, mountain ranges, forests, water bodies etc.

(7) PRODUCT: this almost follows the english meaning of the word product. Includes different objects, vehicles, foods, etc. This doesn't include any sellable services though.

(8) EVENT: this also follows the simple english interpretation. Examples include named hurricanes, game tournaments, battles, wars, etc.

(9) WORK_OF_ART: this refers to any object (physical or virtual) which entitles a artistic work. This can include titles of books, songs, named famous paintings etc.

(10) LAW: this quite simply refers to law which are basically named documents or pieces of papers turned into law.

(11) LANGUAGE: any spoken or communicated language falls in this category.

Rest are also pretty straight forward as (12) DATE (13) TIME (14) PERCENT (15) MONEY (16) QUANTITY (17) ORDINAL (18) CARDINAL.

So this is the most detailed and granular level of ner tags. One can reduce this down; by clubbing multiple tags together; for example, all ordinal, cardinal, percent, quantity back to numbers. date, time into times and so on. The most basic formats of ner tagging includes person, location and numbers.

Can I create custom entity?

Obviously one can create custom entity and train a spacy ner model to detect those custom ner entities too. This is what we call a blank entity model training. How to train spacy ner models with your custom data is extensively discussed in the part 4 of our spacy training series blogs.

When creating your own custom entities; you should albeit take a caution that the entities should be non-overlapping; i.e. one shouldn't tag same thing with two entities. Another thing is advised to do is that when training custom entities, make sure they aren't too niche. For example, if you are training to detect clothes from unstructured text data; then instead of training for bell bottom pants, or capri pants, maybe train for only pants to begin with.

This also refers to the idea that, you may have small amount of data; therefore it is always good to start training from top to bottom approach instead of bottom to top approach. i.e. start with more umbrella and smaller number of entities and only if that performs very good then try going one step down.

How to extract names or other ners from spacy?

Let's assume for the article's sake that we are mainly talking about english spacy. In english models, such as spacy's small model en_core_web_sm or the larger model en_core_web_lg; we get the ner tagging by default in the default pipeline. So once you make the doc by passing the text through the model; only you have to fetch the entities using the attribute doc.ents. A sample code will be like this below:


import spacy
model = spacy.load('en_core_web_sm')
english_text = "Trump, the president of America, gave a horrific speech."
eng_doc = model(english_text)
for ent in eng_doc.ents:
    print(ent.label_, ent.text)

Clearly, the ents attribute stores all the detected entities; including names; and looping through it reveals which tokens or combination of tokens are detected as that entity.

Comparison with other libraries:

When we talk about packages or services which provide ner tagging and annotation; stanfordnlp, nltk, flair are a few to name as an alternative. But generally, if you are not doing a task where above 90% difference of accuracy matters way too much; there spacy is pretty good. In some cases ( not statistically or experimentally proven and just an empirical experience), flair and stanfordnlp may perform better.

But with the speed, option to train on top of baseline and nominal difference with other package performance, it is clear that spacy is the winner for this task as well.

How does spacy ner models work?

Spacy hasn't officially published the structure of exactly how their model architecture is. The latest version 3.1 and above spacy models applies transformer architecture and are trained using transformer model. Before that, spacy would use deep neural networks with specific speed optimization techniques. For further details, try to read spacy blogs. We also have embedded this video from spacy's author Matthew Honnibal; where they talk about spacy models work.

Conclusion:

So in this article, we discussed what NER is, what are the different entities spacy models are trained against and what do they stand for. We also brushed with questions such as whether spacy ner is fast enough, whether you can train custom entities with spacy etc. We also touched the question on how the models work and embedded a video where spacy's author discusses how the models are trained; with a focus on the ner model's structure. For further knowledge, check out our spacy introduction and their following parts.

Thanks for reading and stay tuned for more articles!

20 Must-Know Math Puzzles for Data Science Interviews: Test Your Problem-Solving Skills

Introduction: When preparing for a data science interview, brushing up on your coding and statistical knowledge is crucial—but math puzzles also play a significant role. Many interviewers use puzzles to assess how candidates approach complex problems, test their logical reasoning, and gauge their problem-solving efficiency. These puzzles are often designed to test not only your knowledge of math but also your ability to think critically and creatively. Here, we've compiled 20 challenging yet exciting math puzzles to help you prepare for data science interviews. We’ll walk you through each puzzle, followed by an explanation of the solution. 1. The Missing Dollar Puzzle Puzzle: Three friends check into a hotel room that costs $30. They each contribute $10. Later, the hotel realizes there was an error and the room actually costs $25. The hotel gives $5 back to the bellboy to return to the friends, but the bellboy, being dishonest, pockets $2 and gives $1 back to each friend. No...

Machine learning and statistics with python

Search This Blog