Skip to main content

spacy exploration part 4: neural network model training using spacy

 Introduction:

We have discussed different aspects of spacy in part 1, part 2 and part 3. Now, up to this point, we have used the pre-trained models. Now, in many cases, you may need to tweak or improve models; enter new categories in the tagger or entity for specific projects or tasks. In this part, we will discuss how to modify the neural network model or train our own models also; as well as different technical issues which arise in these cases.

How does training works:

1. initialize the model using random weights, with nlp.begin_training. 

2. predict a bunch of samples using the current model by nlp.update

3. Compare prediction with true labels, calculate change of weight based on those predictions and finally update the weights.

4. finally reiterate from 2.

Refer to the following diagram for a better understanding:


Now in practice, the goal is to create annotated data using prodigy, brat or easily using phraseMatcher feature of spacy. Using phraseMatcher we can quickly label the data with new labels or add the mislabeled data for training and finally, we can use it to update the pos/entity taggers. We will run through a small example of updating a entity tagger to include a gadget label. In this example, we will create training data using phrase matcher and then train using the above mentioned algorithm.


If you note properly, a normal pattern for training data is 

(text,{type:[(start_char,end_char,target_label),(start_char2,end_char2,target_label)...]})

Now we can train and update the entity model with this data in example dataset; using the above steps.

We can also train completely empty models similar to the ones already in action from ground up, using the nlp.blank structure.  

Using spacy.blank("en"), we can create an empty pipeline object.After that, you can initiate one pipe element, using nlp.create_pipe(task_name) where in task_name you put the name of the pipe element we put to use the readymade pipeline. Finally, we can create the pipe, ready for training, by adding that element using nlp.add_pipe. For understanding these pipeline structuring and what are different pipeline elements and their names, you may want to refer to part 3 where we discussed about spacy pipelines and how they work. Finally also add the label using add_label with the specific element.

Now, we need nlp.begin_training() to start the training. For each iteration, we use the specially formatted training data and create text, and its annotation i.e. the start and end character referred actual labels to train against. We call the spacy's minibatch util, to batch wise split the training data, and create these texts and annotations list.

We call the texts and their respective annotations (which we create from the training data using minibatch utils) in nlp.update function. If you remember the algorithm for training or updating models, you will remember that we update using the comparison. 

Here we will call the function nlp.update using nlp.update( texts, annotations, losses = {}) format. To store and observe the losses at each step of training iterations; you need to initiate the empty dict previously and call it after each time the update works. We will discuss how update works in our later posts about diving into spacy's codes.

Now, printing losses will give you fine idea, how the training works and the loss reduces slowly. Finally see the code snippet for what we discussed above:



Problems in training models:

Now, there are a bunch of problems which may occur when you update the model as well as certain things are needed for the quality control of the data you put into models. One of the common problem is:

(1) catastrophic forgetting: 

Commonly let's say you are trying to update the existing ner model. Now, what you are doing is you have got 1000 around examples of electronic gadgets and then you update the model with these 1000 odd examples with the label "gadget". Then, it may very well happen that the model will forget to tag GPE or ORG or some other label. This is because the model gets updated in such a way that the model "forgets" or mathematically in the model the importance of the other gadgets get reduced so much so that the model doesn't tag some other classes anymore. See this question answer explaining with an example below:

 


This is called the catastrophic forgetting. To avoid this, you need to include some or multiple of the already trained labels in each batch of the training data. If you are actually preparing a big training data and then you are just iterating using minibatches, then in that case, you should include similar proportions of data for other labels as well. A more practical vision on this can be gained only by practically doing it. We will discuss this in our future discussion about practical implementation of spacy.

(2) Model can't learn all:

There is another issue while modeling for specific labels is that we need to remember that these are statistical models we are training and there is a limit of specific labels on which we can't train a model. It is not possible for us to create a model with the spacy pipeline which can learn labelling at everything in a human level. That is why when labelling data; we need to remember this fact and not label too specific. Let's look at several examples from the course on what is a good label vs what are bad labels and also guidance about how to choose labels.


 So these are several rules which you should remember while labelling your data for training in spacy. 

Conclusion:

With this, our basic 4 part spacy exploration ends. I have followed the spacy training course in the official website heavily in drafting these 4 parts, as well as have used shots and completed exercises from it. The final result, which is a concise, smaller series; with almost all the same contents mentioned. I have omitted similarity calculations as spacy uses word2vec as background, which is pretty much dropped in the industries now; as well as the extension setting with spacy; which was complicated and also does not match well with the flow. While these 4 parts are only a basic beginning, these should give you enough background to start a spacy based pipeline in your work/research for analyzing text data.

Note that while spacy is awesome in this way, some of the spacy's models maybe outperformed by bert and transformer ( or their derivative) based models. So for creating cutting edge models and pipelines, you will need to use either bert, transformer or something else which comes after this ( GPT-3 and many more to come). 

The biggest usage of spacy is though, not its models, but the structuring of nlp processes, modularized usage and ability to create fast yet smooth processing pipelines. Therefore, even if you are using cutting edge models, fusing spacy's cython driven processes in your pipeline should be given enough thought. 

I will write about some of these advanced usage, as such how people are using transformers/gpt-3 with spacy pipelines; how you can integrate spacy with your tensorflow code and others; to finally reach our goal: blazing fast, human like nlp. 

Thanks for reading! stay tuned for further articles.

Comments

Popular posts from this blog

Mastering SQL for Data Science: Top SQL Interview Questions by Experience Level

Introduction: SQL (Structured Query Language) is a cornerstone of data manipulation and querying in data science. SQL technical rounds are designed to assess a candidate’s ability to work with databases, retrieve, and manipulate data efficiently. This guide provides a comprehensive list of SQL interview questions segmented by experience level—beginner, intermediate, and experienced. For each level, you'll find key questions designed to evaluate the candidate’s proficiency in SQL and their ability to solve data-related problems. The difficulty increases as the experience level rises, and the final section will guide you on how to prepare effectively for these rounds. Beginner (0-2 Years of Experience) At this stage, candidates are expected to know the basics of SQL, common commands, and elementary data manipulation. What is SQL? Explain its importance in data science. Hint: Think about querying, relational databases, and data manipulation. What is the difference between WHERE ...

Spacy errors and their solutions

 Introduction: There are a bunch of errors in spacy, which never makes sense until you get to the depth of it. In this post, we will analyze the attribute error E046 and why it occurs. (1) AttributeError: [E046] Can't retrieve unregistered extension attribute 'tag_name'. Did you forget to call the set_extension method? Let's first understand what the error means on superficial level. There is a tag_name extension in your code. i.e. from a doc object, probably you are calling doc._.tag_name. But spacy suggests to you that probably you forgot to call the set_extension method. So what to do from here? The problem in hand is that your extension is not created where it should have been created. Now in general this means that your pipeline is incorrect at some level.  So how should you solve it? Look into the pipeline of your spacy language object. Chances are that the pipeline component which creates the extension is not included in the pipeline. To check the pipe eleme...

What is Bort?

 Introduction: Bort, is the new and more optimized version of BERT; which came out this october from amazon science. I came to know about it today while parsing amazon science's news on facebook about bort. So Bort is the newest addition to the long list of great LM models with extra-ordinary achievements.  Why is Bort important? Bort, is a model of 5.5% effective and 16% total size of the original BERT model; and is 20x faster than BERT, while being able to surpass the BERT model in 20 out of 23 tasks; to quote the abstract of the paper,  ' it obtains performance improvements of between 0 . 3% and 31%, absolute, with respect to BERT-large, on multiple public natural language understanding (NLU) benchmarks. ' So what made this achievement possible? The main idea behind creation of Bort is to go beyond the shallow depth of weight pruning, connection deletion or merely factoring the NN into different matrix factorizations and thus distilling it. While methods like know...