Skip to main content

what is bigbird models and why is it such a great successor to transformer?

Introduction:

Transformer models made a huge news and then a huge impact in our nlp world. Researchers and industrialists across the world, took the new idea of language modeling and created new standard models in text processing, classification, natural language generation and many other directions. Transformer architecture in language modeling is therefore considered to be a inflection point in the nlp research history.

Issues with initial transformer models:

But with great power, comes great responsibilities. Transformers too, came with one great responsibility; which is computation. In the core of transformers we use attention mechanism. Attention, in plain English is the way to get a signal of relation between different words/tokens of text. Transformers use what we call, a full quadratic transformer; i.e. in case of transformers, we calculate the relevance of each token with every other token. 

This simple thing, in turn, increases the computation cost in a 0(n2) order where, n is the number of input tokens. Now, this creates a computation challenge in training transformers; which roughs out to the fact that at most 512 tokens are possible to enter as an input length.

The solution:

To solve this problem, a series of paper tried out a lot of different architecture for attention; but none of them solved the issue totally. Before bigbird, the best solution was 0(n(logn)k). Then bigbird model came and solved this issue; by actually achieving 0(n) order of computation complexity. This marks the next milestone in transformer architecture. So, in short, BIGBIRD by google research is the first linear complexity transformer architecture.

what does it mean to have a linear complexity?

For starters, this means that now we can build transformer models for very long texts, which are say 4096 or even ~16k tokens long. This also introduces transformer models into the genomics science; where bigbird achieved upto 99.9% accuracy in some of the tasks. So, the basic meaning is that, with the linear computation complexity, the bar on training long texts is almost non-existent now; and therefore future is long for transformers. 

This also means another thing for small or medium level organizations which are working on nlp domains. This directly means that the transformer model is now democratized more than ever, as with this decreased token length and computational complexity; most medium level organizations can train their own transformer models in custom domain specific datasets; to attain great performances in their applications.

How the linear computation complexities are achieved?

To understand how the linear complexity is being achieved, I went through the first part of the paper. I will describe the idea in a high level in this section.

To see how linear is achieved, first you have to look at the attention like a graph. Let's say we are giving a input of n tokens. Then a full attention according to vaswani et al. corresponds to a matrix full of 1s. 

In this case, we will define attention distribution via an adjacency matrix A such that if there is attention being calculated from token i to token j then the A(i,j) = 1.

In such a settings, a full attention as we have seen in original transformer model, i.e. let's refer to it from now on, as BERT, the A matrix is full of 1s.

In bigbird, what was done was that they questioned this: " do we need to use the full matrix?" and then they tried to figure out what minimum amount of attentions can solve this problem.

The problem therefore reduces down from here onwards to this: can we approximate the same behaviors of BERT, while decreasing down the attentions to 0(n) complexity i.e. using 0(n) number of attentions in the total matrix. 

The initial thing to do in this approach was to set up the minimum number of attributes of full attention which one would need to achieve, to say we have achieved same or better performance than the full attention model.

The attributes chosen are theoretically 

(1) the transformer model chosen should be proven to be a universal approximator.

(2) the transformer model chosen should be turing complete; i.e. it should be able to theoretically be able to simulate any other calculating machine. Turing complete has obviously a very rigorous definition which is out of the scope of this article; but the proofs are provided in the paper's appendix.

The practical attributes were obviously that in most of the encoder specific ( i.e. summarization, question answering), masked token prediction, next sentence prediction (NSP) and GLUE tasks; the model should perform equivalent to the best of BERT models or better. 

Considering these attributes in mind, researchers in google started solving this problem as a graph reduction problem. i.e. they tried to approximate the full matrix as a graph approximation problem.

First, they tried approximating the graphs using Erdos-renyi model which creates sequence of random graphs while approximating the eigenvalue of the matrix. Using this they took random attention models and tried using that. This model didn't work out that well.

Secondly, they again tried windows attention model. This model was initiated from the concept that the locality is a very important in NLP and genomics. Locality means that it is assumed that a token's relation with the nearby tokens is very important in nlp and genetics context. 

From this locality concept, a window attention matrix A is created; where for a window w, and row i, A(i-w/2+j)  = 1 for j running from 0 to w; and other A(k,l)=0.

This locality based attention is basically 0(n); but the researchers tried this out; and this worked better; still didn't reach the level of full transformer performance.

Finally, they came up with one additional part; which was inspired from the theoretical concepts in nlp. This theoretical concept is that some of the tokens like CLS ( CLS is a token which denotes the beginning of a sentence), SEP ( separator between two sentences) are much more important to take only windows or random attention with other tokens. These are called global tokens.

From the matrix concept, if there are g such tokens, for these tokens, global attention is provided; i.e. A(i,gj) = 1; A(gj,i) = 1; where j runs from 1 to n.

After merging the global tokens with the windows and random attentions; they reached an approximate nw+gn+e number of attentions; where w,g and e are fixed numbers not dependent on n. So basically the number of attentions to compute still remains at 0(n) range. 


 

But this new combination of G+R+W ( global + random + windows) reached the same amount of performance as par the full attention models in most of the tasks; as well as in the same paper it is proved that this sparse attention transformers satisfy both of the theoretical properties of the full transformer models.

So briefly without getting into the theoretical details, this is how BIGBIRD reached the linear complexity while staying in the 0(n) boundary.

Conclusion:

This is just the high level description of how the research proceeded to find the linear complexity with sparse attention mechanism. A much more detailed read of the bigbird model leads to understanding of how the universal approximation proof worked; what are the new SOTA achievements the bigbird model achieved and others.

Recommended posts:

(1) huggingface transformer library exploration: part 1, summarization

(2) Introduction to NLP using spaCy: part 1

Comments

Popular posts from this blog

Mastering SQL for Data Science: Top SQL Interview Questions by Experience Level

Introduction: SQL (Structured Query Language) is a cornerstone of data manipulation and querying in data science. SQL technical rounds are designed to assess a candidate’s ability to work with databases, retrieve, and manipulate data efficiently. This guide provides a comprehensive list of SQL interview questions segmented by experience level—beginner, intermediate, and experienced. For each level, you'll find key questions designed to evaluate the candidate’s proficiency in SQL and their ability to solve data-related problems. The difficulty increases as the experience level rises, and the final section will guide you on how to prepare effectively for these rounds. Beginner (0-2 Years of Experience) At this stage, candidates are expected to know the basics of SQL, common commands, and elementary data manipulation. What is SQL? Explain its importance in data science. Hint: Think about querying, relational databases, and data manipulation. What is the difference between WHERE ...

Spacy errors and their solutions

 Introduction: There are a bunch of errors in spacy, which never makes sense until you get to the depth of it. In this post, we will analyze the attribute error E046 and why it occurs. (1) AttributeError: [E046] Can't retrieve unregistered extension attribute 'tag_name'. Did you forget to call the set_extension method? Let's first understand what the error means on superficial level. There is a tag_name extension in your code. i.e. from a doc object, probably you are calling doc._.tag_name. But spacy suggests to you that probably you forgot to call the set_extension method. So what to do from here? The problem in hand is that your extension is not created where it should have been created. Now in general this means that your pipeline is incorrect at some level.  So how should you solve it? Look into the pipeline of your spacy language object. Chances are that the pipeline component which creates the extension is not included in the pipeline. To check the pipe eleme...

What is Bort?

 Introduction: Bort, is the new and more optimized version of BERT; which came out this october from amazon science. I came to know about it today while parsing amazon science's news on facebook about bort. So Bort is the newest addition to the long list of great LM models with extra-ordinary achievements.  Why is Bort important? Bort, is a model of 5.5% effective and 16% total size of the original BERT model; and is 20x faster than BERT, while being able to surpass the BERT model in 20 out of 23 tasks; to quote the abstract of the paper,  ' it obtains performance improvements of between 0 . 3% and 31%, absolute, with respect to BERT-large, on multiple public natural language understanding (NLU) benchmarks. ' So what made this achievement possible? The main idea behind creation of Bort is to go beyond the shallow depth of weight pruning, connection deletion or merely factoring the NN into different matrix factorizations and thus distilling it. While methods like know...