Skip to main content

what is bigbird models and why is it such a great successor to transformer?

Introduction:

Transformer models made a huge news and then a huge impact in our nlp world. Researchers and industrialists across the world, took the new idea of language modeling and created new standard models in text processing, classification, natural language generation and many other directions. Transformer architecture in language modeling is therefore considered to be a inflection point in the nlp research history.

Issues with initial transformer models:

But with great power, comes great responsibilities. Transformers too, came with one great responsibility; which is computation. In the core of transformers we use attention mechanism. Attention, in plain English is the way to get a signal of relation between different words/tokens of text. Transformers use what we call, a full quadratic transformer; i.e. in case of transformers, we calculate the relevance of each token with every other token. 

This simple thing, in turn, increases the computation cost in a 0(n2) order where, n is the number of input tokens. Now, this creates a computation challenge in training transformers; which roughs out to the fact that at most 512 tokens are possible to enter as an input length.

The solution:

To solve this problem, a series of paper tried out a lot of different architecture for attention; but none of them solved the issue totally. Before bigbird, the best solution was 0(n(logn)k). Then bigbird model came and solved this issue; by actually achieving 0(n) order of computation complexity. This marks the next milestone in transformer architecture. So, in short, BIGBIRD by google research is the first linear complexity transformer architecture.

what does it mean to have a linear complexity?

For starters, this means that now we can build transformer models for very long texts, which are say 4096 or even ~16k tokens long. This also introduces transformer models into the genomics science; where bigbird achieved upto 99.9% accuracy in some of the tasks. So, the basic meaning is that, with the linear computation complexity, the bar on training long texts is almost non-existent now; and therefore future is long for transformers. 

This also means another thing for small or medium level organizations which are working on nlp domains. This directly means that the transformer model is now democratized more than ever, as with this decreased token length and computational complexity; most medium level organizations can train their own transformer models in custom domain specific datasets; to attain great performances in their applications.

How the linear computation complexities are achieved?

To understand how the linear complexity is being achieved, I went through the first part of the paper. I will describe the idea in a high level in this section.

To see how linear is achieved, first you have to look at the attention like a graph. Let's say we are giving a input of n tokens. Then a full attention according to vaswani et al. corresponds to a matrix full of 1s. 

In this case, we will define attention distribution via an adjacency matrix A such that if there is attention being calculated from token i to token j then the A(i,j) = 1.

In such a settings, a full attention as we have seen in original transformer model, i.e. let's refer to it from now on, as BERT, the A matrix is full of 1s.

In bigbird, what was done was that they questioned this: " do we need to use the full matrix?" and then they tried to figure out what minimum amount of attentions can solve this problem.

The problem therefore reduces down from here onwards to this: can we approximate the same behaviors of BERT, while decreasing down the attentions to 0(n) complexity i.e. using 0(n) number of attentions in the total matrix. 

The initial thing to do in this approach was to set up the minimum number of attributes of full attention which one would need to achieve, to say we have achieved same or better performance than the full attention model.

The attributes chosen are theoretically 

(1) the transformer model chosen should be proven to be a universal approximator.

(2) the transformer model chosen should be turing complete; i.e. it should be able to theoretically be able to simulate any other calculating machine. Turing complete has obviously a very rigorous definition which is out of the scope of this article; but the proofs are provided in the paper's appendix.

The practical attributes were obviously that in most of the encoder specific ( i.e. summarization, question answering), masked token prediction, next sentence prediction (NSP) and GLUE tasks; the model should perform equivalent to the best of BERT models or better. 

Considering these attributes in mind, researchers in google started solving this problem as a graph reduction problem. i.e. they tried to approximate the full matrix as a graph approximation problem.

First, they tried approximating the graphs using Erdos-renyi model which creates sequence of random graphs while approximating the eigenvalue of the matrix. Using this they took random attention models and tried using that. This model didn't work out that well.

Secondly, they again tried windows attention model. This model was initiated from the concept that the locality is a very important in NLP and genomics. Locality means that it is assumed that a token's relation with the nearby tokens is very important in nlp and genetics context. 

From this locality concept, a window attention matrix A is created; where for a window w, and row i, A(i-w/2+j)  = 1 for j running from 0 to w; and other A(k,l)=0.

This locality based attention is basically 0(n); but the researchers tried this out; and this worked better; still didn't reach the level of full transformer performance.

Finally, they came up with one additional part; which was inspired from the theoretical concepts in nlp. This theoretical concept is that some of the tokens like CLS ( CLS is a token which denotes the beginning of a sentence), SEP ( separator between two sentences) are much more important to take only windows or random attention with other tokens. These are called global tokens.

From the matrix concept, if there are g such tokens, for these tokens, global attention is provided; i.e. A(i,gj) = 1; A(gj,i) = 1; where j runs from 1 to n.

After merging the global tokens with the windows and random attentions; they reached an approximate nw+gn+e number of attentions; where w,g and e are fixed numbers not dependent on n. So basically the number of attentions to compute still remains at 0(n) range. 


 

But this new combination of G+R+W ( global + random + windows) reached the same amount of performance as par the full attention models in most of the tasks; as well as in the same paper it is proved that this sparse attention transformers satisfy both of the theoretical properties of the full transformer models.

So briefly without getting into the theoretical details, this is how BIGBIRD reached the linear complexity while staying in the 0(n) boundary.

Conclusion:

This is just the high level description of how the research proceeded to find the linear complexity with sparse attention mechanism. A much more detailed read of the bigbird model leads to understanding of how the universal approximation proof worked; what are the new SOTA achievements the bigbird model achieved and others.

Recommended posts:

(1) huggingface transformer library exploration: part 1, summarization

(2) Introduction to NLP using spaCy: part 1

Comments

Popular posts from this blog

Tinder bio generation with OpenAI GPT-3 API

Introduction: Recently I got access to OpenAI API beta. After a few simple experiments, I set on creating a simple test project. In this project, I will try to create good tinder bio for a specific person.  The abc of openai API playground: In the OpenAI API playground, you get a prompt, and then you can write instructions or specific text to trigger a response from the gpt-3 models. There are also a number of preset templates which loads a specific kind of prompt and let's you generate pre-prepared results. What are the models available? There are 4 models which are stable. These are: (1) curie (2) babbage (3) ada (4) da-vinci da-vinci is the strongest of them all and can perform all downstream tasks which other models can do. There are 2 other new models which openai introduced this year (2021) named da-vinci-instruct-beta and curie-instruct-beta. These instruction models are specifically built for taking in instructions. As OpenAI blog explains and also you will see in our

Can we write codes automatically with GPT-3?

 Introduction: OpenAI created and released the first versions of GPT-3 back in 2021 beginning. We wrote a few text generation articles that time and tested how to create tinder bio using GPT-3 . If you are interested to know more on what is GPT-3 or what is openai, how the server look, then read the tinder bio article. In this article, we will explore Code generation with OpenAI models.  It has been noted already in multiple blogs and exploration work, that GPT-3 can even solve leetcode problems. We will try to explore how good the OpenAI model can "code" and whether prompt tuning will improve or change those performances. Basic coding: We will try to see a few data structure coding performance by GPT-3. (a) Merge sort with python:  First with 200 words limit, it couldn't complete the Write sample code for merge sort in python.   def merge(arr, l, m, r):     n1 = m - l + 1     n2 = r- m       # create temp arrays     L = [0] * (n1)     R = [0] * (n

What is Bort?

 Introduction: Bort, is the new and more optimized version of BERT; which came out this october from amazon science. I came to know about it today while parsing amazon science's news on facebook about bort. So Bort is the newest addition to the long list of great LM models with extra-ordinary achievements.  Why is Bort important? Bort, is a model of 5.5% effective and 16% total size of the original BERT model; and is 20x faster than BERT, while being able to surpass the BERT model in 20 out of 23 tasks; to quote the abstract of the paper,  ' it obtains performance improvements of between 0 . 3% and 31%, absolute, with respect to BERT-large, on multiple public natural language understanding (NLU) benchmarks. ' So what made this achievement possible? The main idea behind creation of Bort is to go beyond the shallow depth of weight pruning, connection deletion or merely factoring the NN into different matrix factorizations and thus distilling it. While methods like knowle