what is bigbird models and why is it such a great successor to transformer?

Introduction:

Transformer models made a huge news and then a huge impact in our nlp world. Researchers and industrialists across the world, took the new idea of language modeling and created new standard models in text processing, classification, natural language generation and many other directions. Transformer architecture in language modeling is therefore considered to be a inflection point in the nlp research history.

Issues with initial transformer models:

But with great power, comes great responsibilities. Transformers too, came with one great responsibility; which is computation. In the core of transformers we use attention mechanism. Attention, in plain English is the way to get a signal of relation between different words/tokens of text. Transformers use what we call, a full quadratic transformer; i.e. in case of transformers, we calculate the relevance of each token with every other token.

This simple thing, in turn, increases the computation cost in a 0(n²) order where, n is the number of input tokens. Now, this creates a computation challenge in training transformers; which roughs out to the fact that at most 512 tokens are possible to enter as an input length.

The solution:

To solve this problem, a series of paper tried out a lot of different architecture for attention; but none of them solved the issue totally. Before bigbird, the best solution was 0(n(logn)^k). Then bigbird model came and solved this issue; by actually achieving 0(n) order of computation complexity. This marks the next milestone in transformer architecture. So, in short, BIGBIRD by google research is the first linear complexity transformer architecture.

what does it mean to have a linear complexity?

For starters, this means that now we can build transformer models for very long texts, which are say 4096 or even ~16k tokens long. This also introduces transformer models into the genomics science; where bigbird achieved upto 99.9% accuracy in some of the tasks. So, the basic meaning is that, with the linear computation complexity, the bar on training long texts is almost non-existent now; and therefore future is long for transformers.

This also means another thing for small or medium level organizations which are working on nlp domains. This directly means that the transformer model is now democratized more than ever, as with this decreased token length and computational complexity; most medium level organizations can train their own transformer models in custom domain specific datasets; to attain great performances in their applications.

How the linear computation complexities are achieved?

To understand how the linear complexity is being achieved, I went through the first part of the paper. I will describe the idea in a high level in this section.

To see how linear is achieved, first you have to look at the attention like a graph. Let's say we are giving a input of n tokens. Then a full attention according to vaswani et al. corresponds to a matrix full of 1s.

In this case, we will define attention distribution via an adjacency matrix A such that if there is attention being calculated from token i to token j then the A(i,j) = 1.

In such a settings, a full attention as we have seen in original transformer model, i.e. let's refer to it from now on, as BERT, the A matrix is full of 1s.

In bigbird, what was done was that they questioned this: " do we need to use the full matrix?" and then they tried to figure out what minimum amount of attentions can solve this problem.

The problem therefore reduces down from here onwards to this: can we approximate the same behaviors of BERT, while decreasing down the attentions to 0(n) complexity i.e. using 0(n) number of attentions in the total matrix.

The initial thing to do in this approach was to set up the minimum number of attributes of full attention which one would need to achieve, to say we have achieved same or better performance than the full attention model.

The attributes chosen are theoretically

(1) the transformer model chosen should be proven to be a universal approximator.

(2) the transformer model chosen should be turing complete; i.e. it should be able to theoretically be able to simulate any other calculating machine. Turing complete has obviously a very rigorous definition which is out of the scope of this article; but the proofs are provided in the paper's appendix.

The practical attributes were obviously that in most of the encoder specific ( i.e. summarization, question answering), masked token prediction, next sentence prediction (NSP) and GLUE tasks; the model should perform equivalent to the best of BERT models or better.

Considering these attributes in mind, researchers in google started solving this problem as a graph reduction problem. i.e. they tried to approximate the full matrix as a graph approximation problem.

First, they tried approximating the graphs using Erdos-renyi model which creates sequence of random graphs while approximating the eigenvalue of the matrix. Using this they took random attention models and tried using that. This model didn't work out that well.

Secondly, they again tried windows attention model. This model was initiated from the concept that the locality is a very important in NLP and genomics. Locality means that it is assumed that a token's relation with the nearby tokens is very important in nlp and genetics context.

From this locality concept, a window attention matrix A is created; where for a window w, and row i, A(i-w/2+j) = 1 for j running from 0 to w; and other A(k,l)=0.

This locality based attention is basically 0(n); but the researchers tried this out; and this worked better; still didn't reach the level of full transformer performance.

Finally, they came up with one additional part; which was inspired from the theoretical concepts in nlp. This theoretical concept is that some of the tokens like CLS ( CLS is a token which denotes the beginning of a sentence), SEP ( separator between two sentences) are much more important to take only windows or random attention with other tokens. These are called global tokens.

From the matrix concept, if there are g such tokens, for these tokens, global attention is provided; i.e. A(i,gj) = 1; A(gj,i) = 1; where j runs from 1 to n.

After merging the global tokens with the windows and random attentions; they reached an approximate nw+gn+e number of attentions; where w,g and e are fixed numbers not dependent on n. So basically the number of attentions to compute still remains at 0(n) range.

But this new combination of G+R+W ( global + random + windows) reached the same amount of performance as par the full attention models in most of the tasks; as well as in the same paper it is proved that this sparse attention transformers satisfy both of the theoretical properties of the full transformer models.

So briefly without getting into the theoretical details, this is how BIGBIRD reached the linear complexity while staying in the 0(n) boundary.

Conclusion:

This is just the high level description of how the research proceeded to find the linear complexity with sparse attention mechanism. A much more detailed read of the bigbird model leads to understanding of how the universal approximation proof worked; what are the new SOTA achievements the bigbird model achieved and others.

20 Must-Know Math Puzzles for Data Science Interviews: Test Your Problem-Solving Skills

Introduction: When preparing for a data science interview, brushing up on your coding and statistical knowledge is crucial—but math puzzles also play a significant role. Many interviewers use puzzles to assess how candidates approach complex problems, test their logical reasoning, and gauge their problem-solving efficiency. These puzzles are often designed to test not only your knowledge of math but also your ability to think critically and creatively. Here, we've compiled 20 challenging yet exciting math puzzles to help you prepare for data science interviews. We’ll walk you through each puzzle, followed by an explanation of the solution. 1. The Missing Dollar Puzzle Puzzle: Three friends check into a hotel room that costs $30. They each contribute $10. Later, the hotel realizes there was an error and the room actually costs $25. The hotel gives $5 back to the bellboy to return to the friends, but the bellboy, being dishonest, pockets $2 and gives $1 back to each friend. No...

Machine learning and statistics with python

Search This Blog