Skip to main content

Hugging face transformer library exploration: part2: problems with summarization and possible solution

Chinese (Simplified)Chinese (Traditional)CzechDanishDutchEnglishFrenchGermanHindiIndonesianItalianKoreanPolishPortugueseRussianSerbianSlovakSpanishThaiBengaliGujaratiMarathiNepaliPunjabiTamilTelugu

Introduction:

We will talk about only one big issue which seems to be hovering over the problem of summarization and that is the sequence length being at maximum 512/1024. This is a big drawback from the abstract documents. We will first discuss why this issue happens and then I will share some literature study on the practices to solve this problem. 

Description:

I have been working with transformers library of huggingface and long documents ( documents with 3000-4000 words). When I use official example codes, or custom codes ( without changing any model architecture part); whether I am using bart or t5 or some other model; there is a number of tokens after which the text gets either truncated or it raises an error saying that "Token indices sequence length is longer than the specified maximum sequence length for this model (no_of_token_in_your_text > 512)"
So what, you may ask. But  that is pretty obvious. Actually if your text is bigger; then your text will have many more words than 512 tokens. In such cases, the algorithms in transformer will truncate it to first 512/1024 tokens and then summarize that small part. So definitely the main purpose will be doomed. For further understanding the issue from the main source; read this thread of issue in github.

Why there is a token limit at all?

Even after knowing that there is a token number limit, there is another question. That is, why even such a limit is there? the answer lies in complexity of the bart and other similar transformer architecture based models. Transformer architecture uses self-attention based modeling; where this attention is nothing but linking of different tokens in a sentence within each other. Now, given a n token length document; you will have to simply create O(n2) number of connections. Although theoretically it may not sound that formidable; but practically this square complexity leads to a machine limit of 1024 tokens in a sequence while training bart and similar models. There have been recent improvements upon this, and namely Big bird model created by google has reduced this complexity to O(n) level. But being a recent innovation; big bird will probably take some time to come to ready made codes; therefore upto that point the limit on token number will be there. 

How to avoid or bypass the token limit?

The token limit, although may sound formidable actually can be bi-passed. Ofcourse the results will not be the best result as it would have been if there was no limit. But there are some ways to avoid this. Let's discuss the ways:
 
(1) splitting and merging the article: 
 
We can simply break the article into multiple parts within permissible token numbers; then summarize each parts; finally to combine it and create the final summary. This is, in a way a easy procedure to summarize longer articles.
 
(2) complex procedures:

In 2018; researchers of google created wikipedia like articles from multiple documents using both extractive summarization and neural abstractive models. You also can use such hybrid models to create custom pipelines and then finally avoid the token limit. You can read about the procedure from this paper.

(3) retraining the language model:

Although mostly formidable because of high machine cost and infrastructural problems; you can try training your custom model using bigbird like architecture if you work in a company which can afford it. Advantages in such a procedure will be that you can create an architecture which will suit your purpose best as well as using your domain specific data; you can train the model to be more efficient on specific domain; even on smaller amount of data. 

Conclusion:

To say the least, instead of being much better in performance than their extractive counterparts, abstractive models from the cutting edge module transformers still suffer from token limit and therefore limits in practical usages. We can avoid them using hybrid models and split-merge type manipulations; until the state of the art researches finally overcome such problems.

Comments

Popular posts from this blog

Tinder bio generation with OpenAI GPT-3 API

Introduction: Recently I got access to OpenAI API beta. After a few simple experiments, I set on creating a simple test project. In this project, I will try to create good tinder bio for a specific person.  The abc of openai API playground: In the OpenAI API playground, you get a prompt, and then you can write instructions or specific text to trigger a response from the gpt-3 models. There are also a number of preset templates which loads a specific kind of prompt and let's you generate pre-prepared results. What are the models available? There are 4 models which are stable. These are: (1) curie (2) babbage (3) ada (4) da-vinci da-vinci is the strongest of them all and can perform all downstream tasks which other models can do. There are 2 other new models which openai introduced this year (2021) named da-vinci-instruct-beta and curie-instruct-beta. These instruction models are specifically built for taking in instructions. As OpenAI blog explains and also you will see in our

Can we write codes automatically with GPT-3?

 Introduction: OpenAI created and released the first versions of GPT-3 back in 2021 beginning. We wrote a few text generation articles that time and tested how to create tinder bio using GPT-3 . If you are interested to know more on what is GPT-3 or what is openai, how the server look, then read the tinder bio article. In this article, we will explore Code generation with OpenAI models.  It has been noted already in multiple blogs and exploration work, that GPT-3 can even solve leetcode problems. We will try to explore how good the OpenAI model can "code" and whether prompt tuning will improve or change those performances. Basic coding: We will try to see a few data structure coding performance by GPT-3. (a) Merge sort with python:  First with 200 words limit, it couldn't complete the Write sample code for merge sort in python.   def merge(arr, l, m, r):     n1 = m - l + 1     n2 = r- m       # create temp arrays     L = [0] * (n1)     R = [0] * (n

What is Bort?

 Introduction: Bort, is the new and more optimized version of BERT; which came out this october from amazon science. I came to know about it today while parsing amazon science's news on facebook about bort. So Bort is the newest addition to the long list of great LM models with extra-ordinary achievements.  Why is Bort important? Bort, is a model of 5.5% effective and 16% total size of the original BERT model; and is 20x faster than BERT, while being able to surpass the BERT model in 20 out of 23 tasks; to quote the abstract of the paper,  ' it obtains performance improvements of between 0 . 3% and 31%, absolute, with respect to BERT-large, on multiple public natural language understanding (NLU) benchmarks. ' So what made this achievement possible? The main idea behind creation of Bort is to go beyond the shallow depth of weight pruning, connection deletion or merely factoring the NN into different matrix factorizations and thus distilling it. While methods like knowle