Skip to main content

Hugging face transformer library exploration: part2: problems with summarization and possible solution

Chinese (Simplified)Chinese (Traditional)CzechDanishDutchEnglishFrenchGermanHindiIndonesianItalianKoreanPolishPortugueseRussianSerbianSlovakSpanishThaiBengaliGujaratiMarathiNepaliPunjabiTamilTelugu

Introduction:

We will talk about only one big issue which seems to be hovering over the problem of summarization and that is the sequence length being at maximum 512/1024. This is a big drawback from the abstract documents. We will first discuss why this issue happens and then I will share some literature study on the practices to solve this problem. 

Description:

I have been working with transformers library of huggingface and long documents ( documents with 3000-4000 words). When I use official example codes, or custom codes ( without changing any model architecture part); whether I am using bart or t5 or some other model; there is a number of tokens after which the text gets either truncated or it raises an error saying that "Token indices sequence length is longer than the specified maximum sequence length for this model (no_of_token_in_your_text > 512)"
So what, you may ask. But  that is pretty obvious. Actually if your text is bigger; then your text will have many more words than 512 tokens. In such cases, the algorithms in transformer will truncate it to first 512/1024 tokens and then summarize that small part. So definitely the main purpose will be doomed. For further understanding the issue from the main source; read this thread of issue in github.

Why there is a token limit at all?

Even after knowing that there is a token number limit, there is another question. That is, why even such a limit is there? the answer lies in complexity of the bart and other similar transformer architecture based models. Transformer architecture uses self-attention based modeling; where this attention is nothing but linking of different tokens in a sentence within each other. Now, given a n token length document; you will have to simply create O(n2) number of connections. Although theoretically it may not sound that formidable; but practically this square complexity leads to a machine limit of 1024 tokens in a sequence while training bart and similar models. There have been recent improvements upon this, and namely Big bird model created by google has reduced this complexity to O(n) level. But being a recent innovation; big bird will probably take some time to come to ready made codes; therefore upto that point the limit on token number will be there. 

How to avoid or bypass the token limit?

The token limit, although may sound formidable actually can be bi-passed. Ofcourse the results will not be the best result as it would have been if there was no limit. But there are some ways to avoid this. Let's discuss the ways:
 
(1) splitting and merging the article: 
 
We can simply break the article into multiple parts within permissible token numbers; then summarize each parts; finally to combine it and create the final summary. This is, in a way a easy procedure to summarize longer articles.
 
(2) complex procedures:

In 2018; researchers of google created wikipedia like articles from multiple documents using both extractive summarization and neural abstractive models. You also can use such hybrid models to create custom pipelines and then finally avoid the token limit. You can read about the procedure from this paper.

(3) retraining the language model:

Although mostly formidable because of high machine cost and infrastructural problems; you can try training your custom model using bigbird like architecture if you work in a company which can afford it. Advantages in such a procedure will be that you can create an architecture which will suit your purpose best as well as using your domain specific data; you can train the model to be more efficient on specific domain; even on smaller amount of data. 

Conclusion:

To say the least, instead of being much better in performance than their extractive counterparts, abstractive models from the cutting edge module transformers still suffer from token limit and therefore limits in practical usages. We can avoid them using hybrid models and split-merge type manipulations; until the state of the art researches finally overcome such problems.

Comments

Popular posts from this blog

20 Must-Know Math Puzzles for Data Science Interviews: Test Your Problem-Solving Skills

Introduction:   When preparing for a data science interview, brushing up on your coding and statistical knowledge is crucial—but math puzzles also play a significant role. Many interviewers use puzzles to assess how candidates approach complex problems, test their logical reasoning, and gauge their problem-solving efficiency. These puzzles are often designed to test not only your knowledge of math but also your ability to think critically and creatively. Here, we've compiled 20 challenging yet exciting math puzzles to help you prepare for data science interviews. We’ll walk you through each puzzle, followed by an explanation of the solution. 1. The Missing Dollar Puzzle Puzzle: Three friends check into a hotel room that costs $30. They each contribute $10. Later, the hotel realizes there was an error and the room actually costs $25. The hotel gives $5 back to the bellboy to return to the friends, but the bellboy, being dishonest, pockets $2 and gives $1 back to each friend. No...

GAM model : PyGAM package details Analysis and possible issue resolving

Introduction:                  picture credit to peter laurinec. I have been studying about PyGAM package for last couple of days. Now, I am planning to thoroughly analyze the code of PyGAM package with necessary description of GAM model and sources whenever necessary. This is going to be a long post and very much technical in nature. Pre-requisites: For understanding the coding part of PyGAM package, first you have to learn what is a GAM model. GAM stands for generalized additive model, i.e. it is a type of statistical modeling where a target variable Y is roughly represented by additive combination of set of different functions. In formula it can be written as: g(E[Y]) = f 1 (x 1 ) + f 2 (x 2 ) + f 3 (x 3 ,x 4 )+...etc where g is called a link function and f are different types of functions. In technical terms, in GAM model, theoretically expectation of the link transformed target variable is assume...

Pyarabic: python package for Arabic language

 Introduction:  In languages which are non-english and non-european as well, NLP work has progressed slowly in the last few decades because of the lesser number of scholars working on them as well as a lack of global interest in them. But now the time has changed and people from all over the world are collaborating on these lesser explored libraries and they are building resources for working on these languages with the same ease with that of english.  Pyarabic is a package created from such a similar effort which deals with the intricate details of the arabic language and helps processing all kinds of arabic texts. While trying to learn it, being from a non-arab background, I couldn't read lots of parts of the main readthedocs site and had to work my around it. So in this blog post, I will summarize my learnings in english language, so that you can learn it and use the package with much more ease than me. [Credit where credit is due: this article heavily uses the ac...