Skip to main content

Hugging face transformer library exploration: part2: problems with summarization and possible solution

Chinese (Simplified)Chinese (Traditional)CzechDanishDutchEnglishFrenchGermanHindiIndonesianItalianKoreanPolishPortugueseRussianSerbianSlovakSpanishThaiBengaliGujaratiMarathiNepaliPunjabiTamilTelugu

Introduction:

We will talk about only one big issue which seems to be hovering over the problem of summarization and that is the sequence length being at maximum 512/1024. This is a big drawback from the abstract documents. We will first discuss why this issue happens and then I will share some literature study on the practices to solve this problem. 

Description:

I have been working with transformers library of huggingface and long documents ( documents with 3000-4000 words). When I use official example codes, or custom codes ( without changing any model architecture part); whether I am using bart or t5 or some other model; there is a number of tokens after which the text gets either truncated or it raises an error saying that "Token indices sequence length is longer than the specified maximum sequence length for this model (no_of_token_in_your_text > 512)"
So what, you may ask. But  that is pretty obvious. Actually if your text is bigger; then your text will have many more words than 512 tokens. In such cases, the algorithms in transformer will truncate it to first 512/1024 tokens and then summarize that small part. So definitely the main purpose will be doomed. For further understanding the issue from the main source; read this thread of issue in github.

Why there is a token limit at all?

Even after knowing that there is a token number limit, there is another question. That is, why even such a limit is there? the answer lies in complexity of the bart and other similar transformer architecture based models. Transformer architecture uses self-attention based modeling; where this attention is nothing but linking of different tokens in a sentence within each other. Now, given a n token length document; you will have to simply create O(n2) number of connections. Although theoretically it may not sound that formidable; but practically this square complexity leads to a machine limit of 1024 tokens in a sequence while training bart and similar models. There have been recent improvements upon this, and namely Big bird model created by google has reduced this complexity to O(n) level. But being a recent innovation; big bird will probably take some time to come to ready made codes; therefore upto that point the limit on token number will be there. 

How to avoid or bypass the token limit?

The token limit, although may sound formidable actually can be bi-passed. Ofcourse the results will not be the best result as it would have been if there was no limit. But there are some ways to avoid this. Let's discuss the ways:
 
(1) splitting and merging the article: 
 
We can simply break the article into multiple parts within permissible token numbers; then summarize each parts; finally to combine it and create the final summary. This is, in a way a easy procedure to summarize longer articles.
 
(2) complex procedures:

In 2018; researchers of google created wikipedia like articles from multiple documents using both extractive summarization and neural abstractive models. You also can use such hybrid models to create custom pipelines and then finally avoid the token limit. You can read about the procedure from this paper.

(3) retraining the language model:

Although mostly formidable because of high machine cost and infrastructural problems; you can try training your custom model using bigbird like architecture if you work in a company which can afford it. Advantages in such a procedure will be that you can create an architecture which will suit your purpose best as well as using your domain specific data; you can train the model to be more efficient on specific domain; even on smaller amount of data. 

Conclusion:

To say the least, instead of being much better in performance than their extractive counterparts, abstractive models from the cutting edge module transformers still suffer from token limit and therefore limits in practical usages. We can avoid them using hybrid models and split-merge type manipulations; until the state of the art researches finally overcome such problems.

Comments

Popular posts from this blog

Mastering SQL for Data Science: Top SQL Interview Questions by Experience Level

Introduction: SQL (Structured Query Language) is a cornerstone of data manipulation and querying in data science. SQL technical rounds are designed to assess a candidate’s ability to work with databases, retrieve, and manipulate data efficiently. This guide provides a comprehensive list of SQL interview questions segmented by experience level—beginner, intermediate, and experienced. For each level, you'll find key questions designed to evaluate the candidate’s proficiency in SQL and their ability to solve data-related problems. The difficulty increases as the experience level rises, and the final section will guide you on how to prepare effectively for these rounds. Beginner (0-2 Years of Experience) At this stage, candidates are expected to know the basics of SQL, common commands, and elementary data manipulation. What is SQL? Explain its importance in data science. Hint: Think about querying, relational databases, and data manipulation. What is the difference between WHERE

What is Bort?

 Introduction: Bort, is the new and more optimized version of BERT; which came out this october from amazon science. I came to know about it today while parsing amazon science's news on facebook about bort. So Bort is the newest addition to the long list of great LM models with extra-ordinary achievements.  Why is Bort important? Bort, is a model of 5.5% effective and 16% total size of the original BERT model; and is 20x faster than BERT, while being able to surpass the BERT model in 20 out of 23 tasks; to quote the abstract of the paper,  ' it obtains performance improvements of between 0 . 3% and 31%, absolute, with respect to BERT-large, on multiple public natural language understanding (NLU) benchmarks. ' So what made this achievement possible? The main idea behind creation of Bort is to go beyond the shallow depth of weight pruning, connection deletion or merely factoring the NN into different matrix factorizations and thus distilling it. While methods like knowle

Spacy errors and their solutions

 Introduction: There are a bunch of errors in spacy, which never makes sense until you get to the depth of it. In this post, we will analyze the attribute error E046 and why it occurs. (1) AttributeError: [E046] Can't retrieve unregistered extension attribute 'tag_name'. Did you forget to call the set_extension method? Let's first understand what the error means on superficial level. There is a tag_name extension in your code. i.e. from a doc object, probably you are calling doc._.tag_name. But spacy suggests to you that probably you forgot to call the set_extension method. So what to do from here? The problem in hand is that your extension is not created where it should have been created. Now in general this means that your pipeline is incorrect at some level.  So how should you solve it? Look into the pipeline of your spacy language object. Chances are that the pipeline component which creates the extension is not included in the pipeline. To check the pipe eleme