Hugging face transformer library exploration: part2: problems with summarization and possible solution
Introduction:
We will talk about only one big issue which seems to be hovering over the problem of summarization and that is the sequence length being at maximum 512/1024. This is a big drawback from the abstract documents. We will first discuss why this issue happens and then I will share some literature study on the practices to solve this problem.
Description:
I have been working with transformers library of huggingface and long documents ( documents with 3000-4000 words). When I use official example codes, or custom codes ( without changing any model architecture part); whether I am using bart or t5 or some other model; there is a number of tokens after which the text gets either truncated or it raises an error saying that "Token indices sequence length is longer than the specified maximum sequence length for this model (no_of_token_in_your_text > 512)"
So what, you may ask. But that is pretty obvious. Actually if your text is bigger; then your text will have many more words than 512 tokens. In such cases, the algorithms in transformer will truncate it to first 512/1024 tokens and then summarize that small part. So definitely the main purpose will be doomed. For further understanding the issue from the main source; read this thread of issue in github.
Why there is a token limit at all?
Even after knowing that there is a token number limit, there is another question. That is, why even such a limit is there? the answer lies in complexity of the bart and other similar transformer architecture based models. Transformer architecture uses self-attention based modeling; where this attention is nothing but linking of different tokens in a sentence within each other. Now, given a n token length document; you will have to simply create O(n2) number of connections. Although theoretically it may not sound that formidable; but practically this square complexity leads to a machine limit of 1024 tokens in a sequence while training bart and similar models. There have been recent improvements upon this, and namely Big bird model created by google has reduced this complexity to O(n) level. But being a recent innovation; big bird will probably take some time to come to ready made codes; therefore upto that point the limit on token number will be there.
How to avoid or bypass the token limit?
The token limit, although may sound formidable actually can be bi-passed. Ofcourse the results will not be the best result as it would have been if there was no limit. But there are some ways to avoid this. Let's discuss the ways:
(1) splitting and merging the article:
We can simply break the article into multiple parts within permissible token numbers; then summarize each parts; finally to combine it and create the final summary. This is, in a way a easy procedure to summarize longer articles.
(2) complex procedures:
In 2018; researchers of google created wikipedia like articles from multiple documents using both extractive summarization and neural abstractive models. You also can use such hybrid models to create custom pipelines and then finally avoid the token limit. You can read about the procedure from this paper.
(3) retraining the language model:
Although mostly formidable because of high machine cost and infrastructural problems; you can try training your custom model using bigbird like architecture if you work in a company which can afford it. Advantages in such a procedure will be that you can create an architecture which will suit your purpose best as well as using your domain specific data; you can train the model to be more efficient on specific domain; even on smaller amount of data.
Conclusion:
To say the least, instead of being much better in performance than their extractive counterparts, abstractive models from the cutting edge module transformers still suffer from token limit and therefore limits in practical usages. We can avoid them using hybrid models and split-merge type manipulations; until the state of the art researches finally overcome such problems.
Comments
Post a Comment