Skip to main content

Summarization: an unsolved NLP task

Chinese (Simplified)Chinese (Traditional)CzechDanishDutchEnglishFrenchGermanHindiIndonesianItalianKoreanPolishPortugueseRussianSerbianSlovakSpanishThaiBengaliGujaratiMarathiNepaliPunjabiTamilTelugu
Summarization is one of the several core tasks which are there in NLP. We will discuss about basic theories of summarization in this post.

Summarization is the process of summarizing a text into a small, concise and important copy. A summarization of a text is expected to be bearing all the key points, while maintaining a text flow and coherence; and all the sentences created should be meaningful. And that's where things start to get a little bit tricky.

In a previous post, I showed my experiment with the transformer library and concluded that although quite advanced, the models like bart, t5 and distilbart are still far from reaching a human level summarization capability. Now, in this part, I intend to get into the theory of summarization a little bit.

Summarizations are inherently of two types. These types are:
(1) extractive summarization: this refers to create a summarization using extracted sentences from the original sentences; based on certain weights of sentences.
(2) abstractive summarization: this refers to a more humanly approach, which reads the whole text, and then creates a abstract summary of the text using advanced neural network architectures.

Now, we will dive into different approaches in each of the summarization types.

Extractive summarization:


In case of extractive summarization, there are three main works to do:
(1) creation of intermediate representation of the text
(2) scoring the sentences based on representation
(3) selecting sentences based on scoring

(1)
there are two types of representation approaches:
topic representation
indicator representation.
Topic representation transforms the text into the intermediate representation and interpret the topics discussed in the text. The techniques are divided into frequency-driven approaches, topic word approaches, latent semantic analysis and Bayesian topic models.
Indicator representation describes every sentence as a list of formal features or indicators of importance such as sentence length, position in the document, having certain phrases, etc.

(2)
Scoring the sentences is the next important step. Using the intermediate representation, a topic and representation based weights of the sentences are assigned. Each of these weights represents how importantly each sentence represent the intermediate topics and therefore is important to represent the summary of a text.

(3)
The final step is to select a number of sentences based on the above-assigned weights. In this step, there can be 2 ways to create the final summary. One is to straight forward select the sentences using decreasing orders of weights. However, more complex procedures can turn this procedures into optimization problem to maximize the importance, coherency and decreasing redundancy.

Now, we will consider the intermediate representation procedures with a bit more details and also follow the sentence-scoring methods based on the representations.

Frequency-driven approaches:

In case of frequency driven approaches, tfidf based representation of the sentences are taken. Then, a clustering of the sentences are done and finally the summary is created using the centroid sentences and sentences which are within certain radius of those centroids.

Latent semantic Analysis:

In this case, a term-sentence occurrence matrix is created. Then we do a SVD decomposition on this and in that SVD decomposition, we get
X = UDV
where U is a term-topic matrix, D is a topic weight matrix, and V is a topic-sentence weight matrix. Multiplying D and V, i.e. topic wise sentence weights with weight of topic itself, we get the scores for each sentence. Again using these scores, one can choose the summary sentences.

Bayesian topic modelling:

In this case, a bayesian model is used to find out words related to each topics discussed in the bayesian topic modelling. The prior probability distribution in this case comes from analyzing topics on a large corpus of documents. Then going through the current corpus, the probabilities are updated continuously so that a better and accurate topic based selection of words can be done. This is the only procedure which uses a probabilistic background for topic modeling and in the extractive summarization procedures.

Graph methods:

In such a case, a sentence based similarity matrix is created. This similarity matrix is then used to find out sentence importance using pagerank algorithm. Using pagerank then one can find out the important sentences and create the summary. In gensim library, this procedure is used.
The heuristic behind this graph method is that the sentences which are more connected with quite a few of sentences are more important and therefore should belong in this summary. The pagerank algorithm helps exactly in that.

There are many other methods in the extractive methods. But the above ones are the more famous ones. Next we will discuss about the application of machine learning in summarization. In general, machine learning led summarizations are abstract procedures, without any specific rules.
Abstract summarization:
We will discuss this part later on.

Comments

Popular posts from this blog

Tinder bio generation with OpenAI GPT-3 API

Introduction: Recently I got access to OpenAI API beta. After a few simple experiments, I set on creating a simple test project. In this project, I will try to create good tinder bio for a specific person.  The abc of openai API playground: In the OpenAI API playground, you get a prompt, and then you can write instructions or specific text to trigger a response from the gpt-3 models. There are also a number of preset templates which loads a specific kind of prompt and let's you generate pre-prepared results. What are the models available? There are 4 models which are stable. These are: (1) curie (2) babbage (3) ada (4) da-vinci da-vinci is the strongest of them all and can perform all downstream tasks which other models can do. There are 2 other new models which openai introduced this year (2021) named da-vinci-instruct-beta and curie-instruct-beta. These instruction models are specifically built for taking in instructions. As OpenAI blog explains and also you will see in our

Can we write codes automatically with GPT-3?

 Introduction: OpenAI created and released the first versions of GPT-3 back in 2021 beginning. We wrote a few text generation articles that time and tested how to create tinder bio using GPT-3 . If you are interested to know more on what is GPT-3 or what is openai, how the server look, then read the tinder bio article. In this article, we will explore Code generation with OpenAI models.  It has been noted already in multiple blogs and exploration work, that GPT-3 can even solve leetcode problems. We will try to explore how good the OpenAI model can "code" and whether prompt tuning will improve or change those performances. Basic coding: We will try to see a few data structure coding performance by GPT-3. (a) Merge sort with python:  First with 200 words limit, it couldn't complete the Write sample code for merge sort in python.   def merge(arr, l, m, r):     n1 = m - l + 1     n2 = r- m       # create temp arrays     L = [0] * (n1)     R = [0] * (n

What is Bort?

 Introduction: Bort, is the new and more optimized version of BERT; which came out this october from amazon science. I came to know about it today while parsing amazon science's news on facebook about bort. So Bort is the newest addition to the long list of great LM models with extra-ordinary achievements.  Why is Bort important? Bort, is a model of 5.5% effective and 16% total size of the original BERT model; and is 20x faster than BERT, while being able to surpass the BERT model in 20 out of 23 tasks; to quote the abstract of the paper,  ' it obtains performance improvements of between 0 . 3% and 31%, absolute, with respect to BERT-large, on multiple public natural language understanding (NLU) benchmarks. ' So what made this achievement possible? The main idea behind creation of Bort is to go beyond the shallow depth of weight pruning, connection deletion or merely factoring the NN into different matrix factorizations and thus distilling it. While methods like knowle