Summarization is one of the several core tasks which are there in NLP. We will discuss about basic theories of summarization in this post.
Summarization is the process of summarizing a text into a small, concise and important copy. A summarization of a text is expected to be bearing all the key points, while maintaining a text flow and coherence; and all the sentences created should be meaningful. And that's where things start to get a little bit tricky.
In a previous post, I showed my experiment with the transformer library and concluded that although quite advanced, the models like bart, t5 and distilbart are still far from reaching a human level summarization capability. Now, in this part, I intend to get into the theory of summarization a little bit.
Summarizations are inherently of two types. These types are:
(1) extractive summarization: this refers to create a summarization using extracted sentences from the original sentences; based on certain weights of sentences.
(2) abstractive summarization: this refers to a more humanly approach, which reads the whole text, and then creates a abstract summary of the text using advanced neural network architectures.
Now, we will dive into different approaches in each of the summarization types.
Extractive summarization:
In case of extractive summarization, there are three main works to do:
(1) creation of intermediate representation of the text
(2) scoring the sentences based on representation
(3) selecting sentences based on scoring
(1)
there are two types of representation approaches:
topic
representation
indicator representation.
Topic representation
transforms the text into the intermediate representation and interpret
the topics discussed in the text. The techniques are divided into frequency-driven
approaches, topic word approaches, latent semantic analysis and Bayesian
topic models.
Indicator representation describes every sentence as a
list of formal features or indicators of importance such as sentence
length, position in the document, having certain phrases, etc.
(2)
Scoring the sentences is the next important step. Using the intermediate representation, a topic and representation based weights of the sentences are assigned. Each of these weights represents how importantly each sentence represent the intermediate topics and therefore is important to represent the summary of a text.
(3)
The final step is to select a number of sentences based on the above-assigned weights. In this step, there can be 2 ways to create the final summary. One is to straight forward select the sentences using decreasing orders of weights. However, more complex procedures can turn this procedures into optimization problem to maximize the importance, coherency and decreasing redundancy.
Now, we will consider the intermediate representation procedures with a bit more details and also follow the sentence-scoring methods based on the representations.
Frequency-driven approaches:
In case of frequency driven approaches, tfidf based representation of the sentences are taken. Then, a clustering of the sentences are done and finally the summary is created using the centroid sentences and sentences which are within certain radius of those centroids.
Latent semantic Analysis:
In this case, a term-sentence occurrence matrix is created. Then we do a SVD decomposition on this and in that SVD decomposition, we get
X = UDV
where U is a term-topic matrix, D is a topic weight matrix, and V is a topic-sentence weight matrix. Multiplying D and V, i.e. topic wise sentence weights with weight of topic itself, we get the scores for each sentence. Again using these scores, one can choose the summary sentences.
Bayesian topic modelling:
In this case, a bayesian model is used to find out words related to each topics discussed in the bayesian topic modelling. The prior probability distribution in this case comes from analyzing topics on a large corpus of documents. Then going through the current corpus, the probabilities are updated continuously so that a better and accurate topic based selection of words can be done. This is the only procedure which uses a probabilistic background for topic modeling and in the extractive summarization procedures.
Graph methods:
In such a case, a sentence based similarity matrix is created. This similarity matrix is then used to find out sentence importance using pagerank algorithm. Using pagerank then one can find out the important sentences and create the summary. In gensim library, this procedure is used.
The heuristic behind this graph method is that the sentences which are more connected with quite a few of sentences are more important and therefore should belong in this summary. The pagerank algorithm helps exactly in that.
There are many other methods in the extractive methods. But the above ones are the more famous ones. Next we will discuss about the application of machine learning in summarization. In general, machine learning led summarizations are abstract procedures, without any specific rules.
Abstract summarization:
We will discuss this part later on.
Comments
Post a Comment