Skip to main content

Summarization: an unsolved NLP task

Chinese (Simplified)Chinese (Traditional)CzechDanishDutchEnglishFrenchGermanHindiIndonesianItalianKoreanPolishPortugueseRussianSerbianSlovakSpanishThaiBengaliGujaratiMarathiNepaliPunjabiTamilTelugu
Summarization is one of the several core tasks which are there in NLP. We will discuss about basic theories of summarization in this post.

Summarization is the process of summarizing a text into a small, concise and important copy. A summarization of a text is expected to be bearing all the key points, while maintaining a text flow and coherence; and all the sentences created should be meaningful. And that's where things start to get a little bit tricky.

In a previous post, I showed my experiment with the transformer library and concluded that although quite advanced, the models like bart, t5 and distilbart are still far from reaching a human level summarization capability. Now, in this part, I intend to get into the theory of summarization a little bit.

Summarizations are inherently of two types. These types are:
(1) extractive summarization: this refers to create a summarization using extracted sentences from the original sentences; based on certain weights of sentences.
(2) abstractive summarization: this refers to a more humanly approach, which reads the whole text, and then creates a abstract summary of the text using advanced neural network architectures.

Now, we will dive into different approaches in each of the summarization types.

Extractive summarization:


In case of extractive summarization, there are three main works to do:
(1) creation of intermediate representation of the text
(2) scoring the sentences based on representation
(3) selecting sentences based on scoring

(1)
there are two types of representation approaches:
topic representation
indicator representation.
Topic representation transforms the text into the intermediate representation and interpret the topics discussed in the text. The techniques are divided into frequency-driven approaches, topic word approaches, latent semantic analysis and Bayesian topic models.
Indicator representation describes every sentence as a list of formal features or indicators of importance such as sentence length, position in the document, having certain phrases, etc.

(2)
Scoring the sentences is the next important step. Using the intermediate representation, a topic and representation based weights of the sentences are assigned. Each of these weights represents how importantly each sentence represent the intermediate topics and therefore is important to represent the summary of a text.

(3)
The final step is to select a number of sentences based on the above-assigned weights. In this step, there can be 2 ways to create the final summary. One is to straight forward select the sentences using decreasing orders of weights. However, more complex procedures can turn this procedures into optimization problem to maximize the importance, coherency and decreasing redundancy.

Now, we will consider the intermediate representation procedures with a bit more details and also follow the sentence-scoring methods based on the representations.

Frequency-driven approaches:

In case of frequency driven approaches, tfidf based representation of the sentences are taken. Then, a clustering of the sentences are done and finally the summary is created using the centroid sentences and sentences which are within certain radius of those centroids.

Latent semantic Analysis:

In this case, a term-sentence occurrence matrix is created. Then we do a SVD decomposition on this and in that SVD decomposition, we get
X = UDV
where U is a term-topic matrix, D is a topic weight matrix, and V is a topic-sentence weight matrix. Multiplying D and V, i.e. topic wise sentence weights with weight of topic itself, we get the scores for each sentence. Again using these scores, one can choose the summary sentences.

Bayesian topic modelling:

In this case, a bayesian model is used to find out words related to each topics discussed in the bayesian topic modelling. The prior probability distribution in this case comes from analyzing topics on a large corpus of documents. Then going through the current corpus, the probabilities are updated continuously so that a better and accurate topic based selection of words can be done. This is the only procedure which uses a probabilistic background for topic modeling and in the extractive summarization procedures.

Graph methods:

In such a case, a sentence based similarity matrix is created. This similarity matrix is then used to find out sentence importance using pagerank algorithm. Using pagerank then one can find out the important sentences and create the summary. In gensim library, this procedure is used.
The heuristic behind this graph method is that the sentences which are more connected with quite a few of sentences are more important and therefore should belong in this summary. The pagerank algorithm helps exactly in that.

There are many other methods in the extractive methods. But the above ones are the more famous ones. Next we will discuss about the application of machine learning in summarization. In general, machine learning led summarizations are abstract procedures, without any specific rules.
Abstract summarization:
We will discuss this part later on.

Comments

Popular posts from this blog

Mastering SQL for Data Science: Top SQL Interview Questions by Experience Level

Introduction: SQL (Structured Query Language) is a cornerstone of data manipulation and querying in data science. SQL technical rounds are designed to assess a candidate’s ability to work with databases, retrieve, and manipulate data efficiently. This guide provides a comprehensive list of SQL interview questions segmented by experience level—beginner, intermediate, and experienced. For each level, you'll find key questions designed to evaluate the candidate’s proficiency in SQL and their ability to solve data-related problems. The difficulty increases as the experience level rises, and the final section will guide you on how to prepare effectively for these rounds. Beginner (0-2 Years of Experience) At this stage, candidates are expected to know the basics of SQL, common commands, and elementary data manipulation. What is SQL? Explain its importance in data science. Hint: Think about querying, relational databases, and data manipulation. What is the difference between WHERE ...

What is Bort?

 Introduction: Bort, is the new and more optimized version of BERT; which came out this october from amazon science. I came to know about it today while parsing amazon science's news on facebook about bort. So Bort is the newest addition to the long list of great LM models with extra-ordinary achievements.  Why is Bort important? Bort, is a model of 5.5% effective and 16% total size of the original BERT model; and is 20x faster than BERT, while being able to surpass the BERT model in 20 out of 23 tasks; to quote the abstract of the paper,  ' it obtains performance improvements of between 0 . 3% and 31%, absolute, with respect to BERT-large, on multiple public natural language understanding (NLU) benchmarks. ' So what made this achievement possible? The main idea behind creation of Bort is to go beyond the shallow depth of weight pruning, connection deletion or merely factoring the NN into different matrix factorizations and thus distilling it. While methods like know...

Spacy errors and their solutions

 Introduction: There are a bunch of errors in spacy, which never makes sense until you get to the depth of it. In this post, we will analyze the attribute error E046 and why it occurs. (1) AttributeError: [E046] Can't retrieve unregistered extension attribute 'tag_name'. Did you forget to call the set_extension method? Let's first understand what the error means on superficial level. There is a tag_name extension in your code. i.e. from a doc object, probably you are calling doc._.tag_name. But spacy suggests to you that probably you forgot to call the set_extension method. So what to do from here? The problem in hand is that your extension is not created where it should have been created. Now in general this means that your pipeline is incorrect at some level.  So how should you solve it? Look into the pipeline of your spacy language object. Chances are that the pipeline component which creates the extension is not included in the pipeline. To check the pipe eleme...