Skip to main content

pytextrank: module for TextRank for phrase extraction and text summarization

 Introduction:

We have described spacy in part1, part2, part3, and part4. In this post, we will describe the pytextrank project based on spacy structure which solves phrase extraction and text summarization. Pytextrank is written by Paco nathan, an american computer scientist, based on texas. Pytextrank is mainly interesting for me for two reasons:

(1) implementation of the textrank algorithm very nicely in a spacy extension format

(2) the easy usage of the package which properly abstracts out all the complexity of the package from the user and can be used with little to no understanding of the underlying algorithm.

Now, as I may have given enough motivation to read and use this package; we will explore the basic usage first, and then dive in to see the inner working; which will be the more advanced part of this post.

How to use pytextrank:

pytextrank can be installed via pip3 install pytextrank as it is included in the pypi listing. Now once you install it in that manner; then we can go forward and   import pytextrank in your python ide.

Now, there are two main usage of pytextrank. Those are:

(1) creating phrases

(2) creating summaries of documents

To use any of the things, first one has to initiate a spacy model object using the following lines:

import spacy

nlp = spacy.load("en_core_web_sm")

In this case, you may find an error that there is no en_core_web_sm model installed. In such a case, you will have to install the model using 

python -m spacy download en_core_web_sm

That should install the model and then we can proceed with the following lines. Now, to add the textrank module's part, we have to instantiate a class of textrank using the following line:

tr = pytextrank.TextRank(logger=None)
Now, tr is basically what we call a spacy extension; i.e. this is a object we can add to the model. So that's what we will do; using the following lines:

nlp.add_pipe(tr.PipelineComponent, name="textrank", last=True)

nlp.max_length = 10**7

doc = nlp(text) #this part is for running the nlp loop on the text

Now, after this, the usage for creating phrases and summarization is a bit different. For creating phrases,

for p in doc._.phrases:
    print("{:.4f} {:5d}  {}".format(p.rank, p.count, p.text))
    print(p.chunks)

Now, for finding summary of the text, the following codes will do:

for sent in doc._.textrank.summary(limit_phrases=15, limit_sentences=5):
     print(sent)

So, these are the methods to extract phrases and summary using textrank. Now, we will go into details of how this package is written and how the different parts work.

Code analysis:

In the github repo of pytextrank project, the main code is written in pytextrank.py file inside the pytextrank project. We will go through specific lines of code and explain how that is helping; and in the end it will be very clear how it all works.

First, look at this code patch. 

How we set the extensions:

def PipelineComponent (self, doc):
        """
        define a custom pipeline component for spaCy and extend the
        Doc class to add TextRank
        """
        self.doc = doc
        Doc.set_extension("phrases", force=True, default=[])
        Doc.set_extension("textrank", force=True, default=self)
        doc._.phrases = self.calc_textrank()

        return doc

Now if you don't know how spacy extensions work, go through these slides written by ines montani, spacy core developer. So, basically spacy extensions are custom functions or classes which we can add to spacy and then run as a spacy custom pipeline element. In the above code patch, the textrank and phrases are added as two spacy extensions; which then we used in the usage section by calling doc._.textrank and doc._.phrases. 

So now let's see, how the classes/methods textrank and phrases are built. 

On parsing few lines above, turns out that the above method was a part of the class TextRank; and by assigning Doc.set_extension("textrank",force = True, default = self) the code assigns the class itself as the textrank extension.

We set the phrases extension as an empty list; but then we fill that by saying 

doc._.phrases = self.calc_textrank()

How the phrases get created:

Now, obviously the calc_textrank method creates the phrases object. Let's inspect that code.

def calc_textrank (self):
        """
        iterate through each sentence in the doc, constructing a lemma graph
        then returning the top-ranked phrases
        """
        self.reset()
        t0 = time.time()

        for sent in self.doc.sents:
            self.link_sentence(sent)

        if self.logger:
            self.logger.debug(self.seen_lemma)

        # to run the algorithm, we use the NetworkX implementation of
        # PageRank – i.e., approximating eigenvalue centrality – to
        # calculate ranks for each of the nodes in the lemma graph

        self.ranks = nx.pagerank(self.lemma_graph)

        # collect the top-ranked phrases based on both the noun chunks
        # and the named entities

        for chunk in self.doc.noun_chunks:
            self.collect_phrases(chunk)

        for ent in self.doc.ents:
            self.collect_phrases(ent)

        # TODO:
        # there are edge cases where the built-in noun chunking in
        # spaCy fails to extract much, for example:

        # > "everything you need to know about student loan interest rates variable and fixed rates capitalization amortization student loan refinancing and more."

        # we should test `len(self.phrases.keys())` vs. a brute-force
        # noun chunking approach, then add the brute-force override to
        # `self.phrases`

        #for k, p in self.phrases.items():
        #    print(" >>", k, p)

        #for key, lemma in self.seen_lemma.items():
        #    node_id = list(self.seen_lemma.keys()).index(key)
        #    print(node_id, key, lemma, self.ranks[node_id])


        # since noun chunks can be expressed in different ways (e.g., may
        # have articles or prepositions), we need to find a minimum span
        # for each phrase based on combinations of lemmas

        min_phrases = {}

        for phrase_key, phrase_list in self.phrases.items():
            phrase_list.sort(key=lambda p: p.rank, reverse=True)
            best_phrase = phrase_list[0]
            min_phrases[best_phrase.text] = (best_phrase.rank, len(phrase_list), phrase_key)

        # yield results

        results = sorted(min_phrases.items(), key=lambda x: x[1][0], reverse=True)

        phrase_list = [
            Phrase(p, r, c, self.phrases[k]) for p, (r, c, k) in results
        ]

        t1 = time.time()
        self.elapsed_time = (t1 - t0) * 1000.0

        return phrase_list
 

So as you can see from the doc string, this method iterates each sentence and then creates a lemma graph, as well as returns a phrase list. Now, this function keeps calling other functions; so it is a bit difficult to understand this; but we'll just assume some parts and go through the code.

First, we call the reset and reset some of the variables inside the class. After that and other formalities, we call a link_sentence() function on each of the sentences in the document. 

This link_sentence() function in other hand, takes each sentence, lemmatizes them and breaks them into tokens and creates a graph out of the lemmas. This link_sentence() function also changes the weights of the different edges; as the sentences keep getting added.

So after calling link sentence function over all the sentences; the graph, made of lemmatized phrases, is built. Now, we call the pagerank algorithm from the networkx (famous graphs and network analysis package in python) to calculate the ranks for each nodes in the lemma graph. We store it in self.ranks.

Now that the ranks are calculated we collect the phrases; using the collect_phrase() method for both created noun chunks and entities which are detected in the document. If you have read about phrase extraction or summarization you will know that, noun chunks and entities are important objects in these contexts.

Finally, we sort the phrases based on their ranks and add the phrase objects based into a phrase list; which comes out of the calc_textrank() method.

Now, the other method, the summary is not yet exposed. In usage, we see that the textrank extension has a summary method; which we are calling and getting our response. This implies that the summary method ranks the sentences and creates the summaries. Let's see its code:

 def summary(self, limit_phrases=10, limit_sentences=4):
        """
        run extractive summarization, based on vector distance
        per sentence from the top-ranked phrases
        """
        unit_vector = []

        # construct a list of sentence boundaries with a phrase set
        # for each (initialized to empty)

        sent_bounds = [ [s.start, s.end, set([])] for s in self.doc.sents ]

        # iterate through the top-ranked phrases, added them to the
        # phrase vector for each sentence

        phrase_id = 0

        for p in self.doc._.phrases:
            unit_vector.append(p.rank)

            if self.logger:
                self.logger.debug(
                    "{} {} {}".format(phrase_id, p.text, p.rank)
                )
    
            for chunk in p.chunks:
                for sent_start, sent_end, sent_vector in sent_bounds:
                    if chunk.start >= sent_start and chunk.start <= sent_end:
                        sent_vector.add(phrase_id)

                        if self.logger:
                            self.logger.debug(
                                " {} {} {} {}".format(sent_start, chunk.start, chunk.end, sent_end)
                                )

                        break

            phrase_id += 1

            if phrase_id == limit_phrases:
                break

        # construct a unit_vector for the top-ranked phrases, up to
        # the requested limit

        sum_ranks = sum(unit_vector)

        try:
            unit_vector = [ rank/sum_ranks for rank in unit_vector ]
        except ZeroDivisionError:
            unit_vector = (0.0,) * len(unit_vector)

        # iterate through each sentence, calculating its euclidean
        # distance from the unit vector

        sent_rank = {}
        sent_id = 0

        for sent_start, sent_end, sent_vector in sent_bounds:
            sum_sq = 0.0
    
            for phrase_id in range(len(unit_vector)):
                if phrase_id not in sent_vector:
                    sum_sq += unit_vector[phrase_id]**2.0

            sent_rank[sent_id] = sqrt(sum_sq)
            sent_id += 1

        # extract the sentences with the lowest distance

        sent_text = {}
        sent_id = 0

        for sent in self.doc.sents:
            sent_text[sent_id] = sent
            sent_id += 1

        # yield results, up to the limit requested

        num_sent = 0

        for sent_id, rank in sorted(sent_rank.items(), key=itemgetter(1)):
            yield sent_text[sent_id]
            num_sent += 1

            if num_sent == limit_sentences:
                break

So, let's understand this code now. From the comments, we understand that the sentences are ranked based on how much their vector distance is from the top-ranked  phrases. Let's see how we are actually implementing it.

The first thing the code does initializing a bunch of variables. Among these, are, a empty unit vector list; and a list of sentence boundary touples which contains       (sentence_begining, sentence_ending , one_empty_list)

Now, lets see how we are filling each of these.

So basically we do the whole operation via iterating through the phrases present in self.doc._.phrases(). For each phrase, we append its rank value to the empty unit vector. Further, in the same loop; we check for each sentences; if the phrase is present inside that sentence; and fill that empty list in each of the sentences with the phrase index of the phrases that are present in that specific sentence.     We leave this looping once we hit the number of highest ranking phrases we are going to consider.

Now, in this small try except below; we normalize the rank inputs for each of the phrases in the unit vector.

try:
     unit_vector = [ rank/sum_ranks for rank in unit_vector ]
except ZeroDivisionError:
     unit_vector = (0.0,) * len(unit_vector)

Finally, we calculate the sentence rank for each of the sentence; by calculating the euclidean distance between the sentence's vector with respect to the phrases present in it; and the unit vector containing all the top phrases.We store these distances with a sent_rank dictionary; containing sent_id and the euclidean distance.

Now, in the starting itself, we said that the sentences which are nearest to the average vector of all the top ranking phrases, are the most important.That's why in the last part of the code, we sort the dictionary based on min to max of the distance and print out sentences which are most important until we hit the number of sentences required.

So, this is how the summarization works in the textrank package. We will keep publishing more on spacy and NLP. Until then, stay tuned and thanks for reading!

Comments

Popular posts from this blog

Mastering SQL for Data Science: Top SQL Interview Questions by Experience Level

Introduction: SQL (Structured Query Language) is a cornerstone of data manipulation and querying in data science. SQL technical rounds are designed to assess a candidate’s ability to work with databases, retrieve, and manipulate data efficiently. This guide provides a comprehensive list of SQL interview questions segmented by experience level—beginner, intermediate, and experienced. For each level, you'll find key questions designed to evaluate the candidate’s proficiency in SQL and their ability to solve data-related problems. The difficulty increases as the experience level rises, and the final section will guide you on how to prepare effectively for these rounds. Beginner (0-2 Years of Experience) At this stage, candidates are expected to know the basics of SQL, common commands, and elementary data manipulation. What is SQL? Explain its importance in data science. Hint: Think about querying, relational databases, and data manipulation. What is the difference between WHERE ...

What is Bort?

 Introduction: Bort, is the new and more optimized version of BERT; which came out this october from amazon science. I came to know about it today while parsing amazon science's news on facebook about bort. So Bort is the newest addition to the long list of great LM models with extra-ordinary achievements.  Why is Bort important? Bort, is a model of 5.5% effective and 16% total size of the original BERT model; and is 20x faster than BERT, while being able to surpass the BERT model in 20 out of 23 tasks; to quote the abstract of the paper,  ' it obtains performance improvements of between 0 . 3% and 31%, absolute, with respect to BERT-large, on multiple public natural language understanding (NLU) benchmarks. ' So what made this achievement possible? The main idea behind creation of Bort is to go beyond the shallow depth of weight pruning, connection deletion or merely factoring the NN into different matrix factorizations and thus distilling it. While methods like know...

Spacy errors and their solutions

 Introduction: There are a bunch of errors in spacy, which never makes sense until you get to the depth of it. In this post, we will analyze the attribute error E046 and why it occurs. (1) AttributeError: [E046] Can't retrieve unregistered extension attribute 'tag_name'. Did you forget to call the set_extension method? Let's first understand what the error means on superficial level. There is a tag_name extension in your code. i.e. from a doc object, probably you are calling doc._.tag_name. But spacy suggests to you that probably you forgot to call the set_extension method. So what to do from here? The problem in hand is that your extension is not created where it should have been created. Now in general this means that your pipeline is incorrect at some level.  So how should you solve it? Look into the pipeline of your spacy language object. Chances are that the pipeline component which creates the extension is not included in the pipeline. To check the pipe eleme...