pytextrank: module for TextRank for phrase extraction and text summarization

Introduction:

We have described spacy in part1, part2, part3, and part4. In this post, we will describe the pytextrank project based on spacy structure which solves phrase extraction and text summarization. Pytextrank is written by Paco nathan, an american computer scientist, based on texas. Pytextrank is mainly interesting for me for two reasons:

(1) implementation of the textrank algorithm very nicely in a spacy extension format

(2) the easy usage of the package which properly abstracts out all the complexity of the package from the user and can be used with little to no understanding of the underlying algorithm.

Now, as I may have given enough motivation to read and use this package; we will explore the basic usage first, and then dive in to see the inner working; which will be the more advanced part of this post.

How to use pytextrank:

pytextrank can be installed via pip3 install pytextrank as it is included in the pypi listing. Now once you install it in that manner; then we can go forward and import pytextrank in your python ide.

Now, there are two main usage of pytextrank. Those are:

(1) creating phrases

(2) creating summaries of documents

To use any of the things, first one has to initiate a spacy model object using the following lines:

import spacy
nlp = spacy.load("en_core_web_sm")

In this case, you may find an error that there is no en_core_web_sm model installed. In such a case, you will have to install the model using

python -m spacy download en_core_web_sm

That should install the model and then we can proceed with the following lines. Now, to add the textrank module's part, we have to instantiate a class of textrank using the following line:

tr = pytextrank.TextRank(logger=None)
Now, tr is basically what we call a spacy extension; i.e. this is a object we can add to the model. So that's what we will do; using the following lines:

nlp.add_pipe(tr.PipelineComponent, name="textrank", last=True)
nlp.max_length = 10**7
doc = nlp(text) #this part is for running the nlp loop on the text

Now, after this, the usage for creating phrases and summarization is a bit different. For creating phrases,

for p in doc._.phrases:
    print("{:.4f} {:5d}  {}".format(p.rank, p.count, p.text))
    print(p.chunks)

Now, for finding summary of the text, the following codes will do:

for sent in doc._.textrank.summary(limit_phrases=15, limit_sentences=5):
     print(sent)

So, these are the methods to extract phrases and summary using textrank. Now, we will go into details of how this package is written and how the different parts work.

Code analysis:

In the github repo of pytextrank project, the main code is written in pytextrank.py file inside the pytextrank project. We will go through specific lines of code and explain how that is helping; and in the end it will be very clear how it all works.

First, look at this code patch.

How we set the extensions:

def PipelineComponent (self, doc):
        """
        define a custom pipeline component for spaCy and extend the
        Doc class to add TextRank
        """
        self.doc = doc
        Doc.set_extension("phrases", force=True, default=[])
        Doc.set_extension("textrank", force=True, default=self)
        doc._.phrases = self.calc_textrank()

        return doc

Now if you don't know how spacy extensions work, go through these slides written by ines montani, spacy core developer. So, basically spacy extensions are custom functions or classes which we can add to spacy and then run as a spacy custom pipeline element. In the above code patch, the textrank and phrases are added as two spacy extensions; which then we used in the usage section by calling doc._.textrank and doc._.phrases.

So now let's see, how the classes/methods textrank and phrases are built.

On parsing few lines above, turns out that the above method was a part of the class TextRank; and by assigning Doc.set_extension("textrank",force = True, default = self) the code assigns the class itself as the textrank extension.

We set the phrases extension as an empty list; but then we fill that by saying

doc._.phrases = self.calc_textrank()

How the phrases get created:

Now, obviously the calc_textrank method creates the phrases object. Let's inspect that code.

def calc_textrank (self):
        """
        iterate through each sentence in the doc, constructing a lemma graph
        then returning the top-ranked phrases
        """
        self.reset()
        t0 = time.time()

        for sent in self.doc.sents:
            self.link_sentence(sent)

        if self.logger:
            self.logger.debug(self.seen_lemma)

        # to run the algorithm, we use the NetworkX implementation of
        # PageRank – i.e., approximating eigenvalue centrality – to
        # calculate ranks for each of the nodes in the lemma graph

        self.ranks = nx.pagerank(self.lemma_graph)

        # collect the top-ranked phrases based on both the noun chunks
        # and the named entities

        for chunk in self.doc.noun_chunks:
            self.collect_phrases(chunk)

        for ent in self.doc.ents:
            self.collect_phrases(ent)

        # TODO:
        # there are edge cases where the built-in noun chunking in
        # spaCy fails to extract much, for example:

        # > "everything you need to know about student loan interest rates variable and fixed rates capitalization amortization student loan refinancing and more."

        # we should test `len(self.phrases.keys())` vs. a brute-force
        # noun chunking approach, then add the brute-force override to
        # `self.phrases`

        #for k, p in self.phrases.items():
        #    print(" >>", k, p)

        #for key, lemma in self.seen_lemma.items():
        #    node_id = list(self.seen_lemma.keys()).index(key)
        #    print(node_id, key, lemma, self.ranks[node_id])


        # since noun chunks can be expressed in different ways (e.g., may
        # have articles or prepositions), we need to find a minimum span
        # for each phrase based on combinations of lemmas

        min_phrases = {}

        for phrase_key, phrase_list in self.phrases.items():
            phrase_list.sort(key=lambda p: p.rank, reverse=True)
            best_phrase = phrase_list[0]
            min_phrases[best_phrase.text] = (best_phrase.rank, len(phrase_list), phrase_key)

        # yield results

        results = sorted(min_phrases.items(), key=lambda x: x[1][0], reverse=True)

        phrase_list = [
            Phrase(p, r, c, self.phrases[k]) for p, (r, c, k) in results
        ]

        t1 = time.time()
        self.elapsed_time = (t1 - t0) * 1000.0

        return phrase_list

So as you can see from the doc string, this method iterates each sentence and then creates a lemma graph, as well as returns a phrase list. Now, this function keeps calling other functions; so it is a bit difficult to understand this; but we'll just assume some parts and go through the code.

First, we call the reset and reset some of the variables inside the class. After that and other formalities, we call a link_sentence() function on each of the sentences in the document.

This link_sentence() function in other hand, takes each sentence, lemmatizes them and breaks them into tokens and creates a graph out of the lemmas. This link_sentence() function also changes the weights of the different edges; as the sentences keep getting added.

So after calling link sentence function over all the sentences; the graph, made of lemmatized phrases, is built. Now, we call the pagerank algorithm from the networkx (famous graphs and network analysis package in python) to calculate the ranks for each nodes in the lemma graph. We store it in self.ranks.

Now that the ranks are calculated we collect the phrases; using the collect_phrase() method for both created noun chunks and entities which are detected in the document. If you have read about phrase extraction or summarization you will know that, noun chunks and entities are important objects in these contexts.

Finally, we sort the phrases based on their ranks and add the phrase objects based into a phrase list; which comes out of the calc_textrank() method.

Now, the other method, the summary is not yet exposed. In usage, we see that the textrank extension has a summary method; which we are calling and getting our response. This implies that the summary method ranks the sentences and creates the summaries. Let's see its code:

 def summary(self, limit_phrases=10, limit_sentences=4):
         """
        run extractive summarization, based on vector distance 
        per sentence from the top-ranked phrases
        """
        unit_vector = []

        # construct a list of sentence boundaries with a phrase set
        # for each (initialized to empty)

        sent_bounds = [ [s.start, s.end, set([])] for s in self.doc.sents ]

        # iterate through the top-ranked phrases, added them to the
        # phrase vector for each sentence

        phrase_id = 0

        for p in self.doc._.phrases:
            unit_vector.append(p.rank)

            if self.logger:
                self.logger.debug(
                    "{} {} {}".format(phrase_id, p.text, p.rank)
                )
    
            for chunk in p.chunks:
                for sent_start, sent_end, sent_vector in sent_bounds:
                    if chunk.start >= sent_start and chunk.start <= sent_end:
                        sent_vector.add(phrase_id)

                        if self.logger:
                            self.logger.debug(
                                " {} {} {} {}".format(sent_start, chunk.start, chunk.end, sent_end)
                                )

                        break

            phrase_id += 1

            if phrase_id == limit_phrases:
                break

        # construct a unit_vector for the top-ranked phrases, up to
        # the requested limit

        sum_ranks = sum(unit_vector)

        try:
            unit_vector = [ rank/sum_ranks for rank in unit_vector ]
        except ZeroDivisionError:
            unit_vector = (0.0,) * len(unit_vector)

        # iterate through each sentence, calculating its euclidean
        # distance from the unit vector

        sent_rank = {}
        sent_id = 0

        for sent_start, sent_end, sent_vector in sent_bounds:
            sum_sq = 0.0
    
            for phrase_id in range(len(unit_vector)):
                if phrase_id not in sent_vector:
                    sum_sq += unit_vector[phrase_id]**2.0

            sent_rank[sent_id] = sqrt(sum_sq)
            sent_id += 1

        # extract the sentences with the lowest distance

        sent_text = {}
        sent_id = 0

        for sent in self.doc.sents:
            sent_text[sent_id] = sent
            sent_id += 1

        # yield results, up to the limit requested

        num_sent = 0

        for sent_id, rank in sorted(sent_rank.items(), key=itemgetter(1)):
            yield sent_text[sent_id]
            num_sent += 1

            if num_sent == limit_sentences:
                break

So, let's understand this code now. From the comments, we understand that the sentences are ranked based on how much their vector distance is from the top-ranked phrases. Let's see how we are actually implementing it.

The first thing the code does initializing a bunch of variables. Among these, are, a empty unit vector list; and a list of sentence boundary touples which contains (sentence_begining, sentence_ending , one_empty_list)

Now, lets see how we are filling each of these.

So basically we do the whole operation via iterating through the phrases present in self.doc._.phrases(). For each phrase, we append its rank value to the empty unit vector. Further, in the same loop; we check for each sentences; if the phrase is present inside that sentence; and fill that empty list in each of the sentences with the phrase index of the phrases that are present in that specific sentence. We leave this looping once we hit the number of highest ranking phrases we are going to consider.

Now, in this small try except below; we normalize the rank inputs for each of the phrases in the unit vector.

try:
unit_vector = [ rank/sum_ranks for rank in unit_vector ]
except ZeroDivisionError:
unit_vector = (0.0,) * len(unit_vector)

Finally, we calculate the sentence rank for each of the sentence; by calculating the euclidean distance between the sentence's vector with respect to the phrases present in it; and the unit vector containing all the top phrases.We store these distances with a sent_rank dictionary; containing sent_id and the euclidean distance.

Now, in the starting itself, we said that the sentences which are nearest to the average vector of all the top ranking phrases, are the most important.That's why in the last part of the code, we sort the dictionary based on min to max of the distance and print out sentences which are most important until we hit the number of sentences required.

So, this is how the summarization works in the textrank package. We will keep publishing more on spacy and NLP. Until then, stay tuned and thanks for reading!

20 Must-Know Math Puzzles for Data Science Interviews: Test Your Problem-Solving Skills

Introduction: When preparing for a data science interview, brushing up on your coding and statistical knowledge is crucial—but math puzzles also play a significant role. Many interviewers use puzzles to assess how candidates approach complex problems, test their logical reasoning, and gauge their problem-solving efficiency. These puzzles are often designed to test not only your knowledge of math but also your ability to think critically and creatively. Here, we've compiled 20 challenging yet exciting math puzzles to help you prepare for data science interviews. We’ll walk you through each puzzle, followed by an explanation of the solution. 1. The Missing Dollar Puzzle Puzzle: Three friends check into a hotel room that costs $30. They each contribute $10. Later, the hotel realizes there was an error and the room actually costs $25. The hotel gives $5 back to the bellboy to return to the friends, but the bellboy, being dishonest, pockets $2 and gives $1 back to each friend. No...

Machine learning and statistics with python

Search This Blog