Huggingface transformers library exploration: part1,summarization

Introduction:

Being in the nlp field, one of the burning topics currently is transformer architecture. Using attention and pooling and a totally new architecture, transformer based models have been pushing new improvements every year. Now, generally these models are pretty tough to understand and implement. Therefore people search for a good library with these models implemented in them. And one such good library is huggingface's transformer library. I have recently started to explore it.

transformers has both pytorch as well as tensorflow support. To install transformers, in linux, you can just type

pip install transformers

And it will download and settle. Now, I will go through the quick tour part and try out a couple of examples from it.

Quick tour:

From quick tour, I have decided to try out the summarization task. I had a microsoft related text in my pc downloaded earlier.

Now, I will try out different summarization tasks with the easily usable pipeline structure.

A pipeline produces a model, when provided a task, the type of pre-trained model we want to use, the frameworks we use and couple of other relevant parameters.

I have used the same pipeline class; and instantiated a summarizer as below:

from transformers import pipeline

summarizer = pipeline('summarization',model = "t5-base")

Now, when running this code, I get the following error:

ImportError: FloatProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html

Clearly the error points out to install or update jupyter. I didn't have jupyter updated over a long time, so I just reinstalled it and on doing that it got solved.

Second error on writing this next line;

summary = summarizer(text, min_length = 5, max_length = 20)

is:

RuntimeError: method 'detach' already has a docstring. So, now we will resolve this.

Now, as of this spyder ide closed issue, this happens because of multiple loading of pytorch in the spyder ide. So there is no fixed solution but to not load it more than once.

Now we finally run this script:

this Hbox classes run in the console and then the model becomes prepared.

Now as you can see, we have a micro_soft.txt named file where we have the following corpus stored:

In an attempt to build an AI-ready workforce, Microsoft announced Intelligent Cloud Hub which has been launched to empower the next generation of students with AI-ready skills. Envisioned as a three-year collaborative program, Intelligent Cloud Hub will support around 100 institutions with AI infrastructure, course content and curriculum, developer support, development tools and give students access to cloud and AI services. As part of the program, the Redmond giant which wants to expand its reach and is planning to build a strong developer ecosystem in India with the program will set up the core AI infrastructure and IoT Hub for the selected campuses. The company will provide AI development tools and Azure AI services such as Microsoft Cognitive Services, Bot Services and Azure Machine Learning.According to Manish Prakash, Country General Manager-PS, Health and Education, Microsoft India, said, "With AI being the defining technology of our time, it is transforming lives and industry and the jobs of tomorrow will require a different skillset. This will require more collaborations and training and working with AI. That’s why it has become more critical than ever for educational institutions to integrate new cloud and AI technologies. The program is an attempt to ramp up the institutional set-up and build capabilities among the educators to educate the workforce of tomorrow." The program aims to build up the cognitive skills and in-depth understanding of developing intelligent cloud connected solutions for applications across industry. Earlier in April this year, the company announced Microsoft Professional Program In AI as a learning track open to the public. The program was developed to provide job ready skills to programmers who wanted to hone their skills in AI and data science with a series of online courses which featured hands-on labs and expert instructors as well. This program also included developer-focused AI school that provided a bunch of assets to help build AI skills.

Now, in summarizer instance, you have two parameters that you can control. These are min_length and max_length.

We ran with the following combinations and got these results:

(1) min_length = 5, max_length = 20 :

Microsoft has launched a three-year collaborative program to empower students with AI-ready skills.

(2) min_length = 10, max_length = 30:

Microsoft has launched Intelligent Cloud Hub to empower students with AI-ready skills . the three-year collaborative program will support around 100 institutions .

(3) min_length = 100, max_length = 150:

clearly, the created blocks are both meaningful; and although don't create the best summarization in some sense, creates pretty good summarization in a way. So the advantages according to me are that the texts continue to have a sequence in a sense of meaning, while there is still a good summarization power.

But this model does not hold good if I ask for longer lengths! see this for example:

If we put min_length = 500, max_length = 1000, it first gives a common warning that:

Your max_length is set to 1000, but you input_length is only 368. You might consider decreasing max_length manually, e.g. summarizer('...', max_length=50)

which sets off whenever the min_length or max_length is more than the original text length.

Now look at the created summary with extended length. It fails miserably:

Microsoft has launched Intelligent Cloud Hub to empower
students with AI-ready skills . the three-year collaborative program will
support around 100 institutions with AI infrastructure, course content and
curriculum, developer support, development tools and give students access
to cloud and AI services . as part of the program, the redmond giant will
set up the core AI infrastructure and IoT Hub for the selected campuses .
Microsoft will also provide Azure AI services such as Microsoft Cognitive
Services, Bot Services and Azure Machine Learning . earlier this year, the
company announced Microsoft Professional Program In AI . . - n - .
.. - .- h n . an an . s " , . " .s t - " " " " " s " " "
"" "" """ " "- " ".
" " " -- "--- -, " , ""- & ., . and ./- .&--/--"--\'- s--.--&-. &.-...-/.
- ".. "- d. /// / &//-// "//"/- "/ " / " "/" "i" "\' " "ii"- "i\' "-" " &
" "& " \'" "& && \'& ," \'\' "&&\' " c" &###&&&#&#" "## "&/##"&#\' "#&
/&&/&# /# & ### -& -/ "&# #&& "& $##\'&& # ##&\'&#/& n&#;&& $&# $&&-&&
(&&( &-/&/-&/ -#&/.&#

Now I wanted to use a summarizer in which I can have a flexible length for creation of text. The ideal usability for a summarizer, according to my creative wish with the experiment, should not restrict much with the length. So now I will try the bart-large-cnn model.

To load the bart-large-cnn model, the path to provide to model is 'facebook/bart-large-cnn'. In this model's case, we can actually see that the length can be extended to much bigger extent. With this experiment being same, I am not mentioning it anymore in this blog; but you can check it on my github repo which will be updated.

From observations on small summarizations like the above from t-5; we can see that the bart is not that good in this same text:

"
Intelligent Cloud Hub will support around 100 institutions with AI infrastructure, course content and curriculum

i.e. this summary is clearly not as good as that one created by the t-5. But the good thing is that the text summarization still maintains the sentence sequence. I will provide a code screenshot for the use of this bart-large code below:

Now, we will test the distilbart package. Using distilbart-cnn-6-6; I got stuck with tokenizer and model name path. Avoiding that, I have tried distilbart-cnn-12-6. Using the path 'sshleifer/distilbart-cnn-12-6', and same for tokenizer, it implements properly.

Again the smaller results are not that good enough; seeing specially the 10-30 length one sentence stays unfinished as below:

Microsoft announced Intelligent Cloud Hub which has been launched to empower the next generation of students with AI-ready skills . Envisioned as a

while the longer one(100-150 words) is again fine, meaningful and a fine summary; as given below:

Now, we tried to create larger text(500-1000 words) from the same model, using the same text and the result was catastrophic at the end. Unlike bart-large-cnn, the model broke down in two cases:

(1) coherence of sentence

(2) flow of text

I will cite the generated text below and analyze how that happens:

Microsoft announced Intelligent Cloud Hub which has been launched to empower the next generation of students with AI-ready skills . Envisioned as a three-year collaborative program, the program will support around 100 institutions with AI infrastructure, course content and curriculum, developer support, development tools and access to cloud and AI services . The program aims to build up the cognitive skills and in-depth understanding of developing intelligent cloud connected solutions for applications across industry . Earlier in April this year, the company announced Microsoft Professional Program In AI as a learning track open to the public . The company will set up the core AI infrastructure and IoT Hub for the selected campuses with the . selected campuses. The program is an attempt to ramp up the institutional set-up and build capabilities among the educators to educate the . educators to . educate the workforce of tomorrow." The program also includes developer-focused AI schools that provided a bunch of assets to help build AI skills. Earlier in the company has also launched a series of online courses which featured hands-on labs and expert instructors as well as the AI school to help develop AI skills for the students with a number of AI skills as well . The programs are available in India for a range of courses and are available on Amazon Prime Prime Minister Narendra Modi’s Amazon Prime Minister in the US and Google’ Prime Minister of India’�s Prime Minister has been in New Delhi and the US’Sara Khan’ease to launch a similar program in the U.S. to launch its own Amazon Prime’ed Amazon Prime account in New York’inging of Amazon’es. The company’ll be available in the next month’ – the first time in the region’\u2009-to-Amazon Prime” – the world’. The first of its own in a new series of software software, which will be available to download its own software software for the first of the year’, which is available to create a package of software for a package to develop a package for the next of the month” to download a new software software to create its own products and use it in a package. The software, the software, is available in a bundle of software and software to develop and use the software for its first of a new product, which can be downloaded in a range from Amazon.com’ The software is available on a subscription to Amazon Prime.com and Microsoft’ve been released.

Clearly, the problem with distilbert have been that, unlike bart, it focuses much less on the context of the sentences and phrases, and therefore as a result, the sentences loose coherence real fast. Also, once the coherence is lost par sentence, the flow of the text also is pretty much done.

I wonder what is the metric to check the flow of the text and/or how a rogue score is related to measure something like that.

Conclusion:

Anyway, this is what my inspection of summarization looks upto this point. After examining, bart-large-cnn, google's t5, and distilbert from our beloved sam shleifer. Definitely, this is more of a beginner-ish rambling, but I have experimented with this toy-sample text and 3 of the models.And the results give some preliminary ideas about how there may have been a lack of attention in some cases in distilbart than bart; but that is much explainable from the point that distilbart is made to be 1.68 times faster than bart and much smaller than it too. And also this being "1" sample instead of millions and billions of samples, definitely these results are not that conclusive enough.

But to conclude, you can take a small look into the rouge scores of these three respective models and then that will be the decisive ending note based on current metrics.

Clearly, in order of Rouge-2 scores, T5 occupies 21.55, bart has 22.06 and distilbart has 12.3 version with 22.37. So, although we saw the discrepancies in article about coherence and flows, they maybe

(1) text specific

(2) not captured by the rouge scores.

I will dig deeper in the next part of transformers library exploration. We'll talk about rouge score in details in a separate blog-post, and in the next part we will try some different text and talk about question answering task mainly.

Thanks for reading! you are awesome and have a great day!

20 Must-Know Math Puzzles for Data Science Interviews: Test Your Problem-Solving Skills

Introduction: When preparing for a data science interview, brushing up on your coding and statistical knowledge is crucial—but math puzzles also play a significant role. Many interviewers use puzzles to assess how candidates approach complex problems, test their logical reasoning, and gauge their problem-solving efficiency. These puzzles are often designed to test not only your knowledge of math but also your ability to think critically and creatively. Here, we've compiled 20 challenging yet exciting math puzzles to help you prepare for data science interviews. We’ll walk you through each puzzle, followed by an explanation of the solution. 1. The Missing Dollar Puzzle Puzzle: Three friends check into a hotel room that costs $30. They each contribute $10. Later, the hotel realizes there was an error and the room actually costs $25. The hotel gives $5 back to the bellboy to return to the friends, but the bellboy, being dishonest, pockets $2 and gives $1 back to each friend. No...

Machine learning and statistics with python

Search This Blog