Skip to main content

Huggingface transformers library exploration: part1,summarization

Introduction:


Being in the nlp field, one of the burning topics currently is transformer architecture. Using attention and pooling and a totally new architecture, transformer based models have been pushing new improvements every year. Now, generally these models are pretty tough to understand and implement. Therefore people search for a good library with these models implemented in them. And one such good library is huggingface's transformer library. I have recently started to explore it.
transformers has both pytorch as well as tensorflow support. To install transformers, in linux, you can just type
pip install transformers
And it will download and settle. Now, I will go through the quick tour part and try out a couple of examples from it.

Quick tour:


From quick tour, I have decided to try out the summarization task. I had a microsoft related text in my pc downloaded earlier.
Now, I will try out different summarization tasks with the easily usable pipeline structure.
A pipeline produces a model, when provided a task, the type of pre-trained model we want to use, the frameworks we use and couple of other relevant parameters.
I have used the same pipeline class; and instantiated a summarizer as below:

from transformers import pipeline
summarizer = pipeline('summarization',model = "t5-base")

Now, when running this code, I get the following error:

ImportError: FloatProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
Clearly the error points out to install or update jupyter. I didn't have jupyter updated over a long time, so I just reinstalled it and on doing that it got solved.

Second error on writing this next line;
summary = summarizer(text, min_length = 5, max_length = 20)
is:
RuntimeError: method 'detach' already has a docstring. So, now we will resolve this.
Now, as of this spyder ide closed issue, this happens because of multiple loading of pytorch in the spyder ide. So there is no fixed solution but to not load it more than once.

Now we finally run this script:
this Hbox classes run in the console and then the model becomes prepared.

Now as you can see, we have a micro_soft.txt named file where we have the following corpus stored:
"
In an attempt to build an AI-ready workforce, Microsoft announced Intelligent Cloud Hub which has been launched to empower the next generation of students with AI-ready skills. Envisioned as a three-year collaborative program, Intelligent Cloud Hub will support around 100 institutions with AI infrastructure, course content and curriculum, developer support, development tools and give students access to cloud and AI services. As part of the program, the Redmond giant which wants to expand its reach and is planning to build a strong developer ecosystem in India with the program will set up the core AI infrastructure and IoT Hub for the selected campuses. The company will provide AI development tools and Azure AI services such as Microsoft Cognitive Services, Bot Services and Azure Machine Learning.According to Manish Prakash, Country General Manager-PS, Health and Education, Microsoft India, said, "With AI being the defining technology of our time, it is transforming lives and industry and the jobs of tomorrow will require a different skillset. This will require more collaborations and training and working with AI. That’s why it has become more critical than ever for educational institutions to integrate new cloud and AI technologies. The program is an attempt to ramp up the institutional set-up and build capabilities among the educators to educate the workforce of tomorrow." The program aims to build up the cognitive skills and in-depth understanding of developing intelligent cloud connected solutions for applications across industry. Earlier in April this year, the company announced Microsoft Professional Program In AI as a learning track open to the public. The program was developed to provide job ready skills to programmers who wanted to hone their skills in AI and data science with a series of online courses which featured hands-on labs and expert instructors as well. This program also included developer-focused AI school that provided a bunch of assets to help build AI skills.
"
Now, in summarizer instance, you have two parameters that you can control. These are min_length and max_length.
We ran with the following combinations and got these results:
(1) min_length = 5, max_length = 20 :
Microsoft has launched a three-year collaborative program to empower students with AI-ready skills.

(2) min_length = 10, max_length = 30:
Microsoft has launched Intelligent Cloud Hub to empower students with AI-ready skills . the three-year collaborative program will support around 100 institutions .

(3) min_length = 100, max_length = 150:
Microsoft has launched Intelligent Cloud Hub to empower students with AI-ready skills . the three-year collaborative program will support around 100 institutions with AI infrastructure, course content and curriculum, developer support, development tools and give students access to cloud and AI services . as part of the program, the redmond giant will set up the core AI infrastructure and IoT Hub for the selected campuses . Microsoft will also provide Azure AI services such as Microsoft Cognitive Services, Bot Services and Azure Machine Learning .

clearly, the created blocks are both meaningful; and although don't create the best summarization in some sense, creates pretty good summarization in a way. So the advantages according to me are that the texts continue to have a sequence in a sense of meaning, while there is still a good summarization power.

But this model does not hold good if I ask for longer lengths! see this for example:
If we put min_length = 500, max_length = 1000, it first gives a common warning that:
Your max_length is set to 1000, but you input_length is only 368. You might consider decreasing max_length manually, e.g. summarizer('...', max_length=50)
which sets off whenever the min_length or max_length is more than the original text length.
Now look at the created summary with extended length. It fails miserably:
"
Microsoft has launched Intelligent Cloud Hub to empower
 students with AI-ready skills . the three-year collaborative program will
 support around 100 institutions with AI infrastructure, course content and
 curriculum, developer support, development tools and give students access
 to cloud and AI services . as part of the program, the redmond giant will
 set up the core AI infrastructure and IoT Hub for the selected campuses .
 Microsoft will also provide Azure AI services such as Microsoft Cognitive
 Services, Bot Services and Azure Machine Learning . earlier this year, the
 company announced Microsoft Professional Program In AI .   . -  n    -  .
 .. -  .-  h  n   . an  an  .    s  "  ,   . " .s  t  - "  "  " " " s " " "
 ""  "" """ " "- " ".
  " " " -- "--- -, " , ""- & ., . and ./- .&--/--"--\'- s--.--&-. &.-...-/.
  - ".. "- d. /// / &//-// "//"/- "/ " / " "/" "i" "\' " "ii"- "i\' "-" " &
  " "& " \'" "& && \'& ," \'\' "&&\' " c" &###&&&#&#" "## "&/##"&#\' "#&
  /&&/&# /# & ### -& -/ "&# #&& "& $##\'&& # ##&\'&#/& n&#;&& $&# $&&-&&
  (&&( &-/&/-&/ -#&/.&#
"
Now I wanted to use a summarizer in which I can have a flexible length for creation of text. The ideal usability for a summarizer, according to my creative wish with the experiment, should not restrict much with the length. So now I will try the bart-large-cnn model.
To load the bart-large-cnn model, the path to provide to model is 'facebook/bart-large-cnn'. In this model's case, we can actually see that the length can be extended to much bigger extent. With this experiment being same, I am not mentioning it anymore in this blog; but you can check it on my github repo which will be updated.
From observations on small summarizations like the above from t-5; we can see that the bart is not that good in this same text:
"
Intelligent Cloud Hub will support around 100 institutions with AI infrastructure, course content and curriculum
"
i.e. this summary is clearly not as good as that one created by the t-5. But the good thing is that the text summarization still maintains the sentence sequence. I will provide a code screenshot for the use of this bart-large code below:

Now, we will test the distilbart package. Using distilbart-cnn-6-6; I got stuck with tokenizer and model name path. Avoiding that, I have tried distilbart-cnn-12-6. Using the path 'sshleifer/distilbart-cnn-12-6', and same for tokenizer, it implements properly.
Again the smaller results are not that good enough; seeing specially the 10-30 length one sentence stays unfinished as below:
"
Microsoft announced Intelligent Cloud Hub which has been launched to empower the next generation of students with AI-ready skills . Envisioned as a
"
while the longer one(100-150 words) is again fine, meaningful and a fine summary; as given below:
"
Microsoft announced Intelligent Cloud Hub which has been launched to empower the next generation of students with AI-ready skills . Envisioned as a three-year collaborative program, the program will support around 100 institutions with AI infrastructure, course content and curriculum, developer support, development tools and access to cloud and AI services . The program aims to build up the cognitive skills and in-depth understanding of developing intelligent cloud connected solutions for applications across industry . Earlier in April this year, the company announced Microsoft Professional Program In AI as a learning track open to the public .
"
Now, we tried to create larger text(500-1000 words) from the same model, using the same text and the result was catastrophic at the end. Unlike bart-large-cnn, the model broke down in two cases:
(1) coherence of sentence
(2) flow of text
I will cite the generated text below and analyze how that happens:
"
Microsoft announced Intelligent Cloud Hub which has been launched to empower the next generation of students with AI-ready skills . Envisioned as a three-year collaborative program, the program will support around 100 institutions with AI infrastructure, course content and curriculum, developer support, development tools and access to cloud and AI services . The program aims to build up the cognitive skills and in-depth understanding of developing intelligent cloud connected solutions for applications across industry . Earlier in April this year, the company announced Microsoft Professional Program In AI as a learning track open to the public . The company will set up the core AI infrastructure and IoT Hub for the selected campuses with the . selected campuses. The program is an attempt to ramp up the institutional set-up and build capabilities among the educators to educate the . educators to . educate the workforce of tomorrow." The program also includes developer-focused AI schools that provided a bunch of assets to help build AI skills. Earlier in the company has also launched a series of online courses which featured hands-on labs and expert instructors as well as the AI school to help develop AI skills for the students with a number of AI skills as well . The programs are available in India for a range of courses and are available on Amazon Prime Prime Minister Narendra Modi’s Amazon Prime Minister in the US and Google’ Prime Minister of India’�s Prime Minister has been in New Delhi and the US’Sara Khan’ease to launch a similar program in the U.S. to launch its own Amazon Prime’ed Amazon Prime account in New York’inging of Amazon’es. The company’ll be available in the next month’ – the first time in the region’\u2009-to-Amazon Prime” – the world’. The first of its own in a new series of software software, which will be available to download its own software software for the first of the year’, which is available to create a package of software for a package to develop a package for the next of the month” to download a new software software to create its own products and use it in a package. The software, the software, is available in a bundle of software and software to develop and use the software for its first of a new product, which can be downloaded in a range from Amazon.com’ The software is available on a subscription to Amazon Prime.com and Microsoft’ve been released.
"
Clearly, the problem with distilbert have been that, unlike bart, it focuses much less on the context of the sentences and phrases, and therefore as a result, the sentences loose coherence real fast. Also, once the coherence is lost par sentence, the flow of the text also is pretty much done.
I wonder what is the metric to check the flow of the text and/or how a rogue score is related to measure something like that.

Conclusion:

Anyway, this is what my inspection of summarization looks upto this point. After examining, bart-large-cnn, google's t5, and distilbert from our beloved sam shleifer. Definitely, this is more of a beginner-ish rambling, but I have experimented with this toy-sample text and 3 of the models.And the results give some preliminary ideas about how there may have been a lack of attention in some cases in distilbart than bart; but that is much explainable from the point that distilbart is made to be 1.68 times faster than bart and much smaller than it too. And also this being "1" sample instead of millions and billions of samples, definitely these results are not that conclusive enough.
But to conclude, you can take a small look into the rouge scores of these three respective models and then that will be the decisive ending note based on current metrics.
Clearly, in order of Rouge-2 scores, T5 occupies 21.55, bart has 22.06 and distilbart has 12.3 version with 22.37. So, although we saw the discrepancies in article about coherence and flows, they maybe
(1) text specific
(2) not captured by the rouge scores.
I will dig deeper in the next part of transformers library exploration. We'll talk about rouge score in details in a separate blog-post, and in the next part we will try some different text and talk about question answering task mainly.
Thanks for reading! you are awesome and have a great day!
Further readings:

Comments

Popular posts from this blog

Mastering SQL for Data Science: Top SQL Interview Questions by Experience Level

Introduction: SQL (Structured Query Language) is a cornerstone of data manipulation and querying in data science. SQL technical rounds are designed to assess a candidate’s ability to work with databases, retrieve, and manipulate data efficiently. This guide provides a comprehensive list of SQL interview questions segmented by experience level—beginner, intermediate, and experienced. For each level, you'll find key questions designed to evaluate the candidate’s proficiency in SQL and their ability to solve data-related problems. The difficulty increases as the experience level rises, and the final section will guide you on how to prepare effectively for these rounds. Beginner (0-2 Years of Experience) At this stage, candidates are expected to know the basics of SQL, common commands, and elementary data manipulation. What is SQL? Explain its importance in data science. Hint: Think about querying, relational databases, and data manipulation. What is the difference between WHERE ...

Spacy errors and their solutions

 Introduction: There are a bunch of errors in spacy, which never makes sense until you get to the depth of it. In this post, we will analyze the attribute error E046 and why it occurs. (1) AttributeError: [E046] Can't retrieve unregistered extension attribute 'tag_name'. Did you forget to call the set_extension method? Let's first understand what the error means on superficial level. There is a tag_name extension in your code. i.e. from a doc object, probably you are calling doc._.tag_name. But spacy suggests to you that probably you forgot to call the set_extension method. So what to do from here? The problem in hand is that your extension is not created where it should have been created. Now in general this means that your pipeline is incorrect at some level.  So how should you solve it? Look into the pipeline of your spacy language object. Chances are that the pipeline component which creates the extension is not included in the pipeline. To check the pipe eleme...

What is Bort?

 Introduction: Bort, is the new and more optimized version of BERT; which came out this october from amazon science. I came to know about it today while parsing amazon science's news on facebook about bort. So Bort is the newest addition to the long list of great LM models with extra-ordinary achievements.  Why is Bort important? Bort, is a model of 5.5% effective and 16% total size of the original BERT model; and is 20x faster than BERT, while being able to surpass the BERT model in 20 out of 23 tasks; to quote the abstract of the paper,  ' it obtains performance improvements of between 0 . 3% and 31%, absolute, with respect to BERT-large, on multiple public natural language understanding (NLU) benchmarks. ' So what made this achievement possible? The main idea behind creation of Bort is to go beyond the shallow depth of weight pruning, connection deletion or merely factoring the NN into different matrix factorizations and thus distilling it. While methods like know...