Skip to main content

A comprehensive article on Stemming


Chinese (Simplified)Chinese (Traditional)CzechDanishDutchEnglishFrenchGermanHindiIndonesianItalianKoreanPolishPortugueseRussianSerbianSlovakSpanishThaiBengaliGujaratiMarathiNepaliPunjabiTamilTelugu

Introduction:

During natural language processing, every day, we face one challenge which is cleaning a text properly. For cleaning text, we often use two methods to modify the texts, which are stemming and lemmatization. In this post, we will analyze about stemming and lemmatization in details.

What is stemming and lemmatization?

For grammatical reasons, we tend to use words that are the same up to a core portion and then end with different suffixes to create different parts of speech as well as different meanings. For example, cars, car, car's, cars'; all have car as the main part and the last parts make the difference of meaning. While statistically analyzing the text, we need to consider the core parts of these words, to avoid redundancy of the information and other reasons.

From this context, enters stemming and lemmatization. It is important to note that the reason stemming algorithms have been developed is totally different from natural language processing. The main goal of stemming is to create the root of any morphological variation of a word; to ultimately improve Information retrieval quality; i.e. its a product IR processes.] Both stemming and lemmatization attempt to reduce a word to its most irreducible core part. Now, stemming refers to algorithmically reduce a word to its core part i.e. the stem or the root ( don't confuse these words with their grammatical meaning), whether that core part is a subword or word, is not important. Lemmatization refers also to algorithmically change a word to its core part, but then it ensures that the core part is a word; often using vocabulary or a word dictionary. Therefore, using stemming, you may end up including subwords in text, while lemmatization always ensures that the changed words are also words. In this post, we will focus on stemming algorithms. In the following post, we will discuss the different varieties of lemmatization.

How does Stemming work?

Stemming is the process of producing morphological variants of a root/base word. There are mainly three problems in stemming, namely (1) under-stemming and (2) over-stemming and (3) misstemming. On a high level, a stemmer is either a set of rules, a dictionary guided lookup algorithm of the sort or a hybrid approach of both of them. A rule-based stemmer matches different types of suffixes within the word and then removes them according to these rules, finally to reach the stem of the word. A dictionary-based stemmer relies on a dictionary to find out the relevant stem from a dictionary of words, using the original word to look up in the dictionary. A hybrid approach uses mostly set of rules for both lookup and easy stemming; while uses a dictionary for disambiguation, minimizing misstemming and ruling out exceptions. On these lines, numerous stemmers for English has been created. Some of these stemmers are:

  1. Lovins Stemmer

    The first stemmer written was Lovins stemmer; written by Julie beth lovins. According to wikipedia

    "This paper was remarkable for its early date and had a great influence on later work in this area. Her paper refers to three earlier major attempts at stemming algorithms, by Professor John W. Tukey of Princeton University, the algorithm developed at Harvard University by Michael Lesk, under the direction of Professor Gerard Salton, and a third algorithm developed by James L. Dolby of R and D Consultants, Los Altos, California."

    Lovins stemmer is the first one to categorize suffix ending and then organize removal of suffices based on conditions, which is also pretty much the core of all other rule-based stemmers.

    "It [Lovins stemmer] performs a lookup on a table of 294 endings, 29 conditions and 35 transformation rules, which have been arranged on the longest-match principle. The Lovins stemmer removes the longest suffix from a word. Once the ending is removed, the word is recoded using a different table that makes various adjustments to convert these stems into valid words. It always removes a maximum of one suffix from a word, due to its nature as a single-pass algorithm. The advantage of this algorithm is it is very fast and can handle the removal of double letters in words like ‘getting’ being transformed to ‘get’ and also handles many irregular plurals like – mouse and mice, index and indices, etc. Drawbacks of the Lovins approach are that it is time and data consuming. Furthermore, many suffixes are not available in the table of endings. It is sometimes highly unreliable and frequently fails to form words from the stems or to match the stems of like-meaning words. The reason being the technical vocabulary being used by the author." -- a description of Lovins stemmer from comparative study of stemming algorithms by Anjali G.jivani.

  2. Dawson Stemmer

    Dawson, J.L., invented the Dawson stemmer in 1974; shortly after the Lovin's stemmer. This stemmer extends the same approach as the Lovins stemmer with a list of more than a thousand suffixes in the English language. Here is the generic algorithm for the Dawson stemmer:

    1. Get the input word
    2. Get the matching suffix
    3. The suffix pool is reverse indexed by length and the last character
    4. Remove the longest suffix from the word with exact match
    5. Recode the word using a mapping table
    6. Convert stem into a valid word

    The advantages of the Dawson stemmer is that it is a single-pass stemmer. But the main problem with this stemmer is that again, it relies on a mapping table and a reverse indexed storage of suffices. Although that does not make it as slow as a dictionary-based stemmer; still it makes it slower than traditional rule-based ones like porter's algorithm.

  3. Porter’s Stemmer algorithm

    Porter's stemmer algorithm was first published in 1980 by M.F.porter and then it has been deeply rediscovered, written in dozens of different languages from that time. Porter stemmer is a 5 step process, which iteratively removes suffixes based on number of conditions specified at each step. Due to this step by step process, it can remove combined suffixes as well as reduces the over stemming part using the conditions. The conditions are mainly based on how many syllables are in a word before the ending and what type of suffix is there at the ending of the word. Each set of rules also obeys the removal of maximum length suffix possible according to the conditions, which is a similarity to its priors like Lovins and dawson.Now, as the algorithm is quite detailed, I will provide proper links to read the details of the stemming process:

    1. Porter stemming process
    2. Porter stemming original paper
    3. research on advances of porter stemming

  4. Snowball stemming language

    Snowball is the standard language developed by M.F.porter for rigorous and unambiguous implementation of stemming algorithms. Snowball can create direct stemming programs in ANSI C or java if the algorithm is properly defined in snowball. It is the current language used by all academics as well as industrial people to write new and standard stemmers.

    Using snowball, porter revised and published an enhanced and improved version of his stemmer, called porter2. This stemmer is sometimes known as snowball stemmer too. NLTK also refers porter2 as Snowball stemmer.

    For porter's stemmer and snowball stemmer, there exist NLTK modules that implement these processes. The link for how to use them is the following: Nltk resources for porter stemmer

  5. Krovetz Stemmer

    Krovetz Stemmer was proposed in 1993 by Robert Krovetz. While we have been discussing mainly algorithm based stemmers up to this point, Krovetz stemmer is the first hybrid approach with a dictionary and a set of rules. This stemmer, given a word, first look up the dictionary to see if it is a proper word without suffices. If it is, then there is no reason to stem it. While if it is not in the dictionary, we stem it using a set of rules just like other stemmers and then again look up in the dictionary. In this way, Krovetz stemmer ensures that we reach a word, and not a non-word, individually non-sensical subword as a stem.

    While, Krovetz stemmer always reach a word after stemming and also tend to have very less misstemming; the problem with Krovetz is the dictionary. The dictionary, as we know, needs yearly maintenance and updates and generally occupies quite a memory space. Also, the lookup process ends up being a computationally heavy process and therefore, Krovetz stemmer becomes quite slow and sort of inapplicable for large text repositories; where traditional slightly more inaccurate rule-based stemmers are much faster and easily applicable.

  6. Xerox Stemmer

    Computational linguists have developed a stemmer based on a lexicon database with a mapping between surface forms and base forms with proper infrastructure for fast use of the database to transform surface-level words, i.e. words with suffixes and prefixes into search terms for the database. The stemmer works like below on a generic level:

    1. It changes the original words using an inflection database to turn nouns into singular from plural, verbs into infinitive form, pronouns into nominative and adjective into positive form.
    2. These changed words are used to find out the related stems from a derivational database, based on both form and semantics. The use of semantics here reduces the misstemming errors which are often prevalent in porter and snowball stemmer. This derivational database, not only removes suffix but also removes prefix, unlike most of the other stemmers, which focuses solely on removing suffices.
    3. There are also inclusion of special tables for handling irregular forms and exceptions, which goes without saying.

    While being technologically stronger, using prefix removal process, derivational and inflectional database, Xerox stemmer is quite good, it is also limited by the curse of using a dictionary-like database which is supposed to update once a few years and needs maintenance and space to work with. Also, it is not susceptible to work with suffixes or prefixes it does not contain in the database and therefore vulnerable to constant changes in the language.

  7. Statistical stemmers:

    While we have discussed rule-based and dictionary or database dependent stemmers up to this point in this article, there are statistical method based stemmers as well in stemming literature. The main stemmers in this genre are:

    1. N-gram stemmer
    2. Hidden Markov model-based stemmer
    3. YSS based stemmer
    These stemmers require some or the other a bit of different contextual knowledge from machine learning and statistics. Although with this site it is not unfamiliar, considering the scope of this post, I am not going into the details of these stemmers for omitting mention of complicated methods involved with these stemmers. Just to give a high-level idea, these stemmers try to use the contextual reference to stem as well as a hidden language model often to find out the original stem from the raw word. You can read further about it from this paper on comparative study of stemming algorithms.

Stemming on other languages and Conclusion:

We have discussed only the English stemming algorithms in this article. It is important to note that, stemming is not applicable to every language, but mostly to Indo-European languages. One famous language which we can not apply stemming from is Chinese. But in fact, similar algorithms are being developed by computer scientists and linguists from other languages and significant works followed last few years. Being a Bengali myself, I conclude this post with a few references to advances in Bengali stemmer and their implementations.

  1. paper on rule-based Bangla stemmer
  2. Implementation of Bangla stemmer
The paper on the rule-based Bangla stemmer includes a good collection of history for the other different types of stemmers as well. Other than Aryan root languages, works like Tamil stemmer and others also focus on Indian Dravid-rooted languages also. So, if someone is willing to research on stemming Indo-Aryan languages, these references can be a good starting point. I hope you liked the article. Please share with your colleagues and friends for sharing the work as well as a comment if any unwilling mistake is included.

Comments

Popular posts from this blog

Tinder bio generation with OpenAI GPT-3 API

Introduction: Recently I got access to OpenAI API beta. After a few simple experiments, I set on creating a simple test project. In this project, I will try to create good tinder bio for a specific person.  The abc of openai API playground: In the OpenAI API playground, you get a prompt, and then you can write instructions or specific text to trigger a response from the gpt-3 models. There are also a number of preset templates which loads a specific kind of prompt and let's you generate pre-prepared results. What are the models available? There are 4 models which are stable. These are: (1) curie (2) babbage (3) ada (4) da-vinci da-vinci is the strongest of them all and can perform all downstream tasks which other models can do. There are 2 other new models which openai introduced this year (2021) named da-vinci-instruct-beta and curie-instruct-beta. These instruction models are specifically built for taking in instructions. As OpenAI blog explains and also you will see in our

Can we write codes automatically with GPT-3?

 Introduction: OpenAI created and released the first versions of GPT-3 back in 2021 beginning. We wrote a few text generation articles that time and tested how to create tinder bio using GPT-3 . If you are interested to know more on what is GPT-3 or what is openai, how the server look, then read the tinder bio article. In this article, we will explore Code generation with OpenAI models.  It has been noted already in multiple blogs and exploration work, that GPT-3 can even solve leetcode problems. We will try to explore how good the OpenAI model can "code" and whether prompt tuning will improve or change those performances. Basic coding: We will try to see a few data structure coding performance by GPT-3. (a) Merge sort with python:  First with 200 words limit, it couldn't complete the Write sample code for merge sort in python.   def merge(arr, l, m, r):     n1 = m - l + 1     n2 = r- m       # create temp arrays     L = [0] * (n1)     R = [0] * (n

What is Bort?

 Introduction: Bort, is the new and more optimized version of BERT; which came out this october from amazon science. I came to know about it today while parsing amazon science's news on facebook about bort. So Bort is the newest addition to the long list of great LM models with extra-ordinary achievements.  Why is Bort important? Bort, is a model of 5.5% effective and 16% total size of the original BERT model; and is 20x faster than BERT, while being able to surpass the BERT model in 20 out of 23 tasks; to quote the abstract of the paper,  ' it obtains performance improvements of between 0 . 3% and 31%, absolute, with respect to BERT-large, on multiple public natural language understanding (NLU) benchmarks. ' So what made this achievement possible? The main idea behind creation of Bort is to go beyond the shallow depth of weight pruning, connection deletion or merely factoring the NN into different matrix factorizations and thus distilling it. While methods like knowle