A comprehensive article on Stemming

Introduction:

During natural language processing, every day, we face one challenge which is cleaning a text properly. For cleaning text, we often use two methods to modify the texts, which are stemming and lemmatization. In this post, we will analyze about stemming and lemmatization in details.

What is stemming and lemmatization?

For grammatical reasons, we tend to use words that are the same up to a core portion and then end with different suffixes to create different parts of speech as well as different meanings. For example, cars, car, car's, cars'; all have car as the main part and the last parts make the difference of meaning. While statistically analyzing the text, we need to consider the core parts of these words, to avoid redundancy of the information and other reasons.

From this context, enters stemming and lemmatization. It is important to note that the reason stemming algorithms have been developed is totally different from natural language processing. The main goal of stemming is to create the root of any morphological variation of a word; to ultimately improve Information retrieval quality; i.e. its a product IR processes.] Both stemming and lemmatization attempt to reduce a word to its most irreducible core part. Now, stemming refers to algorithmically reduce a word to its core part i.e. the stem or the root ( don't confuse these words with their grammatical meaning), whether that core part is a subword or word, is not important. Lemmatization refers also to algorithmically change a word to its core part, but then it ensures that the core part is a word; often using vocabulary or a word dictionary. Therefore, using stemming, you may end up including subwords in text, while lemmatization always ensures that the changed words are also words. In this post, we will focus on stemming algorithms. In the following post, we will discuss the different varieties of lemmatization.

How does Stemming work?

Stemming is the process of producing morphological variants of a root/base word. There are mainly three problems in stemming, namely (1) under-stemming and (2) over-stemming and (3) misstemming. On a high level, a stemmer is either a set of rules, a dictionary guided lookup algorithm of the sort or a hybrid approach of both of them. A rule-based stemmer matches different types of suffixes within the word and then removes them according to these rules, finally to reach the stem of the word. A dictionary-based stemmer relies on a dictionary to find out the relevant stem from a dictionary of words, using the original word to look up in the dictionary. A hybrid approach uses mostly set of rules for both lookup and easy stemming; while uses a dictionary for disambiguation, minimizing misstemming and ruling out exceptions. On these lines, numerous stemmers for English has been created. Some of these stemmers are:

Lovins Stemmer

The first stemmer written was Lovins stemmer; written by Julie beth lovins. According to wikipedia
"This paper was remarkable for its early date and had a great influence on later work in this area. Her paper refers to three earlier major attempts at stemming algorithms, by Professor John W. Tukey of Princeton University, the algorithm developed at Harvard University by Michael Lesk, under the direction of Professor Gerard Salton, and a third algorithm developed by James L. Dolby of R and D Consultants, Los Altos, California."
Lovins stemmer is the first one to categorize suffix ending and then organize removal of suffices based on conditions, which is also pretty much the core of all other rule-based stemmers.

"It [Lovins stemmer] performs a lookup on a table of 294 endings, 29 conditions and 35 transformation rules, which have been arranged on the longest-match principle. The Lovins stemmer removes the longest suffix from a word. Once the ending is removed, the word is recoded using a different table that makes various adjustments to convert these stems into valid words. It always removes a maximum of one suffix from a word, due to its nature as a single-pass algorithm. The advantage of this algorithm is it is very fast and can handle the removal of double letters in words like ‘getting’ being transformed to ‘get’ and also handles many irregular plurals like – mouse and mice, index and indices, etc. Drawbacks of the Lovins approach are that it is time and data consuming. Furthermore, many suffixes are not available in the table of endings. It is sometimes highly unreliable and frequently fails to form words from the stems or to match the stems of like-meaning words. The reason being the technical vocabulary being used by the author." -- a description of Lovins stemmer from comparative study of stemming algorithms by Anjali G.jivani.
Dawson Stemmer

Dawson, J.L., invented the Dawson stemmer in 1974; shortly after the Lovin's stemmer. This stemmer extends the same approach as the Lovins stemmer with a list of more than a thousand suffixes in the English language. Here is the generic algorithm for the Dawson stemmer:
1. Get the input word
2. Get the matching suffix
3. The suffix pool is reverse indexed by length and the last character
4. Remove the longest suffix from the word with exact match
5. Recode the word using a mapping table
6. Convert stem into a valid word
The advantages of the Dawson stemmer is that it is a single-pass stemmer. But the main problem with this stemmer is that again, it relies on a mapping table and a reverse indexed storage of suffices. Although that does not make it as slow as a dictionary-based stemmer; still it makes it slower than traditional rule-based ones like porter's algorithm.
Porter’s Stemmer algorithm

Porter's stemmer algorithm was first published in 1980 by M.F.porter and then it has been deeply rediscovered, written in dozens of different languages from that time. Porter stemmer is a 5 step process, which iteratively removes suffixes based on number of conditions specified at each step. Due to this step by step process, it can remove combined suffixes as well as reduces the over stemming part using the conditions. The conditions are mainly based on how many syllables are in a word before the ending and what type of suffix is there at the ending of the word. Each set of rules also obeys the removal of maximum length suffix possible according to the conditions, which is a similarity to its priors like Lovins and dawson.Now, as the algorithm is quite detailed, I will provide proper links to read the details of the stemming process:
Snowball stemming language

Snowball is the standard language developed by M.F.porter for rigorous and unambiguous implementation of stemming algorithms. Snowball can create direct stemming programs in ANSI C or java if the algorithm is properly defined in snowball. It is the current language used by all academics as well as industrial people to write new and standard stemmers.

Using snowball, porter revised and published an enhanced and improved version of his stemmer, called porter2. This stemmer is sometimes known as snowball stemmer too. NLTK also refers porter2 as Snowball stemmer.

For porter's stemmer and snowball stemmer, there exist NLTK modules that implement these processes. The link for how to use them is the following: Nltk resources for porter stemmer
Krovetz Stemmer

Krovetz Stemmer was proposed in 1993 by Robert Krovetz. While we have been discussing mainly algorithm based stemmers up to this point, Krovetz stemmer is the first hybrid approach with a dictionary and a set of rules. This stemmer, given a word, first look up the dictionary to see if it is a proper word without suffices. If it is, then there is no reason to stem it. While if it is not in the dictionary, we stem it using a set of rules just like other stemmers and then again look up in the dictionary. In this way, Krovetz stemmer ensures that we reach a word, and not a non-word, individually non-sensical subword as a stem.

While, Krovetz stemmer always reach a word after stemming and also tend to have very less misstemming; the problem with Krovetz is the dictionary. The dictionary, as we know, needs yearly maintenance and updates and generally occupies quite a memory space. Also, the lookup process ends up being a computationally heavy process and therefore, Krovetz stemmer becomes quite slow and sort of inapplicable for large text repositories; where traditional slightly more inaccurate rule-based stemmers are much faster and easily applicable.
Xerox Stemmer

Computational linguists have developed a stemmer based on a lexicon database with a mapping between surface forms and base forms with proper infrastructure for fast use of the database to transform surface-level words, i.e. words with suffixes and prefixes into search terms for the database. The stemmer works like below on a generic level:
1. It changes the original words using an inflection database to turn nouns into singular from plural, verbs into infinitive form, pronouns into nominative and adjective into positive form.
2. These changed words are used to find out the related stems from a derivational database, based on both form and semantics. The use of semantics here reduces the misstemming errors which are often prevalent in porter and snowball stemmer. This derivational database, not only removes suffix but also removes prefix, unlike most of the other stemmers, which focuses solely on removing suffices.
3. There are also inclusion of special tables for handling irregular forms and exceptions, which goes without saying.
While being technologically stronger, using prefix removal process, derivational and inflectional database, Xerox stemmer is quite good, it is also limited by the curse of using a dictionary-like database which is supposed to update once a few years and needs maintenance and space to work with. Also, it is not susceptible to work with suffixes or prefixes it does not contain in the database and therefore vulnerable to constant changes in the language.
Statistical stemmers:

While we have discussed rule-based and dictionary or database dependent stemmers up to this point in this article, there are statistical method based stemmers as well in stemming literature. The main stemmers in this genre are:
1. N-gram stemmer
2. Hidden Markov model-based stemmer
3. YSS based stemmer
These stemmers require some or the other a bit of different contextual knowledge from machine learning and statistics. Although with this site it is not unfamiliar, considering the scope of this post, I am not going into the details of these stemmers for omitting mention of complicated methods involved with these stemmers. Just to give a high-level idea, these stemmers try to use the contextual reference to stem as well as a hidden language model often to find out the original stem from the raw word. You can read further about it from this paper on comparative study of stemming algorithms.

Stemming on other languages and Conclusion:

We have discussed only the English stemming algorithms in this article. It is important to note that, stemming is not applicable to every language, but mostly to Indo-European languages. One famous language which we can not apply stemming from is Chinese. But in fact, similar algorithms are being developed by computer scientists and linguists from other languages and significant works followed last few years. Being a Bengali myself, I conclude this post with a few references to advances in Bengali stemmer and their implementations.

The paper on the rule-based Bangla stemmer includes a good collection of history for the other different types of stemmers as well. Other than Aryan root languages, works like Tamil stemmer and others also focus on Indian Dravid-rooted languages also. So, if someone is willing to research on stemming Indo-Aryan languages, these references can be a good starting point. I hope you liked the article. Please share with your colleagues and friends for sharing the work as well as a comment if any unwilling mistake is included.

20 Must-Know Math Puzzles for Data Science Interviews: Test Your Problem-Solving Skills

Introduction: When preparing for a data science interview, brushing up on your coding and statistical knowledge is crucial—but math puzzles also play a significant role. Many interviewers use puzzles to assess how candidates approach complex problems, test their logical reasoning, and gauge their problem-solving efficiency. These puzzles are often designed to test not only your knowledge of math but also your ability to think critically and creatively. Here, we've compiled 20 challenging yet exciting math puzzles to help you prepare for data science interviews. We’ll walk you through each puzzle, followed by an explanation of the solution. 1. The Missing Dollar Puzzle Puzzle: Three friends check into a hotel room that costs $30. They each contribute $10. Later, the hotel realizes there was an error and the room actually costs $25. The hotel gives $5 back to the bellboy to return to the friends, but the bellboy, being dishonest, pockets $2 and gives $1 back to each friend. No...

Machine learning and statistics with python

Search This Blog