Best stemming algorithm in NLTK, Python - python

I am trying to stem the word tokens I get after tokenizing the data using PorterStemmer but am getting incorrect results. Which stemming algorithm would be the best one to go with?
Code-
from nltk.stem import PorterStemmer
porter = PorterStemmer()
porter.stem("mobile")
Code Output-
mobil
Expected Output-
mobile

You might be looking for lemmatization and not stemming. Check out https://www.guru99.com/stemming-lemmatization-python-nltk.html.
Stemming means the reduction to the root/base of the word. Lemmatization means the reduction to the non-flectional base form (e.g. infinitive for verbs).
The root of "mobile" is "mobil" because of words like "mobility". The unchanged root/base does in this case not include the e.

Related

How to lemmatise nouns?

I am trying to lemmatise words like "Escalation" to "Escalate" using NLTK.stem Wordlemmatizer.
word_lem = WordNetLemmatizer()
print( word_lem.lemmatize("escalation", pos = "n")
Which pos tag should be used to get result like "escalate"
First, please notice that:
Stemming usually refers to a crude heuristic process that chops off the ends of words in the hope of achieving this goal correctly most of the time, and often includes the removal of derivational affixes. Lemmatization usually refers to doing things properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma .
Now, if you desire to obtain a canonical form for both "escalation" and "escalate", you can use a summarizer, e.g., Porter stemmer.
from nltk.stem import PorterStemmer
ps = PorterStemmer()
print(ps.stem("escalate"))
print(ps.stem("escalation"))
Although the result is escal, but it is the same for both words.

How to exclude certain names and terms from stemming (Python NLTK SnowballStemmer (Porter2))

I am newly getting into NLP, Python, and posting on Stackoverflow at the same time, so please be patient with me if I might seem ignorant :).
I am using SnowballStemmer in Python's NLTK in order to stem words for textual analysis. While lemmatization seems to understem my tokens, the snowball porter2 stemmer, which I read is mostly preferred to the basic porter stemmer, overstems my tokens. I am analyzing tweets including many names and probably also places and other words which should not be stemmed, like: hillary, hannity, president, which are now reduced to hillari, hanniti, and presid (you probably guessed already whose tweets I am analyzing).
Is there an easy way to exclude certain terms from stemming? Conversely, I could also merely lemmatize tokens and include a rule for common suffixes like -ed, -s, …. Another idea might be to merely stem verbs and adjectives as well as nouns ending in s. That might also be close enough…
I am using below code as of now:
# LEMMATIZE AND STEM WORDS
from nltk.stem.snowball import EnglishStemmer
lemmatizer = nltk.stem.WordNetLemmatizer()
snowball = EnglishStemmer()
def lemmatize_text(text):
return [lemmatizer.lemmatize(w) for w in text]
def snowball_stemmer(text):
return [snowball.stem(w) for w in text]
# APPLY FUNCTIONS
tweets['text_snowball'] = tweets.text_processed.apply(snowball_stemmer)
tweets['text_lemma'] = tweets.text_processed.apply(lemmatize_text)
I hope someone can help… Contrary to my past experience with all kinds of issues, I have not been able to find adequate help for my issue online so far.
Thanks!
Do you know NER? It means named entity recognition. You can preprocess your text and locate all named entities, which you then exclude from stemming. After stemming, you can merge the data again.

Keyword extraction: same word in plural/singular/past_tense/-ing format

When extracting keywords from a text, I realized that I get back mostly the same words in different formats. Is there a way to enable the same word to show up only once?
Example: updated updates update updating | research researched researchers | files filed
file
Code: Summa (TextRank) package used here:
k_words = keywords.keywords((str(document)), words=10, ratio=0.2, language='english')
You need to stem and lemmatize the text before doing any work on it (also, remove stop words and punctuation). NLTK has built-in lemmatizers and stemmers, of which you can use:
For stemming:
import nltk
from nltk.stem import PorterStemmer
porter = PorterStemmer()
print(porter.stem("cats")) # => cat
print(porter.stem("trouble")) # => troubl
print(porter.stem("troubling")) # => troubl
print(porter.stem("troubled")) # => troubl
From DataCamp:
"Stemming is the process of reducing inflection in words to their root forms such as mapping a group of words to the same stem even if the stem itself is not a valid word in the Language."
For lemmatization:
from nltk.stem import WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()
wordnet_lemmatizer.lemmatize("has") # => has
wordnet_lemmatizer.lemmatize("was") # => wa
From DataCamp:
"Lemmatization, unlike Stemming, reduces the inflected words properly ensuring that the root word belongs to the language. In Lemmatization root word is called Lemma. A lemma (plural lemmas or lemmata) is the canonical form, dictionary form, or citation form of a set of words."
You can read more about Stemming and Lemmatization with Python NLTK in this article.

NLTK words lemmatizing

I am trying to do lemmatization on words with NLTK.
What I can find now is that I can use the stem package to get some results like transform "cars" to "car" and "women" to "woman", however I cannot do lemmatization on some words with affixes like "acknowledgement".
When using WordNetLemmatizer() on "acknowledgement", it returns "acknowledgement" and using .PorterStemmer(), it returns "acknowledg" rather than "acknowledge".
Can anyone tell me how to eliminate the affixes of words?
Say, when input is "acknowledgement", the output to be "acknowledge"
Lemmatization does not (and should not) return "acknowledge" for "acknowledgement". The former is a verb, while the latter is a noun. Porter's stemming algorithm, on the other hand, simply uses a fixed set of rules. So, your only way there is to change the rules at source. (NOT the right way to fix your problem).
What you are looking for is the derivationally related form of "acknowledgement", and for this, your best source is WordNet. You can check this online on WordNet.
There are quite a few WordNet-based libraries that you can use for this (e.g. in JWNL in Java). In Python, NLTK should be able to get the derivationally related form you saw online:
from nltk.corpus import wordnet as wn
acknowledgment_synset = wn.synset('acknowledgement.n.01')
acknowledgment_lemma = acknowledgment_synset.lemmas[1]
print(acknowledgment_lemma.derivationally_related_forms())
# [Lemma('admit.v.01.acknowledge'), Lemma('acknowledge.v.06.acknowledge')]

Performing Stemming outputs jibberish/concatenated words

I am experimenting with the python library NLTK for Natural Language Processing.
My Problem: I'm trying to perform stemming; reduce words to their normalised form. But its not producing correct words. Am I using the stemming class correctly? And how can I get the results I am attempting to get?
I want to normalise the following words:
words = ["forgot","forgotten","there's","myself","remuneration"]
...into this:
words = ["forgot","forgot","there","myself","remunerate"]
My code:
from nltk import stem
words = ["forgot","forgotten","there's","myself","remuneration"]
for word in words:
print stemmer.stem(word)
#output is:
#forgot forgotten there' myself remuner
There are two types of normalization you can do at a word level.
Stemming - a quick and dirty hack to convert words into some token which is not guaranteed to be an actual word, but generally different forms of the same word should map to the same stemmed token
Lemmatization - converting a word into some base form (singular, present tense, etc) which is always a legitimate word on its own. This can obviously be slower and more complicated and is generally not required for a lot of NLP tasks.
You seem to be looking for a lemmatizer instead of a stemmer. Searching Stack Overflow for 'lemmatization' should give you plenty of clues about how to set one of those up. I have played with this one called morpha and have found it to be pretty useful and cool.
Like adi92, I too believe you're looking for lemmatization. Since you're using NLTK you could probably use its WordNet interface.

Categories