Why "shining" becomes "shin" after lemmatized using python nltk?

Why "shining" becomes "shin" after lemmatized using python nltk? - python

there are several words that use "-ing" as present continuous like "shining". but when I try to lemmatize "shining" using nltk, it changes into "shin". the code is this:
from nltk.stem import WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()
word = "shining"
newlemma = wordnet_lemmatizer.lemmatize(word,'v')
print newlemma
even without using 'v', it still the same "shining" and doesn't change.
I'm expecting output "shine".
anybody can help? thanks

Because of the way WordNet applies rules and exception lists when searching for the root form.
It has a list of rules particularly to remove word endings, for instance:
"ing" -> ""
"ing" -> "e"
It applies the rules and see if the resulting word form exists in WordNet. So for instance, with mining, it would try min and not find anything. Then it would try mine (second rule) and find mine is valid word and return it. But with shining, it likely tries shin, finds shin in the list of valid words and believes this to be the proper root, so it returns it.

Related

Find all the variations (or tenses) of a word in Python

I would like to know how you would find all the variations of a word, or the words that are related or very similar the the original word in Python.
An example of the sort of thing I am looking for is like this:
word = "summary" # any word
word_variations = find_variations_of_word(word) # a function that finds all the variations of a word, What i want to know how to make
print(word_variations)
# What is should print out: ["summaries", "summarize", "summarizing", "summarized"]
This is just an example of what the code should do, i have seen other similar question on this same topic, but none of them were accurate enough, i found some code and altered it to my own, which kinda works, but now to way i would like it to.
import nltk
from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
def find_inflections(word):
inflections = []
for synset in wordnet.synsets(word): # Find all synsets for the word
for lemma in synset.lemmas(): # Find all lemmas for each synset
inflected_form = lemma.name().replace("_", " ") # Get the inflected form of the lemma
if inflected_form != word: # Only add the inflected form if it's different from the original word
inflections.append(inflected_form)
return inflections
word = "summary"
inflections = find_inflections(word)
print(inflections)
# Output: ['sum-up', 'drumhead', 'compendious', 'compact', 'succinct']
# What the Output should be: ["summaries", "summarize", "summarizing", "summarized"]

This probably isn't of any use to you, but may help someone else who finds this with a search -
If the aim is just to find the words, rather than specifically to use a machine-learning approach to the problem, you could try using a regular expression (regex).
w3 schools seems to cover enough to get the result you want here or there is a more technical overview on python.org
to search case insensitively for the specific words you listed the following would work:
import re
string = "A SUMMARY ON SUMMATION:" \
"We use summaries to summarize. This action is summarizing. " \
"Once the action is complete things have been summarized."
occurrences = re.findall("summ[a-zA-Z]*", string, re.IGNORECASE)
print(occurrences)
However, depending on your precise needs you may need to modify the regular expression as this would also find words like 'summer' and 'summon'.
I'm not very good at regex but they can be a powerful tool if you know precisely what you are looking for and spend a little time crafting the right expression.
Sorry this probably isn't relevant to your circumstance but good luck.

Remove repeating characters from sentence but retain the words meaning

I want to remove repeating characters from a sentence but make it so that the words still retain its meaning (if it has any). For example : I'm so haaappppyyyy about offline school
to I'm so happy about offline school. See, haaappppyyyy became happy and offline & school stay the same instead becoming ofline & schol
I've tried two solutions, using RE and itertools, but none really fits for what I'm searching for
Using Regex :
tweet = 'I'm so haaappppyyyy about offline school'
repeat_char = re.compile(r"(.)\1{1,}", re.IGNORECASE)
tweet = repeat_char.sub(r"\1\1", tweet)
tweet = re.sub("(.)\\1{2,}", "\\1", tweet)
output :
I'm so haappyy about offline school #it makes 2 chars for every repating chars
using itertools :
tweet = 'I'm so happy about offline school'
tweet = ''.join(ch for ch, _ in itertools.groupby(tweet))
output :
I'm so hapy about ofline schol
How can I fix this? should I make a lists of words I want to exclude?
In addition, I want it to also be able to reduce some words that's in a pattern to it's base form. For example :
wkwk (base form)
wkwkwkwk
wkwkwkwkwkwkwk
I want to make the second and the third word into the first word, the base form

You can combine regex and NLP here by iterating over all words in a string, and once you find one with identical consecutive letters reduce them to max 2 consecutive occurrences of the same letters and run the automatic spellcheck to fix the spelling.
See an example Python code:
import re
from textblob import TextBlob
from textblob import Word
rx = re.compile(r'([^\W\d_])\1{2,}')
print( re.sub(r'[^\W\d_]+', lambda x: Word(rx.sub(r'\1\1', x.group())).correct() if rx.search(x.group()) else x.group(), tweet) )
# => "I'm so happy about offline school"
The code uses the Textblob library, but you may use any you like.
Note that ([^\W\d_])\1{2,} matches any three or more consecutive letters, [^\W\d_]+ matches one or more letters.

This answer was originally written for Regex to reduce repeated chars in a string which was closed as duplicate before I could submit my post. So I "recycled" it here.
Regex is not always the best solution
Regex for validation of formats or input
A regex is often used for low-level pattern recognition and substitution.
It may be useful for validation of formats. You can see it as "dump" automation.
Linguistics (NLP)
When it comes to natural language (NLP), or here spelling (dictionary) the semantics may play a role. Depending on the context "ass" and "as" may both be correctly spelled, although the semantics are very different.
(I apologize for the rude examples, but I am not a native-speaker and those two had the most distinct meaning depending on re-duplication).
For those cases a regex or simple pattern-recognition may be not sufficient. It can cause more effort to apply it correctly than the research for a language-specific library or solution (including a basic application).
Examples for spelling that a regex may struggle with
Like the difference between "haappy" (orthographically invalid, but only the duplicated vowels "aa", not the consonants "pp") and "yeees" (contains no duplicates in correct spelling) or "kiss" (is correctly spelled with duplicate consonants)
Spelling correction requires more
For example a dictionary to lookup if duplicate characters (vowels or consonants) are valid for correct spelling of the word in its form.
Consider a spelling-correction module
You could use textblob module for spelling correction:
To install:
pip install textblob
Example for some test-cases (independent words):
from textblob import TextBlob
incorrect_words = ["cmputr", "yeees", "haappy"] # incorrect spelling
text = ",".join(incorrect_words) # join them as comma separated list
print(f"original words: {text}")
b = TextBlob(text)
# prints the corrected spelling
print(f"corrected words: {b.correct()}")
Prints:
original words: cmputr,yeees,haappy
corrected words: computer,eyes,happy
Surprise: You might have expected "yes" (so did I). But the correction results not in removal of 2 duplicated vowels "ee", but rearrangement to keep almost all letters (5 of 6, only removed one "e").
Example for the given sentence:
from textblob import TextBlob
tweet = "I'm so haaappppyyyy about offline school" # either escape or use different quotes when a single-quote (') is enclosed
print(TextBlob(tweet).correct())
Prints:
I'm so haaappppyyyy about office school
Unfortunately quite worse:
not "happy"
semantically out-of-scope with "office" instead "offline"
Apparently a preceeding cleaning step using regex, like Wiktor suggests, may ameliorate the result.
See also:
Stackabuse: Spelling Correction in Python with TextBlob, tutorial
documentation: TextBlob: Simplified Text Processing

Well, first of all you need a list (or set) of all allowed words, to compare with.
I'd approach it with the assumption (which might be wrong) that no words contain sequences of more than two repeating characters. So for each word generate a list of all potential candidates, for example "haaappppppyyyy" would yield you ["haappyy", "happyy", "happy", etc]. then it's just a matter of checking which one of those words actually exists by comparing to the allowed word list.
The time complexity of this is quite high, tho so if it needs to go fast then throw a hash table on it or something :)

How to lemmatise nouns?

I am trying to lemmatise words like "Escalation" to "Escalate" using NLTK.stem Wordlemmatizer.
word_lem = WordNetLemmatizer()
print( word_lem.lemmatize("escalation", pos = "n")
Which pos tag should be used to get result like "escalate"

First, please notice that:
Stemming usually refers to a crude heuristic process that chops off the ends of words in the hope of achieving this goal correctly most of the time, and often includes the removal of derivational affixes. Lemmatization usually refers to doing things properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma .
Now, if you desire to obtain a canonical form for both "escalation" and "escalate", you can use a summarizer, e.g., Porter stemmer.
from nltk.stem import PorterStemmer
ps = PorterStemmer()
print(ps.stem("escalate"))
print(ps.stem("escalation"))
Although the result is escal, but it is the same for both words.

Words.word() from nltk corpus seemingly contains strange non-valid words

This code loops through every word in word.words() from the nltk library, then pushes the word into an array. Then it checks every word in the array to see if it is an actual word by using the same library and somehow many words are strange words that aren't real at all, like "adighe". What's going on here?
import nltk
from nltk.corpus import words
test_array = []
for i in words.words():
i = i.lower()
test_array.append(i)
for i in test_array:
if i not in words.words():
print(i)

I don't think there's anything mysterious going on here. The first such example I found is "Aani", "the dog-headed ape sacred to the Egyptian god Thoth". Since it's a proper noun, "Aani" is in the word list and "aani" isn't.
According to dictionary.com, "Adighe" is an alternative spelling of "Adygei", which is another proper noun meaning a region of Russia. Since it's also a language I suppose you might argue that "adighe" should also be allowed. This particular word list will argue that it shouldn't.

Performing Stemming outputs jibberish/concatenated words

I am experimenting with the python library NLTK for Natural Language Processing.
My Problem: I'm trying to perform stemming; reduce words to their normalised form. But its not producing correct words. Am I using the stemming class correctly? And how can I get the results I am attempting to get?
I want to normalise the following words:
words = ["forgot","forgotten","there's","myself","remuneration"]
...into this:
words = ["forgot","forgot","there","myself","remunerate"]
My code:
from nltk import stem
words = ["forgot","forgotten","there's","myself","remuneration"]
for word in words:
print stemmer.stem(word)
#output is:
#forgot forgotten there' myself remuner

There are two types of normalization you can do at a word level.
Stemming - a quick and dirty hack to convert words into some token which is not guaranteed to be an actual word, but generally different forms of the same word should map to the same stemmed token
Lemmatization - converting a word into some base form (singular, present tense, etc) which is always a legitimate word on its own. This can obviously be slower and more complicated and is generally not required for a lot of NLP tasks.
You seem to be looking for a lemmatizer instead of a stemmer. Searching Stack Overflow for 'lemmatization' should give you plenty of clues about how to set one of those up. I have played with this one called morpha and have found it to be pretty useful and cool.

Like adi92, I too believe you're looking for lemmatization. Since you're using NLTK you could probably use its WordNet interface.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Why "shining" becomes "shin" after lemmatized using python nltk? - python

Related

Find all the variations (or tenses) of a word in Python

Remove repeating characters from sentence but retain the words meaning

How to lemmatise nouns?

Words.word() from nltk corpus seemingly contains strange non-valid words

Performing Stemming outputs jibberish/concatenated words

Categories

Resources