Performing Stemming outputs jibberish/concatenated words - python

I am experimenting with the python library NLTK for Natural Language Processing.
My Problem: I'm trying to perform stemming; reduce words to their normalised form. But its not producing correct words. Am I using the stemming class correctly? And how can I get the results I am attempting to get?
I want to normalise the following words:
words = ["forgot","forgotten","there's","myself","remuneration"]
...into this:
words = ["forgot","forgot","there","myself","remunerate"]
My code:
from nltk import stem
words = ["forgot","forgotten","there's","myself","remuneration"]
for word in words:
print stemmer.stem(word)
#output is:
#forgot forgotten there' myself remuner

There are two types of normalization you can do at a word level.
Stemming - a quick and dirty hack to convert words into some token which is not guaranteed to be an actual word, but generally different forms of the same word should map to the same stemmed token
Lemmatization - converting a word into some base form (singular, present tense, etc) which is always a legitimate word on its own. This can obviously be slower and more complicated and is generally not required for a lot of NLP tasks.
You seem to be looking for a lemmatizer instead of a stemmer. Searching Stack Overflow for 'lemmatization' should give you plenty of clues about how to set one of those up. I have played with this one called morpha and have found it to be pretty useful and cool.

Like adi92, I too believe you're looking for lemmatization. Since you're using NLTK you could probably use its WordNet interface.

Related

How to exclude certain names and terms from stemming (Python NLTK SnowballStemmer (Porter2))

I am newly getting into NLP, Python, and posting on Stackoverflow at the same time, so please be patient with me if I might seem ignorant :).
I am using SnowballStemmer in Python's NLTK in order to stem words for textual analysis. While lemmatization seems to understem my tokens, the snowball porter2 stemmer, which I read is mostly preferred to the basic porter stemmer, overstems my tokens. I am analyzing tweets including many names and probably also places and other words which should not be stemmed, like: hillary, hannity, president, which are now reduced to hillari, hanniti, and presid (you probably guessed already whose tweets I am analyzing).
Is there an easy way to exclude certain terms from stemming? Conversely, I could also merely lemmatize tokens and include a rule for common suffixes like -ed, -s, …. Another idea might be to merely stem verbs and adjectives as well as nouns ending in s. That might also be close enough…
I am using below code as of now:
# LEMMATIZE AND STEM WORDS
from nltk.stem.snowball import EnglishStemmer
lemmatizer = nltk.stem.WordNetLemmatizer()
snowball = EnglishStemmer()
def lemmatize_text(text):
return [lemmatizer.lemmatize(w) for w in text]
def snowball_stemmer(text):
return [snowball.stem(w) for w in text]
# APPLY FUNCTIONS
tweets['text_snowball'] = tweets.text_processed.apply(snowball_stemmer)
tweets['text_lemma'] = tweets.text_processed.apply(lemmatize_text)
I hope someone can help… Contrary to my past experience with all kinds of issues, I have not been able to find adequate help for my issue online so far.
Thanks!
Do you know NER? It means named entity recognition. You can preprocess your text and locate all named entities, which you then exclude from stemming. After stemming, you can merge the data again.

Words.word() from nltk corpus seemingly contains strange non-valid words

This code loops through every word in word.words() from the nltk library, then pushes the word into an array. Then it checks every word in the array to see if it is an actual word by using the same library and somehow many words are strange words that aren't real at all, like "adighe". What's going on here?
import nltk
from nltk.corpus import words
test_array = []
for i in words.words():
i = i.lower()
test_array.append(i)
for i in test_array:
if i not in words.words():
print(i)
I don't think there's anything mysterious going on here. The first such example I found is "Aani", "the dog-headed ape sacred to the Egyptian god Thoth". Since it's a proper noun, "Aani" is in the word list and "aani" isn't.
According to dictionary.com, "Adighe" is an alternative spelling of "Adygei", which is another proper noun meaning a region of Russia. Since it's also a language I suppose you might argue that "adighe" should also be allowed. This particular word list will argue that it shouldn't.

Getting a Large List of Nouns (or Adjectives) in Python with NLTK; or Python Mad Libs

Like this question, I am interested in getting a large list of words by part of speech (a long list of nouns; a list of adjectives) to be used programmatically elsewhere. This answer has a solution using the WordNet database (in SQL) format.
Is there a way to get at such list using the corpora/tools built into the Python NLTK. I could take a large bunch of text, parse it and then store the nouns and adjectives. But given the dictionaries and other tools built in, is there a smarter way to simply extract the words that are already present in the NLTK datasets, encoded as nouns/adjectives (whatever)?
Thanks.
It's worth noting that Wordnet is actually one of the corpora included in the NLTK downloader by default. So you could conceivably just use the solution you already found without having to reinvent any wheels.
For instance, you could just do something like this to get all noun synsets:
from nltk.corpus import wordnet as wn
for synset in list(wn.all_synsets('n')):
print synset
# Or, equivalently
for synset in list(wn.all_synsets(wn.NOUN)):
print synset
That example will give you every noun that you want and it will even group them into their synsets so you can try to be sure that they're being used in the correct context.
If you want to get them all into a list you can do something like the following (though this will vary quite a bit based on how you want to use the words and synsets):
all_nouns = []
for synset in wn.all_synsets('n'):
all_nouns.extend(synset.lemma_names())
Or as a one-liner:
all_nouns = [word for synset in wn.all_synsets('n') for word in synset.lemma_names()]
You should use the Moby Parts of Speech Project data. Don't be fixated on using only what is directly in NLTK by default. It would be little work to download the files for this and pretty easy to parse them with NLTK once loaded.
I saw a similar question earlier this week (can't find the link), but like I said then, I don't think maintaining a list of nouns/adjectives/whatever is a great idea. This is primarily because the same word can have different parts of speech, depending on the context.
However, if you are still dead set on using these lists, then here's how I would do it (I don't have a working NLTK install on this machine, but I remember the basics):
nouns = set()
for sentence in my_corpus.sents():
# each sentence is either a list of words or a list of (word, POS tag) tuples
for word, pos in nltk.pos_tag(sentence): # remove the call to nltk.pos_tag if `sentence` is a list of tuples as described above
if pos in ['NN', "NNP"]: # feel free to add any other noun tags
nouns.add(word)
Hope this helps

NLTK words lemmatizing

I am trying to do lemmatization on words with NLTK.
What I can find now is that I can use the stem package to get some results like transform "cars" to "car" and "women" to "woman", however I cannot do lemmatization on some words with affixes like "acknowledgement".
When using WordNetLemmatizer() on "acknowledgement", it returns "acknowledgement" and using .PorterStemmer(), it returns "acknowledg" rather than "acknowledge".
Can anyone tell me how to eliminate the affixes of words?
Say, when input is "acknowledgement", the output to be "acknowledge"
Lemmatization does not (and should not) return "acknowledge" for "acknowledgement". The former is a verb, while the latter is a noun. Porter's stemming algorithm, on the other hand, simply uses a fixed set of rules. So, your only way there is to change the rules at source. (NOT the right way to fix your problem).
What you are looking for is the derivationally related form of "acknowledgement", and for this, your best source is WordNet. You can check this online on WordNet.
There are quite a few WordNet-based libraries that you can use for this (e.g. in JWNL in Java). In Python, NLTK should be able to get the derivationally related form you saw online:
from nltk.corpus import wordnet as wn
acknowledgment_synset = wn.synset('acknowledgement.n.01')
acknowledgment_lemma = acknowledgment_synset.lemmas[1]
print(acknowledgment_lemma.derivationally_related_forms())
# [Lemma('admit.v.01.acknowledge'), Lemma('acknowledge.v.06.acknowledge')]

python nltk keyword extraction from sentence

"First thing we do, let's kill all the lawyers." - William Shakespeare
Given the quote above, I would like to pull out "kill" and "lawyers" as the two prominent keywords to describe the overall meaning of the sentence. I have extracted the following noun/verb POS tags:
[["First", "NNP"], ["thing", "NN"], ["do", "VBP"], ["lets", "NNS"], ["kill", "VB"], ["lawyers", "NNS"]]
The more general problem I am trying to solve is to distill a sentence to the "most important"* words/tags to summarise the overall "meaning"* of a sentence.
*note the scare quotes. I acknowledge this is a very hard problem and there is most likely no perfect solution at this point in time. Nonetheless, I am interested to see attempts at solving the specific problem (extracting "kill" and "lawyers") and the general problem (summarising the overall meaning of a sentence in keywords/tags)
I don't think theres any perfect answer to this question because there aren't any gold-set of input/output mappings which everybody will agree upon. You think the most important words for that sentence are ('kill', 'lawyers'), someone else might argue the correct answer should be ('first', 'kill', 'lawyers'). If you are able to very precisely and completely unambiguously describe exactly what you want your system to do, your problem will be more than half solved.
Until then, I can suggest some additional heuristics to help you get what you want.
Build an idf dictionary using your data, i.e. build a mapping from every word to a number that correlates with how rare that word is. Bonus points for doing it for larger n-grams as well.
By combining the idf values of each word in your input sentence along with their POS tags, you answer questions of the form 'What is the rarest verb in this sentence?', 'What is the rarest noun in this sentence', etc. In any reasonable corpus, 'kill' should be rarer than 'do', and 'lawyers' rarer than 'thing', so maybe trying to find the rarest noun and rarest verb in a sentence and returning just those two will do the trick for most of your intended use cases. If not, you can always make your algorithm a little more complicated and see if that seems to do the job better.
Ways to expand this include trying to identify larger phrases using n-gram idf's, building a full parse-tree of the sentence (using maybe the stanford parser) and identifying some pattern within these trees to help you figure out which parts of the tree do important things tend to be based, etc.
One simple approach would be to keep stop word lists for NN, VB etc. These would be high frequency words that usually don't add much semantic content to a sentence.
The snippet below shows distinct lists for each type of word token, but you could just as well employ a single stop word list for both verbs and nouns (such as this one).
stop_words = dict(
NNP=['first', 'second'],
NN=['thing'],
VBP=['do','done'],
VB=[],
NNS=['lets', 'things'],
)
def filter_stop_words(pos_list):
return [[token, token_type]
for token, token_type in pos_list
if token.lower() not in stop_words[token_type]]
in your case, you can simply use Rake (thanks to Fabian) package for python to get what you need:
>>> path = #your path
>>> r = RAKE.Rake(path)
>>> r.run("First thing we do, let's kill all the lawyers")
[('lawyers', 1.0), ('kill', 1.0), ('thing', 1.0)]
the path can be for example this file.
but in general, you better to use NLTK package for the NLP usages

Categories