NLTK convert tokenized sentence to synset format - python

I'm looking to get the similarity between a single word and each word in a sentence using NLTK.
NLTK can get the similarity between two specific words as shown below. This method requires that a specific reference to the word is given, in this case it is 'dog.n.01' where dog is a noun and we want to use the first (01) NLTK definition.
dog = wordnet.synset('dog.n.01')
cat = wordnet.synset('cat.n.01')
print dog.path_similarity(cat)
>> 0.2
The problem is that I need to get the part of speech information from each word in the sentence. The NLTK package has the ability to get the parts of speech for each word in a sentence as shown below. However, these speech parts ('NN', 'VB', 'PRP'...) don't match up with the format that the synset takes as a parameter.
text = word_tokenize("They refuse to permit us to obtain the refuse permit")
pos_tag(text)
>> [('They', 'PRP'), ('refuse', 'VBP'), ('to', 'TO'), ('permit', 'VB'), ('us', 'PRP'), ('to', 'TO'), ('obtain', 'VB'), ('the', 'DT'), ('refuse', 'NN'), ('permit', 'NN')]
Is is possible to get the synset formatted data from pos_tag() results in NLTK? By synset formatted I mean the format like dog.n.01

You can use a simple conversion function:
from nltk.corpus import wordnet as wn
def penn_to_wn(tag):
if tag.startswith('J'):
return wn.ADJ
elif tag.startswith('N'):
return wn.NOUN
elif tag.startswith('R'):
return wn.ADV
elif tag.startswith('V'):
return wn.VERB
return None
After tagging a sentence you can tie a word inside the sentence with a SYNSET using this function. Here's an example:
from nltk.stem import WordNetLemmatizer
from nltk import pos_tag, word_tokenize
sentence = "I am going to buy some gifts"
tagged = pos_tag(word_tokenize(sentence))
synsets = []
lemmatzr = WordNetLemmatizer()
for token in tagged:
wn_tag = penn_to_wn(token[1])
if not wn_tag:
continue
lemma = lemmatzr.lemmatize(token[0], pos=wn_tag)
synsets.append(wn.synsets(lemma, pos=wn_tag)[0])
print synsets
Result: [Synset('be.v.01'), Synset('travel.v.01'), Synset('buy.v.01'), Synset('gift.n.01')]

You can use the alternative form of wordnet.synset:
wordnet.synset('dog', pos=wordnet.NOUN)
You'll still need to translate the tags offered by pos_tag into those supported by wordnet.sysnset -- unfortunately, I don't know of a pre-built dictionary doing that, so (unless I'm missing the existence of such a correspondence table) you'll need to build your own (you can do that once and pickle it for subsequent reloading).
See http://www.nltk.org/book/ch05.html , subchapter 1, on how to get help about a specific tagset -- e.g nltk.help.upenn_tagset('N.*') will confirm that the UPenn tagset (which I believe is the default one used by pos_tag) uses 'N' followed by something to identify variants of what synset will see as a wordnet.NOUN.
I have not tried http://www.nltk.org/_modules/nltk/tag/mapping.html but it might be just what you require -- give it a try!

Related

How to remove words from a sentence that carry no positive or negative sentiment?

Im trying a sentiment analysis based approach on youtube comments, but the comments many times have words like mrbeast, tiger/'s, lion/'s, pewdiepie, james, etc which do not add any feeling in the sentence. I've gone through nltk's average_perception_tagger but it didn't work well as it gave the results as
my input:
"mrbeast james lion tigers bad sad clickbait fight nice good"
words that i need in my sentence:
"bad sad clickbait fight nice good"
what i got using average_perception_tagger:
[('mrbeast', 'NN'),
('james', 'NNS'),
('lion', 'JJ'),
('tigers', 'NNS'),
('bad', 'JJ'),
('sad', 'JJ'),
('clickbait', 'NN'),
('fight', 'NN'),
('nice', 'RB'),
('good', 'JJ')]
so as you can see if i remove mrbeast i.e NN the words like clickbait, fight will also get removed which than ultimately remove expressions from that sentence.
okay, this is what i do for companies that report on the LSE. You can do similar with your words.
# define what you consider to be positive, negative or neutral keywords
posKeyWords = ['profit', 'increase', 'pleased', 'excellent', 'good', 'solid financial', 'robust', 'significantly improved', 'improve']
negKeyWords = ['loss', 'decrease', 'dissapoint', 'poor', 'bad','decline', 'negative', 'bad', 'weather', 'covid' ]
neutralKeyWords = ['financial']
keyWords = posKeyWords + neutralKeyWords + negKeyWords
Next you get data as text (from whatever source you choose). Put the data (words) into a list (array).
dataTest = []
dataText = resp.text # or whatever source you are reading from
Mine is a response from a web query, but yours cour be from a text file or ther source.
Next create an empty dictionary to count key words into a dict (hashing is fast).
keyWordSummary = {} # dictionary of keywords & values
Finally, loop through the keywords and put them into the dict.
# look for some keywords
for kw in keyWords:
kwVal = re.findall(kw, dataText)
#print('keyword count:', kw, len(kwVal))
# put into a dict
keyWordSummary[kw] = len(kwVal)
You now have a list of word frequencies which you could analyse in a dataframe for example (which outside the scope of this particular question).
There are multiple ways of doing this like
you can create a set of positive and negative words and for each word in your grammar you can check if it exists in your set, if it does you should keep the word, else delete it. This however would first require all positive and negative words dataset.
you can use something like textblob which can give you the sentiment score of a word or a sentence. so with a cutoff sentiment score you can filter out the words that you don't need.

How to remove stop words and get lemmas in a pandas data frame using spacy?

I have a column of tokens in a pandas data frame in python. Something that looks like:
word_tokens
(the,cheeseburger,was,great)
(i,never,did,like,the,pizza,too,much)
(yellow,submarine,was,only,an,ok,song)
I want get two more new columns in this dataframe using the spacy library. One column that contains each row's tokens with the stopwords removed, and the other one containing the lemmas from the second column. How could I do that?
You're right about making your text a spaCy type - you want to transform every tuple of tokens into a spaCy Doc. From there, it is best to use the attributes of the tokens to answer the questions of "is the token a stop word" (use token.is_stop), or "what is the lemma of this token" (use token.lemma_). My implementation is below, I altered your input data slightly to include some examples of plurals so you can see that the lemmatization works properly.
import spacy
import pandas as pd
nlp = spacy.load('en_core_web_sm')
texts = [('the','cheeseburger','was','great'),
('i','never','did','like','the','pizzas','too','much'),
('yellowed','submarines','was','only','an','ok','song')]
df = pd.DataFrame({'word_tokens': texts})
The initial DataFrame looks like this:
word_tokens
0
('the', 'cheeseburger', 'was', 'great')
1
('i', 'never', 'did', 'like', 'the', 'pizzas', 'too', 'much')
2
('yellowed', 'submarines', 'was', 'only', 'an', 'ok', 'song')
I define functions to perform the main tasks:
tuple of tokens -> spaCy Doc
spaCy Doc -> list of non-stop words
spaCy Doc -> list of non-stop, lemmatized words
def to_doc(words:tuple) -> spacy.tokens.Doc:
# Create SpaCy documents by joining the words into a string
return nlp(' '.join(words))
def remove_stops(doc) -> list:
# Filter out stop words by using the `token.is_stop` attribute
return [token.text for token in doc if not token.is_stop]
def lemmatize(doc) -> list:
# Take the `token.lemma_` of each non-stop word
return [token.lemma_ for token in doc if not token.is_stop]
Applying these looks like:
# create documents for all tuples of tokens
docs = list(map(to_doc, df.word_tokens))
# apply removing stop words to all
df['removed_stops'] = list(map(remove_stops, docs))
# apply lemmatization to all
df['lemmatized'] = list(map(lemmatize, docs))
The output you get should look like this:
word_tokens
removed_stops
lemmatized
0
('the', 'cheeseburger', 'was', 'great')
['cheeseburger', 'great']
['cheeseburger', 'great']
1
('i', 'never', 'did', 'like', 'the', 'pizzas', 'too', 'much')
['like', 'pizzas']
['like', 'pizza']
2
('yellowed', 'submarines', 'was', 'only', 'an', 'ok', 'song')
['yellowed', 'submarines', 'ok', 'song']
['yellow', 'submarine', 'ok', 'song']
Based on your use case, you may want to explore other attributes of spaCy's document object (https://spacy.io/api/doc). Particularly, take a look at doc.noun_chunks and doc.ents if you're trying to extract more meaning out of text.
It is also worth noting that if you plan on using this with a very large number of texts, you should consider nlp.pipe: https://spacy.io/usage/processing-pipelines. It processes your documents in batches instead of one by one, and could make your implementation more efficient.
If you are working with spacy you should make your text a spacy type, so something like this:
nlp = spacy.load("en_core_web_sm")
text = topic_data['word_tokens'].values.tolist()
text = '.'.join(map(str, text))
text = nlp(text)
This makes it easier to work with. You can then tokenize the words like this
token_list = []
for token in text:
token_list.append(token.text)
And Remove stop words like so.
token_list= [word for word in token_list if not word in nlp.Defaults.stop_words]
I haven't yet figured out the lemmatization part yet, but this is a start till then.

Why the stem of word "got " is still "got" instead of "get"?

from stemming.porter2 import stem
documents = ['got',"get"]
documents = [[stem(word) for word in sentence.split(" ")] for sentence in documents]
print(documents)
The result is :
[['got'], ['get']]
Can someone help to explain this ?
Thank you !
What you want is a lemmatizer instead of a stemmer. The difference is subtle.
Generally, a stemmer drops suffixes as much as possible and in some cases handles an exception list of words for words that cannot find a normalized form by simply dropping suffixes.
A lemmatizer tries to find the "basic"/root/infinitive form of a word and usually, it requires specialized rules for different languages.
See
what is the true difference between lemmatization vs stemming?
Stemmers vs Lemmatizers
Lemmatization using the NLTK implementation of the morphy lemmatizer requires the correct part-of-speech (POS) tag to be fairly accurate.
Avoid (or in fact never) try to lemmatize individual word in isolation. Try lemmatizing a fully POS tagged sentence, e.g.
from nltk import word_tokenize, pos_tag
from nltk import wordnet as wn
def penn2morphy(penntag, returnNone=False, default_to_noun=False):
morphy_tag = {'NN':wn.NOUN, 'JJ':wn.ADJ,
'VB':wn.VERB, 'RB':wn.ADV}
try:
return morphy_tag[penntag[:2]]
except:
if returnNone:
return None
elif default_to_noun:
return 'n'
else:
return ''
With the penn2morphy helper function, you need to convert the POS tag from pos_tag() to the morphy tags and you can then:
>>> from nltk.stem import WordNetLemmatizer
>>> wnl = WordNetLemmatizer()
>>> sent = "He got up in bed at 8am."
>>> [(token, penn2morphy(tag)) for token, tag in pos_tag(word_tokenize(sent))]
[('He', ''), ('got', 'v'), ('up', ''), ('in', ''), ('bed', 'n'), ('at', ''), ('8am', ''), ('.', '')]
>>> [wnl.lemmatize(token, pos=penn2morphy(tag, default_to_noun=True)) for token, tag in pos_tag(word_tokenize(sent))]
['He', 'get', 'up', 'in', 'bed', 'at', '8am', '.']
For convenience you can also try the pywsd lemmatizer.
>>> from pywsd.utils import lemmatize_sentence
Warming up PyWSD (takes ~10 secs)... took 7.196984529495239 secs.
>>> sent = "He got up in bed at 8am."
>>> lemmatize_sentence(sent)
['he', 'get', 'up', 'in', 'bed', 'at', '8am', '.']
See also https://stackoverflow.com/a/22343640/610569

Identify the word as a noun, verb or adjective

Given a single word such as "table", I want to identify what it is most commonly used as, whether its most common usage is noun, verb or adjective. I want to do this in python. Is there anything else besides wordnet too? I don't prefer wordnet. Or, if I use wordnet, how would I do it exactly with it?
import nltk
text = 'This is a table. We should table this offer. The table is in the center.'
text = nltk.word_tokenize(text)
result = nltk.pos_tag(text)
result = [i for i in result if i[0].lower() == 'table']
print(result) # [('table', 'JJ'), ('table', 'VB'), ('table', 'NN')]
If you have a word out of context and want to know its most common use, you could look at someone else's frequency table (e.g. WordNet), or you can do your own counts: Just find a tagged corpus that's large enough for your purposes, and count its instances. If you want to use a free corpus, the NLTK includes the Brown corpus (1 million words). The NLTK also provides methods for working with larger, non-free corpora (e.g, the British National Corpus).
import nltk
from nltk.corpus import brown
table = nltk.FreqDist(t for w, t in brown.tagged_words() if w.lower() == 'table')
print(table.most_common())
[('NN', 147), ('NN-TL', 50), ('VB', 1)]

What does NN VBD IN DT NNS RB means in NLTK?

when I chunk text, I get lots of codes in the output like
NN, VBD, IN, DT, NNS, RB.
Is there a list documented somewhere which tells me the meaning of these?
I have tried googling nltk chunk code nltk chunk grammar nltk chunk tokens.
But I am not able to find any documentation which explains what these codes mean.
The tags that you see are not a result of the chunks but the POS tagging that happens before chunking. It's the Penn Treebank tagset, see https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html
>>> from nltk import word_tokenize, pos_tag, ne_chunk
>>> sent = "This is a Foo Bar sentence."
# POS tag.
>>> nltk.pos_tag(word_tokenize(sent))
[('This', 'DT'), ('is', 'VBZ'), ('a', 'DT'), ('Foo', 'NNP'), ('Bar', 'NNP'), ('sentence', 'NN'), ('.', '.')]
>>> tagged_sent = nltk.pos_tag(word_tokenize(sent))
# Chunk.
>>> ne_chunk(tagged_sent)
Tree('S', [('This', 'DT'), ('is', 'VBZ'), ('a', 'DT'), Tree('ORGANIZATION', [('Foo', 'NNP'), ('Bar', 'NNP')]), ('sentence', 'NN'), ('.', '.')])
To get the chunks look for subtrees within the chunked outputs. From the above output, the Tree('ORGANIZATION', [('Foo', 'NNP'), ('Bar', 'NNP')]) indicates the chunk.
This tutorial site is pretty helpful to explain the chunking process in NLTK: http://www.eecis.udel.edu/~trnka/CISC889-11S/lectures/dongqing-chunking.pdf.
For official documentation, see http://www.nltk.org/howto/chunk.html
Even though the above links have all kinds. But hope this is still helpful for someone, added a few that are missed on other links.
CC: Coordinating conjunction
CD: Cardinal number
DT: Determiner
EX: Existential there
FW: Foreign word
IN: Preposition or subordinating conjunction
JJ: Adjective
VP: Verb Phrase
JJR: Adjective, comparative
JJS: Adjective, superlative
LS: List item marker
MD: Modal
NN: Noun, singular or mass
NNS: Noun, plural
PP: Preposition Phrase
NNP: Proper noun, singular Phrase
NNPS: Proper noun, plural
PDT: Pre determiner
POS: Possessive ending
PRP: Personal pronoun Phrase
PRP: Possessive pronoun Phrase
RB: Adverb
RBR: Adverb, comparative
RBS: Adverb, superlative
RP: Particle
S: Simple declarative clause
SBAR: Clause introduced by a (possibly empty) subordinating conjunction
SBARQ: Direct question introduced by a wh-word or a wh-phrase.
SINV: Inverted declarative sentence, i.e. one in which the subject follows the tensed verb or modal.
SQ: Inverted yes/no question, or main clause of a wh-question, following the wh-phrase in SBARQ.
SYM: Symbol
VBD: Verb, past tense
VBG: Verb, gerund or present participle
VBN: Verb, past participle
VBP: Verb, non-3rd person singular present
VBZ: Verb, 3rd person singular present
WDT: Wh-determiner
WP: Wh-pronoun
WP: Possessive wh-pronoun
WRB: Wh-adverb
As told by Alvas above, these tags are part-of-speech which tells whether a word/phrase is Noun phrase,Adverb,determiner,verb etc...
Here are the POS Tag details you can refer.
Chunking recovers the phrased from the Part of speech tags
You can refer this link for reading for about chunking.

Categories