Can we create a simple thesaurus from a field in a dataframe? - python

I am trying to find synonyms and antonyms for one word, using strings from a field in a dataframe and not a standard wordnet.synsets lexical database. I'm pretty sure this is possible, but I'm not sure how to feed in the appropriate data source (my specific field).
For instance, the code below works fine.
import nltk
from nltk.corpus import wordnet #Import wordnet from the NLTK
syn = list()
ant = list()
for synset in wordnet.synsets("fake"):
for lemma in synset.lemmas():
syn.append(lemma.name()) #add the synonyms
if lemma.antonyms(): #When antonyms are available, add them into the list
ant.append(lemma.antonyms()[0].name())
print('Synonyms: ' + str(syn))
print('Antonyms: ' + str(ant))
I tried to convert the field to an array, and use that...
import pandas as pd
import nltk.corpus
from nltk.corpus import stopwords, wordnet
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer, WordNetLemmatizer
df = pd.read_csv("C:\\my_path\\dataset.csv")
df['review_text'] = df['review_text'].astype(str)
type(df)
df.dtypes
asarray = pd.array(df['review_text'])
import nltk
from nltk.corpus import wordnet #Import wordnet from the NLTK
syn = list()
ant = list()
for synset in wordnet.asarray('fake'):
for lemma in df['review_text'].iterrows():
syn.append(lemma.name()) #add the synonyms
if lemma.antonyms(): #When antonyms are available, add them into the list
ant.append(lemma.antonyms()[0].name())
print('Synonyms: ' + str(syn))
print('Antonyms: ' + str(ant))
When I run that, I get this error:
AttributeError: 'WordNetCorpusReader' object has no attribute 'asarray'
The field in the dataframe looks like this:
feels comfortable i wear day!
package came end missing box. since it’s gift i update actual fit.
birkenstock amazing shoe!!!! i wish i ten pairs!
delivered advertised.... shoe looks & fits expected. leather color seems bit lighter one seen store, still satisfactory.
second pair i had. nothing beats them.
These are the first 5 rows. Maybe the issue is related to this thing (not sure): it’s

It's a typo or something.
Error happens on:
for synset in wordnet.asarray('fake'):
Where wordnet is an object from nltk.corpus, thus a WordNetCorpusReader
And before the for loop, you have
asarray = pd.array(df['review_text'])
that reads in a Pandas array/series into the asarray variable. But it's not related to WordNetCorpusReader

Related

How to import a nltk.corpus by using a variable

so I am trying to receive input to which corpus we want to access in this problem. As shown below I have the input set to be 'corp'.
The problem is when I run my program, the computer says it can't find the corpus 'corp' instead of actually using the 'corp' term as the variable it is. Any idea how to take an input and use it as a variable in the phrase 'from nltk.corpus import XXX'
Thanks in advance! :)
def chi_square(w1, w2, corp):
from nltk.corpus import corp
import nltk.corpus as corp
This is a way how you import a python package using your own custom name.

Incomplete list of synset hypernyms in NLTK's WordNet?

While trying to recover any given WordNet synset's hypernyms through WN NLTK's interface, I am getting what I think are different results from WN's web search interface. For example:
from nltk.corpus import wordnet as wn
bank6ss = wn.synsets("bank")[5] # 'bank' as gambling house funds
bank6ss.hypernyms()
# returns [Synset('funds.n.01')]
That is, only one hypernym found (no others are found with, for instance, instance_hypernyms()). However, when looking at WN's web interface, this sense of 'bank' lists several other hypernyms under "Direct hypernym":
funds, finances, monetary resource, cash in hand, pecuniary resource
What would explain this difference, and how could I get that longer list of hypernyms in NLTK's WordNet?
The WordNet version used in my NLTK installation is 3.0.
I just realized that I'm looking at two different types of output: What is returned in NLTK WordNet is a hypernym synset (Synset['funds.n.01']) while the list of hypernyms in the web interface is composed of lemmas belonging to that one synset.
To fully answer the question, this list of lemmas can be recovered in NLTK as follows:
from nltk.corpus import wordnet as wn
bank6ss = wn.synsets("bank")[5] # 'bank' as gambling house funds
hn1ss = bank6ss.hypernyms()[0]
hn1ss.lemmas()
# returns [Lemma('funds.n.01.funds'),
# Lemma('funds.n.01.finances'),
# Lemma('funds.n.01.monetary_resource'),
# Lemma('funds.n.01.cash_in_hand'),
# Lemma('funds.n.01.pecuniary_resource')]
Or, if only lemma names are of interest:
hn1ss.lemma_names()
# returns [u'funds',
# u'finances',
# u'monetary_resource',
# u'cash_in_hand',
# u'pecuniary_resource']

How to do POS tagging for Bigrams in Python

Firstly I must admit that I am a newbie to Python or R.
Here I am trying to create a file with the list of bi-grams / 2-grams along with their POS tags (NN, VB, etc...). This is used to easily identify meaningful bi-grams and their POS tag combinations.
For example: the bigram - 'Gross' 'Profit' has the POS tag combination of JJ & NN. But the bigram - 'quarter' 'of' has the POS tag combination of NN & IN. With this I can find meaningful POS combinations. It may not be accurate. That is fine. Just want to research with it.
For Reference please check the section "2-gram Results" in this page.My requirement is something like that. But it was done in R. So it was not useful to me.
As I have come across in Python, POS Tagging and creation of bi-grams can be done using NLTK or TextBlob package. But I am unable to find a logic to assign POS tags for the bi-grams generated in Python. Please see below for the code and relevant output.
import nltk
from textblob import TextBlob
from nltk import word_tokenize
from nltk import bigrams
################# Code snippet using TextBlob Package #######################
text1 = """This is an example for using TextBlob Package"""
blobs = TextBlob(text1) ### Converting str to textblob object
blob_tags = blobs.tags ### Assigning POS tags to the word blobs
print(blob_tags)
blob_bigrams = blobs.ngrams(n=2) ### Creating bi-grams from word blobs
print(blob_bigrams)
################# Code snippet using NLTK Package #######################
text2 = """This is an example for using NLTK Package"""
tokens = word_tokenize(text2) ### Converting str object to List object
nltk_tags = nltk.pos_tag(tokens) ### Assigning POS tags to the word tokens
print(nltk_tags)
nltk_bigrams = bigrams(tokens) ### Creating bi-grams from word tokens
print(list(nltk_bigrams))
Any help is much appreciated. Thanks in advance.

Does wordnet API has a function ".is_parent_of()"?

I want to check if 'worda' is a hypernym of 'wordb', according to the wordnet words hierarchy relationship.
Does wordnet api NLKT has a function like
worda.is_parent_of(wordb)
Thanks
There is the hypernyms() method for synsets. Also lowest_common_hypernyms() can be useful.Bear in mind that synsets can contain more than one word.Some example code to navigate wordnet can be found below.
from nltk.corpus import wordnet as wn
right_whale = wn.synset('right_whale.n.01')
orca = wn.synset('orca.n.01')
print orca
print right_whale.lowest_common_hypernyms(orca)
baleen_whale = right_whale.hypernyms()[0]
print baleen_whale

Plotting the ConditionalFreqDist of a book

Using nltk (already imported). Playing around with gutenberg corpus
import nltk
from nltk.corpus import gutenberg
Checked out the fileids, to find one I could play with:
gutenberg.fileids()
I made a small code to find the most common words (in order to choose a few for the graph)
kjv_text = nltk.Text(kjv)
from collections import Counter
for words in [kjv_text]:
c = Counter(words)
print c.most_common()[:100] # top 100
kjv_text.dispersion_plot(["LORD", "God", "Israel", "king", "people"])
Until here it works perfectly. Then I try and implement the ConditionalFreqDist, but get bunch of errors:
cfd2 = nltk.ConditionalFreqDist((target, fileid['bible-kjv.txt'])
for fileid in gutenberg.fileids()
for w in gutenberg.words(fileid)
for target in ['lord']
if w.lower().startswith(target))
cfd2.plot()
I have tried to change a few things, but always get some errors. Any experts that can tell me what I'm doing wrong?
Thanks
Here is what was wrong:
The fileid in:
cfd2 = nltk.ConditionalFreqDist((target, fileid['bible-kjv.txt'])
should reference to which element it is (in this case the 4th on the list of gutemberg texts.
So the line should instead say:
cfd2 = nltk.ConditionalFreqDist((target, fileid[3])

Categories