Sentiment analysis with own lexicon - python

Hi i am supposed to make a sentiment analysis of the three sentences below. I am wondering how i will start this task because I am really stuck
I am supposed to write a function to determine the sentiment of a comment by using a word list as an
argument. Given a sentiment lexicon as follows:
• positive words: "good", "awesome", "excellent", "great"
• negative words: "bad", "broke", "terrible", "poor".
Then I need calculate the positive and negative words in the list in order to find out if the comment is positive or negative and print "positive comment" or "negative comment"
def splitandremovepunc(s):
t = s.maketrans("", "", string.punctuation)
return s.translate(t).split()
lst = ("Good for the price, but poor Bluetooth connections.")
lst2 = ("Excellent product. Awesome quality and good customer service.")
lst3 = ("The quality is terrible. I would not buy this product again.")
print(splitandremovepunc(lst))
print(splitandremovepunc(lst2))
print(splitandremovepunc(lst3))

One of the most basic ways to do this is simple counting the number of negative and positive words in each string like so:
lst_split = ("Good for the price, but poor Bluetooth connections.").lower().split()
positive_word_list = ['good', 'great']
negative_word_list = ['bad', 'poor']
positive_score = 0
negative_scope = 0
for word in positive_word_list:
positive_score += lst_split.count(word)
for word in negative_word_list:
negative_score += lst_split.count(word)

May I propose VADER (Valence Aware Dictionary and sEntiment Reasoner)? It’s a sentiment lexicon and you can use it like this:
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
sentiment_analyzer = SentimentIntensityAnalyzer()
result = sentiment_analyzer.polarity_scores('Good for the price, but poor Bluetooth connections.')
result is a dictionary that includes information concerning the sentence's sentiment polarity. The first three values (neg, neu, pos) are the portion of the text that is negative, neutral and positive. The last value stands for compound score, which represents the overall negativity or positivity score, normalised, ranging from -1 to +1.

Related

Clustering script fails with German, but works like expected with English

I have a script to cluster keywords, utilizing pandas and polyfuzz. With English, it works like expected. Trying to use the script with keywords in German, the script recognizes multiple keywords wrongly.
What means "wrongly recognized": clustering recognizes the first and second word in the keyword. And as you can see on the screenshot, columns G and H (First Word and Second Word) contain other words, then corresponding keywords in column B (Keyword):
The script fails not always with German - multiple keywords are clustered correctly. But the part of wrongly recognized keywords is very high, up to 20%.
Could somebody explain to me why the script failed with German keywords and, in the best case, improve the script enabling it to work with German?
Here is the part of the script, which does clustering:
# find keywords from one column in another in any order and count the frequency
df_matched['Cluster Name'] = df_matched['Cluster Name'].str.strip()
df_matched['Keyword'] = df_matched['Keyword'].str.strip()
df_matched['First Word'] = df_matched['Cluster Name'].str.split(" ").str[0]
df_matched['Second Word'] = df_matched['Cluster Name'].str.split(" ").str[1]
df_matched['Total Keywords'] = df_matched['First Word'].str.count(' ') + 1
def ismatch(s):
A = set(s["First Word"].split())
B = set(s['Keyword'].split())
return A.intersection(B) == A
df_matched['Found'] = df_matched.apply(ismatch, axis=1)
df_matched = df_matched. fillna('')
def ismatch(s):
A = set(s["Second Word"].split())
B = set(s['Keyword'].split())
return A.intersection(B) == A
df_matched['Found 2'] = df_matched.apply(ismatch, axis=1)
# todo - document this algo. Essentially if it matches on the second word only, it renames the cluster to the second word
# clean up code nd variable names
df_matched.loc[(df_matched["Found"] == False) & (df_matched["Found 2"] == True), "Cluster Name"] = df_matched["Second Word"]
df_matched.loc[(df_matched["Found"] == False) & (df_matched["Found 2"] == False), "Cluster Name"] = "zzz_no_cluster_available"
# count cluster_size
df_matched['Cluster Size'] = df_matched['Cluster Name'].map(df_matched.groupby('Cluster Name')['Cluster Name'].count())
df_matched.loc[df_matched["Cluster Size"] == 1, "Cluster Name"] = "zzz_no_cluster_available"
df_matched = df_matched.sort_values(by="Cluster Name", ascending=True)
Here are two datasets:
Working dataset in English: http://dl.dropboxusercontent.com/s/zrobh2x4bs3ztlf/working-dataset-english.txt
Badly working dataset in German: http://dl.dropboxusercontent.com/s/i1p3j3zi1t0cev3/badly-working-dataset-german.txt
And here, the working Colab with the whole script.
I opened the full code to understand where df_matched came from.
I'm not 100% sure of what you are trying to do, but I think that the problem comes from before the snippet you shared here.
It comes from the way that df_matched is created. It uses fuzzy matching to create clusters. So the words of "Cluster Name" are not all guaranteed to be present in "Keyword".
If you run the code for the English data, and check the words in position -1 and -2 (last two words of the Cluster Name) instead of 0 and 1...
df_matched['First Word'] = df_matched['Cluster Name'].str.split(" ").str[-1]
df_matched['Second Word'] = df_matched['Cluster Name'].str.split(" ").str[-2]
...then calculate how many of them are not found...
print((~df_matched["Found"]).sum())
print((~df_matched["Found 2"]).sum())
# 140
# 10
...you can see that for 104 out of 158 rows, the last word is not part of the keywords.
(I don't know if you care about the first two words more than the last two... but this looks worse than the 20% you noticed in the German data.)
For the German one the problem is more visible because this language uses a lot of compound words and many frequent suffixes (e.g., "ung")... So they will fuzzy-match a lot.
Example of df_matched for German: the "From" words are not present in "To"... but there are large overlaps.
This is df_matched for English: some words of "From" are not even close to the words in "To"... and similarity score can be worse than in the German dataset.
Possible improvements
I think that the part where you could improve the clustering is this (from the colab notebook)
df_1_list = df_1.Keyword.tolist()  # create list from df
model = PolyFuzz("TF-IDF")
cluster_tags = df_1_list[::]
cluster_tags = set(cluster_tags)
cluster_tags = list(cluster_tags)
print("Cleaning up the cluster tags.. Please be patient!")
substrings = {w1 for w1 in tqdm(cluster_tags) for w2 in cluster_tags if w1 in w2 and w1 != w2}
longest_word = set(cluster_tags) - substrings
longest_word = list(longest_word)
shortest_word_list = list(set(cluster_tags) - set(longest_word))
try:
    model.match(df_1_list, shortest_word_list)
except ValueError:
    print("Empty Dataframe, Can't Match - Check the URL Filter!")
    sys.exit()
model.group(link_min_similarity=sim_match_percent)
df_matched = model.get_matches()
Here you compute the similarity between df_1_list and shortest_word_list.
shortest_word_list is created by looking for substrings, which might lead to weird clusters is German because of compound words.
You could try and normalize the text with (language-specific) ​stemming or lemmatization before / instead of checking for substrings and creating clusters. This should help and transform each word in their "root form" and retain their meaning.
Yoy can use the spaCy library, which provide language-specific
pretrained models for stemming, embedding and other language operations.
You can select the correct model for each language and use the lemmatization function to replace each word of df_1_list with their "base form" before trying to cluster.
Lemmatization example
import spacy
nlp = spacy.load("en_core_web_sm") # load English or German model
lemmatizer = nlp.get_pipe("lemmatizer")
print(lemmatizer.mode) # 'rule'
doc = nlp("I was reading the paper.")
print([token.lemma_ for token in doc])
# ['I', 'be', 'read', 'the', 'paper', '.']
Link to spaCy German model: https://spacy.io/models/de

How to remove adjectives or attributive before noun?

Currently I am using nltk to remove all the adjectives, this is my attempt:
def remove_adj(sentence):
adjective_tags = ["JJ", "JJR", "JJS"]
tokens = nltk.word_tokenize(sentence)
tags = nltk.pos_tag(tokens)
for i in range(len(tags)):
word = [word for word,pos in tags if (pos not in adjective_tags)]
return ' '.join(word)
But what I need is different from this one. Here are some examples:
input: "who has the highest revenue" output: "who has the revenue"
input: "who earned more than average income" output: "who earned more than income"
input: "what is the mean of profit" output: "what is the profit"
Can anyone give me some suggestions? Thanks all in advance.
I think I understand what you are trying to achieve, but what problem are you having? I've run your code and it appears to work perfectly at removing adjectives.
A couple things are throwing me off though. For the below input/output, you can expect the word 'more' to be removed, as it is an adjective with token 'JJR'. Your post suggests that you were not expecting it to be removed.
input: "who earned more than average income" output: "who earned more than income"
Also, I'm not sure why you were expecting the word 'mean' to be removed in the below input/output, as it isn't an adjective.
input: "what is the mean of profit" output: "what is the profit"
A great place to check you sentences is Parts of Speech
Below would be your actual outputs, removing the adjectives correctly, and it seems to be doing just that.
input: "who has the highest revenue" output: "who has the revenue"
input: "who earned more than average income" output: "who earned than income"
input: "what is the mean of profit" output: "what is the mean of profit"
If you are simply trying to remove any descriptive elements pertaining to the noun, I would have to ask more about your problem. Your examples all ended with a noun, and this appears to be the noun you are focusing on. Will this be the case with all sentences that this code would handle? If so, you might consider iterating through your sentence backwards. You can easily identify the noun. As you step through, you would then look to see if the noun has a determiner (a, an, the) with tag 'DT', as you wouldn't want to remove that from what I see. You continue to step through removing everything until you reach an adjective or another noun. I don't know what your actual rules are for removing words on this one, but working backwards may help.
EDIT:
I tinkered with this a bit and got the below code to work exactly as you wanted on the outputs. You can populate tags in the 'stop_tags' variable if there are other speech tags you want it to stop on.
def remove_adj(sentence):
stop_tags = ["JJ", "JJR", "JJS", "NN"]
tokens = nltk.word_tokenize(sentence)
tags = list(reversed(nltk.pos_tag(tokens)))
noun_located = False
stop_reached = False
final_sent = ''
for word,pos in tags:
if noun_located == False and pos == 'NN':
noun_located = True
final_sent+=f' {word}'
elif stop_reached == False and pos in stop_tags:
stop_reached = True
elif stop_reached == True:
final_sent+=f' {word}'
final_sent = ' '.join(reversed(final_sent.split(' ')))
return final_sent
x = remove_adj('what is the mean of profit')
print(x)
`

NLP- sentiment analysis using word vectors

I have a code that does the following:
Generate word vectors using brown corpus fron nltk
maintain 2 list, one having few positive sentimental words (eg: good, happy, nice) and other negative sentimental words (ed. bad, sad, unnhappy)
Define a statement whose sentiment we wish to obtain.
perform preprocessing on this statement (tokenize, lowercase, remove special characters, remove stopwords, lemmatize words
Generate word vectors for all these words and store it in a list
I have a test sentence of 7 words and I wish to determine its sentiment. First I define two lists:
good_words=[good, excellent, happy]
bad_words=[bad,terrible,sad]
Now I run a loop taking i words at a time where i ranges from 1 to sentence length. For a particular i, I have few windows of words that span the test sentence. For each window, I take average of word vectors of the window and compute euclidian distance of this windowed vector and the 2 lists.For example i= 3, and test sentence: food looks fresh healthy. I will have 2 windows: food looks fresh and looks fresh healthy for i =3. Now I take mean of vectors of the words in each window and compute euclidian distance with the good_words and bad_words. So corresponding to each word in both lists I will have 2 values(for 2 windows). Now I take mean of these 2 values for each word in the lists and whichever word has least distance lies closest to the test sentence.
I wish to show that window size(i) = 3 or 4 shows highest accuracy in determining the sentiment of test sentence but I am facing difficulty in achieving it. Any leads on how I can produce my results would be highly appreciated.
Thanks in advance.
b = Word2Vec(brown.sents(), window=5, min_count=5, negative=15, size=50, iter= 10, workers=multiprocessing.cpu_count())
pos_words=['good','happy','nice','excellent','satisfied']
neg_words=['bad','sad','unhappy','disgusted','afraid','fearful','angry']
pos_vec=[b[word] for word in pos_words]
neg_vec=[b[word] for word in neg_words]
test="Sound quality on both end is excellent."
tokenized_word= word_tokenize(test)
lower_tokens= convert_lowercase(tokenized_word)
alpha_tokens= remove_specialchar(lower_tokens)
rem_tokens= removestopwords(alpha_tokens)
lemma_tokens= lemmatize(rem_tokens)
word_vec=[b[word] for word in lemma_tokens]
for i in range(0,len(lemma_tokens)):
windowed_vec=[]
for j in range(0,len(lemma_tokens)-i):
windowed_vec.append(np.mean([word_vec[j+k] for k in range(0,i+1)],axis=0))
gen_pos_arr=[]
gen_neg_arr=[]
for p in range(0,len(pos_vec)):
gen_pos_arr.append([euclidian_distance(vec,pos_vec[p]) for vec in windowed_vec])
for q in range(0,len(neg_vec)):
gen_neg_arr.append([euclidian_distance(vec,neg_vec[q]) for vec in windowed_vec])
gen_pos_arr_mean=[]
gen_pos_arr_mean.append([np.mean(x) for x in gen_pos_arr])
gen_neg_arr_mean=[]
gen_neg_arr_mean.append([np.mean(x) for x in gen_neg_arr])
min_value=np.min([np.min(gen_pos_arr_mean),np.min(gen_neg_arr_mean)])
for v in gen_pos_arr_mean:
print('min value:',min_value)
if min_value in v:
print('pos',v)
plt.scatter(i,min_value,color='blue')
plt.text(i,min_value,pos_words[gen_pos_arr_mean[0].index(min_value)])
else:
print('neg',v)
plt.scatter(i,min_value,color='red')
plt.text(i,min_value,neg_words[gen_neg_arr_mean[0].index(min_value)])
print(test)
plt.title('')
plt.xlabel('window size')
plt.ylabel('avg of distances of windows from sentiment words')
plt.show()

Python - count word frequency of string from list, number of words from list varies

I am trying to create a program that runs though a list of mental health terms, looks in a research abstract, and counts the number of times the word or phrase appears. I can get this to work with single words, but I'm struggling to do this with multiple words. I tried using NLTK ngrams too, but since the number of words from the mental health list varies (i.e., not all terms from the mental health list will be bigrams or trigrams), I couldn't get that to work either.
I want to emphasize that I know splitting each word will only allow single words to be counted, however, I'm just stuck on how to deal with a varying number of words from my list to count in the abstract.
Thanks!
from collections import Counter
abstracts = ['This is a mental health abstract about anxiety and bipolar
disorder as well as other things.', 'While this abstract is not about ptsd
or any trauma-related illnesses, it does have a mental health focus.']
for x2 in abstracts:
mh_terms = ['bipolar disorder', 'anxiety', 'substance abuse disorder',
'ptsd', 'schizophrenia', 'mental health']
c = Counter(s.lower().replace('.', '') for s in x2.split())
for term in mh_terms:
term = term.replace(',','')
term = term.replace('.','')
xx = (term, c.get(term, 0))
mh_total_occur = sum(c.get(v, 0) for v in mh_terms)
print(mh_total_occur)
From my example, both abstracts are getting a count of 1, but I want a count of two.
The problem is that you will never match "mental health" as you are only counting occurrences of single words split by the " " character.
I don't know if using a counter is the right solution here. If you did need an highly scalable and indexable solution, then n-grams are probably the way to go, but for small to medium problems it should be pretty quick to use regex pattern matching.
import re
abstracts = [
'This is a mental health abstract about anxiety and bipolar disorder as well as other things.',
'While this abstract is not about ptsd or any trauma-related illnesses, it does have a mental health focus.'
]
mh_terms = [
'bipolar disorder', 'anxiety', 'substance abuse disorder',
'ptsd', 'schizophrenia', 'mental health'
]
def _regex_word(text):
""" wrap text with special regex expression for start/end of words """
return '\\b{}\\b'.format(text)
def _normalize(text):
""" Remove any non alpha/numeric/space character """
return re.sub('[^a-z0-9 ]', '', text.lower())
normed_terms = [_normalize(term) for term in mh_terms]
for raw_abstract in abstracts:
print('--------')
normed_abstract = _normalize(raw_abstract)
# Search for all occurrences of chosen terms
found = {}
for norm_term in normed_terms:
pattern = _regex_word(norm_term)
found[norm_term] = len(re.findall(pattern, normed_abstract))
print('found = {!r}'.format(found))
mh_total_occur = sum(found.values())
print('mh_total_occur = {!r}'.format(mh_total_occur))
I tried to add helpers functions and comments to make it clear what I was doing.
Using the \b regex control character is important in general use cases because it prevents possible search terms like "miss" from matching words like "dismiss".

Accumulated Frequencies, Ngrams

quick question here: if you run the code below you get a list of frequencies of bigrams per list from the corpus.
I would like to be able to display and keep track of a total running tally. IE instead of what you see displayed when you run it as 1 or maybe 2 for the frequency because the index is so small, it counts through the whole corpus and displays frequencies.
I then basically need to generate text from the frequencies that models the original corpus.
#---------------------------------------------------------
#!/usr/bin/env python
#Ngram Project
#Import all of the libraries we will need for the program to function
import nltk
import nltk.collocations
from collections import defaultdict
import nltk.corpus as corpus
from nltk.corpus import brown
#---------------------------------------------------------
#create our list with the Brown corpus inside variable called "news"
news = corpus.brown.sents(categories = 'editorial')
#This will display the type of variable Python recognizes this as
print "News Is Of The Variable Type : ",type(news),'\n'
#---------------------------------------------------------
#This function will take in the corpus one line at a time
#After searching through and adding a <s> to the beggning of each list item, it also annotates periods out for </s>'
def alter_list(corpus_list):
#Simply check for an instance of a period, and if so, replace with '</s>'
if corpus_list[-1] == '.':
corpus_list[-1] = '</s>'
#Stripe is a modifier that allows us to remove all special characters, IE '\n'
corpus_list[-1].strip()
#Else add to the end of the list item
else:
corpus_list.append('</s>')
return ['<s>'] + corpus_list
#Displays the length of the list 'news'
print "The Length of News is : ",len(news),'\n'
#Allows the user to choose how much of the annotated corpus they would like to see
print "How many lines of the <s> // </s> annotated corpus would you like to see? ", '\n'
user = input()
#Takes user input to determine how many lines to display if any
if(user >= 1):
print "The Corpus Annotated with <s> and </s> looks like : "
print "Displaying [",user,"] rows of the corpus : ", '\n'
for corpus_list in news[:user]:
print(alter_list(corpus_list),'\n')
#Non positive number catch
else:
print "Fine I Won't Show You Any... ",'\n'
#---------------------------------------------------------
print '\n'
#Again allows the user to choose the number of lists from Brown corpus to be displayed in
# Unigram, bigram, trigram and quadgram format
user2 = input("How many list sequences would you like to see broken into bigrams, trigrams, and quadgrams? ")
count = 0
#Function 'ngrams' is run in a loop so that each entry in the list can be gone through and turned into information
#Displayed to the user
while(count < user2):
passer = news[count]
def ngrams(passer, n = 2, padding = True):
#Padding refers to the same idea demonstrated above, that is bump the first word to the second, making
#'None' the first item in each list so that calculations of frequencies can be made
pad = [] if not padding else [None]*(n-1)
grams = pad + passer + pad
return (tuple(grams[i:i+n]) for i in range(0, len(grams) - (n - 1)))
#In this case, arguments are first: n-gram type (bi, tri, quad)
#Followed by in our case the addition of 'padding'
#Padding is used in every case here because we need it for calculations
#This function structure allows us to pull in corpus parts without the added annotations if need be
for size, padding in ((1,1), (2,1), (3, 1), (4, 1)):
print '\n%d - grams || padding = %d' % (size, padding)
print list(ngrams(passer, size, padding))
# show frequency
counts = defaultdict(int)
for n_gram in ngrams(passer, 2, False):
counts[n_gram] += 1
print ("======================================================================================")
print '\nFrequencies Of Bigrams:'
for c, n_gram in sorted(((c, n_gram) for n_gram, c in counts.iteritems()), reverse = True):
print c, n_gram
print '\nFrequencies Of Trigrams:'
for c, n_gram in sorted(((c, n_gram) for n_gram, c in counts.iteritems()), reverse = True):
print c, n_gram
count = count + 1
#---------------------------------------------------------
I'm not sure I understand the question. nltk has a function generate. The book from which nltk comes from is available online.
http://nltk.org/book/ch01.html
Now, just for fun, let's try generating some random text in the various styles we have just seen. To do this, we type the name of the text followed by the term generate. (We need to include the parentheses, but there's nothing that goes between them.)
>>> text3.generate()
In the beginning of his brother is a hairy man , whose top may reach
unto heaven ; and ye shall sow the land of Egypt there was no bread in
all that he was taken out of the month , upon the earth . So shall thy
wages be ? And they made their father ; and Isaac was old , and kissed
him : and Laban with his cattle in the midst of the hands of Esau thy
first born , and Phichol the chief butler unto his son Isaac , she
The problem is that you define the dict counts anew for each sentence, so the ngram counts get reset to zero. Define it above the while loop and the counts will accumulate over the entire Brown corpus.
Bonus advice: You should also move the definition of ngram outside the loop-- it's nonsensical to define the same function over and over and over. (But it does no harm, except to performance). Better yet, you should use the nltk's ngram function and read about FreqDist, which is like a dict counter on steroids. It will come in handy when you tackle the statistical text generation.

Categories