I am working on a basic sentiment analysis project using afinn and twitter data. My goal is to end up with a dataframe that displays the individual tweets, dates, retweets, favorites, and afinn scores.
Here is my code:
import sklearn as sk
import pandas as pd
import got3
tweetCriteria = got3.manager.TweetCriteria()
tweetCriteria.setQuerySearch("Kentucky Derby")
tweetCriteria.setSince("2016-04-01")
tweetCriteria.setUntil("2016-05-30")
tweetCriteria.setMaxTweets(25)
KYDerby_tweets = got3.manager.TweetManager.getTweets(tweetCriteria)
from afinn import Afinn
afinn = Afinn()
for x in KYDerby_tweets:
afinn.score
AF = afinn.score
for x in KYDerby_tweets:
print(x.text)
print(x.date)
print(x.retweets)
print(x.favorites)
print(AF)
print("*"*50)
Everything prints out fine EXCEPT for the afinn score. In its place, I am getting the following error: >
So the first tweet in the list looks like this:
NBO: Kentucky Derby - Bourbon Barrel Edition http:// ift.tt/1pySg8M #Beer
2016-05-29 19:29:40
0
3
>
Sorry for the newbie question, but can anyone tell me what I'm doing wrong with the afinn part of my code? Thanks!
Afinn.score is a method, - not an attribute. You need to call the method with the text you want to have scored. I think something like AF = afinn.score(x.text) should work. You need to have that line of code within a loop when you want to have multiple tweets scored.
"bound method" means that the value of AF is the function itself (a reference to the function), - not the value returned from the function.
Related
My use case:
Given an item, I would like to get recommendations of users who have not rated this item.
I found this amazing Python library that can answer my use case:
python-recsys https://github.com/ocelma/python-recsys
The example is given as below.
Which users should see Toy Story? (e.g. which users -that have not rated Toy Story- would give it a high rating?)
svd.recommend(ITEMID)
# Returns: <USERID, Predicted Rating>
[(283, 5.716264440514446),
(3604, 5.6471765418323141),
(5056, 5.6218800339214496),
(446, 5.5707524860615738),
(3902, 5.5494529168484652),
(4634, 5.51643364021289),
(3324, 5.5138903299082802),
(4801, 5.4947999354188548),
(1131, 5.4941438045650068),
(2339, 5.4916048051511659)]
This implementation used SVD to predict ratings given by users, and return the user id of the highest rating user-movie which were initially not rated.
Unfortunately, this library is written using Python 2.7, which is not compatible with my project.
I also found the Scikit Surprise library which has a similar example.
https://surprise.readthedocs.io/en/stable/FAQ.html#how-to-get-the-k-nearest-neighbors-of-a-user-or-item
import io # needed because of weird encoding of u.item file
from surprise import KNNBaseline
from surprise import Dataset
from surprise import get_dataset_dir
def read_item_names():
"""Read the u.item file from MovieLens 100-k dataset and return two
mappings to convert raw ids into movie names and movie names into raw ids.
"""
file_name = get_dataset_dir() + '/ml-100k/ml-100k/u.item'
rid_to_name = {}
name_to_rid = {}
with io.open(file_name, 'r', encoding='ISO-8859-1') as f:
for line in f:
line = line.split('|')
rid_to_name[line[0]] = line[1]
name_to_rid[line[1]] = line[0]
return rid_to_name, name_to_rid
# First, train the algortihm to compute the similarities between items
data = Dataset.load_builtin('ml-100k')
trainset = data.build_full_trainset()
sim_options = {'name': 'pearson_baseline', 'user_based': False}
algo = KNNBaseline(sim_options=sim_options)
algo.fit(trainset)
# Read the mappings raw id <-> movie name
rid_to_name, name_to_rid = read_item_names()
# Retrieve inner id of the movie Toy Story
toy_story_raw_id = name_to_rid['Toy Story (1995)']
toy_story_inner_id = algo.trainset.to_inner_iid(toy_story_raw_id)
# Retrieve inner ids of the nearest neighbors of Toy Story.
toy_story_neighbors = algo.get_neighbors(toy_story_inner_id, k=10)
# Convert inner ids of the neighbors into names.
toy_story_neighbors = (algo.trainset.to_raw_iid(inner_id)
for inner_id in toy_story_neighbors)
toy_story_neighbors = (rid_to_name[rid]
for rid in toy_story_neighbors)
print()
print('The 10 nearest neighbors of Toy Story are:')
for movie in toy_story_neighbors:
print(movie)
Prints
The 10 nearest neighbors of Toy Story are:
Beauty and the Beast (1991)
Raiders of the Lost Ark (1981)
That Thing You Do! (1996)
Lion King, The (1994)
Craft, The (1996)
Liar Liar (1997)
Aladdin (1992)
Cool Hand Luke (1967)
Winnie the Pooh and the Blustery Day (1968)
Indiana Jones and the Last Crusade (1989)
How do I change the code to get the outcome like the python-recsys's example above?
Thanks in advance.
This is just an implementation of the k-nearest neighbors algorithm. Take a look at how it works before you continue.
What's happening is the second chunk of code you posted is just classifying movies based on some metrics. The first bit is (probably) taking the already seen movies and matching it up against all the existing classes. From there, it's computing a similarity score and returning the highest.
So you take Beauty and the Beast. That's been classified as a children's cartoon. You compare the watched movies of your users to the full set of movies and take the x highest users with a score that indicates a high similarity between the set of movies that Beauty and the Beast falls into and the user's previously watched movies, but also where Beauty and the Beast is unwatched.
This is the math behind the algorithm https://youtu.be/4ObVzTuFivY
i am not sure if its too late to answer but i too wanted to try the same thing and got this workaround from the surprise package, not sure if this is the right approach though,
movieid = 1
# get the list of the user ids
unique_ids = ratingSA['userID'].unique()
# get the list of the ids that the movieid has been watched
iids1001 = ratingSA.loc[ratingSA['item']==movieid, 'userID']
# remove the rated users for the recommendations
users_to_predict = np.setdiff1d(unique_ids,iids1001)
# predicting for movie 1
algo = KNNBaseline(n_epochs = training_parameters['n_epochs'], lr_all = training_parameters['lr_all'], reg_all = training_parameters['reg_all'])
algo.fit(trainset)
my_recs = []
for iid in users_to_predict:
my_recs.append((iid, algo.predict(uid=userid,iid=iid).est))
recomend=pd.DataFrame(my_recs, columns=['iid', 'predictions']).sort_values('predictions', ascending=False).head(5)
recomend= recomend.rename({'iid':'userId'},axis=1)
recomend
I want to split a row into a new row whenever an adverb is present. However, if multiple adverbs occur in a row, then I only want to split into a new row after the last adverb.
A sample of my dataframe looks like this:
0 but well that's alright
1 otherwise however we'll have to
2 okay sure
3 what?
With adverbs = ['but', 'well', 'otherwise', 'however'], I want the resulting df to look like this:
0 but well
1 that's alright
2 otherwise however
3 we'll have to
2 okay sure
3 what?
I have a partial solution, maybe it can help.
You could use the TextBlob package.
Using this API, you can assign each word a token. A list of possible tokens is available here.
The issue is that it's not perfect to tag words, and your definition of adverb might not match theirs (for instance, but is a coordinating conjunction on the API, and the well tag is, for some reason, a verb. But it still works for the most part:
The splitting could be done this way
from textblob import TextBlob
def adv_split(s):
annotations = TextBlob(s).tags
# Extract adverbs (CC for coordinating conjunction or RB for adverbs)
adv_words = [ word for word,tag in annotations
if tag.startswith('CC') or tag.startswith('RB') ]
# We have at least one adverb
if len(adv_words) >0:
# Get the last one
adv_pos = s.index(adv_words[-1]) + len(adv_words[-1])
return [s[:adv_pos], s[adv_pos:]]
else:
return s
Then, you can use the pandas apply() and the new explode() method (pandas>0.25) to split your dataframe:
import pandas as pd
data = pd.Series(["but well that's alright",
"otherwise however we'll have to",
"okay sure",
"what?"])
data.apply(adv_split).explode()
You get:
0 but
0 well that's alright
1 otherwise however
1 we'll have to
2 okay sure
3 what?
It's not exactly right since well's tag is wrong, but you have the idea.
df = df[0].str.split().explode().to_frame()
df[1] = df[0].str.contains('|'.join(adverbs))
df = df.groupby([df.index, 1], sort=False).agg(' '.join).reset_index(drop=True)
print(df)
0
0 but well
1 that's alright
2 otherwise however
3 we'll have to
4 okay sure
5 what?
I am really a beginner in programming, and I have run into a problem. I am making a comparative analysis between fake news and real news. I have a text corpus with aprox. 3000 real news and 3000 fake news. I need to figure out whether fake or real news evoke more high-arousal emotions. I want to do that by using Warriner et. al. word list: http://crr.ugent.be/archives/1003
I have imported the word list to my script:
warriner = pd.read_csv('warriner.csv', sep = '\t', encoding = 'utf-8')
print warriner.head()
I (think, I) want to find the Arousal Mean Sum, which in the word list is called A.Mean.Sum. But I can't make it work, Spyder just say: 'DataFrame' object has no attribute 'A'.
Can anyone help? I have already calculated the sentiment scores by using LabMT as seen below, but I can't make Warringer et al work.
text_scored = [] for text in df['text']: sent_score = tm.labMT_sent(text)
text_scored.append(sent_score)
df['abs_sent'] = text_scored #adding the scored text to the df
relative sentiment score
text_scored = [] for text in df['text']: sent_score = tm.labMT_sent(text, rel = True)
text_scored.append(sent_score)
df['rel_sent'] = text_scored #adding the scored text to the df
overall mean
df['abs_sent'].mean() df['abs_sent'].loc[df['label'] == 'FAKE'].mean()
#'fake' mean = - 22,1 df['abs_sent'].loc[df['label'] == 'REAL'].mean()
#'real' mean = - 41,95
relative score mean calculations
df['rel_sent'].mean() #overall mean df['rel_sent'].loc[df['label'] == 'FAKE'].mean()
#'fake' mean = - 0,02 df['rel_sent'].loc[df['label'] == 'REAL'].mean()
#'real' mean = - 0,05
The example code you provided is hard for me to read. You're reporting the problem as having to do with A.Mean.Sum, but there's no code relating to that. There are also references to Spyder and DataFrame without explanation, code, or tags. Finally, the title should tell the potential answerer something about the problem itself, not the general field the code is working with. The current one expects the reader to find what they're supposed to do from within the report.
I'll readily admit I'm a novice here, but I suggest reading the intro How-to-ask and clarifying your question with it.
I'm also guessing this is a pandas related question, so its docs page might help you.
I hope I was of any help!
On the Python DescisionTree module homepage (DecisionTree-1.6.1), they give a piece of example code. Here it is:
dt = DecisionTree( training_datafile = "training.dat", debug1 = 1 )
dt.get_training_data()
dt.show_training_data()
root_node = dt.construct_decision_tree_classifier()
root_node.display_decision_tree(" ")
test_sample = ['exercising=>never', 'smoking=>heavy',
'fatIntake=>heavy', 'videoAddiction=>heavy']
classification = dt.classify(root_node, test_sample)
print "Classification: ", classification
My question is: How can I specify sample data (test_sample here) from variables? On the project homepage, it says: "You classify new data by first constructing a new data vector:" I have searched around but have been unable to find out what a data vector is or the answer to my question.
Any help would be appreciated!
Um, the example says it all. It's a list of strings, with features and values separated by '=>'. To use the example, a feature is 'exercising', and the value is 'never'.
I have this task that I've been working on, but am having extreme misgivings about my methodology.
So the problem is that I have a ton of excel files that are formatted strangely (and not consistently) and I need to extract certain fields for each entry. An example data set is
My original approach was this:
Export to csv
Separate into counties
Separate into districts
Analyze each district individually, pull out values
write to output.csv
The problem I've run into is that the format (seemingly well organized) is almost random across files. Each line contains the same fields, but in a different order, spacing, and wording. I wrote a script to correctly process one file, but it doesn't work on any other files.
So my question is, is there a more robust method of approaching this problem rather than simple string processing? What I had in mind was more of a fuzzy logic approach for trying to pin which field an item was, which could handle the inputs being a little arbitrary. How would you approach this problem?
If it helps clear up the problem, here is the script I wrote:
# This file takes a tax CSV file as input
# and separates it into counties
# then appends each county's entries onto
# the end of the master out.csv
# which will contain everything including
# taxes, bonds, etc from all years
#import the data csv
import sys
import re
import csv
def cleancommas(x):
toggle=False
for i,j in enumerate(x):
if j=="\"":
toggle=not toggle
if toggle==True:
if j==",":
x=x[:i]+" "+x[i+1:]
return x
def districtatize(x):
#list indexes of entries starting with "for" or "to" of length >5
indices=[1]
for i,j in enumerate(x):
if len(j)>2:
if j[:2]=="to":
indices.append(i)
if len(j)>3:
if j[:3]==" to" or j[:3]=="for":
indices.append(i)
if len(j)>5:
if j[:5]==" \"for" or j[:5]==" \'for":
indices.append(i)
if len(j)>4:
if j[:4]==" \"to" or j[:4]==" \'to" or j[:4]==" for":
indices.append(i)
if len(indices)==1:
return [x[0],x[1:len(x)-1]]
new=[x[0],x[1:indices[1]+1]]
z=1
while z<len(indices)-1:
new.append(x[indices[z]+1:indices[z+1]+1])
z+=1
return new
#should return a list of lists. First entry will be county
#each successive element in list will be list by district
def splitforstos(string):
for itemind,item in enumerate(string): # take all exception cases that didn't get processed
splitfor=re.split('(?<=\d)\s\s(?=for)',item) # correctly and split them up so that the for begins
splitto=re.split('(?<=\d)\s\s(?=to)',item) # a cell
if len(splitfor)>1:
print "\n\n\nfor detected\n\n"
string.remove(item)
string.insert(itemind,splitfor[0])
string.insert(itemind+1,splitfor[1])
elif len(splitto)>1:
print "\n\n\nto detected\n\n"
string.remove(item)
string.insert(itemind,splitto[0])
string.insert(itemind+1,splitto[1])
def analyze(x):
#input should be a string of content
#target values are nomills,levytype,term,yearcom,yeardue
clean=cleancommas(x)
countylist=clean.split(',')
emptystrip=filter(lambda a: a != '',countylist)
empt2strip=filter(lambda a: a != ' ', emptystrip)
singstrip=filter(lambda a: a != '\' \'',empt2strip)
quotestrip=filter(lambda a: a !='\" \"',singstrip)
splitforstos(quotestrip)
distd=districtatize(quotestrip)
print '\n\ndistrictized\n\n',distd
county = distd[0]
for x in distd[1:]:
if len(x)>8:
district=x[0]
vote1=x[1]
votemil=x[2]
spaceindex=[m.start() for m in re.finditer(' ', votemil)][-1]
vote2=votemil[:spaceindex]
mills=votemil[spaceindex+1:]
votetype=x[4]
numyears=x[6]
yearcom=x[8]
yeardue=x[10]
reason=x[11]
data = [filename,county,district, vote1, vote2, mills, votetype, numyears, yearcom, yeardue, reason]
print "data",data
else:
print "x\n\n",x
district=x[0]
vote1=x[1]
votemil=x[2]
spaceindex=[m.start() for m in re.finditer(' ', votemil)][-1]
vote2=votemil[:spaceindex]
mills=votemil[spaceindex+1:]
votetype=x[4]
special=x[5]
splitspec=special.split(' ')
try:
forind=[i for i,j in enumerate(splitspec) if j=='for'][0]
numyears=splitspec[forind+1]
yearcom=splitspec[forind+6]
except:
forind=[i for i,j in enumerate(splitspec) if j=='commencing'][0]
numyears=None
yearcom=splitspec[forind+2]
yeardue=str(x[6])[-4:]
reason=x[7]
data = [filename,county,district,vote1,vote2,mills,votetype,numyears,yearcom,yeardue,reason]
print "data other", data
openfile=csv.writer(open('out.csv','a'),delimiter=',', quotechar='|',quoting=csv.QUOTE_MINIMAL)
openfile.writerow(data)
# call the file like so: python tax.py 2007May8Tax.csv
filename = sys.argv[1] #the file is the first argument
f=open(filename,'r')
contents=f.read() #entire csv as string
#find index of every instance of the word county
separators=[m.start() for m in re.finditer('\w+\sCOUNTY',contents)] #alternative implementation in regex
# split contents into sections by county
# analyze each section and append to out.csv
for x,y in enumerate(separators):
try:
data = contents[y:separators[x+1]]
except:
data = contents[y:]
analyze(data)
is there a more robust method of approaching this problem rather than simple string processing?
Not really.
What I had in mind was more of a fuzzy logic approach for trying to pin which field an item was, which could handle the inputs being a little arbitrary. How would you approach this problem?
After a ton of analysis and programming, it won't be significantly better than what you've got.
Reading stuff prepared by people requires -- sadly -- people-like brains.
You can mess with NLTK to try and do a better job, but it doesn't work out terribly well either.
You don't need a radically new approach. You need to streamline the approach you have.
For example.
district=x[0]
vote1=x[1]
votemil=x[2]
spaceindex=[m.start() for m in re.finditer(' ', votemil)][-1]
vote2=votemil[:spaceindex]
mills=votemil[spaceindex+1:]
votetype=x[4]
numyears=x[6]
yearcom=x[8]
yeardue=x[10]
reason=x[11]
data = [filename,county,district, vote1, vote2, mills, votetype, numyears, yearcom, yeardue, reason]
print "data",data
Might be improved by using a named tuple.
Then build something like this.
data = SomeSensibleName(
district= x[0],
vote1=x[1], ... etc.
)
So that you're not creating a lot of intermediate (and largely uninformative) loose variables.
Also, keep looking at your analyze function (and any other function) to pull out the various "pattern matching" rules. The idea is that you'll examine a county's data, step through a bunch of functions until one matches the pattern; this will also create the named tuple. You want something like this.
for p in ( some, list, of, functions ):
match= p(data)
if match:
return match
Each function either returns a named tuple (because it liked the row) or None (because it didn't like the row).