I am really a beginner in programming, and I have run into a problem. I am making a comparative analysis between fake news and real news. I have a text corpus with aprox. 3000 real news and 3000 fake news. I need to figure out whether fake or real news evoke more high-arousal emotions. I want to do that by using Warriner et. al. word list: http://crr.ugent.be/archives/1003
I have imported the word list to my script:
warriner = pd.read_csv('warriner.csv', sep = '\t', encoding = 'utf-8')
print warriner.head()
I (think, I) want to find the Arousal Mean Sum, which in the word list is called A.Mean.Sum. But I can't make it work, Spyder just say: 'DataFrame' object has no attribute 'A'.
Can anyone help? I have already calculated the sentiment scores by using LabMT as seen below, but I can't make Warringer et al work.
text_scored = [] for text in df['text']: sent_score = tm.labMT_sent(text)
text_scored.append(sent_score)
df['abs_sent'] = text_scored #adding the scored text to the df
relative sentiment score
text_scored = [] for text in df['text']: sent_score = tm.labMT_sent(text, rel = True)
text_scored.append(sent_score)
df['rel_sent'] = text_scored #adding the scored text to the df
overall mean
df['abs_sent'].mean() df['abs_sent'].loc[df['label'] == 'FAKE'].mean()
#'fake' mean = - 22,1 df['abs_sent'].loc[df['label'] == 'REAL'].mean()
#'real' mean = - 41,95
relative score mean calculations
df['rel_sent'].mean() #overall mean df['rel_sent'].loc[df['label'] == 'FAKE'].mean()
#'fake' mean = - 0,02 df['rel_sent'].loc[df['label'] == 'REAL'].mean()
#'real' mean = - 0,05
The example code you provided is hard for me to read. You're reporting the problem as having to do with A.Mean.Sum, but there's no code relating to that. There are also references to Spyder and DataFrame without explanation, code, or tags. Finally, the title should tell the potential answerer something about the problem itself, not the general field the code is working with. The current one expects the reader to find what they're supposed to do from within the report.
I'll readily admit I'm a novice here, but I suggest reading the intro How-to-ask and clarifying your question with it.
I'm also guessing this is a pandas related question, so its docs page might help you.
I hope I was of any help!
Related
I wrote a code based on the TF-IDF algorithm to extract keywords from a very large text.
The problem is that I keep getting the division by zero error. When I debug my code, everything is working perfectly. As soon as I make the text shorter to contains the word that causes the problem, it works. So, I assume that it's a memory problem.
I thought maybe I could read the big text file in chunks (1KB) instead of reading the whole document in the first place. Unfortunately, it does not work. what should I do?
(I am using pycharm on windows)
I am a beginner in programming, python, and NLP domain. Therefore, I really appreciate it if you could help me here.
if __name__ == "__main__":
with open('spli.txt') as f:
for piece in read_in_chunks(f):
#print(piece)
piece = piece.lower()
no_punc_words, all_words = text_split(piece)
no_punc_words, all_words = rm_stop_word(no_punc_words, all_words)
no_punc_words_freq, all_words_freq = calc_freq(no_punc_words, all_words)
tf_score = calc_tf_score(no_punc_words_freq)
idf_score = calc_idf_score(no_punc_words_freq, all_words_freq, piece)
tf_idf_score = {}
for k in tf_score:
tf_idf_score[k] = tf_score[k] * idf_score[k]
#print(final_score)
final_tf_idf = {}
for scores in tf_idf_score:
final_tf_idf += tf_idf_score
print(final_tf_idf)
My use case:
Given an item, I would like to get recommendations of users who have not rated this item.
I found this amazing Python library that can answer my use case:
python-recsys https://github.com/ocelma/python-recsys
The example is given as below.
Which users should see Toy Story? (e.g. which users -that have not rated Toy Story- would give it a high rating?)
svd.recommend(ITEMID)
# Returns: <USERID, Predicted Rating>
[(283, 5.716264440514446),
(3604, 5.6471765418323141),
(5056, 5.6218800339214496),
(446, 5.5707524860615738),
(3902, 5.5494529168484652),
(4634, 5.51643364021289),
(3324, 5.5138903299082802),
(4801, 5.4947999354188548),
(1131, 5.4941438045650068),
(2339, 5.4916048051511659)]
This implementation used SVD to predict ratings given by users, and return the user id of the highest rating user-movie which were initially not rated.
Unfortunately, this library is written using Python 2.7, which is not compatible with my project.
I also found the Scikit Surprise library which has a similar example.
https://surprise.readthedocs.io/en/stable/FAQ.html#how-to-get-the-k-nearest-neighbors-of-a-user-or-item
import io # needed because of weird encoding of u.item file
from surprise import KNNBaseline
from surprise import Dataset
from surprise import get_dataset_dir
def read_item_names():
"""Read the u.item file from MovieLens 100-k dataset and return two
mappings to convert raw ids into movie names and movie names into raw ids.
"""
file_name = get_dataset_dir() + '/ml-100k/ml-100k/u.item'
rid_to_name = {}
name_to_rid = {}
with io.open(file_name, 'r', encoding='ISO-8859-1') as f:
for line in f:
line = line.split('|')
rid_to_name[line[0]] = line[1]
name_to_rid[line[1]] = line[0]
return rid_to_name, name_to_rid
# First, train the algortihm to compute the similarities between items
data = Dataset.load_builtin('ml-100k')
trainset = data.build_full_trainset()
sim_options = {'name': 'pearson_baseline', 'user_based': False}
algo = KNNBaseline(sim_options=sim_options)
algo.fit(trainset)
# Read the mappings raw id <-> movie name
rid_to_name, name_to_rid = read_item_names()
# Retrieve inner id of the movie Toy Story
toy_story_raw_id = name_to_rid['Toy Story (1995)']
toy_story_inner_id = algo.trainset.to_inner_iid(toy_story_raw_id)
# Retrieve inner ids of the nearest neighbors of Toy Story.
toy_story_neighbors = algo.get_neighbors(toy_story_inner_id, k=10)
# Convert inner ids of the neighbors into names.
toy_story_neighbors = (algo.trainset.to_raw_iid(inner_id)
for inner_id in toy_story_neighbors)
toy_story_neighbors = (rid_to_name[rid]
for rid in toy_story_neighbors)
print()
print('The 10 nearest neighbors of Toy Story are:')
for movie in toy_story_neighbors:
print(movie)
Prints
The 10 nearest neighbors of Toy Story are:
Beauty and the Beast (1991)
Raiders of the Lost Ark (1981)
That Thing You Do! (1996)
Lion King, The (1994)
Craft, The (1996)
Liar Liar (1997)
Aladdin (1992)
Cool Hand Luke (1967)
Winnie the Pooh and the Blustery Day (1968)
Indiana Jones and the Last Crusade (1989)
How do I change the code to get the outcome like the python-recsys's example above?
Thanks in advance.
This is just an implementation of the k-nearest neighbors algorithm. Take a look at how it works before you continue.
What's happening is the second chunk of code you posted is just classifying movies based on some metrics. The first bit is (probably) taking the already seen movies and matching it up against all the existing classes. From there, it's computing a similarity score and returning the highest.
So you take Beauty and the Beast. That's been classified as a children's cartoon. You compare the watched movies of your users to the full set of movies and take the x highest users with a score that indicates a high similarity between the set of movies that Beauty and the Beast falls into and the user's previously watched movies, but also where Beauty and the Beast is unwatched.
This is the math behind the algorithm https://youtu.be/4ObVzTuFivY
i am not sure if its too late to answer but i too wanted to try the same thing and got this workaround from the surprise package, not sure if this is the right approach though,
movieid = 1
# get the list of the user ids
unique_ids = ratingSA['userID'].unique()
# get the list of the ids that the movieid has been watched
iids1001 = ratingSA.loc[ratingSA['item']==movieid, 'userID']
# remove the rated users for the recommendations
users_to_predict = np.setdiff1d(unique_ids,iids1001)
# predicting for movie 1
algo = KNNBaseline(n_epochs = training_parameters['n_epochs'], lr_all = training_parameters['lr_all'], reg_all = training_parameters['reg_all'])
algo.fit(trainset)
my_recs = []
for iid in users_to_predict:
my_recs.append((iid, algo.predict(uid=userid,iid=iid).est))
recomend=pd.DataFrame(my_recs, columns=['iid', 'predictions']).sort_values('predictions', ascending=False).head(5)
recomend= recomend.rename({'iid':'userId'},axis=1)
recomend
I am working on a basic sentiment analysis project using afinn and twitter data. My goal is to end up with a dataframe that displays the individual tweets, dates, retweets, favorites, and afinn scores.
Here is my code:
import sklearn as sk
import pandas as pd
import got3
tweetCriteria = got3.manager.TweetCriteria()
tweetCriteria.setQuerySearch("Kentucky Derby")
tweetCriteria.setSince("2016-04-01")
tweetCriteria.setUntil("2016-05-30")
tweetCriteria.setMaxTweets(25)
KYDerby_tweets = got3.manager.TweetManager.getTweets(tweetCriteria)
from afinn import Afinn
afinn = Afinn()
for x in KYDerby_tweets:
afinn.score
AF = afinn.score
for x in KYDerby_tweets:
print(x.text)
print(x.date)
print(x.retweets)
print(x.favorites)
print(AF)
print("*"*50)
Everything prints out fine EXCEPT for the afinn score. In its place, I am getting the following error: >
So the first tweet in the list looks like this:
NBO: Kentucky Derby - Bourbon Barrel Edition http:// ift.tt/1pySg8M #Beer
2016-05-29 19:29:40
0
3
>
Sorry for the newbie question, but can anyone tell me what I'm doing wrong with the afinn part of my code? Thanks!
Afinn.score is a method, - not an attribute. You need to call the method with the text you want to have scored. I think something like AF = afinn.score(x.text) should work. You need to have that line of code within a loop when you want to have multiple tweets scored.
"bound method" means that the value of AF is the function itself (a reference to the function), - not the value returned from the function.
I give a lot of information on the methods that I used to write my code. If you just want to read my question, skip to the quotes at the end.
I'm working on a project that has a goal of detecting sub populations in a group of patients. I thought this sounded like the perfect opportunity to use association rule mining as I'm currently taking a class on the subject.
I there are 42 variables in total. Of those, 20 are continuous and had to be discretized. For each variable, I used the Freedman-Diaconis rule to determine how many categories to divide a group into.
def Freedman_Diaconis(column_values):
#sort the list first
column_values[1].sort()
first_quartile = int(len(column_values[1]) * .25)
third_quartile = int(len(column_values[1]) * .75)
fq_value = column_values[1][first_quartile]
tq_value = column_values[1][third_quartile]
iqr = tq_value - fq_value
n_to_pow = len(column_values[1])**(-1/3)
h = 2 * iqr * n_to_pow
retval = (column_values[1][-1] - column_values[1][1])/h
test = int(retval+1)
return test
From there I used min-max normalization
def min_max_transform(column_of_data, num_bins):
min_max_normalizer = preprocessing.MinMaxScaler(feature_range=(1, num_bins))
data_min_max = min_max_normalizer.fit_transform(column_of_data[1])
data_min_max_ints = take_int(data_min_max)
return data_min_max_ints
to transform my data and then I simply took the interger portion to get the final categorization.
def take_int(list_of_float):
ints = []
for flt in list_of_float:
asint = int(flt)
ints.append(asint)
return ints
I then also wrote a function that I used to combine this value with the variable name.
def string_transform(prefix, column, index):
transformed_list = []
transformed = ""
if index < 4:
for entry in column[1]:
transformed = prefix+str(entry)
transformed_list.append(transformed)
else:
prefix_num = prefix.split('x')
for entry in column[1]:
transformed = str(prefix_num[1])+'x'+str(entry)
transformed_list.append(transformed)
return transformed_list
This was done to differentiate variables that have the same value, but appear in different columns. For example, having a value of 1 for variable x14 means something different from getting a value of 1 in variable x20. The string transform function would create 14x1 and 20x1 for the previously mentioned examples.
After this, I wrote everything to a file in basket format
def create_basket(list_of_lists, headers):
#for filename in os.listdir("."):
# if filename.e
if not os.path.exists('baskets'):
os.makedirs('baskets')
down_length = len(list_of_lists[0])
with open('baskets/dataset.basket', 'w') as basketfile:
basket_writer = csv.DictWriter(basketfile, fieldnames=headers)
for i in range(0, down_length):
basket_writer.writerow({"trt": list_of_lists[0][i], "y": list_of_lists[1][i], "x1": list_of_lists[2][i],
"x2": list_of_lists[3][i], "x3": list_of_lists[4][i], "x4": list_of_lists[5][i],
"x5": list_of_lists[6][i], "x6": list_of_lists[7][i], "x7": list_of_lists[8][i],
"x8": list_of_lists[9][i], "x9": list_of_lists[10][i], "x10": list_of_lists[11][i],
"x11": list_of_lists[12][i], "x12":list_of_lists[13][i], "x13": list_of_lists[14][i],
"x14": list_of_lists[15][i], "x15": list_of_lists[16][i], "x16": list_of_lists[17][i],
"x17": list_of_lists[18][i], "x18": list_of_lists[19][i], "x19": list_of_lists[20][i],
"x20": list_of_lists[21][i], "x21": list_of_lists[22][i], "x22": list_of_lists[23][i],
"x23": list_of_lists[24][i], "x24": list_of_lists[25][i], "x25": list_of_lists[26][i],
"x26": list_of_lists[27][i], "x27": list_of_lists[28][i], "x28": list_of_lists[29][i],
"x29": list_of_lists[30][i], "x30": list_of_lists[31][i], "x31": list_of_lists[32][i],
"x32": list_of_lists[33][i], "x33": list_of_lists[34][i], "x34": list_of_lists[35][i],
"x35": list_of_lists[36][i], "x36": list_of_lists[37][i], "x37": list_of_lists[38][i],
"x38": list_of_lists[39][i], "x39": list_of_lists[40][i], "x40": list_of_lists[41][i]})
and I used the apriori package in Orange to see if there were any association rules.
rules = Orange.associate.AssociationRulesSparseInducer(patient_basket, support=0.3, confidence=0.3)
print "%4s %4s %s" % ("Supp", "Conf", "Rule")
for r in rules:
my_rule = str(r)
split_rule = my_rule.split("->")
if 'trt' in split_rule[1]:
print 'treatment rule'
print "%4.1f %4.1f %s" % (r.support, r.confidence, r)
Using this, technique I found quite a few association rules with my testing data.
THIS IS WHERE I HAVE A PROBLEM
When I read the notes for the training data, there is this note
...That is, the only
reason for the differences among observed responses to the same treatment across patients is
random noise. Hence, there is NO meaningful subgroup for this dataset...
My question is,
why do I get multiple association rules that would imply that there are subgroups, when according to the notes I shouldn't see anything?
I'm getting lift numbers that are above 2 as opposed to the 1 that you should expect if everything was random like the notes state.
Supp Conf Rule
0.3 0.7 6x0 -> trt1
Even though my code runs, I'm not getting results anywhere close to what should be expected. This leads me to believe that I messed something up, but I'm not sure what it is.
After some research, I realized that my sample size is too small for the number of variables that I have. I would need a way larger sample size in order to really use the method that I was using. In fact, the method that I tried to use was developed with the assumption that it would be run on databases with hundreds of thousands or millions of rows.
On the Python DescisionTree module homepage (DecisionTree-1.6.1), they give a piece of example code. Here it is:
dt = DecisionTree( training_datafile = "training.dat", debug1 = 1 )
dt.get_training_data()
dt.show_training_data()
root_node = dt.construct_decision_tree_classifier()
root_node.display_decision_tree(" ")
test_sample = ['exercising=>never', 'smoking=>heavy',
'fatIntake=>heavy', 'videoAddiction=>heavy']
classification = dt.classify(root_node, test_sample)
print "Classification: ", classification
My question is: How can I specify sample data (test_sample here) from variables? On the project homepage, it says: "You classify new data by first constructing a new data vector:" I have searched around but have been unable to find out what a data vector is or the answer to my question.
Any help would be appreciated!
Um, the example says it all. It's a list of strings, with features and values separated by '=>'. To use the example, a feature is 'exercising', and the value is 'never'.