Using similarities.cosine (with dataset) of SurPRISE package python

Using similarities.cosine (with dataset) of SurPRISE package python - python

Briefing:
I'm working over Movielens 100k Dataset for recommendation of movies. So far I've done foll.
Sorting of values
df_sorted_values = df.sort_values(['UserID', 'MovieID'])
print type(df_sorted_values)
Printing Matrix with NaN values
df_matrix = df.pivot_table(values='Rating', index='UserID', columns='MovieID')
Performed 5 Fold CV on it
reader = Reader(line_format="user item rating", sep='\t', rating_scale=(1,5))
df = Dataset.load_from_file('ml-100k/u.data', reader=reader)
df.split(n_folds=5)
I've evaluated the dataset using SVD
perf = evaluate(SVD(),df,measures=['RMSE','MAE'])
print_perf(perf)
HERE I NEED THE USE SIMILARITY ALGORITHM provided by same package (Surprise) which is written as surprise.cosine to Predict the missing values. This shows that it needs (*args,**kwargs) arguments but I'm clueless as what is actually to be passed.
ONCE THE SIMILARITIES ARE GENERATED I NEED TO PRINT THE MATRIX WITH REPLACED NaN values WHICH ARE NOW PREDICTED, later will be used for recommendation
P.S. I'm open to different solutions from CRAB, RECSYS, PANDAS and GRAPHLAB provided they can be worked out on steps 1 to 4 as well
My past references have been:
This Manual, but doesn't show on how the arguments have passed
nor the example
This Which doesn't have much difference than
first

While computing the cosine similarity between 2 vectors is very easy (how about 1-np.dot(a,b)/(np.linalg.norm(a)*np.linalg.norm(b))
I would recommend you to Work with Scipy if you don't want to implement it yourself:
from scipy.spatial.distance import cosine

Those similarity functions are used like these docs: Using prediction algorithms , FAQ, The algorithm base class - compute_similarities for KNN-based algos. They are not supposed to be used like what you want to.
You may want to use the predict function if you choose to use SVD algorithm The algorithm base class - predict like:
# Build an algorithm, and train it.
algo = SVD()
algo.train(trainset)
uid = str(196) # raw user id
iid = str(302) # raw item id
# get a prediction for specific users and items.
pred = algo.predict(uid, iid)

Related

Get all similar documents with doc2vec

I am actually working with doc2vec from gensim library and I want to get all similarities with probabilites not only the top 10 similarities provided by model.docvecs.most_similar()
Once my model is trained
In [1]: print(model)
Out [1]: Doc2vec(...)
If I use model.docvecs.most_similar() I get only the Top 10 similar docs
In [2]: model.docvecs.most_similar('1')
Out [2]: [('2007', 0.9171321988105774),
('606', 0.5638039708137512),
('2578', 0.530228853225708),
('4506', 0.5193327069282532),
('2550', 0.5178008675575256),
('4620', 0.5098666548728943),
('1296', 0.5071642994880676),
('3943', 0.5070815086364746),
('438', 0.5057751536369324),
('1922', 0.5048809051513672)]
And I am looking to get all probilities not only the top 10 for some analysis.
Thanks for your help :)

most_similar() takes an optional topn parameter, with a default value of 10, meaning just the top 10 results will be returned.
If you supply another integer, such as the total number of doc-vectors known to the model, then that many sorted results will be provided.
(You can also supply Python None, which returns all similarities unsorted, in the same order as the vectors are stored in the model.)
Note these values are cosine similarities, with a range of values from -1.0 to 1.0, not 'probabilities'.

How to re classify 20 newsgroups data set from 20 to 6

I have downloaded the popular 20 newsgroups data set which has 20 classes, but I want to re-classify the whole documents into six classes since some classes are very related.
So for example, all computer related docs should have a new class say 1. As it is now, the docs are assigned from 1-20 reflecting the classes. The computer related classes are 2,3,4,5,and 6.
I want say, 1 to be the class of all the computer related(2,3,4,5,6). I tested it by using 20_newsgroups.target[0], and it gave me 7. Meaning the class of the doc at 0 is 7.
I re-assigned it to a new class using 20_newsgroups.target[0]='1' and when I try 20_newsgroups.target[0], it shows 1 which is OK.
But how can I do this for all the documents that currently have (2,3,4,5,6) as their class? I can easily extend it to other classes if I understand that one.I also try for d in 20_newsgroups:
if 20_newsgroups.target in [2,3,4,5,6], 20_newsgroups.target='1'.
But this is showing an error that "the truth value of an array with more than one element is unambiguous, use a.any() or a.all".

I'm not sure if I understand your question, but you seem to want to join categories into supercategories. This should not be hard to do, but it's less than optimal to do this at a late stage of the experiment. If you want to reduce the number of categories, do this by joining some of the categories as the very first step of your process. That way, similar samples from different (original) categories will not cause confusion in the training phase (provided, of course, that they now belong to the same new category), thereby producing a better overall result.

You could do something like this. The code is based on the retrieval of the 20newsgroup data set with scikit learn: https://scikit-learn.org/0.19/datasets/twenty_newsgroups.html
topic_1 = [0,15,19]
topic_2 = [1,2,3,4,5]
topic_3 = [6]
topic_4 = [7,8,9,10]
topic_5 = [11,12,13,14]
topic_6 = [16,17,18]
topics = [topic_1, topic_2, topic_3, topic_4, topic_5, topic_6]
The topic distribution is based on the table provided by http://qwone.com/~jason/20Newsgroups/ (but can be adjusted). The following code reduces the amount of categories of the data set.
twenty_train_reduced = twenty_train.target.copy
for index, target in enumerate(twenty_train.target):
for topic_i, topic in enumerate(topics):
if(target in topic):
twenty_train_reduced[index] = topic_i

Get similarity percent with sklearn hashing vectorizer

I have python program, that fetch article from few sites and store them on database, in my case, when I wan't add new article in database, I should check it's not a duplicate article. I want do this work simply with get percent of similarity and setting a threshold for it(for example, i say if (percent of similarity two string) > 70% then new article is duplicate)
My problem is finding percent of similarity. now I use difflib and SequenceMatcher class:
diff = SequenceMatcher(
None, article1.content, article2.content).ratio()
But it 's not right and I think using HashingVectorizer is better for this case(?):
vectorizer = HashingVectorizer(n_features=(2**18))
article1_vector = vectorizer.transform([article1.content])
article2_vector = vectorizer.transform([article2.content])
How can I get percent of similarity two hashvector(for example cosine distance) and how can I convert it to percent? thanks for your answers.

With the default settings for HashingVectorizer (in particular, norm="l2"), the cosine similarity between these two vectors is
sim = (article1_vector * article2_vector.T).A[0, 0]
This is really just a dot product with some trickery to get rid of the SciPy sparse matrix format.
This gives a similarity between -1 and 1, so you could add one and divide by two to get a percentage.

Running AB tests on Revenue in Python

I'm trying to run an AB test - comparing revenue amongst variants on websites.
Our standard approach (using t-tests) didn't seem like it would work because revenue can't be modelled binomially. However, I read about bootstrapping and came up with the following code:
import numpy as np
import scipy.stats as stats
import random
def resampler(original_array, number_of_samples):
sample_array = np.zeros(number_of_samples)
choice = random.choice
for i in range(number_of_samples):
sample_array[i] = sum([choice(original_array) for _ in range(len(original_array))])
y = stats.normaltest(sample_array)
if y[1] > 0.001:
print y
new_y = resampler(original_array, number_of_samples * 2)
y = new_y
return sample_array
Basically, randomly sample from the 'revenue vector' (a sparsely populated vector - a zero for all non-converting visitors) and sum the resulting vectors until you've got a normal distribution.
I can perform this for both test groups at which point I've got two normally distributed quantities for t-testing. Using scipy.stats.ttest_ind I was able to get results that looked someway reasonable.
However, I wondered what the effect of running this procedure on cookie split would be (expected each group to see 50% of the cookies). Here, I saw something fairly unexpected - given the following code:
x = [272898,389076,61091,65251,10060,1468815,216014,25863,42421,476379,73761]
y = [274253,387941,61333,65020,10056,1466908,214679,25682,42873,474692,73837]
print stats.ttest_ind(x,y)
I get the output: (0.0021911476165975929, 0.99827342714956546)
Not at all significant (I think I'm interpreting that correctly?)
However, when I run this code:
for i in range(1000, 100000, 5000):
one_array = resampler(x,i)
two_array = resampler(y,i)
t_value, p_value = stats.ttest_ind(one_array, two_array)
t_value_array.append(t_value)
p_value_array.append(p_value)
print np.mean(t_value_array)
print np.mean(p_value_array)
I get:
0.642213492773
0.490587258892
I'm not really sure how to interpret these numbers - as far as I'm aware, I've repeatedly generated normal distributions from the actual cookie splits (each number in the array represents a different site). In each of these cases, I've used a t-test on the two distributions and gotten a t-statistic and a p-value.
Is this a legitimate thing to do? I only ran these tests multiple times because I was seeing so much variation in the p-value and t-statistic when not doing this.
Am I missing an obvious way to run this kind of test?
Cheers,
Matt
p.s
The data we have:
Website 1 : test group 1: unique cookies: revenue
Website 1 : test group 2: unique cookies: revenue
Website 2 : test group 1: unique cookies: revenue
Website 2 : test group 2: unique cookies: revenue
e.t.c.
What we'd like:
Test group x is beating test group y with z% certainty
(null hypothesis of test group 1 = test group 2)
Bonus:
The same as above but at a per site, as well as overall, basis

Firstly, using a t-test to test binomial response variables isn't correct. You need to use a logistic regression model.
On to your question. It's very hard to read that code and understand what you think you're testing---what's your H_0 (null hypothesis)? If I'm being honest (and I hope you don't take offense) it looks pretty confused.
I'm going to have to guess what the data look like---you have a bunch of samples like this:
Website Method Revenue
------- ------ -------
w1 A 12
w2 B 0
w3 A 6
w4 B 0
etc etc. Does this look correct? Do you have repeated measures (i.e. do you have a revenue measurement for each website for each method? Or did you randomly assign websites to methods?)? I'm guessing that what you're passing to your method is an array of all revenues for one of the methods in turn, but do they pair up across methods in any way?
I can imagine testing various hypotheses with this data. For example, is method A more likely to generate non-zero revenue than method B (use logistic regression, response is binary)? Of the cases where a method generates revenue at all, does method A generate more than method B (t-test on non-zero revenues)? Does method A generate more revenue than method B across all instances (probably a sign test, due to problems with the assumption of normality when you include the zeros). I assume this troubling assumption is why you run the procedure of repeatedly subsampling until your data look normal, but you can't do this and test anything meaningful: just because some subset of your data is normally distributed doesn't mean you can look at only this part of it! In fact, I wouldn't be surprised to see that what this essentially does is excludes either most of the zero entries or most of the non-zero entries.
If you elaborate with what some of the actual data look like, and what questions you want to answer, I'm happy to make more specific suggestions.

Clustering using k-means in python

I have a document d1 consisting of lines of form user_id tag_id.
There is another document d2 consisting of tag_id tag_name
I need to generate clusters of users with similar tagging behaviour.
I want to try this with k-means algorithm in python.
I am completely new to this and cant figure out how to start on this.
Can anyone give any pointers?
Do I need to first create different documents for each user using d1 with his tag vocabulary?
And then apply k-means algorithm on these documents?
There are like 1 million users in d1. I am not sure I am thinking in right direction, creating 1 million files ?

Since the data you have is binary and sparse (in particular, not all users have tagged all documents, right)? So I'm not at all convinced that k-means is the proper way to do this.
Anyway, if you want to give k-means a try, have a look at the variants such as k-medians (which won't allow "half-tagging") and convex/spherical k-means (which supposedly works better with distance functions such as cosine distance, which seems a lot more appropriate here).

As mentioned by #Jacob Eggers, you have to denormalize the data to form the matrix which is a sparse one indeed.
Use SciPy package in python for k means. See
Scipy Kmeans
for examples and execution.
Also check Kmeans in python (Stackoverflow) for more information in python kmeans clustering.

First you need to denormalize the data so that you have one file like this:
userid tag1 tag2 tag3 tag4 ....
0001 1 0 1 0 ....
0002 0 1 1 0 ....
0003 0 0 1 1 ....
Then you need to loop through the k-means algorithm. Here is matlab code from the ml-class:
% Initialize centroids
centroids = kMeansInitCentroids(X, K);
for iter = 1:iterations
% Cluster assignment step: Assign each data point to the
% closest centroid. idx(i) corresponds to cˆ(i), the index
% of the centroid assigned to example i
idx = findClosestCentroids(X, centroids);
% Move centroid step: Compute means based on centroid
% assignments
centroids = computeMeans(X, idx, K);
end

For sparse k-means, see the examples under
scikit-learn clustering.
About how many ids are there, how many per user on average,
how many clusters are you looking for ? Even rough numbers,
e.g. 100k ids, av 10 per user, 100 clusters,
may lead to someone who's done clustering in that range
(or else to back-of-the-envelope "impossible").
MinHash
may be better suited for your problem than k-means;
see chapter 3, Finding Similar Items,
of Ullman, Mining Massive Datasets;
also SO questions/tagged/similarity+algorithm+python.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.