Clustering using k-means in python

Clustering using k-means in python - python

I have a document d1 consisting of lines of form user_id tag_id.
There is another document d2 consisting of tag_id tag_name
I need to generate clusters of users with similar tagging behaviour.
I want to try this with k-means algorithm in python.
I am completely new to this and cant figure out how to start on this.
Can anyone give any pointers?
Do I need to first create different documents for each user using d1 with his tag vocabulary?
And then apply k-means algorithm on these documents?
There are like 1 million users in d1. I am not sure I am thinking in right direction, creating 1 million files ?

Since the data you have is binary and sparse (in particular, not all users have tagged all documents, right)? So I'm not at all convinced that k-means is the proper way to do this.
Anyway, if you want to give k-means a try, have a look at the variants such as k-medians (which won't allow "half-tagging") and convex/spherical k-means (which supposedly works better with distance functions such as cosine distance, which seems a lot more appropriate here).

As mentioned by #Jacob Eggers, you have to denormalize the data to form the matrix which is a sparse one indeed.
Use SciPy package in python for k means. See
Scipy Kmeans
for examples and execution.
Also check Kmeans in python (Stackoverflow) for more information in python kmeans clustering.

First you need to denormalize the data so that you have one file like this:
userid tag1 tag2 tag3 tag4 ....
0001 1 0 1 0 ....
0002 0 1 1 0 ....
0003 0 0 1 1 ....
Then you need to loop through the k-means algorithm. Here is matlab code from the ml-class:
% Initialize centroids
centroids = kMeansInitCentroids(X, K);
for iter = 1:iterations
% Cluster assignment step: Assign each data point to the
% closest centroid. idx(i) corresponds to cˆ(i), the index
% of the centroid assigned to example i
idx = findClosestCentroids(X, centroids);
% Move centroid step: Compute means based on centroid
% assignments
centroids = computeMeans(X, idx, K);
end

For sparse k-means, see the examples under
scikit-learn clustering.
About how many ids are there, how many per user on average,
how many clusters are you looking for ? Even rough numbers,
e.g. 100k ids, av 10 per user, 100 clusters,
may lead to someone who's done clustering in that range
(or else to back-of-the-envelope "impossible").
MinHash
may be better suited for your problem than k-means;
see chapter 3, Finding Similar Items,
of Ullman, Mining Massive Datasets;
also SO questions/tagged/similarity+algorithm+python.

Related

Using similarities.cosine (with dataset) of SurPRISE package python

Briefing:
I'm working over Movielens 100k Dataset for recommendation of movies. So far I've done foll.
Sorting of values
df_sorted_values = df.sort_values(['UserID', 'MovieID'])
print type(df_sorted_values)
Printing Matrix with NaN values
df_matrix = df.pivot_table(values='Rating', index='UserID', columns='MovieID')
Performed 5 Fold CV on it
reader = Reader(line_format="user item rating", sep='\t', rating_scale=(1,5))
df = Dataset.load_from_file('ml-100k/u.data', reader=reader)
df.split(n_folds=5)
I've evaluated the dataset using SVD
perf = evaluate(SVD(),df,measures=['RMSE','MAE'])
print_perf(perf)
HERE I NEED THE USE SIMILARITY ALGORITHM provided by same package (Surprise) which is written as surprise.cosine to Predict the missing values. This shows that it needs (*args,**kwargs) arguments but I'm clueless as what is actually to be passed.
ONCE THE SIMILARITIES ARE GENERATED I NEED TO PRINT THE MATRIX WITH REPLACED NaN values WHICH ARE NOW PREDICTED, later will be used for recommendation
P.S. I'm open to different solutions from CRAB, RECSYS, PANDAS and GRAPHLAB provided they can be worked out on steps 1 to 4 as well
My past references have been:
This Manual, but doesn't show on how the arguments have passed
nor the example
This Which doesn't have much difference than
first

While computing the cosine similarity between 2 vectors is very easy (how about 1-np.dot(a,b)/(np.linalg.norm(a)*np.linalg.norm(b))
I would recommend you to Work with Scipy if you don't want to implement it yourself:
from scipy.spatial.distance import cosine

Those similarity functions are used like these docs: Using prediction algorithms , FAQ, The algorithm base class - compute_similarities for KNN-based algos. They are not supposed to be used like what you want to.
You may want to use the predict function if you choose to use SVD algorithm The algorithm base class - predict like:
# Build an algorithm, and train it.
algo = SVD()
algo.train(trainset)
uid = str(196) # raw user id
iid = str(302) # raw item id
# get a prediction for specific users and items.
pred = algo.predict(uid, iid)

fast comparison of large amount of list of lists

Comparing list of lists has been posted about before but the python environment that I am working in cannot fully integrate all the methods and classes in numpy. I cannot import pandas either.
I am trying to compare lists within a big list and come up with roughly 8-10 lists that approximate all the other lists in the big list.
The approach I have works fine if I have <50 lists in the big list. However, I am trying to compare at least 20k lists and ideally 1million+. I am currently looking into itertools. What might be the fastest, most efficient approach for large data sets without using numpy or pandas?
I am able to use some of the methods and classes in numpy but not all. For example, numpy.allclose and numpy.all do not work properly and that is because of the environment that I am working in.
global rel_tol, avg_lists
rel_tol=.1
avg_lists=[]
#compare the lists in the big list and output ~8-10 lists that approximate the all the lists in the big list
for j in range(len(big_list)):
for k in range(len(big_list)):
array1=np.array(big_list[j])
array2=np.array(big_list[k])
if j!=k:
#if j is not k:
diff=np.subtract(array1, array2)
abs_diff=np.absolute(diff)
#cannot use numpy.allclose
#if the deviation for the largest value in the array is < 10%
if np.amax(abs_diff)<= rel_tol and big_list[k] not in avg_lists:
cntr+=1
avg_lists.append(big_list[k])

Fundamentally, it looks like what you're aiming at is a clustering operation (i.e. representing a set of N points via K < N cluster centers). I would suggest a K-Means clustering approach, where you increase K until the size of your clusters is below your desired threshold.
I'm not sure what you mean by "cannot fully integrate all the methods and classes in numpy", but if scikit-learn is available you could use its K-means estimator. If that's not possible, a simple version of the K-means algorithm is relatively easy to code from scratch, and you might use that.
Here's a k-means approach using scikit-learn:
# 100 lists of length 10 = 100 points in 10 dimensions
from random import random
big_list = [[random() for i in range(10)] for j in range(100)]
# compute eight representative points
from sklearn.cluster import KMeans
model = KMeans(n_clusters=8)
model.fit(big_list)
centers = model.cluster_centers_
print(centers.shape) # (8, 10)
# this is the sum of square distances of your points to the cluster centers
# you can adjust n_clusters until this is small enough for your purposes.
sum_sq_dists = model.inertia_
From here you can e.g. find the closest point in each cluster to its center and treat this as the average. Without more detail of the problem you're trying to solve, it's hard to say for sure. But a clustering approach like this will be the most efficient way to solve a problem like the one you stated in your question.

Get similarity percent with sklearn hashing vectorizer

I have python program, that fetch article from few sites and store them on database, in my case, when I wan't add new article in database, I should check it's not a duplicate article. I want do this work simply with get percent of similarity and setting a threshold for it(for example, i say if (percent of similarity two string) > 70% then new article is duplicate)
My problem is finding percent of similarity. now I use difflib and SequenceMatcher class:
diff = SequenceMatcher(
None, article1.content, article2.content).ratio()
But it 's not right and I think using HashingVectorizer is better for this case(?):
vectorizer = HashingVectorizer(n_features=(2**18))
article1_vector = vectorizer.transform([article1.content])
article2_vector = vectorizer.transform([article2.content])
How can I get percent of similarity two hashvector(for example cosine distance) and how can I convert it to percent? thanks for your answers.

With the default settings for HashingVectorizer (in particular, norm="l2"), the cosine similarity between these two vectors is
sim = (article1_vector * article2_vector.T).A[0, 0]
This is really just a dot product with some trickery to get rid of the SciPy sparse matrix format.
This gives a similarity between -1 and 1, so you could add one and divide by two to get a percentage.

Running AB tests on Revenue in Python

I'm trying to run an AB test - comparing revenue amongst variants on websites.
Our standard approach (using t-tests) didn't seem like it would work because revenue can't be modelled binomially. However, I read about bootstrapping and came up with the following code:
import numpy as np
import scipy.stats as stats
import random
def resampler(original_array, number_of_samples):
sample_array = np.zeros(number_of_samples)
choice = random.choice
for i in range(number_of_samples):
sample_array[i] = sum([choice(original_array) for _ in range(len(original_array))])
y = stats.normaltest(sample_array)
if y[1] > 0.001:
print y
new_y = resampler(original_array, number_of_samples * 2)
y = new_y
return sample_array
Basically, randomly sample from the 'revenue vector' (a sparsely populated vector - a zero for all non-converting visitors) and sum the resulting vectors until you've got a normal distribution.
I can perform this for both test groups at which point I've got two normally distributed quantities for t-testing. Using scipy.stats.ttest_ind I was able to get results that looked someway reasonable.
However, I wondered what the effect of running this procedure on cookie split would be (expected each group to see 50% of the cookies). Here, I saw something fairly unexpected - given the following code:
x = [272898,389076,61091,65251,10060,1468815,216014,25863,42421,476379,73761]
y = [274253,387941,61333,65020,10056,1466908,214679,25682,42873,474692,73837]
print stats.ttest_ind(x,y)
I get the output: (0.0021911476165975929, 0.99827342714956546)
Not at all significant (I think I'm interpreting that correctly?)
However, when I run this code:
for i in range(1000, 100000, 5000):
one_array = resampler(x,i)
two_array = resampler(y,i)
t_value, p_value = stats.ttest_ind(one_array, two_array)
t_value_array.append(t_value)
p_value_array.append(p_value)
print np.mean(t_value_array)
print np.mean(p_value_array)
I get:
0.642213492773
0.490587258892
I'm not really sure how to interpret these numbers - as far as I'm aware, I've repeatedly generated normal distributions from the actual cookie splits (each number in the array represents a different site). In each of these cases, I've used a t-test on the two distributions and gotten a t-statistic and a p-value.
Is this a legitimate thing to do? I only ran these tests multiple times because I was seeing so much variation in the p-value and t-statistic when not doing this.
Am I missing an obvious way to run this kind of test?
Cheers,
Matt
p.s
The data we have:
Website 1 : test group 1: unique cookies: revenue
Website 1 : test group 2: unique cookies: revenue
Website 2 : test group 1: unique cookies: revenue
Website 2 : test group 2: unique cookies: revenue
e.t.c.
What we'd like:
Test group x is beating test group y with z% certainty
(null hypothesis of test group 1 = test group 2)
Bonus:
The same as above but at a per site, as well as overall, basis

Firstly, using a t-test to test binomial response variables isn't correct. You need to use a logistic regression model.
On to your question. It's very hard to read that code and understand what you think you're testing---what's your H_0 (null hypothesis)? If I'm being honest (and I hope you don't take offense) it looks pretty confused.
I'm going to have to guess what the data look like---you have a bunch of samples like this:
Website Method Revenue
------- ------ -------
w1 A 12
w2 B 0
w3 A 6
w4 B 0
etc etc. Does this look correct? Do you have repeated measures (i.e. do you have a revenue measurement for each website for each method? Or did you randomly assign websites to methods?)? I'm guessing that what you're passing to your method is an array of all revenues for one of the methods in turn, but do they pair up across methods in any way?
I can imagine testing various hypotheses with this data. For example, is method A more likely to generate non-zero revenue than method B (use logistic regression, response is binary)? Of the cases where a method generates revenue at all, does method A generate more than method B (t-test on non-zero revenues)? Does method A generate more revenue than method B across all instances (probably a sign test, due to problems with the assumption of normality when you include the zeros). I assume this troubling assumption is why you run the procedure of repeatedly subsampling until your data look normal, but you can't do this and test anything meaningful: just because some subset of your data is normally distributed doesn't mean you can look at only this part of it! In fact, I wouldn't be surprised to see that what this essentially does is excludes either most of the zero entries or most of the non-zero entries.
If you elaborate with what some of the actual data look like, and what questions you want to answer, I'm happy to make more specific suggestions.

Collaborative Filtering: Non-Personalized item-to-item similarity

I'm trying to compute item-to-item similarity along the lines of Amazon's "Customers who viewed/purchased X have also viewed/purchased Y and Z". All of the examples and references I've seen are for either computing item similarity for ranked items, for finding user-user similarity, or for finding recommended items based on the current users' history. I'd like to start off with a non-targeted approach before factoring in the current users' preferences.
Looking at the Amazon.com recommendations white paper, they use the following logic for offline item-item similarity:
For each item in product catalog, I1
For each customer C who purchased I1
For each item I2 purchased by customer C
Record that a customer purchased I1 and I2
For each item I2
Compute the similarity between I1 and I2
If I understand correctly, by the time we're at "Compute similiarty between I1 and I2", I have a list of items(I2) purchased in conjunction with a single value I1(the outer loop).
How is this calculation performed?
Another idea is that I'm overthinking this and making it more difficult than I need to - Would it be enough to do a top-n query on the count of I2 bought in conjunction with I1?
I also appreciate suggestions on whether or not this approach is a correct one. My product database has about 150k items at any time. Since the bulk of the reading material I've seen shows user-item similarity or even user-user similarity, should I be looking to go that route instead.
I've worked with similarity algorithms in the past but they've always involved a rank or a score. I think the only way this would work would be to build a customer-product matrix scoring 0/1 for not purchased/purchased. Given the purchase history and the item size, this could get really large.
edit: although i listed python as a tag, i'd prefer to keep the logic inside of a db, preferably using Oracle PL/SQL.

Let's understand Item-to-Item Collaborative Filtering.
suppose we have purchase matrix
Item1 Item2 ... ItemN
User1 0 1 ... 0
User2 1 1 ... 0
.
.
.
UserM 1 0 ... 0
Then we can calculate Item similarity using column vector, e.g use cosine. We have a item similarity symmetry matrix as below
Item1 Item2 ... ItemN
Item1 1 1/M ... 0
Item2 1/M 1 ... 0
.
.
.
ItemN 0 0 ... 1
It's can be explained as "Customers who viewed/purchased X have also viewed/purchased Y, Z, ..." (Collaborative Filtering). Because Item's vectorization is based on user's purchased.
Amazon's logic is exactly same with above while it's target is to improve efficient. As they said
We could build a product-to-product matrix by
iterating through all item pairs and com- puting a similarity metric
for each pair. However, many product pairs have no common customers,
and thus the approach is inefficient in terms of processing time and
memory usage. The iterative algorithm provides a better approach by
calculating the similarity between a single prod-uct and all related
products

There's a good O'Reilly book on this topic. While the whitepaper might lay the logic out in pseudo-code like that, I don't think that approach would scale very well. The calculations are all probability calculations, so things like Bayes' Theorem get used to say, "Given Person A purchased X, what's the likelihood they purchased Z?" Straightforward looping over the data is working too hard. You have to go through it all for each person.

#Neil or whoever comes to this question later on:
The choice of similarity metric is up to you and you might want to leave it malleable for the future. Check out the Wikipedia article on Frobenius norm for a start. Or as in the link you submitted, the Jaccard coefficient cos(I1,I2).
User-item –vs– user-user –vs– item-item, or whatever combination, cannot be answered objectively. It depends on what kind of data you can get from your users, how the UI draws information out of them, what parts of your data you consider reliable, and your own time constraints (as far as hybrids go).
Since many people have done masters theses on the questions above, you probably want to start with the easiest implementable solution while leaving room for growth in the complexity of the algorithm.

This may not be a perfect answer for your question but another way to look at this problem is Frequent Itemset Mining, which computes all the frequently co-purchased product pairs / groups given a minimum frequency threshold. And you can map a customer's purchase to its commonly co-purchased products.
There is no model training or Bayesian probability predicting because it's a pure math problem. Just need to count the frequency of all possible product pairs purchased together in your transaction base. It's an exponential search space but there are a lot of different efficient algorithms and implementations out there to use (SPMF is a very good one written in Java). This could work as a quick baseline model.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.