I have python program, that fetch article from few sites and store them on database, in my case, when I wan't add new article in database, I should check it's not a duplicate article. I want do this work simply with get percent of similarity and setting a threshold for it(for example, i say if (percent of similarity two string) > 70% then new article is duplicate)
My problem is finding percent of similarity. now I use difflib and SequenceMatcher class:
diff = SequenceMatcher(
None, article1.content, article2.content).ratio()
But it 's not right and I think using HashingVectorizer is better for this case(?):
vectorizer = HashingVectorizer(n_features=(2**18))
article1_vector = vectorizer.transform([article1.content])
article2_vector = vectorizer.transform([article2.content])
How can I get percent of similarity two hashvector(for example cosine distance) and how can I convert it to percent? thanks for your answers.
With the default settings for HashingVectorizer (in particular, norm="l2"), the cosine similarity between these two vectors is
sim = (article1_vector * article2_vector.T).A[0, 0]
This is really just a dot product with some trickery to get rid of the SciPy sparse matrix format.
This gives a similarity between -1 and 1, so you could add one and divide by two to get a percentage.
Related
I have a list of words, and I need to create a pairwise similarity matrix using the Fasttext word embedding. This is what I am currently doing:
from gensim.models import fasttext as ft
from sklearn.metrics import pairwise_distances
path='cc.en.300.bin'
model=ft.load_facebook_vectors(path, encoding='utf-8')
wordlist = [x for x in df_['word']] # list of words from dataframe
wordlist_vec = [model[x] for x in df_['word']] #get word vector
wd_arr = np.array(wordlist_vec).reshape(-1, 1) # reshape to compute pairwise distance
distances = pairwise_distances(wd_arr, wd_arr, metric=model.similarity) # pairwise distance matrix
this would yield a pairwise distance matrix using Gensim's cosine similarity function. Unfortunately, I get a memory error
Unable to allocate 1013. GiB for an array with shape (368700, 368700) and data type float64
I guess because it's trying to stock in memory all the vectors of the words (we are talking about ~1100 words, tops).
I am not sure which way to proceed here. Is there a native gensim function to create a similarity matrix starting from a list of words? Alternatively, what could be a clever way to get it?
The error clearly indicates that pairwise_distances() has been given 368,700 items whose distances should be calculated with 368,700 other items.
That would take (368700^2) * 8 bytes = 1013GB of RAM to cacluation, which your machine likely does not have, giving an error.
If you think it should be only "~1100 words, tops", take a look at your interim values – wordlist, wordlist_vec, & wd_arr – to make sure each is the size/shape/contents you intend.
(You may run into another issue when you fix that, though: I don't think model.similarity is of the exact type expected by pairwise_distances() metric parameter.)
I am trying to implement textrank algorithm where I am calculating cosine-similarity matrix for all the sentences.I want to parallelize the task of similarity matrix creation using Spark but don't know how to implement it.Here is the code:
cluster_summary_dict = {}
for cluster,sentences in tqdm(cluster_wise_sen.items()):
sen_sim_matrix = np.zeros([len(sentences),len(sentences)])
for row in range(len(sentences)):
for col in range(len(sentences)):
if row != col:
sen_sim_matrix[row][col] = cosine_similarity(cluster_dict[cluster]
[row].reshape(1,100), cluster_dict[cluster]
[col].reshape(1,100))[0,0]
sentence_graph = nx.from_numpy_array(sen_sim_matrix)
scores = nx.pagerank(sentence_graph)
pagerank_sentences = sorted(((scores[k],sent) for k,sent in enumerate(sentences)),
reverse=True)
cluster_summary_dict[cluster] = pagerank_sentences
Here,cluster_wise_sen is a dictionary that contains list of sentences for different clusters ({'cluster 1' : [list of sentences] ,...., 'cluster n' : [list of sentences]}). cluster_dict contains the 100d vector representation of the sentences. I have to compute the sentence similarity matrix for each cluster. Since it is time consuming, therefore looking to parallelize it using spark.
The experiments with large scale matrix calculation for cosine similarity are well written in here!
To achieve speed and not compromising much on the accuracy, you can also try hashing methods like Min-Hash and evaluate Jaccard Distance similarity. It comes with a nice implementation with Spark ML-lib, the documentation has very detailed examples for reference: http://spark.apache.org/docs/latest/ml-features.html#minhash-for-jaccard-distance
This is from a text analysis exercise using data from Rotten Tomatoes. The data is in critics.csv, imported as a pandas DataFrame, "critics".
This piece of the exercise is to
Construct the cumulative distribution of document frequencies (df).
The 𝑥 -axis is a document count (𝑥𝑖) and the 𝑦 -axis is the
percentage of words that appear less than (𝑥𝑖) times. For example,
at 𝑥=5 , plot a point representing the percentage or number of words
that appear in 5 or fewer documents.
From a previous exercise, I have a "Bag of Words"
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
# build the vocabulary and transform to a "bag of words"
X = vectorizer.fit_transform(critics.quote)
# Convert matrix to Compressed Sparse Column (CSC) format
X = X.tocsc()
Evey sample I've found calculates a matrix of documents per word from that "bag of words" matrix in this way:
docs_per_word = X.sum(axis=0)
I buy that this works; I've looked at the result.
But I'm confused about what's actually happening and why it works, what is being summed, and how I might have been able to figure out how to do this without needing to look up what other people did.
I figured this out last night. It doesn't actually work; I misinterpreted the result. (I thought it was working because Jupyter notebooks only show a few values in a large array. But, examined more closely, the array values were too big. The max value in the array was larger than the number of 'documents'!)
X (my "bag of words) is a word frequency vector. Summing over X provides information on how often each word occurs within the corpus of documents. But the instructions as for how many documents a word appears in (e.g. between 0 and 4 for four documents), not how many times it appears in the set of those documents (0 - n for four documents).
I need to convert X to a boolean matrix. (Now I just have to figure out how to do this. ;-)
Briefing:
I'm working over Movielens 100k Dataset for recommendation of movies. So far I've done foll.
Sorting of values
df_sorted_values = df.sort_values(['UserID', 'MovieID'])
print type(df_sorted_values)
Printing Matrix with NaN values
df_matrix = df.pivot_table(values='Rating', index='UserID', columns='MovieID')
Performed 5 Fold CV on it
reader = Reader(line_format="user item rating", sep='\t', rating_scale=(1,5))
df = Dataset.load_from_file('ml-100k/u.data', reader=reader)
df.split(n_folds=5)
I've evaluated the dataset using SVD
perf = evaluate(SVD(),df,measures=['RMSE','MAE'])
print_perf(perf)
HERE I NEED THE USE SIMILARITY ALGORITHM provided by same package (Surprise) which is written as surprise.cosine to Predict the missing values. This shows that it needs (*args,**kwargs) arguments but I'm clueless as what is actually to be passed.
ONCE THE SIMILARITIES ARE GENERATED I NEED TO PRINT THE MATRIX WITH REPLACED NaN values WHICH ARE NOW PREDICTED, later will be used for recommendation
P.S. I'm open to different solutions from CRAB, RECSYS, PANDAS and GRAPHLAB provided they can be worked out on steps 1 to 4 as well
My past references have been:
This Manual, but doesn't show on how the arguments have passed
nor the example
This Which doesn't have much difference than
first
While computing the cosine similarity between 2 vectors is very easy (how about 1-np.dot(a,b)/(np.linalg.norm(a)*np.linalg.norm(b))
I would recommend you to Work with Scipy if you don't want to implement it yourself:
from scipy.spatial.distance import cosine
Those similarity functions are used like these docs: Using prediction algorithms , FAQ, The algorithm base class - compute_similarities for KNN-based algos. They are not supposed to be used like what you want to.
You may want to use the predict function if you choose to use SVD algorithm The algorithm base class - predict like:
# Build an algorithm, and train it.
algo = SVD()
algo.train(trainset)
uid = str(196) # raw user id
iid = str(302) # raw item id
# get a prediction for specific users and items.
pred = algo.predict(uid, iid)
I have a document d1 consisting of lines of form user_id tag_id.
There is another document d2 consisting of tag_id tag_name
I need to generate clusters of users with similar tagging behaviour.
I want to try this with k-means algorithm in python.
I am completely new to this and cant figure out how to start on this.
Can anyone give any pointers?
Do I need to first create different documents for each user using d1 with his tag vocabulary?
And then apply k-means algorithm on these documents?
There are like 1 million users in d1. I am not sure I am thinking in right direction, creating 1 million files ?
Since the data you have is binary and sparse (in particular, not all users have tagged all documents, right)? So I'm not at all convinced that k-means is the proper way to do this.
Anyway, if you want to give k-means a try, have a look at the variants such as k-medians (which won't allow "half-tagging") and convex/spherical k-means (which supposedly works better with distance functions such as cosine distance, which seems a lot more appropriate here).
As mentioned by #Jacob Eggers, you have to denormalize the data to form the matrix which is a sparse one indeed.
Use SciPy package in python for k means. See
Scipy Kmeans
for examples and execution.
Also check Kmeans in python (Stackoverflow) for more information in python kmeans clustering.
First you need to denormalize the data so that you have one file like this:
userid tag1 tag2 tag3 tag4 ....
0001 1 0 1 0 ....
0002 0 1 1 0 ....
0003 0 0 1 1 ....
Then you need to loop through the k-means algorithm. Here is matlab code from the ml-class:
% Initialize centroids
centroids = kMeansInitCentroids(X, K);
for iter = 1:iterations
% Cluster assignment step: Assign each data point to the
% closest centroid. idx(i) corresponds to cˆ(i), the index
% of the centroid assigned to example i
idx = findClosestCentroids(X, centroids);
% Move centroid step: Compute means based on centroid
% assignments
centroids = computeMeans(X, idx, K);
end
For sparse k-means, see the examples under
scikit-learn clustering.
About how many ids are there, how many per user on average,
how many clusters are you looking for ? Even rough numbers,
e.g. 100k ids, av 10 per user, 100 clusters,
may lead to someone who's done clustering in that range
(or else to back-of-the-envelope "impossible").
MinHash
may be better suited for your problem than k-means;
see chapter 3, Finding Similar Items,
of Ullman, Mining Massive Datasets;
also SO questions/tagged/similarity+algorithm+python.