Get all similar documents with doc2vec - python

I am actually working with doc2vec from gensim library and I want to get all similarities with probabilites not only the top 10 similarities provided by model.docvecs.most_similar()
Once my model is trained
In [1]: print(model)
Out [1]: Doc2vec(...)
If I use model.docvecs.most_similar() I get only the Top 10 similar docs
In [2]: model.docvecs.most_similar('1')
Out [2]: [('2007', 0.9171321988105774),
('606', 0.5638039708137512),
('2578', 0.530228853225708),
('4506', 0.5193327069282532),
('2550', 0.5178008675575256),
('4620', 0.5098666548728943),
('1296', 0.5071642994880676),
('3943', 0.5070815086364746),
('438', 0.5057751536369324),
('1922', 0.5048809051513672)]
And I am looking to get all probilities not only the top 10 for some analysis.
Thanks for your help :)

most_similar() takes an optional topn parameter, with a default value of 10, meaning just the top 10 results will be returned.
If you supply another integer, such as the total number of doc-vectors known to the model, then that many sorted results will be provided.
(You can also supply Python None, which returns all similarities unsorted, in the same order as the vectors are stored in the model.)
Note these values are cosine similarities, with a range of values from -1.0 to 1.0, not 'probabilities'.

Related

Naive Bayes, Text Analysis, SKLearn

This is from a text analysis exercise using data from Rotten Tomatoes. The data is in critics.csv, imported as a pandas DataFrame, "critics".
This piece of the exercise is to
Construct the cumulative distribution of document frequencies (df).
The 𝑥 -axis is a document count (𝑥𝑖) and the 𝑦 -axis is the
percentage of words that appear less than (𝑥𝑖) times. For example,
at 𝑥=5 , plot a point representing the percentage or number of words
that appear in 5 or fewer documents.
From a previous exercise, I have a "Bag of Words"
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
# build the vocabulary and transform to a "bag of words"
X = vectorizer.fit_transform(critics.quote)
# Convert matrix to Compressed Sparse Column (CSC) format
X = X.tocsc()
Evey sample I've found calculates a matrix of documents per word from that "bag of words" matrix in this way:
docs_per_word = X.sum(axis=0)
I buy that this works; I've looked at the result.
But I'm confused about what's actually happening and why it works, what is being summed, and how I might have been able to figure out how to do this without needing to look up what other people did.
I figured this out last night. It doesn't actually work; I misinterpreted the result. (I thought it was working because Jupyter notebooks only show a few values in a large array. But, examined more closely, the array values were too big. The max value in the array was larger than the number of 'documents'!)
X (my "bag of words) is a word frequency vector. Summing over X provides information on how often each word occurs within the corpus of documents. But the instructions as for how many documents a word appears in (e.g. between 0 and 4 for four documents), not how many times it appears in the set of those documents (0 - n for four documents).
I need to convert X to a boolean matrix. (Now I just have to figure out how to do this. ;-)

Using similarities.cosine (with dataset) of SurPRISE package python

Briefing:
I'm working over Movielens 100k Dataset for recommendation of movies. So far I've done foll.
Sorting of values
df_sorted_values = df.sort_values(['UserID', 'MovieID'])
print type(df_sorted_values)
Printing Matrix with NaN values
df_matrix = df.pivot_table(values='Rating', index='UserID', columns='MovieID')
Performed 5 Fold CV on it
reader = Reader(line_format="user item rating", sep='\t', rating_scale=(1,5))
df = Dataset.load_from_file('ml-100k/u.data', reader=reader)
df.split(n_folds=5)
I've evaluated the dataset using SVD
perf = evaluate(SVD(),df,measures=['RMSE','MAE'])
print_perf(perf)
HERE I NEED THE USE SIMILARITY ALGORITHM provided by same package (Surprise) which is written as surprise.cosine to Predict the missing values. This shows that it needs (*args,**kwargs) arguments but I'm clueless as what is actually to be passed.
ONCE THE SIMILARITIES ARE GENERATED I NEED TO PRINT THE MATRIX WITH REPLACED NaN values WHICH ARE NOW PREDICTED, later will be used for recommendation
P.S. I'm open to different solutions from CRAB, RECSYS, PANDAS and GRAPHLAB provided they can be worked out on steps 1 to 4 as well
My past references have been:
This Manual, but doesn't show on how the arguments have passed
nor the example
This Which doesn't have much difference than
first
While computing the cosine similarity between 2 vectors is very easy (how about 1-np.dot(a,b)/(np.linalg.norm(a)*np.linalg.norm(b))
I would recommend you to Work with Scipy if you don't want to implement it yourself:
from scipy.spatial.distance import cosine
Those similarity functions are used like these docs: Using prediction algorithms , FAQ, The algorithm base class - compute_similarities for KNN-based algos. They are not supposed to be used like what you want to.
You may want to use the predict function if you choose to use SVD algorithm The algorithm base class - predict like:
# Build an algorithm, and train it.
algo = SVD()
algo.train(trainset)
uid = str(196) # raw user id
iid = str(302) # raw item id
# get a prediction for specific users and items.
pred = algo.predict(uid, iid)

Get similarity percent with sklearn hashing vectorizer

I have python program, that fetch article from few sites and store them on database, in my case, when I wan't add new article in database, I should check it's not a duplicate article. I want do this work simply with get percent of similarity and setting a threshold for it(for example, i say if (percent of similarity two string) > 70% then new article is duplicate)
My problem is finding percent of similarity. now I use difflib and SequenceMatcher class:
diff = SequenceMatcher(
None, article1.content, article2.content).ratio()
But it 's not right and I think using HashingVectorizer is better for this case(?):
vectorizer = HashingVectorizer(n_features=(2**18))
article1_vector = vectorizer.transform([article1.content])
article2_vector = vectorizer.transform([article2.content])
How can I get percent of similarity two hashvector(for example cosine distance) and how can I convert it to percent? thanks for your answers.
With the default settings for HashingVectorizer (in particular, norm="l2"), the cosine similarity between these two vectors is
sim = (article1_vector * article2_vector.T).A[0, 0]
This is really just a dot product with some trickery to get rid of the SciPy sparse matrix format.
This gives a similarity between -1 and 1, so you could add one and divide by two to get a percentage.

sci-kit learn: Identifying the corresponding feature-id values when using SelectKBest

I am using sci-kit learn (version 0.11 with Python version 2.7.3) to select the top K features from a binary classification dataset in svmlight format.
I am trying to identify the feature-id values of the selected features. I assumed this would be quite simple - and may well be!
(By feature-id, I mean the number before the feature value as described here)
The following code illustrates exactly how I have been trying to do this:
from sklearn.datasets import load_svmlight_file
from sklearn.feature_selection import SelectKBest
svmlight_format_train_file = 'contrived_svmlight_train_file.txt' #I present the contents of this file below
X_train_data, Y_train_data = load_svmlight_file(svmlight_format_train_file)
featureSelector = SelectKBest(score_func=chi2,k=2)
featureSelector.fit(X_train_data,Y_train_data)
assumed_to_be_the_feature_ids_of_the_top_k_features = list(featureSelector.get_support(indices=True)) #indices=False just gives me a list of True,False etc...
print assumed_to_be_the_feature_ids_of_the_top_k_features #this gives: [0, 2]
Clearly, assumed_to_be_the_feature_ids_of_the_top_k_features cannot correspond to the feature-id values - since (see below) the feature-id values in my input file start from 1.
Now, I suspect that assumed_to_be_the_feature_ids_of_the_top_k_features may, in fact, correspond to the list indices of the feature-id values sorted in order of increasing value. In my case, index 0 would correspond to feature-id=1 etc. - such that the code is telling me that feature-id=1 and feature-id=3 were selected.
I'd be grateful if someone could either confirm or deny this, however.
Thanks in advance.
Contents of contrived_svmlight_train_file.txt:
1 1:1.000000 2:1.000000 4:1.000000 6:1.000000#mA
1 1:1.000000 2:1.000000#mB
0 5:1.000000#mC
1 1:1.000000 2:1.000000#mD
0 3:1.000000 4:1.000000#mE
0 3:1.000000#mF
0 2:1.000000 4:1.000000 5:1.000000 6:1.000000#mG
0 2:1.000000#mH
P.S. Apologies for not formatting correctly (first time here); I hope this is legible and comprehensible!
Clearly, assumed_to_be_the_feature_ids_of_the_top_k_features cannot correspond to the feature-id values - since (see below) the feature-id values in my input file start from 1.
Actually, they are. The SVMlight format loader will detect that your input file has one-based indices and will subtract one from every index so as not to waste a column. If that's not what you want, then pass zero_based=True to load_svmlight_file to pretend that it's actually zero-based and insert an extra column; see its documentation for details.

Clustering using k-means in python

I have a document d1 consisting of lines of form user_id tag_id.
There is another document d2 consisting of tag_id tag_name
I need to generate clusters of users with similar tagging behaviour.
I want to try this with k-means algorithm in python.
I am completely new to this and cant figure out how to start on this.
Can anyone give any pointers?
Do I need to first create different documents for each user using d1 with his tag vocabulary?
And then apply k-means algorithm on these documents?
There are like 1 million users in d1. I am not sure I am thinking in right direction, creating 1 million files ?
Since the data you have is binary and sparse (in particular, not all users have tagged all documents, right)? So I'm not at all convinced that k-means is the proper way to do this.
Anyway, if you want to give k-means a try, have a look at the variants such as k-medians (which won't allow "half-tagging") and convex/spherical k-means (which supposedly works better with distance functions such as cosine distance, which seems a lot more appropriate here).
As mentioned by #Jacob Eggers, you have to denormalize the data to form the matrix which is a sparse one indeed.
Use SciPy package in python for k means. See
Scipy Kmeans
for examples and execution.
Also check Kmeans in python (Stackoverflow) for more information in python kmeans clustering.
First you need to denormalize the data so that you have one file like this:
userid tag1 tag2 tag3 tag4 ....
0001 1 0 1 0 ....
0002 0 1 1 0 ....
0003 0 0 1 1 ....
Then you need to loop through the k-means algorithm. Here is matlab code from the ml-class:
% Initialize centroids
centroids = kMeansInitCentroids(X, K);
for iter = 1:iterations
% Cluster assignment step: Assign each data point to the
% closest centroid. idx(i) corresponds to cˆ(i), the index
% of the centroid assigned to example i
idx = findClosestCentroids(X, centroids);
% Move centroid step: Compute means based on centroid
% assignments
centroids = computeMeans(X, idx, K);
end
For sparse k-means, see the examples under
scikit-learn clustering.
About how many ids are there, how many per user on average,
how many clusters are you looking for ? Even rough numbers,
e.g. 100k ids, av 10 per user, 100 clusters,
may lead to someone who's done clustering in that range
(or else to back-of-the-envelope "impossible").
MinHash
may be better suited for your problem than k-means;
see chapter 3, Finding Similar Items,
of Ullman, Mining Massive Datasets;
also SO questions/tagged/similarity+algorithm+python.

Categories