Properly calculate cosine similarities for low memory on large datasets?

Properly calculate cosine similarities for low memory on large datasets? - python

I am following this tutorial here to just learn a bit about content recommenders: https://www.datacamp.com/community/tutorials/recommender-systems-python
but i ran into a Memory Error when running the "content based" part of the tutorial. Upon some reading I found that this has to do with just how large the dataset being used it. I couldn't really find an exact way for this specific case on how to run this with low memory, so instead i modified this a little bit to split the original dataframe up into 6 pieces, run this cosine similarity calculation for each split dataframe, merge together the results, then run this one last time to get a final result. here is my code:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel
from sklearn.metrics.pairwise import cosine_similarity
# Function that takes in movie title as input and outputs most similar movies
def get_recommendations(title, indices, cosine_sim, final=False):
# Get the index of the movie that matches the title
idx = indices[title]
# Get the pairwsie similarity scores of all movies with that movie
sim_scores = list(enumerate(cosine_sim[idx]))
# Sort the movies based on the similarity scores
sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
# Get the scores of the 10 most similar movies
sim_scores = sim_scores[1:11]
# Get the movie indices
movie_indices = [i[0] for i in sim_scores]
# Return the top 10 most similar movies
if not final:
return metadata.iloc[movie_indices, :]
else:
return metadata['title'].iloc[movie_indices]
# Load Movies Metadata
metadata = pd.read_csv('dataset/movies_metadata.csv', low_memory=False)
#Define a TF-IDF Vectorizer Object. Remove all english stop words such as 'the', 'a'
tfidf = TfidfVectorizer(stop_words='english')
#Replace NaN with an empty string
metadata['overview'] = metadata['overview'].fillna('')
split_db = np.array_split(metadata, 6)
source_db = None
search_db = None
db_remove_idx = None
new_db_list = list()
for x, db in enumerate(split_db):
search = db.loc[db['title'] == 'The Dark Knight Rises']
if not search.empty:
source_db = db
new_db_list.append(source_db)
search_db = search
db_remove_idx = x
break
split_db.pop(db_remove_idx)
for x, db in enumerate(split_db):
new_db_list.append(db.append(search_db, ignore_index=True))
del(split_db)
refined_db = None
for db in new_db_list:
small_db = db.reset_index()
#Construct the required TF-IDF matrix by fitting and transforming the data
tfidf_matrix = tfidf.fit_transform(small_db['overview'])
# Compute the cosine similarity matrix
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)
#cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)
#Construct a reverse map of indices and movie titles
indices = pd.Series(small_db.index, index=small_db['title']).drop_duplicates()
result = (get_recommendations('The Dark Knight Rises', indices, cosine_sim))
if type(refined_db) != pd.core.frame.DataFrame:
refined_db = result.append(search_db, ignore_index=True)
else:
refined_db = refined_db.append(result, ignore_index=True)
final_db = refined_db.reset_index()
#Construct the required TF-IDF matrix by fitting and transforming the data
tfidf_matrix = tfidf.fit_transform(final_db['overview'])
# Compute the cosine similarity matrix
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)
#Construct a reverse map of indices and movie titles
indices = pd.Series(final_db.index, index=final_db['title']).drop_duplicates()
final_result = (get_recommendations('The Dark Knight Rises', indices, cosine_sim, final=True))
print(final_result)
i thought this would work, but the results are not even close to what is given in the tutorial:
11 Dracula: Dead and Loving It
13 Nixon
12 Balto
15 Casino
20 Get Shorty
18 Ace Ventura: When Nature Calls
14 Cutthroat Island
16 Sense and Sensibility
19 Money Train
17 Four Rooms
Name: title, dtype: object
could anyone explain what i am doing wrong here? i figured since the dataset was too large by splitting it up, running this "cosine similarity" process as first a refinement, then using the resulting data and running the process again would give a similar result, but then why is the result i am getting so different than what is expected?
And this is the data i am using this against: https://www.kaggle.com/rounakbanik/the-movies-dataset/data

Related

loop over pandas column for wmd similarity

I have two dataframe. both have two columns. I want to use wmd to find closest match for each entity in column source_label to entities in column target_label However, at the end I would like to have a DataFrame with all the 4 columns with respect to the entities.
df1
,source_Label,source_uri
'neuronal ceroid lipofuscinosis 8',"http://purl.obolibrary.org/obo/DOID_0110723"
'autosomal dominant distal hereditary motor neuronopathy',"http://purl.obolibrary.org/obo/DOID_0111198"
df2
,target_label,target_uri
'neuronal ceroid ',"http://purl.obolibrary.org/obo/DOID_0110748"
'autosomal dominanthereditary',"http://purl.obolibrary.org/obo/DOID_0111110"
Expected result
,source_label, target_label, source_uri, target_uri, wmd score
'neuronal ceroid lipofuscinosis 8', 'neuronal ceroid ', "http://purl.obolibrary.org/obo/DOID_0110723", "http://purl.obolibrary.org/obo/DOID_0110748", 0.98
'autosomal dominant distal hereditary motor neuronopathy', 'autosomal dominanthereditary', "http://purl.obolibrary.org/obo/DOID_0111198", "http://purl.obolibrary.org/obo/DOID_0111110", 0.65
The dataframe is so big that I am looking for some faster way to iterate over both label columns. So far I tried this:
list_distances = []
temp = []
def preprocess(sentence):
return [w for w in sentence.lower().split()]
entity = df1['source_label']
target = df2['target_label']
for i in tqdm(entity):
for j in target:
wmd_distance = model.wmdistance(preprocess(i), preprocess(j))
temp.append(wmd_distance)
list_distances.append(min(temp))
# print("list_distances", list_distances)
WMD_Dataframe = pd.DataFrame({'source_label': pd.Series(entity),
'target_label': pd.Series(target),
'source_uri': df1['source_uri'],
'target_uri': df2['target_uri'],
'wmd_Score': pd.Series(list_distances)}).sort_values(by=['wmd_Score'])
WMD_Dataframe = WMD_Dataframe.reset_index()
First of all this code is not working well as the other two columns are coming directly from the dfs' and do not take entities relation with the uri into consideration.
How one can make it faster as the entities are in millions. Thanks in advance.

A quick fix :
closest_neighbour_index_df2 = []
def preprocess(sentence):
return [w for w in sentence.lower().split()]
for i in tqdm(entity):
temp = []
for j in target:
wmd_distance = model.wmdistance(preprocess(i), preprocess(j))
temp.append(wmd_distance)
# maybe assert to make sure its always right
closest_neighbour_index_df2.append(np.argmin(np.array(temp)))
# return argmin to return index rather than the value.
# Add the indices from df2 to df1
df1['closest_neighbour'] = closest_neighbour_index_df2
# add information to respective row from df2 using the closest_neighbour column

Can I optimize this Word Mover's Distance look-up function?

I am trying to measure the Word Mover's Distance between a lot of texts using Gensim's Word2Vec tools in Python. I am comparing each text with all other texts, so I first use itertools to create pairwise combinations like [1,2,3] -> [(1,2), (1,3), (2,3)]. For memory's sake, I don't do the combinations by having all texts repeated in a big dataframe, but instead make a reference dataframe combinations with indices of the texts, which looks like:
0 1
0 0 1
1 0 2
2 0 3
And then in the comparison function I use these indices to look up the text in the original dataframe. The solution works fine, but I am wondering whether I would be able to it with big datasets. For instance I have a 300.000 row dataset of texts, which gives me about a 100 year's worth of computation on my laptop:
C2(300000) = 300000! / (2!(300000−2))!
= 300000⋅299999 / 2 * 1
= 44999850000 combinations
Is there any way this could be optimized better?
My code right now:
import multiprocessing
import itertools
import numpy as np
import pandas as pd
import dask.dataframe as dd
from dask.diagnostics import ProgressBar
from gensim.models.word2vec import Word2Vec
from gensim.corpora.wikicorpus import WikiCorpus
def get_distance(row):
try:
sent1 = df.loc[row[0], 'text'].split()
sent2 = df.loc[row[1], 'text'].split()
return model.wv.wmdistance(sent1, sent2) # Compute WMD
except Exception as e:
return np.nan
df = pd.read_csv('data.csv')
# I then set up the gensim model, let me know if you need that bit of code too.
# Make pairwise combination of all indices
combinations = pd.DataFrame(itertools.combinations(df.index, 2))
# To dask df and apply function
dcombinations = dd.from_pandas(combinations, npartitions= 2 * multiprocessing.cpu_count())
dcombinations['distance'] = dcombinations.apply(get_distance, axis=1)
with ProgressBar():
combinations = dcombinations.compute()

You might use wmd-relax for performance improvement. However, you'll first have to convert your model to spaCy and use the SimilarityHook as described on their webpage:
import spacy
import wmd
nlp = spacy.load('en_core_web_md')
nlp.add_pipe(wmd.WMD.SpacySimilarityHook(nlp), last=True)
doc1 = nlp("Politician speaks to the media in Illinois.")
doc2 = nlp("The president greets the press in Chicago.")
print(doc1.similarity(doc2))

time series trend recognization in python

I have a CSV containing selling figures for various dates.
Here is an example of the file:
DATE, ARTICLENO, QUANTITY
2018-07-17, 101, 50
2018-07-16, 101, 55
2018-07-16, 105, 36
2018-07-15, 105, 23
I read this into a pandas dataframe and ran a basic kmeans-algorithm on this but i need more help.
Data description:
The date column is the index of the dataframe and describes the date for the selling value. There are multiple tuples (Date-Quantity-ArticleNo) so there is a time series for each article number. Those can have different lengths and starting dates, which makes predicting and recognizing trends (e.g. good selling in summer or winter) even harder. The CSV is sorted by ArticleNo and Date.
Goal:
Cluster a given set of data from a csv and create labels for good selling articles in summer or winter (seasonal trends) and match future articles to them.
Here is what I did so far (currently i did not have date as index xet, but that is the goal):
from __future__ import absolute_import, division, print_function
import pandas as pd
import numpy as np
from matplotlib import pyplot as plp
from sklearn import preprocessing
from sklearn.cluster import KMeans
import sys
def extract_articles(data, article_numbers):
return pd.DataFrame(
[
data[data['ARTICLENO'] == article_no]['QUANTITY'].values
for article_no in article_numbers
]
).fillna(0)
def read_csv_file(file_name, number_of_lines):
return pd.read_csv(file_name, parse_dates=['DATE'],
nrows=number_of_lines)
def get_unique_article_numbers(data):
return data['ARTICLENO'].unique()
def main():
data = read_csv_file('statistic.csv', 400000)
modeling_article_numbers = get_unique_article_numbers(data)
print("Clustering on", len(modeling_article_numbers), "article numbers")
modeling_data = extract_articles(data, modeling_article_numbers)
modeling_data = modeling_data.iloc[:50, :]
# 'switch' dataframe
modeling_data = modeling_data.T
modeling_data = modeling_data.pct_change().fillna(0)
normalized_modeling_data = preprocessing.normalize(modeling_data,
norm='l2', axis=0)
print(modeling_data)
predicting_article_numbers = [30079229, 30079854, 30086845]
predicting_article_data = extract_articles(data,
predicting_article_numbers)
predicting_article_data = predicting_article_data.pct_change().fillna(0)
normalized_predicting_article_data = preprocessing.normalize(
predicting_article_data, norm='l2'
)
kmeans = KMeans(n_clusters=5,
random_state=0).fit(normalized_modeling_data)
print(kmeans.labels_)
# for data, article_no in [
# (normalized_predicting_article_data, 430079229),
# (normalized_predicting_article_data, 430079854),
# (modeling_data, 430074590),
# ]:
# print('Predicting article {0}'.format(article_no))
# print(kmeans.predict([data[0]]))
for i, cluster_center in enumerate(kmeans.cluster_centers_):
plp.plot(cluster_center, label='Center {0}'.format(i))
plp.legend(loc='best')
plp.title(('Cluster based on ' + str(len(modeling_article_numbers)) + '
article numbers'))
plp.show()
main()
I transposed the dataframe, beacause it did not contain the series for each article number along the axis 1.
My question is: How can i get the 'description' of the label? Can i name them?
Maybe kmeans is the wrong algorithm for my intentions?

have you tried making each article a row in your dataset?
I'm not sure if you did after reading your question.
After you did that you can aggregate your date e.g. as quantity per week. If you have more than one year data make it average quantity per week. So you get a table with 52 Features {week 1 : sold 500; week 2 : sold 520 ...} for every article.
I dont think k-means is what you are looking for because you know pretty well what you want and that makes you a good "teacher" for your algorithm, ergo: use supervised algortihms.
Therefore you need to lable at least some (at best all) of your aggregated product data by hand, but it should be worth the work due to better results.
Also you could look into Time-Series Sesonality Analysis / Time Series decomposition.
Anyway if you are familiar with sci-kit learn i would give the supervised algorithms (Decision Trees, Random Forest, SVM, MLPClassifier ...) a chance, might be way easier to accomplish.

Python Scikit-Learn PCA: Get Component Score

I am trying to perform a Principal Component Analysis for work. While i have successful in getting the the Principal Components laid out, i don't really know how to assign the resulting Component Score to each line item. I am looking for an output sort of like this.
Town PrinComponent 1 PrinComponent 2 PrinComponent 3
Columbia 0.31989 -0.44216 -0.44369
Middletown -0.37101 -0.24531 -0.47020
Harrisburg -0.00974 -0.06105 0.32792
Newport -0.38678 0.40935 -0.62996
The scikit-learn docs are not being helpful in this circumstance. Can anybody explain to me how i can reach this output?
The code i have so far is below.
def perform_PCA(df):
threshold = 0.1
pca = decomposition.PCA(n_components=3)
numpyMatrix = df.as_matrix().astype(float)
scaled_data = preprocessing.scale(numpyMatrix)
pca.fit(scaled_data)
pca.transform(scaled_data)
pca_components_df = pd.DataFrame(data = pca.components_,columns = df.columns.values)
#print pca_components_df
#pca_components_df.to_csv('pca_components_df.csv')
filtered = pca_components_df[abs(pca_components_df) > threshold]
trans_filtered= filtered.T
#print filtered.T #Tranformed Dataframe
trans_filtered.to_csv('trans_filtered.csv')
print pca.explained_variance_ratio_

I pumped the transformed array into the data portion of the DataFrame function, and then defined the index and columns the by putting them into columns= and index= respectively.
pd.DataFrame(data=transformed, columns=["PC1", "PC2"], index=df.index)

Python/Pandas aggregation combined with NLTK

I want to do some text processing on a dataset containing Twitter messages. So far I'm able to load the data (.CSV) in a Pandas dataframe and index that by a (custom) column 'timestamp'.
df = pandas.read_csv(f)
df.index = pandas.to_datetime(df.pop('timestamp'))
Looks a bit like this:
user_name user_handle
timestamp
2015-02-02 23:58:42 Netherlands Startups NLTechStartups
2015-02-02 23:58:42 shareNL share_NL
2015-02-02 23:58:42 BreakngAmsterdamNews iAmsterdamNews
[49570 rows x 8 columns]
I can create a new object (Series) containing just the text like so:
texts = pandas.Series(df['text'])
Which creates this:
2015-06-02 14:50:54 Business Update Meer cruiseschepen dan ooit in...
2015-06-02 14:50:53 RT #ProvincieNH: Provincie maakt Markermeerdij...
2015-06-02 14:50:53 Amsterdam - Nieuwe flitspaal Wibautstraat: In ...
2015-06-02 14:50:53 Amsterdam - Nieuwe flitspaal Wibautstraat http...
2015-06-02 14:50:53 Lugar secreto em Amsterdam: Begijnhof // Hidde...
Name: text, Length: 49570
1. Is this new object of the same sort of type (dataframe) as my initial df variable, just with different columns/rows?
Now together with the nltk tookit I'm able to tokenize the strings using this:
for w in words:
print(nltk.word_tokenize(w))
This iterates the array instead of mapping the 'text' column to a multiple-column 'words' array. 2. How would I do this and moreover how do I then count the occurrences of each word?
I know there is a unique() method which I could use to create a distinct list of words. But then again I'd need an extra column which is a count over the array which I'm unable to produce in the first place. :) 3. Or would the next step towards 'counting' occurrences of those words be grouping?
EDIT. 3: I seem to need "CountVectorizer", thanks EdChum
documents = df['text'].values
vectorizer = CountVectorizer(min_df=0, stop_words=[])
X = vectorizer.fit_transform(documents)
print(X.toarray())
My main goal is to count the occurences of each word and selecting the top X results. I feel I'm on the right track, but I can't get the final steps just right..

Building on EdChums comments here is a way to get the (I assume global) word counts from CountVectorizer:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
vect= CountVectorizer()
df= pd.DataFrame({'text':['cat on the cat','angel eyes has','blue red angel','one two blue','blue whales eat','hot tin roof','angel eyes has','have a cat']\
,'class': ['a','a','a','a','c','c','b','e']})
X = vect.fit_transform(df['text'].values)
y = df['class'].values
covert the sparse matrix returned by CountVectoriser to a dense matrix, and pass it and the feature names to the dataframe constructor. Then transpose the frame and sum along axis=1 to get the total per word:
word_counts =pd.DataFrame(X.todense(),columns = vect.get_feature_names()).T.sum(axis=1)
word_counts.sort(ascending=False)
word_counts[:3]
If all you are interested in is the frequency distribution of the words consider using Freq Dist from NLTK:
import nltk
import itertools
from nltk.probability import FreqDist
texts = ['cat on the cat','angel eyes has','blue red angel','one two blue','blue whales eat','hot tin roof','angel eyes has','have a cat']
texts = [nltk.word_tokenize(text) for text in texts]
# collapse into a single list
tokens = list(itertools.chain(*texts))
FD =FreqDist(tokens)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Properly calculate cosine similarities for low memory on large datasets? - python

Related

loop over pandas column for wmd similarity

Can I optimize this Word Mover's Distance look-up function?

time series trend recognization in python

Python Scikit-Learn PCA: Get Component Score

Python/Pandas aggregation combined with NLTK

Categories

Resources