How to build features using pandas column and a dictionary efficiently? - python

I have a machine learning problem where I am calculating bigram Jaccard similarity of a pandas dataframe text column with values of a dictionary. Currently I am storing them as a list and then converting them to columns. This is proving to be very slow in production. Is there a more efficient way to do it?
Following are the steps I am currently following:
For each key in dict:
1. Get bigrams for the pandas column and the dict[key]
2. Calculate Jaccard similarity
3. Append to an empty list
4. Store the list in the dataframe
5. Convert the list to columns
from itertools import tee, islice
def count_ngrams(lst, n):
tlst = lst
while True:
a, b = tee(tlst)
l = tuple(islice(a, n))
if len(l) == n:
yield l
next(b)
tlst = b
else:
break
def n_gram_jaccard_similarity(str1, str2,n):
a = set(count_ngrams(str1.split(),n))
b = set(count_ngrams(str2.split(),n))
intersection = a.intersection(b)
union = a.union(b)
try:
return len(intersection) / float(len(union))
except:
return np.nan
def jc_list(sample_dict,row,n):
sim_list = []
for key in sample_dict:
sim_list.append(n_gram_jaccard_similarity(sample_dict[key],row["text"],n))
return str(sim_list)
Using the above functions to build the bigram Jaccard similarity features as follows:
df["bigram_jaccard_similarity"]=df.apply(lambda row: jc_list(sample_dict,row,2),axis=1)
df["bigram_jaccard_similarity"] = df["bigram_jaccard_similarity"].map(lambda x:[float(i) for i in [a for a in [s.replace(',','').replace(']', '').replace('[','') for s in x.split()] if a!='']])
df[[i for i in sample_dict]] = pd.DataFrame(df["bigram_jaccard_similarity"].values.tolist(), index= df.index)
Sample input:
df = pd.DataFrame(columns=["id","text"],index=None)
df.loc[0] = ["1","this is a sample text"]
import collections
sample_dict = collections.defaultdict()
sample_dict["r1"] = "this is sample 1"
sample_dict["r2"] = "is sample"
sample_dict["r3"] = "sample text 2"
Expected output:

So, this is more difficult than I though, due to some broadcasting issues of sparse matrices. Additionally, in the short period of time I was not able to fully vectorize it.
I added an additional text row to the frame:
df = pd.DataFrame(columns=["id","text"],index=None)
df.loc[0] = ["1","this is a sample text"]
df.loc[1] = ["2","this is a second sample text"]
import collections
sample_dict = collections.defaultdict()
sample_dict["r1"] = "this is sample 1"
sample_dict["r2"] = "is sample"
sample_dict["r3"] = "sample text 2"
We will use the following modules/functions/classes:
from sklearn.feature_extraction.text import CountVectorizer
from scipy.sparse import csr_matrix
import numpy as np
and define a CountVectorizer to create character based n_grams
ngram_vectorizer = CountVectorizer(ngram_range=(2, 2), analyzer="char")
feel free to choose the n-grams you need. I'd advise to take an existing tokenizer and n-gram creator. You should find plenty of those. Also the CountVectorizer can be tweaked extensively (e.g. convert to lowercase, get rid of whitespace etc.)
We concatenate all the data:
all_data = np.concatenate((df.text.to_numpy(),np.array(list(sample_dict.values()))))
we do this, as our vectorizer needs to have a common indexing scheme for all the tokens appearing.
Now let's fit the Count vectorizer and transform the data accordingly:
ngrammed = ngram_vectorizer.fit_transform(all_data) >0
ngrammed is now a sparse matrix containing the identifiers to the tokens appearing in the respective rows and not the counts anymore as before. you can inspect the ngram_vecotrizer and find a mapping from tokens to column ids.
Next we want to compare every grammes entry from the sample dict against every row of our ngrammed text data. We need some magic here:
texts = ngrammed[:len(df)]
samples = ngrammed[len(df):]
text_rows = len(df)
jaccard_similarities = []
for key, ngram_sample in zip(sample_dict.keys(), samples):
repeated_row_matrix = (csr_matrix(np.ones([text_rows,1])) * ngram_sample).astype(bool)
support = texts.maximum(repeated_row_matrix)
intersection = texts.multiply(repeated_row_matrix).todense()
jaccard_similarities.append(pd.Series((intersection.sum(axis=1)/support.sum(axis=1)).A1, name=key))
support is the boolean array, that measures the union of the n-grams over both comparable. intersection is only True if a token is present in both comparable. Note that .A1 represents a matrix-object as the underlying base array.
Now
pd.concat(jaccard_similarities, axis=1)
gives
r1 r2 r3
0 0.631579 0.444444 0.500000
1 0.480000 0.333333 0.384615
you can concat is as well to df and obtain with
pd.concat([df, pd.concat(jaccard_similarities, axis=1)], axis=1)
id text r1 r2 r3
0 1 this is a sample text 0.631579 0.444444 0.500000
1 2 this is a second sample text 0.480000 0.333333 0.384615

Related

How to get average pairwise cosine similarity per group in Pandas

I have a sample dataframe as below
df=pd.DataFrame(np.array([['facebook', "women tennis"], ['facebook', "men basketball"], ['facebook', 'club'],['apple', "vice president"], ['apple', 'swimming contest']]),columns=['firm','text'])
Now I'd like to calculate the degree of text similarity within each firm using word embedding. For example, the average cosine similarity for facebook would be the cosine similarity between row 0, 1, and 2. The final dataframe should have a column ['mean_cos_between_items'] next to each row for each firm. The value will be the same for each company, since it is a within-firm pairwise comparison.
I wrote below code:
import gensim
from gensim import utils
from gensim.models import Word2Vec
from gensim.models import KeyedVectors
from gensim.scripts.glove2word2vec import glove2word2vec
from sklearn.metrics.pairwise import cosine_similarity
# map each word to vector space
def represent(sentence):
vectors = []
for word in sentence:
try:
vector = model.wv[word]
vectors.append(vector)
except KeyError:
pass
return np.array(vectors).mean(axis=0)
# get average if more than 1 word is included in the "text" column
def document_vector(items):
# remove out-of-vocabulary words
doc = [word for word in items if word in model_glove.vocab]
if doc:
doc_vector = model_glove[doc]
mean_vec=np.mean(doc_vector, axis=0)
else:
mean_vec = None
return mean_vec
# get average pairwise cosine distance score
def mean_cos_sim(grp):
output = []
for i,j in combinations(grp.index.tolist(),2 ):
doc_vec=document_vector(grp.iloc[i]['text'])
if doc_vec is not None and len(doc_vec) > 0:
sim = cosine_similarity(document_vector(grp.iloc[i]['text']).reshape(1,-1),document_vector(grp.iloc[j]['text']).reshape(1,-1))
output.append([i, j, sim])
return np.mean(np.array(output), axis=0)
# save the result to a new column
df['mean_cos_between_items']=df.groupby(['firm']).apply(mean_cos_sim)
However, I got below error:
Could you kindly help? Thanks!
Note that sklearn.metrics.pairwise.cosine_similarity, when passed a single matrix X, automatically returns the pairwise similarities between all samples in X. I.e., it isn't necessary to manually construct pairs.
Say you construct your average embeddings with something like this (I'm using glove-twitter-25 here),
def mean_embeddings(s):
"""Transfer a list of words into mean embedding"""
return np.mean([model_glove.get_vector(x) for x in s], axis=0)
df["embeddings"] = df.text.str.split().apply(mean_embeddings)
so df.embeddings turns out
>>> df.embeddings
0 [-0.2597, -0.153495, -0.5106895, -1.070115, 0....
1 [0.0600965, 0.39806002, -0.45810497, -1.375365...
2 [-0.43819, 0.66232, 0.04611, -0.91103, 0.32231...
3 [0.1912625, 0.0066999793, -0.500785, -0.529915...
4 [-0.82556, 0.24555385, 0.38557374, -0.78941, 0...
Name: embeddings, dtype: object
You can get the mean pairwise cosine similarity like so, with the main point being that you can directly apply cosine_similarity to the adequately prepared matrix for each group:
(
df.groupby("firm").embeddings # extract 'embeddings' for each group
.apply(np.stack) # turns sequence of arrays into proper matrix
.apply(cosine_similarity) # the magic: compute pairwise similarity matrix
.apply(np.mean) # get the mean
)
which, for the model I used, results in:
firm
apple 0.765953
facebook 0.893262
Name: embeddings, dtype: float32
Remove the .vocab here in model_glove.vocab, this is not supported in the current version of gensim any more: Edit: also needs split() to iterate over words and not characters here.
# get average if more than 1 word is included in the "text" column
def document_vector(items):
# remove out-of-vocabulary words
doc = [word for word in items.split() if word in model_glove]
if doc:
doc_vector = model_glove[doc]
mean_vec = np.mean(doc_vector, axis=0)
else:
mean_vec = None
return mean_vec
Here you iterate over tuples of indices when you want to iterate over the values, so drop the .index. Also you put all values in output including the words (/indices) i and j, so if you want to get their average you would have to specify what exactly you want the average over. Since you seem to not need i and j you can just put only the resulting sims in a list and then take the lists average:
# get pairwise cosine similarity score
def mean_cos_sim(grp):
output = []
for i, j in combinations(grp.tolist(), 2):
if document_vector(i) is not None and len(document_vector(i)) > 0:
sim = cosine_similarity(document_vector(i).reshape(1, -1), document_vector(j).reshape(1, -1))
output.append(sim)
return np.mean(output, axis=0)
Here you try to add the results as a column but the number of rows is going to be different as the result DataFrame only has one row per firm while the original DataFrame has one per text. So you have to create a new DataFrame (which you can optionally then merge/join with the original DataFrame based on the firm column):
df = pd.DataFrame(np.array(
[['facebook', "women tennis"], ['facebook', "men basketball"], ['facebook', 'club'],
['apple', "vice president"], ['apple', 'swimming contest']]), columns=['firm', 'text'])
df_grpd = df.groupby(['firm'])["text"].apply(mean_cos_sim)
Which overall will give you (Edit: updated):
print(df_grpd)
> firm
apple [[0.53190523]]
facebook [[0.83989316]]
Name: text, dtype: object
Edit:
I just noticed that the reason for the super high score is that this is missing a tokenization, see the changed part. Without the split() this just compares character similarities which tend to be super high.

pandas: calculate overlapping words between rows only if values in another column match (issue with multiple instances)

I have a dataframe that looks like the following, but with many rows:
import pandas as pd
data = {'intent': ['order_food', 'order_food','order_taxi','order_call','order_call','order_call','order_taxi'],
'Sent': ['i need hamburger','she wants sushi','i need a cab','call me at 6','she called me','order call','i would like a new taxi' ],
'key_words': [['need','hamburger'], ['want','sushi'],['need','cab'],['call','6'],['call'],['order','call'],['new','taxi']]}
df = pd.DataFrame (data, columns = ['intent','Sent','key_words'])
I have calculated the jaccard similarity using the code below (not my solution):
def lexical_overlap(doc1, doc2):
words_doc1 = set(doc1)
words_doc2 = set(doc2)
intersection = words_doc1.intersection(words_doc2)
return intersection
and modified the code given by #Amit Amola to compare overlapping words between every possible two rows and created a dataframe out of it:
overlapping_word_list=[]
for val in list(combinations(range(len(data_new)), 2)):
overlapping_word_list.append(f"the shared keywords between {data_new.iloc[val[0],0]} and {data_new.iloc[val[1],0]} sentences are: {lexical_overlap(data_new.iloc[val[0],1],data_new.iloc[val[1],1])}")
#creating an overlap dataframe
banking_overlapping_words_per_sent = DataFrame(overlapping_word_list,columns=['overlapping_list'])
#gold_cy 's answer has helped me and i made some changes to it to get the output i like:
for intent in df.intent.unique():
# loc returns a DataFrame but we need just the column
rows = df.loc[df.intent == intent,['intent','key_words','Sent']].values.tolist()
combos = combinations(rows, 2)
for combo in combos:
x, y = rows
overlap = lexical_overlap(x[1], y[1])
print(f"Overlap of intent ({x[0]}) for ({x[2]}) and ({y[2]}) is {overlap}")
the issue is that when there are more instances of the same intent, i run into the error:
ValueError: too many values to unpack (expected 2)
and I do not know how to handle that for many more examples that i have in my dataset
Do you want this?
from itertools import combinations
from operator import itemgetter
items_to_consider = []
for item in list(combinations(zip(df.Sent.values, map(set,df.key_words.values)),2)):
keywords = (list(map(itemgetter(1),item)))
intersect = keywords[0].intersection(keywords[1])
if len(intersect) > 0:
str_list = list(map(itemgetter(0),item))
str_list.append(intersect)
items_to_consider.append(str_list)
for i in items_to_consider:
for item in i[2]:
if item in i[0] and item in i[1]:
print(f"Overlap of intent (order_food) for ({i[0]}) and ({i[1]}) is {item}")

How to make a pandas dataframe from list data generated

I have a list of co-authors:
ten_author_pairs = [('creutzig', 'gao'),
('creutzig', 'linshaw'),
('gao', 'linshaw'),
('jing', 'zhang'),
('jing', 'liu'),
('zhang', 'liu'),
('jing', 'xu'),
('briant', 'einav'),
('chen', 'gao'),
('chen', 'jing')]
From here I can generate a list of negative examples - i.e. authors-pairs which are unconnected using the following code:
#generating negative examples -
from itertools import combinations
elements = list(set([e for l in ten_author_pairs for e in l])) # find all unique elements
complete_list = list(combinations(elements, 2)) # generate all possible combinations
#convert to sets to negate the order
set1 = [set(l) for l in ten_author_pairs]
complete_set = [set(l) for l in complete_list]
# find sets in `complete_set` but not in `set1`
ten_unconnnected = [list(l) for l in complete_set if l not in set1]
print(len(ten_author_pairs))
print(len(ten_unconnnected))
Next, I want to implement a link prediction problem for which I want to obtain a dataframe as follows:
author-pair jaccard Resource_Allocation Adamic_Adar Preferential cn_soundarajan_hopcroft within_inter_cluster link
creutzig-linshaw 0.25 0.25 0.25 0.25 0.25 0.25 1
I can calculate these and have lists with scores as output using networkx documentation, but I am not able to put it together as a table as shown above.
Like for the positive examples (the list mentioned above), I can generate a dataframe using:
df = pd.DataFrame(list, columns = ['u1','u2])
and then make a graph with:
G = nx.from_pandas_edgelist(df, u1, u2, create_using = nx.Graph())
After which say for jaccard index I can apply:
nx.jaccard_coefficient(G)
Which returns me a list of node pairs with jaccard score.
The 'link' column is generated with the logic - 1 for co-authors and 0 for pairs in the negative example.
But, I need all the respective scores as a table as mentioned.
Can anyone please help me with how to construct the above dataframe.
(The scores mentioned are just for example purpose to indicate the kind of table i need)
Oh -- this has been a good two years, but I just stumbled upon this...in case I understood you correctly, building on your basis:
from itertools import combinations
import pandas as pd
import networkx as nx
elements = list(set([e for l in ten_author_pairs for e in l]))
complete_list = list(combinations(elements, 2))
set1 = [set(l) for l in ten_author_pairs]
df = pd.DataFrame(set1, columns=["u1", "u2"])
G = nx.from_pandas_edgelist(df, "u1", "u2", create_using=nx.Graph())
Then defining the list of generators
list_generators = [
nx.jaccard_coefficient,
nx.resource_allocation_index,
nx.adamic_adar_index,
nx.preferential_attachment,
]
Building the score dataframe:
dfx = pd.DataFrame()
for item_generator in list_generators:
if dfx.shape[0]:
dfx = dfx.merge(
right=get_df_network(generator=item_generator, graph=G),
left_index=True,
right_index=True,
)
else:
dfx = get_df_network(generator=item_generator, graph=G)
And finally merging in the link dataframe
df_link = (
pd.DataFrame(set1, columns=["node_0", "node_1"])
.set_index(["node_0", "node_1"])
.assign(link=[1] * len(set1))
)
dfx.merge(df_link, left_index=True, right_index=True, how="outer").fillna(0)
could do the job?

index 0 is out of bounds for axis 0 with size 0 Python

PLEASE READ:
I have looked at all the other answers related to this question and none of them solve my specific problem so please carry on reading below.
I have the below code. what the code basically does is keeps the Title column and then concatenated the rest of the columns into one in order to be able to create a cosine matrix.
the main point is the recommendations function that is suppose to take in a Title for imput and return the top 10 matches based on that title but what i get at the end is the index 0 is out of bounds for axis 0 with size 0 error and i have no idea why.
import pandas as pd
from rake_nltk import Rake
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import CountVectorizer
df =
pd.read_csv('https://query.data.world/s/uikepcpffyo2nhig52xxeevdialfl7')
df = df[['Title','Genre','Director','Actors','Plot']]
df.head()
df['Key_words'] = ""
for index, row in df.iterrows():
plot = row['Plot']
# instantiating Rake, by default it uses english stopwords from NLTK
# and discards all puntuation characters as well
r = Rake()
# extracting the words by passing the text
r.extract_keywords_from_text(plot)
# getting the dictionary whith key words as keys and their scores as values
key_words_dict_scores = r.get_word_degrees()
# assigning the key words to the new column for the corresponding movie
row['Key_words'] = list(key_words_dict_scores.keys())
# dropping the Plot column
df.drop(columns = ['Plot'], inplace = True)
# instantiating and generating the count matrix
df['bag_of_words'] = df[df.columns[1:]].apply(lambda x: '
'.join(x.astype(str)),axis=1)
count = CountVectorizer()
count_matrix = count.fit_transform(df['bag_of_words'])
# generating the cosine similarity matrix
cosine_sim = cosine_similarity(count_matrix, count_matrix)
cosine_sim
indices = pd.Series(df.index)
# defining the function that takes in movie title
# as input and returns the top 10 recommended movies
def recommendations(title, cosine_sim = cosine_sim):
#print(title)
# initializing the empty list of recommended movies
recommended_movies = []
# gettin the index of the movie that matches the title
idx = indices[indices == title].index[0]
print('idx is '+ idx)
# creating a Series with the similarity scores in descending order
score_series = pd.Series(cosine_sim[idx]).sort_values(ascending = False)
# getting the indexes of the 10 most similar movies
top_10_indexes = list(score_series.iloc[1:11].index)
# populating the list with the titles of the best 10 matching movies
for i in top_10_indexes:
recommended_movies.append(list(df.index)[i])
return recommended_movies
This line:
idx = indices[indices == title].index[0]
will fail if you do not return a match:
df.loc[df['Title']=='This is not a valid title'].index[0]
returns:
IndexError: index 0 is out of bounds for axis 0 with size 0
You need to confirm that the title you are passing in is actually in DF before trying to access any data associated with it:
def recommendations(title, cosine_sim = cosine_sim):
#print(title)
# initializing the empty list of recommended movies
recommended_movies = []
if title not in indices:
raise KeyError("title is not in indices")
# gettin the index of the movie that matches the title
idx = indices[indices == title].index[0]
print('idx is '+ idx)
# creating a Series with the similarity scores in descending order
score_series = pd.Series(cosine_sim[idx]).sort_values(ascending = False)
# getting the indexes of the 10 most similar movies
top_10_indexes = list(score_series.iloc[1:11].index)
# populating the list with the titles of the best 10 matching movies
for i in top_10_indexes:
recommended_movies.append(list(df.index)[i])
return recommended_movies
This expression also seems to be doing nothing:
for index, row in df.iterrows():
plot = row['Plot']
If you just want a single plot record with which to do some development try:
plot = df['Plot'].sample(n=1)
Finally, it appears that recommendations is using the global variable indices - in general this is bad practice, as if indices changes outside of the scope of recommendations the function might break. I would consider refactoring this to be a little less brittle overall.

How to remove rows that have 3 word or less in dataframe?

Because I want to remove ambiguity when I train the data. I want to clean it well. So how can I remove all rows that contain 3 words or less in python?
Hello World! This will be my first contribution ever to SO :-)
Let's create some data:
data = { 'Source':['Hello all Im Happy','Its a lie, dont trust him','Oops','foo','bar']}
df = pd.DataFrame (data, columns = ['Source'])
My approach is very straight forward, simple and little "brute" and inefficient,howver I ran this in a large dataframe (1013952 rows) and the time was fairly acceptable.
let's find the indices of the data frame where there are more than n tokens:
from nltk.tokenize import word_tokenize
def get_indices(df,col,n):
"""
Get the indices of dataframe where exist more than n tokens in a specific column
Parameters:
df(pandas dataframe)
n(int): threshold value for minimum words
col(string): column name
"""
tmp = []
for i in range(len(df)):#df.iterrows() wasnt working for me
if len(word_tokenize(df[col][i])) < n:
tmp.append(i)
return tmp
Next we just need to call the function and drop the rows and said indices:
tmp = get_indices(df)
df_clean = df.drop(tmp)
Best!
df = pd.DataFrame({"mycolumn": ["", " ", "test string", "test string 1", "test string 2 2"]})
df = df.loc[df["mycolumn"].str.count(" ") >= 2]
You should never loop over a dataframe, always use vectorized operations.

Categories