I have a sample dataframe as below
df=pd.DataFrame(np.array([['facebook', "women tennis"], ['facebook', "men basketball"], ['facebook', 'club'],['apple', "vice president"], ['apple', 'swimming contest']]),columns=['firm','text'])
Now I'd like to calculate the degree of text similarity within each firm using word embedding. For example, the average cosine similarity for facebook would be the cosine similarity between row 0, 1, and 2. The final dataframe should have a column ['mean_cos_between_items'] next to each row for each firm. The value will be the same for each company, since it is a within-firm pairwise comparison.
I wrote below code:
import gensim
from gensim import utils
from gensim.models import Word2Vec
from gensim.models import KeyedVectors
from gensim.scripts.glove2word2vec import glove2word2vec
from sklearn.metrics.pairwise import cosine_similarity
# map each word to vector space
def represent(sentence):
vectors = []
for word in sentence:
try:
vector = model.wv[word]
vectors.append(vector)
except KeyError:
pass
return np.array(vectors).mean(axis=0)
# get average if more than 1 word is included in the "text" column
def document_vector(items):
# remove out-of-vocabulary words
doc = [word for word in items if word in model_glove.vocab]
if doc:
doc_vector = model_glove[doc]
mean_vec=np.mean(doc_vector, axis=0)
else:
mean_vec = None
return mean_vec
# get average pairwise cosine distance score
def mean_cos_sim(grp):
output = []
for i,j in combinations(grp.index.tolist(),2 ):
doc_vec=document_vector(grp.iloc[i]['text'])
if doc_vec is not None and len(doc_vec) > 0:
sim = cosine_similarity(document_vector(grp.iloc[i]['text']).reshape(1,-1),document_vector(grp.iloc[j]['text']).reshape(1,-1))
output.append([i, j, sim])
return np.mean(np.array(output), axis=0)
# save the result to a new column
df['mean_cos_between_items']=df.groupby(['firm']).apply(mean_cos_sim)
However, I got below error:
Could you kindly help? Thanks!
Note that sklearn.metrics.pairwise.cosine_similarity, when passed a single matrix X, automatically returns the pairwise similarities between all samples in X. I.e., it isn't necessary to manually construct pairs.
Say you construct your average embeddings with something like this (I'm using glove-twitter-25 here),
def mean_embeddings(s):
"""Transfer a list of words into mean embedding"""
return np.mean([model_glove.get_vector(x) for x in s], axis=0)
df["embeddings"] = df.text.str.split().apply(mean_embeddings)
so df.embeddings turns out
>>> df.embeddings
0 [-0.2597, -0.153495, -0.5106895, -1.070115, 0....
1 [0.0600965, 0.39806002, -0.45810497, -1.375365...
2 [-0.43819, 0.66232, 0.04611, -0.91103, 0.32231...
3 [0.1912625, 0.0066999793, -0.500785, -0.529915...
4 [-0.82556, 0.24555385, 0.38557374, -0.78941, 0...
Name: embeddings, dtype: object
You can get the mean pairwise cosine similarity like so, with the main point being that you can directly apply cosine_similarity to the adequately prepared matrix for each group:
(
df.groupby("firm").embeddings # extract 'embeddings' for each group
.apply(np.stack) # turns sequence of arrays into proper matrix
.apply(cosine_similarity) # the magic: compute pairwise similarity matrix
.apply(np.mean) # get the mean
)
which, for the model I used, results in:
firm
apple 0.765953
facebook 0.893262
Name: embeddings, dtype: float32
Remove the .vocab here in model_glove.vocab, this is not supported in the current version of gensim any more: Edit: also needs split() to iterate over words and not characters here.
# get average if more than 1 word is included in the "text" column
def document_vector(items):
# remove out-of-vocabulary words
doc = [word for word in items.split() if word in model_glove]
if doc:
doc_vector = model_glove[doc]
mean_vec = np.mean(doc_vector, axis=0)
else:
mean_vec = None
return mean_vec
Here you iterate over tuples of indices when you want to iterate over the values, so drop the .index. Also you put all values in output including the words (/indices) i and j, so if you want to get their average you would have to specify what exactly you want the average over. Since you seem to not need i and j you can just put only the resulting sims in a list and then take the lists average:
# get pairwise cosine similarity score
def mean_cos_sim(grp):
output = []
for i, j in combinations(grp.tolist(), 2):
if document_vector(i) is not None and len(document_vector(i)) > 0:
sim = cosine_similarity(document_vector(i).reshape(1, -1), document_vector(j).reshape(1, -1))
output.append(sim)
return np.mean(output, axis=0)
Here you try to add the results as a column but the number of rows is going to be different as the result DataFrame only has one row per firm while the original DataFrame has one per text. So you have to create a new DataFrame (which you can optionally then merge/join with the original DataFrame based on the firm column):
df = pd.DataFrame(np.array(
[['facebook', "women tennis"], ['facebook', "men basketball"], ['facebook', 'club'],
['apple', "vice president"], ['apple', 'swimming contest']]), columns=['firm', 'text'])
df_grpd = df.groupby(['firm'])["text"].apply(mean_cos_sim)
Which overall will give you (Edit: updated):
print(df_grpd)
> firm
apple [[0.53190523]]
facebook [[0.83989316]]
Name: text, dtype: object
Edit:
I just noticed that the reason for the super high score is that this is missing a tokenization, see the changed part. Without the split() this just compares character similarities which tend to be super high.
Related
I've written the code that will find take a number of n grams, a specific index, and a threshold, and return the values that fall within that threshold. However, currently, it only compares a set of tokens (a given index) to each set of tokens at all other indices. I want to compare each set tokens at all indices to every other set of tokens at all indices. I don't think this is a difficult question, but python is my main language and I struggle with for loops a bit.
So essentially, the variable token in the function should actually iterate over each string in the column, and be compared with comp_token and the index call would be removed, since it would be iterating over all indices.
Let me know if that isn't clear enough and I will think more about how to say this: it is just difficult because the thing I am asking is the thing I am struggling with.
data = ['Time', "NY Times", 'Atlantic']
ph = pd.DataFrame(data, columns=['companies'])
ph.reset_index(inplace=True)
import py_stringmatching as sm
import pandas as pd
import numpy as np
jac = sm.Jaccard()
def predict_label(num, index, thresh):
qg_num_tok = sm.QgramTokenizer(qval = num)
companies = ph.companies.to_list()
ids = ph['index']
companies_qg_num_token = {}
companies_id2index = {}
for i in range(len(companies)):
companies_id2index[i] = companies[i]
companies_qg_num_token[i] = qg_num_tok.tokenize(companies[i])
predicted_label = [1]
token = companies_qg_num_token[index] #index you want: to get all the tokens
for comp_name in ids[1:]:
comp_token = companies_qg_num_token[comp_name]
sim = jac.get_sim_score(token, comp_token)
if sim > thresh:
predicted_label.append(1)
else:
predicted_label.append(0)
#companies_id2index must be equal to token numbner
ph.loc[ph['index'] != companies_id2index[index], 'label'] = 0 #if not equal to index
ph['prediction'] = predicted_label
print(token)
print(companies_id2index[index])
return ph.query('prediction==1')
predict_label(ph, 1, .5)
I have a subset of a dataframe that looks like (note, the new_tags are not exhaustively illustrated here):
df = pd.DataFrame({'PageNumber': [175, 162, 576], 'new_tags': [['flower architecture people'], ['hair red bobbles'], ['sweets chocolate shop']})
<OUT>
PageNumber new_tags
175 flower architecture people...
162 hair red bobbles...
576 sweets chocolate shop...
I am hoping to iterate through each row (also termed a document) and conduct a topic model then extract the top 20 words from each topic into a csv. I am using Gensim.
I have the code that works for conducting the topic model, but I am unsure how to do this by row. The issue I think I am having is that when converting the df into a dictionary it doesn't allow me to subset it for the loop.
Here is my progress at the moment:
First, I want to tokenize and lemmatize the tags.
nlp = spacy.load('en_core_web_md', disable=['parser', 'ner'])
def lemmatization(texts,allowed_postags=['NOUN', 'ADJ']):
output = []
for sent in texts:
doc = nlp(sent)
output.append([token.lemma_ for token in doc if token.pos_ in allowed_postags ])
return output
#convert column to list
text_list=df['new_tags'].tolist()
#lemmatisation and tokenisation
tokenized_tags = lemmatization(text_list)
Next, I define a function to conduct a topic model and then write that to the csv.
i = 1
def topic_model(tokenized_tags):
''' this function is used to conduct topic modelling for each grid/document '''
for row in tokenized_tags:
#convert tokenized lists into dictionary
dictionary = corpora.Dictionary(row)
#create document term matrix
doc_term_matrix = [dictionary.doc2bow(tag) for tag in row]
#initialise topic model from gensim
LDA = gensim.models.ldamodel.LdaModel
#build and train topic model
lda_model = LDA(corpus=doc_term_matrix, id2word=dictionary, num_topics=40, random_state=100, chunksize=400, passes=50,iterations=100)
#write top 20 words from each document as csv
top_words_per_topic = []
for t in range(lda_model.num_topics):
top_words_per_topic.extend([(t, ) + x for x in lda_model.show_topic(t, topn = 20)])
#return csv - write first row then append subsequent rows
return pd.DataFrame(top_words_per_topic, columns=['Topic', 'Word', 'P']).to_csv("top_words.csv", mode='a', index = False, header=False)
i+=1
topic_model(tokenized_tags)
As a side note, is there a way to work out the optimal parameters e.g. coherence value for each document after running the topic model and somehow adjust the model to take in the best value?
Any help is very much appreciated! Thanks!
UPDATED CODE:
I've updated the function so I'm passing the tokenized version of the df and wanting to apply a topic model to each row and append that onto the df as a new column. How will I be able to do this?
tokens = central_edi_posts_grouped['new_tags'].astype(str).apply(nltk.word_tokenize)
def topic_model(central_edi_posts_grouped):
''' this function is used to conduct topic modelling for each grid/document '''
#convert tokenized lists into dictionary
dictionary = corpora.Dictionary(tokens)
#create document term matrix
doc_term_matrix = [dictionary.doc2bow(tag) for tag in tokens]
#initialise topic model from gensim
LDA = gensim.models.ldamodel.LdaModel
#build and train topic model
lda_model = LDA(corpus=doc_term_matrix, id2word=dictionary, num_topics=8, random_state=100,
chunksize=400, passes=50,iterations=100)
#let's check out the coheence number
from gensim.models.coherencemodel import CoherenceModel
coherence_model_lda = CoherenceModel(model=lda_model, texts=tokens, dictionary=dictionary , coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
#write top 20 words from each document as csv
top_words_per_topic = []
for t in range(lda_model.num_topics):
top_words_per_topic.extend([(t, ) + x for x in lda_model.show_topic(t, topn = 20)])
#return csv - write first row then append subsequent rows
pd.DataFrame(top_words_per_topic, coherence_lda, columns=['Topic', 'Word', 'P', 'Coherence_value']).to_csv("top_words_loop_test.csv", mode='a', index = False, header=False)
return coherence_lda
df['new_col'] = df['new_tags'].apply(lambda tokens: topic_model((tokens)))
You can use apply() function in Pandas to conduct row iterations.
df['new_col'] = df['new_tags'].apply(lambda text_list: topic_model(lemmatization(text_list)))
You may have to modify your topic_model() function a bit, so that it returns just the values you need, but not a pd.DataFrame.
I have a list of job titles, the no of unique job titles are around 24,000. Most of the job titles are very similar.
For example:
Software Developer, Software engineer, software engineering, software engineer 2, senior software engineer, junior software engineer,....
I want to find the most similar titles and replace all the the similar titles with the most repeated title to reduce the uniqueness in the column.
For example, in the screenshots, all the variations of aadhar card supervisor will be replaced with aadhar supervisor since it has been repeated most often. all variants of Software engineering jobs will be replaced with software engineer title since it has been repeated very often, and so on...
Please suggest solutions and approaches to achieve this desired result.
Sample Job titles for your reference is here in this repository:
https://github.com/skwolvie/jobprofile_sample
I have also tabulated the similarity scores of a title with every other title in the sample job title dataset. each title is associated with a value_counts score.
Screenshots:
Here's another try:
There are 5 steps in this solution:
Use pd.Series.value_counts().reset_index() to get only unique titles in descending order of frequency.
Calculate the distances between these unique titles using the Levenshtein distance measure
Find the indices of the words most close to each word using a threshold in the Levenshtein distances
Consolidate the nodes of duplicates to avoid repetition (i.e., if ids 1, 2, and 5 are duplicates, we want only one entry for them and not [1, 2, 5], [2, 1, 5], and [5, 1, 2]).
Finally, we consolidate the information into the df.title.value_counts() series and in a dictionary to replace in the original DataFrame.
Code based on the csv file you shared earlier:
# Load required libraries
import pandas as pd
import numpy as np
import Levenshtein
from collections import defaultdict
STEP1: Load data (it is already in the value_counts() desired format)
df = pd.read_csv("https://raw.githubusercontent.com/skwolvie/jobprofile_sample/main/sample_jobprofiles.csv",
index_col=False)
df.columns = ['title', "frequency"]
STEP2: Calculate the distances
def levenshtein_matrix(titles):
"""
Fill a matrix with the Levenshtein ratio between each word in a list
of words with each other.
Since Levenshtein.ratio(w1, w2) == Levenshtein.ratio(w2, w1), we can
sequentially decrease the lenght of the inner loop in order to calculate
the Levenshtein ratio distance only once
"""
size = len(titles)
final = np.zeros((size, size))
for i, w1 in enumerate(titles):
for j, w2 in enumerate(titles[i:], i):
lev = Levenshtein.ratio(w1, w2)
final[i, j] = lev
final[j, i] = lev
return final
titles = df.title
lev_matrix = levenshtein_matrix(titles) # 30 seconds to run in my machine with 7k+ items
STEP3: Loop through each row of the lev_matrix to find the ids of similar entries
# Create function
def get_similar_nodes(distance_matrix, threshold=.9):
"""
Takes a matrix of distances and returns a generator with the entries
that have a distance measure higher than threshold for each row
in the matrix.
"""
for i in lev_matrix:
yield np.where(i > threshold)[0].tolist()
similar_nodes = get_similar_nodes(lev_matrix)
STEP4: Consolidate all the lists that share at least one item in a single list
def connected_components(lists):
"""
This function yields a generator with all connected lists inside the given
list of lists.
"""
neighbors = defaultdict(set)
seen = set()
for each in lists:
for item in each:
neighbors[item].update(each)
def component(node, neighbors=neighbors, seen=seen, see=seen.add):
nodes = set([node])
next_node = nodes.pop
while nodes:
node = next_node()
see(node)
nodes |= neighbors[node] - seen
yield node
for node in neighbors:
if node not in seen:
yield sorted(component(node))
connected_nodes = list(connected_components(similar_nodes))
For updating the values, you create a dictionary mapping all names to the most common name in their group and pass it to the DataFrame.
Note that using nodes[0] as the most common title in the node works because the DataFrame is ordered by descending frequency since we created it using .value_counts().
# Copy the DataFrame for comparison
df_test = df.copy()
dict_most_popular_names = {}
for nodes in connected_nodes:
dict_most_popular_names |= {key: titles[nodes[0]] for key in titles[nodes]}
# Check the dictionary
titles[connected_nodes[0]][:3]
# >>> 0 'software engineer'
# >>> 20 'software qa engineer'
# >>> 23 'software engineer ii'
# >>> Name: title, dtype: object
dict_most_popular_names["software engineer qa"]
# >>> 'software engineer'
dict_most_popular_names["software engineer"]
# >>> 'software engineer'
dict_most_popular_names["software engineer ii"]
# >>> 'software engineer'
# Update the dataframe
df_test["clean_title"] = [dict_most_popular_names[x] for x in titles]
You can use the dict_most_popular_names to replace in your original dataframe as well.
For me, run this whole script takes 30 seconds, which is pretty much the time spent calculating the Levenshtein distances. If you need to optimize further, there is where you need to check.
I have a machine learning problem where I am calculating bigram Jaccard similarity of a pandas dataframe text column with values of a dictionary. Currently I am storing them as a list and then converting them to columns. This is proving to be very slow in production. Is there a more efficient way to do it?
Following are the steps I am currently following:
For each key in dict:
1. Get bigrams for the pandas column and the dict[key]
2. Calculate Jaccard similarity
3. Append to an empty list
4. Store the list in the dataframe
5. Convert the list to columns
from itertools import tee, islice
def count_ngrams(lst, n):
tlst = lst
while True:
a, b = tee(tlst)
l = tuple(islice(a, n))
if len(l) == n:
yield l
next(b)
tlst = b
else:
break
def n_gram_jaccard_similarity(str1, str2,n):
a = set(count_ngrams(str1.split(),n))
b = set(count_ngrams(str2.split(),n))
intersection = a.intersection(b)
union = a.union(b)
try:
return len(intersection) / float(len(union))
except:
return np.nan
def jc_list(sample_dict,row,n):
sim_list = []
for key in sample_dict:
sim_list.append(n_gram_jaccard_similarity(sample_dict[key],row["text"],n))
return str(sim_list)
Using the above functions to build the bigram Jaccard similarity features as follows:
df["bigram_jaccard_similarity"]=df.apply(lambda row: jc_list(sample_dict,row,2),axis=1)
df["bigram_jaccard_similarity"] = df["bigram_jaccard_similarity"].map(lambda x:[float(i) for i in [a for a in [s.replace(',','').replace(']', '').replace('[','') for s in x.split()] if a!='']])
df[[i for i in sample_dict]] = pd.DataFrame(df["bigram_jaccard_similarity"].values.tolist(), index= df.index)
Sample input:
df = pd.DataFrame(columns=["id","text"],index=None)
df.loc[0] = ["1","this is a sample text"]
import collections
sample_dict = collections.defaultdict()
sample_dict["r1"] = "this is sample 1"
sample_dict["r2"] = "is sample"
sample_dict["r3"] = "sample text 2"
Expected output:
So, this is more difficult than I though, due to some broadcasting issues of sparse matrices. Additionally, in the short period of time I was not able to fully vectorize it.
I added an additional text row to the frame:
df = pd.DataFrame(columns=["id","text"],index=None)
df.loc[0] = ["1","this is a sample text"]
df.loc[1] = ["2","this is a second sample text"]
import collections
sample_dict = collections.defaultdict()
sample_dict["r1"] = "this is sample 1"
sample_dict["r2"] = "is sample"
sample_dict["r3"] = "sample text 2"
We will use the following modules/functions/classes:
from sklearn.feature_extraction.text import CountVectorizer
from scipy.sparse import csr_matrix
import numpy as np
and define a CountVectorizer to create character based n_grams
ngram_vectorizer = CountVectorizer(ngram_range=(2, 2), analyzer="char")
feel free to choose the n-grams you need. I'd advise to take an existing tokenizer and n-gram creator. You should find plenty of those. Also the CountVectorizer can be tweaked extensively (e.g. convert to lowercase, get rid of whitespace etc.)
We concatenate all the data:
all_data = np.concatenate((df.text.to_numpy(),np.array(list(sample_dict.values()))))
we do this, as our vectorizer needs to have a common indexing scheme for all the tokens appearing.
Now let's fit the Count vectorizer and transform the data accordingly:
ngrammed = ngram_vectorizer.fit_transform(all_data) >0
ngrammed is now a sparse matrix containing the identifiers to the tokens appearing in the respective rows and not the counts anymore as before. you can inspect the ngram_vecotrizer and find a mapping from tokens to column ids.
Next we want to compare every grammes entry from the sample dict against every row of our ngrammed text data. We need some magic here:
texts = ngrammed[:len(df)]
samples = ngrammed[len(df):]
text_rows = len(df)
jaccard_similarities = []
for key, ngram_sample in zip(sample_dict.keys(), samples):
repeated_row_matrix = (csr_matrix(np.ones([text_rows,1])) * ngram_sample).astype(bool)
support = texts.maximum(repeated_row_matrix)
intersection = texts.multiply(repeated_row_matrix).todense()
jaccard_similarities.append(pd.Series((intersection.sum(axis=1)/support.sum(axis=1)).A1, name=key))
support is the boolean array, that measures the union of the n-grams over both comparable. intersection is only True if a token is present in both comparable. Note that .A1 represents a matrix-object as the underlying base array.
Now
pd.concat(jaccard_similarities, axis=1)
gives
r1 r2 r3
0 0.631579 0.444444 0.500000
1 0.480000 0.333333 0.384615
you can concat is as well to df and obtain with
pd.concat([df, pd.concat(jaccard_similarities, axis=1)], axis=1)
id text r1 r2 r3
0 1 this is a sample text 0.631579 0.444444 0.500000
1 2 this is a second sample text 0.480000 0.333333 0.384615
PLEASE READ:
I have looked at all the other answers related to this question and none of them solve my specific problem so please carry on reading below.
I have the below code. what the code basically does is keeps the Title column and then concatenated the rest of the columns into one in order to be able to create a cosine matrix.
the main point is the recommendations function that is suppose to take in a Title for imput and return the top 10 matches based on that title but what i get at the end is the index 0 is out of bounds for axis 0 with size 0 error and i have no idea why.
import pandas as pd
from rake_nltk import Rake
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import CountVectorizer
df =
pd.read_csv('https://query.data.world/s/uikepcpffyo2nhig52xxeevdialfl7')
df = df[['Title','Genre','Director','Actors','Plot']]
df.head()
df['Key_words'] = ""
for index, row in df.iterrows():
plot = row['Plot']
# instantiating Rake, by default it uses english stopwords from NLTK
# and discards all puntuation characters as well
r = Rake()
# extracting the words by passing the text
r.extract_keywords_from_text(plot)
# getting the dictionary whith key words as keys and their scores as values
key_words_dict_scores = r.get_word_degrees()
# assigning the key words to the new column for the corresponding movie
row['Key_words'] = list(key_words_dict_scores.keys())
# dropping the Plot column
df.drop(columns = ['Plot'], inplace = True)
# instantiating and generating the count matrix
df['bag_of_words'] = df[df.columns[1:]].apply(lambda x: '
'.join(x.astype(str)),axis=1)
count = CountVectorizer()
count_matrix = count.fit_transform(df['bag_of_words'])
# generating the cosine similarity matrix
cosine_sim = cosine_similarity(count_matrix, count_matrix)
cosine_sim
indices = pd.Series(df.index)
# defining the function that takes in movie title
# as input and returns the top 10 recommended movies
def recommendations(title, cosine_sim = cosine_sim):
#print(title)
# initializing the empty list of recommended movies
recommended_movies = []
# gettin the index of the movie that matches the title
idx = indices[indices == title].index[0]
print('idx is '+ idx)
# creating a Series with the similarity scores in descending order
score_series = pd.Series(cosine_sim[idx]).sort_values(ascending = False)
# getting the indexes of the 10 most similar movies
top_10_indexes = list(score_series.iloc[1:11].index)
# populating the list with the titles of the best 10 matching movies
for i in top_10_indexes:
recommended_movies.append(list(df.index)[i])
return recommended_movies
This line:
idx = indices[indices == title].index[0]
will fail if you do not return a match:
df.loc[df['Title']=='This is not a valid title'].index[0]
returns:
IndexError: index 0 is out of bounds for axis 0 with size 0
You need to confirm that the title you are passing in is actually in DF before trying to access any data associated with it:
def recommendations(title, cosine_sim = cosine_sim):
#print(title)
# initializing the empty list of recommended movies
recommended_movies = []
if title not in indices:
raise KeyError("title is not in indices")
# gettin the index of the movie that matches the title
idx = indices[indices == title].index[0]
print('idx is '+ idx)
# creating a Series with the similarity scores in descending order
score_series = pd.Series(cosine_sim[idx]).sort_values(ascending = False)
# getting the indexes of the 10 most similar movies
top_10_indexes = list(score_series.iloc[1:11].index)
# populating the list with the titles of the best 10 matching movies
for i in top_10_indexes:
recommended_movies.append(list(df.index)[i])
return recommended_movies
This expression also seems to be doing nothing:
for index, row in df.iterrows():
plot = row['Plot']
If you just want a single plot record with which to do some development try:
plot = df['Plot'].sample(n=1)
Finally, it appears that recommendations is using the global variable indices - in general this is bad practice, as if indices changes outside of the scope of recommendations the function might break. I would consider refactoring this to be a little less brittle overall.