I have two pandas data frames. The first one contains a list of unigrams extracted from the text, count and probability of the unigram occurring in the text. The structure looks like this:
unigram_df
word count prob
0 we 109 0.003615
1 investigated 20 0.000663
2 the 1125 0.037315
3 potential 36 0.001194
4 of 1122 0.037215
The second one contains a list of skipgrams extracted from the same text, along with the count and probability of the skipgram occurring in the text. It looks like this:
skipgram_df
word count prob
0 (we, investigated) 5 0.000055
1 (we, the) 31 0.000343
2 (we, potential) 2 0.000022
3 (investigated, the) 11 0.000122
4 (investigated, potential) 3 0.000033
Now, I want to calculate the pointwise mutual information for each skipgram, which is basically a log of skipgram probability divided by the product of its unigrams' probabilities. I wrote a function for that, which iterates through the skipgram df and and it works exactly how I want, but I have huge issues with performance, and I wanted to ask if there is a way to improve my code to make it calculate the pmi faster.
Here's my code:
def calculate_pmi(row):
skipgram_prob = float(row[3])
x_unigram_prob = float(unigram_df.loc[unigram_df['word'] == row[1][0]]
['prob'])
y_unigram_prob = float(unigram_df.loc[unigram_df['word'] == row[1][1]]
['prob'])
pmi = math.log10(float(skipgram_prob / (x_unigram_prob * y_unigram_prob)))
result = str(str(row[1][0]) + ' ' + str(row[1][1]) + ' ' + str(pmi))
return result
pmi_list = list(map(calculate_pmi, skipgram_df.itertuples()))
Performance of the function for now is around 483.18it/s, which is super slow, as I have hundreds of thousands of skipgrams to iterate through. Any suggestions would be welcome. Thanks.
This is a good question, and exercise, for new users of pandas. Use df.iterrows only as a last resort and, even then, consider alternatives. There are relatively few occasions when this is the right option.
Below is an example of how you can vectorise your calculations.
import pandas as pd
import numpy as np
uni = pd.DataFrame([['we', 109, 0.003615], ['investigated', 20, 0.000663],
['the', 1125, 0.037315], ['potential', 36, 0.001194],
['of', 1122, 0.037215]], columns=['word', 'count', 'prob'])
skip = pd.DataFrame([[('we', 'investigated'), 5, 0.000055],
[('we', 'the'), 31, 0.000343],
[('we', 'potential'), 2, 0.000022],
[('investigated', 'the'), 11, 0.000122],
[('investigated', 'potential'), 3, 0.000033]],
columns=['word', 'count', 'prob'])
# first split column of tuples in skip
skip[['word1', 'word2']] = skip['word'].apply(pd.Series)
# set index of uni to 'word'
uni = uni.set_index('word')
# merge prob1 & prob2 from uni to skip
skip['prob1'] = skip['word1'].map(uni['prob'].get)
skip['prob2'] = skip['word2'].map(uni['prob'].get)
# perform calculation and filter columns
skip['result'] = np.log(skip['prob'] / (skip['prob1'] * skip['prob2']))
skip = skip[['word', 'count', 'prob', 'result']]
Related
I have two dataframes (A and B). I want to compare strings in A and find a match or is contained in another string in B. Then count the amount of times A was matched or contained in B.
Dataframe A
0 "4012, 4065, 4682"
1 "4712, 2339, 5652, 10007"
2 "4618, 8987"
3 "7447, 4615, 4012"
4 "6515"
5 "4065, 2339, 4012"
Dataframe B
0 "6515, 4012, 4618, 8987" <- matches (DF A, Index 2 & 4) (2: 4618, 8987), (4: 6515)
1 "4065, 5116, 2339, 8757, 4012" <- matches (DF A, Index 5) (4065, 2339, 4012)
2 "1101"
3 "6515" <- matches (DF A, Index 4) (6515)
4 "4012, 4615, 7447" <- matches (DF A, Index 3) (7447, 4615, 4012)
5 "7447, 6515, 4012, 4615" <- matches (DF A, Index 3 & 4) (3: 7447, 4615, 4012 ), (4: 6515)
Desired Output:
Itemset Count
2 4618, 8987 1
3 7447, 4165, 4012 2
4 6515 3
5 4065, 2339, 4012 1
Basically, I want to count when there is a direct match of A in B (either in order or not) or if A is partially contained in B (in order or not). My goal is to count how many times A is being validated by B. These are all strings by the way.
EDIT Need for speed edition:
This is a redo question from my previous post:
Compare two dataframe columns for matching strings or are substrings then count in pandas
I have millions of rows in both dfA and dfB to make these comparisons against.
In my previous post, the following code got the job done:
import pandas as pd
dfA = pd.DataFrame(["4012, 4065, 4682",
"4712, 2339, 5652, 10007",
"4618, 8987",
"7447, 4615, 4012",
"6515",
"4065, 2339, 4012",],
columns=['values'])
dfB = pd.DataFrame(["6515, 4012, 4618, 8987",
"4065, 5116, 2339, 8757, 4012",
"1101",
"6515",
"4012, 4615, 7447",
"7447, 6515, 4012, 4615"],
columns=['values'])
dfA['values_list'] = dfA['values'].str.split(', ')
dfB['values_list'] = dfB['values'].str.split(', ')
dfA['overlap_A'] = [sum(all(val in cell for val in row)
for cell in dfB['values_list'])
for row in dfA['values_list']]
However with the total amount of rows to check, I am experiencing a performance issue and need another way to check the frequency / counts. Seems like Numpy is needed in this case. this is about the extent of my numpy knowledge as I work primarily in pandas. Anyone have suggestions to make this faster?
dfA_array = dfA['values_list'].to_numpy()
dfB_array = dfB['values_list'].to_numpy()
give this a try. Your algorithm is O(NNK): square of count * words per line. Below should improve to O(NK)
from collections import defaultdict
from functools import reduce
d=defaultdict(set)
for i,t in enumerate(dfB['values']):
for s in t.split(', '):
d[s].add(i)
dfA['count']=dfA['values'].apply(lambda x:len(reduce(lambda a,b: a.intersection(b), [d[s] for s in x.split(', ') ])))
I have two dataframe. both have two columns. I want to use wmd to find closest match for each entity in column source_label to entities in column target_label However, at the end I would like to have a DataFrame with all the 4 columns with respect to the entities.
df1
,source_Label,source_uri
'neuronal ceroid lipofuscinosis 8',"http://purl.obolibrary.org/obo/DOID_0110723"
'autosomal dominant distal hereditary motor neuronopathy',"http://purl.obolibrary.org/obo/DOID_0111198"
df2
,target_label,target_uri
'neuronal ceroid ',"http://purl.obolibrary.org/obo/DOID_0110748"
'autosomal dominanthereditary',"http://purl.obolibrary.org/obo/DOID_0111110"
Expected result
,source_label, target_label, source_uri, target_uri, wmd score
'neuronal ceroid lipofuscinosis 8', 'neuronal ceroid ', "http://purl.obolibrary.org/obo/DOID_0110723", "http://purl.obolibrary.org/obo/DOID_0110748", 0.98
'autosomal dominant distal hereditary motor neuronopathy', 'autosomal dominanthereditary', "http://purl.obolibrary.org/obo/DOID_0111198", "http://purl.obolibrary.org/obo/DOID_0111110", 0.65
The dataframe is so big that I am looking for some faster way to iterate over both label columns. So far I tried this:
list_distances = []
temp = []
def preprocess(sentence):
return [w for w in sentence.lower().split()]
entity = df1['source_label']
target = df2['target_label']
for i in tqdm(entity):
for j in target:
wmd_distance = model.wmdistance(preprocess(i), preprocess(j))
temp.append(wmd_distance)
list_distances.append(min(temp))
# print("list_distances", list_distances)
WMD_Dataframe = pd.DataFrame({'source_label': pd.Series(entity),
'target_label': pd.Series(target),
'source_uri': df1['source_uri'],
'target_uri': df2['target_uri'],
'wmd_Score': pd.Series(list_distances)}).sort_values(by=['wmd_Score'])
WMD_Dataframe = WMD_Dataframe.reset_index()
First of all this code is not working well as the other two columns are coming directly from the dfs' and do not take entities relation with the uri into consideration.
How one can make it faster as the entities are in millions. Thanks in advance.
A quick fix :
closest_neighbour_index_df2 = []
def preprocess(sentence):
return [w for w in sentence.lower().split()]
for i in tqdm(entity):
temp = []
for j in target:
wmd_distance = model.wmdistance(preprocess(i), preprocess(j))
temp.append(wmd_distance)
# maybe assert to make sure its always right
closest_neighbour_index_df2.append(np.argmin(np.array(temp)))
# return argmin to return index rather than the value.
# Add the indices from df2 to df1
df1['closest_neighbour'] = closest_neighbour_index_df2
# add information to respective row from df2 using the closest_neighbour column
The following scenario is given.
I have 2 dataframes called orders and customers.
I want to look where the CustomerID from the OrderDataFrame is in the LinkedCustomer column of the Customer Dataframe. The LinkedCustomers field is an array of CustomerIds.
The orders dataframe contains approximately 5.800.000 items.
The customer dataframe contains approximately 180 000 items.
I am looking for a way to optimize the following code, because this code runs but is very slow. How can I speed this up?
# demo data -- In the real scenario this data was read from csv-/json files.
orders = pd.DataFrame({'custId': [1, 2, 3, 4], 'orderId': [2,3,4,5]})
customers = pd.DataFrame({'id':[5,6,7], 'linkedCustomers': [{1,2}, {4,5,6}, {3, 7, 8, 9}]})
def getMergeCustomerID(row):
customerOrderId = row['custId']
searchMasterCustomer = customers[customers['linkedCustomers'].str.contains(str(customerOrderId))]
searchMasterCustomer = searchMasterCustomer['id']
if len(searchMasterCustomer) > 0:
return searchMasterCustomer
else:
return customerOrderId
orders['newId'] = orders.apply(lambda x: getMergeCustomerID(x), axis=1)
# expected result
custId orderId newId
1 2 5
2 3 5
3 4 7
4 5 6
I think that in some circumstances this approach can solve your problem:
Build a dictionary first,
myDict = {}
for i,j in customers.iterrows():
for j2 in j[1]:
myDict[j2]=j[0]
then use the dictionary to create the new column:
orders['newId'] = [myDict[i] for i in orders['custId']]
IMO even though this can solve your problem (speed up your program) this is not the most generic solution. Better answers are welcome!
I'm working on text analysis and try to quantify the value of sentence as the sum of the value assigned to some words if they are in the sentence. I have a DF with words and values such as:
import pandas as pd
df_w = pd.DataFrame( { 'word': [ 'high', 'sell', 'hello'],
'value': [ 32, 45, 12] } )
Then I have sentences in another DF such as:
df_s = pd.DataFrame({'sentence': [ 'hello life if good',
'i sell this at a high price',
'i sell or you sell'] } )
Now, I want to add a column in df_s with the sum of the value of each word in the sentence if the word is in the df_w. To do so, I tried:
df_s['value'] = df_s['sentence'].apply(lambda x: sum(df_w['value'][df_w['word'].isin(x.split(' '))]))
The results is:
sentence value
0 hello life if good 12
1 i sell this at a high price 77
2 i sell or you sell 45
My problem with this answer is that for the last sentence i sell or you sell, I have twice sell and I was expecting 90 (2*45) but sell has been considered only once so I got 45.
To solve this, I decided to create a dictionary and then do a apply:
dict_w = pd.Series(df_w['value'].values,index=df_w['word']).to_dict()
df_s['value'] = df_s['sentence'].apply(lambda x: sum([dict_w[word] for word in x.split(' ') if word in dict_w.keys()]))
This time, the result is what I expected (90 for the last sentence). But my problem comes with larger DF, and the time to perform the method with dict_w is about 20 times longer than the method with isin for my test case.
Do you know an way to multiply the value of a word by its occurrence within the method with isin? any other solution is welcomed too.
You can using str.split with stack and filter(isin) the result , replace those key words to value , then assign it back
s=df_s.sentence.str.split(' ',expand=True).stack()
df_s['Value']=s[s.isin(df_w.word)].replace(dict(zip(df_w.word,df_w.value))).sum(level=0)
df_s
Out[984]:
sentence Value
0 hello life if good 12
1 i sell this at a high price 77
2 i sell or you sell 90
Create a function with a default value out of a dictionary's get method
dw = lambda x: dict(zip(df_w.word, df_w.value)).get(x, 0)
df_s.assign(value=[sum(map(dw, s.split())) for s in df_s.sentence])
sentence value
0 hello life if good 12
1 i sell this at a high price 77
2 i sell or you sell 90
Thanks to the answer of piRSquared with his map function, I had the idea to use merge such as:
df_s['value'] = df_s['sentence'].apply(lambda x: sum(pd.merge(pd.DataFrame({'word':x.split(' ')}),df_w)['value']))
Thanks to the answer of Wen with his stack function, I use his idea but with merge such as:
df_stack = pd.DataFrame({'word': df_s['sentence'].str.split(' ',expand=True).stack()})
df_s['value'] = df_stack.reset_index().merge(df_w).set_index(['level_0','level_1'])['value'].sum(level=0)
And both methods give me the right answer.
Finally, to test which solution is faster, I define functions such as:
def sol_dict (df_s, df_w): # answer with a dict
dict_w = pd.Series(df_w['value'].values,index=df_w['word']).to_dict()
df_s['value'] = df_s['sentence'].apply(lambda x: sum([dict_w[word] for word in x.split(' ') if word in dict_w.keys()]))
return df_s
def sol_wen(df_s, df_w): # answer of Wen
s=df_s.sentence.str.split(' ',expand=True).stack()
df_s['value']=s[s.isin(df_w.word)].replace(dict(zip(df_w.word,df_w.value))).sum(level=0)
return df_s
def sol_pi (df_s, df_w): # answer of piRSquared
dw = lambda x: dict(zip(df_w.word, df_w.value)).get(x, 0)
df_s.assign(value=[sum(map(dw, s.split())) for s in df_s.sentence])
# or df_s['value'] = [sum(map(dw, s.split())) for s in df_s.sentence]
return df_s
def sol_merge(df_s, df_w): # answer with merge
df_s['value'] = df_s['sentence'].apply(lambda x: sum(pd.merge(pd.DataFrame({'word':x.split(' ')}),df_w)['value']))
return df_s
def sol_stack(df_s, df_w): # answer with stack and merge
df_stack = pd.DataFrame({'word': df_s['sentence'].str.split(' ',expand=True).stack()})
df_s['value'] = df_stack.reset_index().merge(df_w).set_index(['level_0','level_1'])['value'].sum(level=0)
return df_s
My "large" test DFs where composed of around 3200 words in df_w and around 42700 words in df_s (once split all sentences). I run timeit with several size of df_w (from 320 to 3200 words) with the full size of df_s and then with several size of df_s (from 3500 to 42700 words) with the full size of df_w. After curve fitting my results, I got:
To conclude, whatever is the size of both DFs, the method using stack then merge is really efficient (around 100ms, sorry not really visible on graphs). I run it on a my full size DFs with around 54k words in df_w 2.4 millions words in df_s and I got the results in few seconds.
Thanks both for your ideas.
Firstly, sorry if this is a bit lengthy, but I wanted to fully describe what I have having problems with and what I have tried already.
I am trying to join (merge) together two dataframe objects on multiple conditions. I know how to do this if the conditions to be met are all 'equals' operators, however, I need to make use of LESS THAN and MORE THAN.
The dataframes represent genetic information: one is a list of mutations in the genome (referred to as SNPs) and the other provides information on the locations of the genes on the human genome. Performing df.head() on these returns the following:
SNP DataFrame (snp_df):
chromosome SNP BP
0 1 rs3094315 752566
1 1 rs3131972 752721
2 1 rs2073814 753474
3 1 rs3115859 754503
4 1 rs3131956 758144
This shows the SNP reference ID and their locations. 'BP' stands for the 'Base-Pair' position.
Gene DataFrame (gene_df):
chromosome chr_start chr_stop feature_id
0 1 10954 11507 GeneID:100506145
1 1 12190 13639 GeneID:100652771
2 1 14362 29370 GeneID:653635
3 1 30366 30503 GeneID:100302278
4 1 34611 36081 GeneID:645520
This dataframe shows the locations of all the genes of interest.
What I want to find out is all of the SNPs which fall within the gene regions in the genome, and discard those that are outside of these regions.
If I wanted to merge together two dataframes based on multiple (equals) conditions, I would do something like the following:
merged_df = pd.merge(snp_df, gene_df, on=['chromosome', 'other_columns'])
However, in this instance - I need to find the SNPs where the chromosome values match those in the Gene dataframe, and the BP value falls between 'chr_start' and 'chr_stop'. What makes this challenging is that these dataframes are quite large. In this current dataset the snp_df has 6795021 rows, and the gene_df has 34362.
I have tried to tackle this by either looking at chromosomes or genes seperately. There are 22 different chromosome values (ints 1-22) as the sex chromosomes are not used. Both methods are taking an extremely long time. One uses the pandasql module, while the other approach is to loop through the separate genes.
SQL method
import pandas as pd
import pandasql as psql
pysqldf = lambda q: psql.sqldf(q, globals())
q = """
SELECT s.SNP, g.feature_id
FROM this_snp s INNER JOIN this_genes g
WHERE s.BP >= g.chr_start
AND s.BP <= g.chr_stop;
"""
all_dfs = []
for chromosome in snp_df['chromosome'].unique():
this_snp = snp_df.loc[snp_df['chromosome'] == chromosome]
this_genes = gene_df.loc[gene_df['chromosome'] == chromosome]
genic_snps = pysqldf(q)
all_dfs.append(genic_snps)
all_genic_snps = pd.concat(all_dfs)
Gene iteration method
all_dfs = []
for line in gene_df.iterrows():
info = line[1] # Getting the Series object
this_snp = snp_df.loc[(snp_df['chromosome'] == info['chromosome']) &
(snp_df['BP'] >= info['chr_start']) & (snp_df['BP'] <= info['chr_stop'])]
if this_snp.shape[0] != 0:
this_snp = this_snp[['SNP']]
this_snp.insert(len(this_snp.columns), 'feature_id', info['feature_id'])
all_dfs.append(this_snp)
all_genic_snps = pd.concat(all_dfs)
Can anyone give any suggestions of a more effective way of doing this?
I've just thought of a way to solve this - by combining my two methods:
First, focus on the individual chromosomes, and then loop through the genes in these smaller dataframes. This also doesn't have to make use of any SQL queries either. I've also included a section to immediately identify any redundant genes that don't have any SNPs that fall within their range. This makes use of a double for-loop which I normally try to avoid - but in this case it works quite well.
all_dfs = []
for chromosome in snp_df['chromosome'].unique():
this_chr_snp = snp_df.loc[snp_df['chromosome'] == chromosome]
this_genes = gene_df.loc[gene_df['chromosome'] == chromosome]
# Getting rid of redundant genes
min_bp = this_chr_snp['BP'].min()
max_bp = this_chr_snp['BP'].max()
this_genes = this_genes.loc[~(this_genes['chr_start'] >= max_bp) &
~(this_genes['chr_stop'] <= min_bp)]
for line in this_genes.iterrows():
info = line[1]
this_snp = this_chr_snp.loc[(this_chr_snp['BP'] >= info['chr_start']) &
(this_chr_snp['BP'] <= info['chr_stop'])]
if this_snp.shape[0] != 0:
this_snp = this_snp[['SNP']]
this_snp.insert(1, 'feature_id', info['feature_id'])
all_dfs.append(this_snp)
all_genic_snps = pd.concat(all_dfs)
While this doesn't run spectacularly quickly - it does run so that I can actually get some answers. I'd still like to know if anyone has any tips to make it run more efficiently though.
You can use the following to accomplish what you're looking for:
merged_df=snp_df.merge(gene_df,on=['chromosome'],how='inner')
merged_df=merged_df[(merged_df.BP>=merged_df.chr_start) & (merged_df.BP<=merged_df.chr_stop)][['SNP','feature_id']]
Note: your example dataframes do not meet your join criteria. Here is an example using modified dataframes:
snp_df
Out[193]:
chromosome SNP BP
0 1 rs3094315 752566
1 1 rs3131972 30400
2 1 rs2073814 753474
3 1 rs3115859 754503
4 1 rs3131956 758144
gene_df
Out[194]:
chromosome chr_start chr_stop feature_id
0 1 10954 11507 GeneID:100506145
1 1 12190 13639 GeneID:100652771
2 1 14362 29370 GeneID:653635
3 1 30366 30503 GeneID:100302278
4 1 34611 36081 GeneID:645520
merged_df
Out[195]:
SNP feature_id
8 rs3131972 GeneID:100302278