how to make python loop faster to run pairwise association test - python

I have a list of patient id and drug names and a list of patient id and disease names. I want to find the most indicative drug for each disease.
To find this I want to do Fisher exact test to get the p-value for each disease/drug pair. But the loop runs very slowly, more than 10 hours. Is there a way to make the loop more efficient, or a better way to solve this association problem?
My loop:
import numpy as np
import pandas as pd
from scipy.stats import fisher_exact
most_indicative_medication = {}
rx_list = list(meps_meds.rxName.unique())
disease_list = list(meps_base_data.columns.values)[8:]
for i in disease_list:
print i
rx_dict = {}
for j in rx_list:
subset = base[['id', i, 'rxName']].drop_duplicates()
subset[j] = subset['rxName'] == j
subset = subset.loc[subset[i].isin(['Yes', 'No'])]
subset = subset[[i, j]]
tab = pd.crosstab(subset[i], subset[j])
if len(tab.columns) == 2:
rx_dict[j] = fisher_exact(tab)[1]
else:
rx_dict[j] = np.nan
most_indicative_medication[i] = min(rx_dict, key=rx_dict.get)

You need multiprocessing/multithreading, I have added the code.:
from multiprocessing.dummy import Pool as ThreadPool
most_indicative_medication = {}
rx_list = list(meps_meds.rxName.unique())
disease_list = list(meps_base_data.columns.values)[8:]
def run_pairwise(i):
print i
rx_dict = {}
for j in rx_list:
subset = base[['id', i, 'rxName']].drop_duplicates()
subset[j] = subset['rxName'] == j
subset = subset.loc[subset[i].isin(['Yes', 'No'])]
subset = subset[[i, j]]
tab = pd.crosstab(subset[i], subset[j])
if len(tab.columns) == 2:
rx_dict[j] = fisher_exact(tab)[1]
else:
rx_dict[j] = np.nan
most_indicative_medication[i] = min(rx_dict, key=rx_dict.get)
pool = ThreadPool(3)
pairwise_test_results = pool.map(run_pairwise,disease_list)
pool.close()
pool.join()
notes:http://chriskiehl.com/article/parallelism-in-one-line/

Crunching faster is good, but a better algorithm usually beats it ;-)
Filling in a bit,
import numpy as np
import pandas as pd
from scipy.stats import fisher_exact
# data files can be download at
# https://github.com/Saynah/platform/tree/d7e9f150ef2ff436387585960ca312a301847a46/data
meps_meds = pd.read_csv("meps_meds.csv") # 8 cols * 1,148,347 rows
meps_base_data = pd.read_csv("meps_base_data.csv") # 18 columns * 61,489 rows
# merge to get disease and drug info in same table
merged = pd.merge( # 25 cols * 1,148,347 rows
meps_base_data, meps_meds,
how='inner', left_on='id', right_on='id'
)
rx_list = meps_meds.rxName.unique().tolist() # 9218 items
disease_list = meps_base_data.columns.values[8:].tolist() # 10 items
Note that rx_list has a LOT of duplicates (eg. 52 entries for Amoxicillin, if you include misspellings).
Then
most_indicative = {}
for disease in disease_list:
# get unique (id, disease, prescription)
subset = merged[['id', disease, 'rxName']].drop_duplicates()
# keep only Yes/No entries under disease
subset = subset[subset[disease].isin(['Yes', 'No'])]
# summarize (replaces the inner loop)
tab = pd.crosstab(subset.rxName, subset[disease])
# bind "No" values for Fisher exact function
nf, yf = tab.sum().values
def p_value(x, nf=nf, yf=yf):
return fisher_exact([[nf - x.No, x.No], [yf - x.Yes, x.Yes]])[1]
# OPTIONAL:
# We can probably assume that the most-indicative drugs are among
# the most-prescribed; get just the 100 most-prescribed drugs
# Note: you have to get the nf, yf values before doing this!
tab = tab.sort_values("Yes", ascending=False)[:100]
# and apply the function
tab["P-value"] = tab.apply(p_value, axis=1)
# find the best match
best_med = tab.sort_values("P-value").index[0]
most_indicative[disease] = best_med
This now runs in about 18 minutes on my machine, and you could probably combine it with EM28's answer to speed it up by another factor of 4 or more.

Related

Going from code that compares one value to all other values to all values of all other values

I've written the code that will find take a number of n grams, a specific index, and a threshold, and return the values that fall within that threshold. However, currently, it only compares a set of tokens (a given index) to each set of tokens at all other indices. I want to compare each set tokens at all indices to every other set of tokens at all indices. I don't think this is a difficult question, but python is my main language and I struggle with for loops a bit.
So essentially, the variable token in the function should actually iterate over each string in the column, and be compared with comp_token and the index call would be removed, since it would be iterating over all indices.
Let me know if that isn't clear enough and I will think more about how to say this: it is just difficult because the thing I am asking is the thing I am struggling with.
data = ['Time', "NY Times", 'Atlantic']
ph = pd.DataFrame(data, columns=['companies'])
ph.reset_index(inplace=True)
import py_stringmatching as sm
import pandas as pd
import numpy as np
jac = sm.Jaccard()
def predict_label(num, index, thresh):
qg_num_tok = sm.QgramTokenizer(qval = num)
companies = ph.companies.to_list()
ids = ph['index']
companies_qg_num_token = {}
companies_id2index = {}
for i in range(len(companies)):
companies_id2index[i] = companies[i]
companies_qg_num_token[i] = qg_num_tok.tokenize(companies[i])
predicted_label = [1]
token = companies_qg_num_token[index] #index you want: to get all the tokens
for comp_name in ids[1:]:
comp_token = companies_qg_num_token[comp_name]
sim = jac.get_sim_score(token, comp_token)
if sim > thresh:
predicted_label.append(1)
else:
predicted_label.append(0)
#companies_id2index must be equal to token numbner
ph.loc[ph['index'] != companies_id2index[index], 'label'] = 0 #if not equal to index
ph['prediction'] = predicted_label
print(token)
print(companies_id2index[index])
return ph.query('prediction==1')
predict_label(ph, 1, .5)

How to convert the value of a column to rows -dataframgroupby

I have a table of "Borrower Personal ID" and "Loan ID".
BwrPersonld LoanId
113225 16330
113225 27073
113225 68842
113253 16341
113269 16348
113285 16354
113289 26768
113297 16360
113299 16361
113319 16369
113418 16403
113418 26854
I'm trying to know which loans belong to the same borrower. So I "groupby" the "BwrPersonalId" and "LoanId" like below.
Now I'm expecting like this.
Here is my code, but it doesn't work.
grouped = pd.DataFrame()
unique = loan['BwrPersonId'].unique()
grouped['BwrPersonId'] = ''*len(loan['BwrPersonId'].unique())
grouped['Loan1'] = ''
grouped['Loan2'] = ''
grouped['Loan3'] = ''
grouped['Loan4'] = ''
grouped['Loan5'] = ''
grouped.iloc[:,0] = unique
for i in grouped.index:
idloan = loan.loc[loan['BwrPersonId'] == unique[i], 'LoanId']
grouped.iloc[i,1:len(idloan)+1] = idloan
print(i)
How can I do it now? And is there any other way that can simplify the code? Thanks a lot for your help.
Basically, what you need to do is create a temp that will be utilizing the data that will be sorted, and the name that will be in charge of the Id to sort the Ids according to the loans.
import pandas as pd
import numpy as np
from collections import defaultdict
from itertools import count
dict = defaultdict(count)
id, name = pd.factorize([*zip(grouped.id, grouped.name)])
joined = np.array([next(dict[x]) for x in id])
lenOfr, Max = len(name), joined.max() + 1
temp = np.empty((lenOfr, Max), dtype=np.object)
temp[id, joined] = grouped.LoanId
df1 = pd.DataFrame(name.tolist(), columns=['BwrPersonId'])
df2 = pd.DataFrame(temp, columns=['Loan1', 'Loan2', 'Loan3', 'Loan4', 'Loan5'])
final = df1.join(df2)

Why does performance decrease with the size of the data frame?

I'm working with a relatively large dataset (approx 5m observations, made up of about 5.5k firms).
I needed to run OLS regressions with a 60 month rolling window for each firm. I noticed that the performance was insanely slow when I ran the following code:
for idx, sub_df in master_df.groupby("firm_id"):
# OLS code
However, when I first split my dataframe into about 5.5k dfs and then iterated over each of the dfs, the performance improved dramatically.
grouped_df = master_df.groupby("firm_id")
df_list = [group for group in grouped_df]
for df in df_list:
my_df = df[1]
# OLS code
I'm talking 1-2 weeks of time (24/7) to complete in the first version compared to 8-9 hours tops.
Can anyone please explain why splitting the master df into N smaller dfs and then iterating over each smaller df performs better than iterating over the same number of groups within the master df?
Thanks ever so much!
I'm unable to reproduce your observation. Here's some code that generates data and then times the direct and indirect methods separately. The time taken is very similar in either case.
Is it possible that you accidentally sorted the dataframe by the group key between the runs? Sorting by group key results in a noticeable difference in run time.
Otherwise, I'm beginning to think that there might be some other differences in your code. It would be great if you could post the full code.
import numpy as np
import pandas as pd
from datetime import datetime
def generate_data():
''' returns a Pandas DF with columns 'firm_id' and 'score' '''
# configuration
np.random.seed(22)
num_groups = 50000 # number of distinct groups in the DF
mean_group_length = 200 # how many records per group?
cov_group_length = 0.10 # throw in some variability in the num records per group
# simulate group lengths
stdv_group_length = mean_group_length * cov_group_length
group_lengths = np.random.normal(
loc=mean_group_length,
scale=stdv_group_length,
size=(num_groups,)).astype(int)
group_lengths[group_lengths <= 0] = mean_group_length
# final length of DF
total_length = sum(group_lengths)
# compute entries for group key column
firm_id_list = []
for i, l in enumerate(group_lengths):
firm_id_list.extend([(i + 1)] * l)
# construct the DF; data column is 'score' populated with Numpy's U[0, 1)
result_df = pd.DataFrame(data={
'firm_id': firm_id_list,
'score': np.random.rand(total_length)
})
# Optionally, shuffle or sort the DF by group keys
# ALTERNATIVE 1: (badly) unsorted df
result_df = result_df.sample(frac=1, random_state=13).reset_index(drop=True)
# ALTERNATIVE 2: sort by group key
# result_df.sort_values(by='firm_id', inplace=True)
return result_df
def time_method(df, method):
''' time 'method' with 'df' as its argument '''
t_start = datetime.now()
method(df)
t_final = datetime.now()
delta_t = t_final - t_start
print(f"Method '{method.__name__}' took {delta_t}.")
return
def process_direct(df):
''' direct for-loop over groupby object '''
for group, df in df.groupby('firm_id'):
m = df.score.mean()
s = df.score.std()
return
def process_indirect(df):
''' indirect method: generate groups first as list and then loop over list '''
grouped_df = df.groupby('firm_id')
group_list = [pair for pair in grouped_df]
for pair in group_list:
m = pair[1].score.mean()
s = pair[1].score.std()
df = generate_data()
time_method(df, process_direct)
time_method(df, process_indirect)

Problem with multiprocessing in Python writing into csv

I created a function which takes a date as an argument and writes produced output into csv. If I run multiprocessing Pool with e.g. 28 tasks, and I have a list of 100 dates, then the last 72 rows in the output csv file are twice longer than they should be (just a joined repetition of last 72 rows).
My code:
import numpy as np
import pandas as pd
import multiprocessing
#Load the data
df = pd.read_csv('data.csv', low_memory=False)
list_s = df.date.unique()
def funk(date):
...
# for each date in df.date.unique() do stuff which gives sample dataframe
# as an output
return sample
# list_s is a list of dates I want to calculate function funk for
def mp_handler():
# 28 is a number of processes I want to run
p = multiprocessing.Pool(28)
for result in p.imap(funk, list_s[0:100]):
result.to_csv('crsp_full.csv', mode='a')
if __name__=='__main__':
mp_handler()
And the output looks like this:
date,port_ret_1,port_ret_2
2010-03-05,0.0,0.002
date,port_ret_1,port_ret_2
2010-02-12,-0.001727,0.009139189315
...
# and after first 28 rows like this
date,port_ret_1,port_ret_2,port_ret_1,port_ret_2
2010-03-03,0.002045,0.00045092025,0.002045,0.00045092025
date,port_ret_1,port_ret_2,port_ret_1,port_ret_2
2010-03-15,-0.006055,-0.00188451972,-0.006055,-0.00188451972
I tried to insert lock() into the funk(), but it yielded the same results, just took more time to implement. Any ideas how to fix it?
Edit. funk looks like this. e is equivalent to date.
def funk(e):
block = pd.DataFrame()
i = s_list.index(e)
if i > 19:
ran = s_list[i-19:i+6]
ran0 = s_list[i-19:i+1]
# print ran0
piv = df.pivot(index='date', columns='permno', values='date')
# Drop the stocks which do not have returns for the given time window and make the list of suitable stocks
s = list(piv.loc[ran].dropna(axis=1).columns)
sample = df[df['permno'].isin(s)]
sample = sample.loc[ran]
permno = ['10001', '93422']
sample = sample[sample['permno'].isin(permno)]
# print sample.index.unique()
# get past 20 days returns in additional 20 columns
for i in range(0, 20):
sample['r_{}'.format(i)] = sample.groupby('permno')['ret'].shift(i)
#merge dataset with betas
sample = pd.merge(sample, betas_aug, left_index=True, right_index=True)
sample['ex_ret'] = 0
# calculate expected return
for i in range(0,20):
sample['ex_ret'] += sample['ma_beta_{}'.format(i)]*sample['r_{}'.format(i)]
# print(sample)
# define a stock into two legs based on expected return
sample['sign'] = sample['ex_ret'].apply(lambda x: -1 if x<0 else 1)
# workaround for short leg, multiply returns by -1
sample['abs_ex_ret'] = sample['ex_ret']*sample['sign']
# create 5 columns for future realised 5 days returns (multiplied by -1 for short leg)
for i in range(1,6):
sample['rp_{}'.format(i)] = sample.groupby(['permno'])['ret'].shift(-i)
sample['rp_{}'.format(i)] = sample['rp_{}'.format(i)]*sample['sign']
sample = sample.reset_index(drop=True)
sample['w_0'] = sample['abs_ex_ret'].div(sample.groupby(['date'])['abs_ex_ret'].transform('sum'))
for i in range(1, 5):
sample['w_{}'.format(i)] = sample['w_{}'.format(i-1)]*(1+sample['rp_{}'.format(i)])
sample = sample.dropna(how='any')
for k in range(0,20):
sample.drop(columns = ['ma_beta_{}'.format(k), 'r_{}'.format(k)])
for k in range(1, 6):
sample['port_ret_{}'.format(k)] = sample['w_{}'.format(k-1)]*sample['rp_{}'.format(k)]
q = ['port_ret_{}'.format(k)]
list_names.extend(q)
block = sample.groupby('date')[list_names].sum().copy()
return block

Python looping and Pandas rank/index quirk

This question pertains to one posted here:
Sort dataframe rows independently by values in another dataframe
In the linked question, I utilize a Pandas Dataframe to sort each row independently using values in another Pandas Dataframe. The function presented there works perfectly every single time it is directly called. For example:
import pandas as pd
import numpy as np
import os
##Generate example dataset
d1 = {}
d2 = {}
d3 = {}
d4 = {}
## generate data:
np.random.seed(5)
for col in list("ABCDEF"):
d1[col] = np.random.randn(12)
d2[col+'2'] = np.random.random_integers(0,100, 12)
d3[col+'3'] = np.random.random_integers(0,100, 12)
d4[col+'4'] = np.random.random_integers(0,100, 12)
t_index = pd.date_range(start = '2015-01-31', periods = 12, freq = "M")
#place data into dataframes
dat1 = pd.DataFrame(d1, index = t_index)
dat2 = pd.DataFrame(d2, index = t_index)
dat3 = pd.DataFrame(d3, index = t_index)
dat4 = pd.DataFrame(d4, index = t_index)
## Functions
def sortByAnthr(X,Y,Xindex, Reverse=False):
#order the subset of X.index by Y
ordrX = [x for (x,y) in sorted(zip(Xindex,Y), key=lambda pair: pair[1],reverse=Reverse)]
return(ordrX)
def OrderRow(row,df):
ordrd_row = df.ix[row.dropna().name,row.dropna().values].tolist()
return(ordrd_row)
def r_selectr(dat2,dat1, n, Reverse=False):
ordr_cols = dat1.apply(lambda x: sortByAnthr(x,dat2.loc[x.name,:],x.index,Reverse),axis=1).iloc[:,-n:]
ordr_cols.columns = list(range(0,n)) #assign interpretable column names
ordr_r = ordr_cols.apply(lambda x: OrderRow(x,dat1),axis=1)
return([ordr_cols, ordr_r])
## Call functions
ordr_cols2,ordr_r = r_selectr(dat2,dat1,5)
##print output:
print("Ordering set:\n",dat2.iloc[-2:,:])
print("Original set:\n", dat1.iloc[-2:,:])
print("Column ordr:\n",ordr_cols2.iloc[-2:,:])
As can be checked, the columns of dat1 are correctly ordered according to the values in dat2.
However, when called from a loop over dataframes, it does not rank/index correctly and produces completely dubious results. Although I am not quite able to recreate the problem using the reduced version presented here, the idea should be the same.
## Loop test:
out_dict = {}
data_dicts = {'dat2':dat2, 'dat3': dat3, 'dat4':dat4}
for i in range(3):
#this outer for loop supplies different parameter values to a wrapper
#function that calls r_selectr.
for key in data_dicts.keys():
ordr_cols,_ = r_selectr(data_dicts[key], dat1,5)
out_list.append(ordr_cols)
#do stuff here
#print output:
print("Ordering set:\n",dat3.iloc[-2:,:])
print("Column ordr:\n",ordr_cols2.iloc[-2:,:])
In my code (almost completely analogous to the example given here), the ordr_cols are no longer ordered correctly for any of the sorting data frames.
I currently solve the issue by separating the ordering and indexing operations with r_selectr into two separate functions. That, for some reason, resolves the issue though I have no idea why.

Categories