Problem with multiprocessing in Python writing into csv - python

I created a function which takes a date as an argument and writes produced output into csv. If I run multiprocessing Pool with e.g. 28 tasks, and I have a list of 100 dates, then the last 72 rows in the output csv file are twice longer than they should be (just a joined repetition of last 72 rows).
My code:
import numpy as np
import pandas as pd
import multiprocessing
#Load the data
df = pd.read_csv('data.csv', low_memory=False)
list_s = df.date.unique()
def funk(date):
...
# for each date in df.date.unique() do stuff which gives sample dataframe
# as an output
return sample
# list_s is a list of dates I want to calculate function funk for
def mp_handler():
# 28 is a number of processes I want to run
p = multiprocessing.Pool(28)
for result in p.imap(funk, list_s[0:100]):
result.to_csv('crsp_full.csv', mode='a')
if __name__=='__main__':
mp_handler()
And the output looks like this:
date,port_ret_1,port_ret_2
2010-03-05,0.0,0.002
date,port_ret_1,port_ret_2
2010-02-12,-0.001727,0.009139189315
...
# and after first 28 rows like this
date,port_ret_1,port_ret_2,port_ret_1,port_ret_2
2010-03-03,0.002045,0.00045092025,0.002045,0.00045092025
date,port_ret_1,port_ret_2,port_ret_1,port_ret_2
2010-03-15,-0.006055,-0.00188451972,-0.006055,-0.00188451972
I tried to insert lock() into the funk(), but it yielded the same results, just took more time to implement. Any ideas how to fix it?
Edit. funk looks like this. e is equivalent to date.
def funk(e):
block = pd.DataFrame()
i = s_list.index(e)
if i > 19:
ran = s_list[i-19:i+6]
ran0 = s_list[i-19:i+1]
# print ran0
piv = df.pivot(index='date', columns='permno', values='date')
# Drop the stocks which do not have returns for the given time window and make the list of suitable stocks
s = list(piv.loc[ran].dropna(axis=1).columns)
sample = df[df['permno'].isin(s)]
sample = sample.loc[ran]
permno = ['10001', '93422']
sample = sample[sample['permno'].isin(permno)]
# print sample.index.unique()
# get past 20 days returns in additional 20 columns
for i in range(0, 20):
sample['r_{}'.format(i)] = sample.groupby('permno')['ret'].shift(i)
#merge dataset with betas
sample = pd.merge(sample, betas_aug, left_index=True, right_index=True)
sample['ex_ret'] = 0
# calculate expected return
for i in range(0,20):
sample['ex_ret'] += sample['ma_beta_{}'.format(i)]*sample['r_{}'.format(i)]
# print(sample)
# define a stock into two legs based on expected return
sample['sign'] = sample['ex_ret'].apply(lambda x: -1 if x<0 else 1)
# workaround for short leg, multiply returns by -1
sample['abs_ex_ret'] = sample['ex_ret']*sample['sign']
# create 5 columns for future realised 5 days returns (multiplied by -1 for short leg)
for i in range(1,6):
sample['rp_{}'.format(i)] = sample.groupby(['permno'])['ret'].shift(-i)
sample['rp_{}'.format(i)] = sample['rp_{}'.format(i)]*sample['sign']
sample = sample.reset_index(drop=True)
sample['w_0'] = sample['abs_ex_ret'].div(sample.groupby(['date'])['abs_ex_ret'].transform('sum'))
for i in range(1, 5):
sample['w_{}'.format(i)] = sample['w_{}'.format(i-1)]*(1+sample['rp_{}'.format(i)])
sample = sample.dropna(how='any')
for k in range(0,20):
sample.drop(columns = ['ma_beta_{}'.format(k), 'r_{}'.format(k)])
for k in range(1, 6):
sample['port_ret_{}'.format(k)] = sample['w_{}'.format(k-1)]*sample['rp_{}'.format(k)]
q = ['port_ret_{}'.format(k)]
list_names.extend(q)
block = sample.groupby('date')[list_names].sum().copy()
return block

Related

Time complexity between appending to a Dict vs a Dataframe

My program currently creates a bunch of Dataframes with a specific structure. The total number of DataFrames is, for now, 88 (with up to 10k rows) ; however, this is just the testing phase with a small amount of data.
This number might increase to several hundreds of Dfs, with up to few 100k rows.
I'm concerned about scalability. I have implemented two methods to retrieve the output which is the concatenation of all these Dfs. For now, they give approximately the same result ; however, as I said, which of these will perform better ?
Append to a DataFrame :
Create an empty DataFrame df1 (with the correct structure),
Loop
Create the DataFrame of results,
Append it to df1,
Export to csv
Append to a Dictionary:
Create an empty Dict,
Loop
Create the DataFrame of results,
Append to Dict
Concat all values of the dict in a df
Export to csv
. which of these will perform better as the amount of data grows ?
. does appending to a Dict gives better result than to a DataFrame although there are more steps, or the other way around, or does it give the same result ?
Approach2 is absolutely faster. Pandas is quite a heavy lib I think. Maybe you should consider using MySQL to insert data into the database rather than pandas if the data is large and consumes much memory. In MySQL, you could save the data in the database rather than save them in memory.
import pandas as pd
from time import time
df = pd.DataFrame(columns=range(100))
#start to test approach1
approach1_start = time()
for i in range(1000):
data_entry = ['test' for i in range(100)]
new = pd.DataFrame([data_entry])
df = pd.concat([df,new])
approach1_end = time()
approach1_time = approach1_end - approach1_start
print(approach1_time)
9.54729175567627
#start to test approach2
approach2_start = time()
data_entry_list = []
for i in range(1000):
data_entry = ['test' for i in range(100)]
data_entry_list.append(data_entry)
df = pd.DataFrame(data_entry_list)
approach2_end = time()
approach2_time = approach2_end - approach2_start
print(approach2_time)
0.021973371505737305
I have done some testing in order to have an idea. Here are the testing code:
import timeit
import time
import pandas as pd
def timing2(f):
def wrap(*args):
time1 = time.time()
ret = f(*args)
time2 = time.time()
print('{:s} : {:.3f} ms'.format(f.__name__, (time2-time1)*1000.0))
return ret
return wrap
#timing2
def withList():
lst = []
for i in range(100):
df = pd.DataFrame({'A': [1,2,3], 'B':[4,5,6], 'C':[90,53,64]})
lst.append(df)
df_new = pd.concat(lst)
return df_new
#timing2
def withDataFrame():
lst = []
col_lst = ['A','B','C']
df = pd.DataFrame(columns = col_lst)
for i in range(100):
df_r = pd.DataFrame({'A': [1,2,3], 'B':[4,5,6], 'C':[90,53,64]})
df.append(df_r)
return df
#timing2
def withDict():
dic = {}
col_lst = ['A','B','C']
df = pd.DataFrame(columns = col_lst)
for i in range(100):
df_r = pd.DataFrame({'A': [1,2,3], 'B':[4,5,6], 'C':[90,53,64]})
dic[i] = df_r
lst_result = [values for values in dic.values()]
df = df.append(lst_result)
return df
withList()
withDataFrame()
withDict()
Here are the results :
withList : 76.801 ms ;
withDataFrame : 101.746 ms ;
withDict : 57.819 ms.

how to make python loop faster to run pairwise association test

I have a list of patient id and drug names and a list of patient id and disease names. I want to find the most indicative drug for each disease.
To find this I want to do Fisher exact test to get the p-value for each disease/drug pair. But the loop runs very slowly, more than 10 hours. Is there a way to make the loop more efficient, or a better way to solve this association problem?
My loop:
import numpy as np
import pandas as pd
from scipy.stats import fisher_exact
most_indicative_medication = {}
rx_list = list(meps_meds.rxName.unique())
disease_list = list(meps_base_data.columns.values)[8:]
for i in disease_list:
print i
rx_dict = {}
for j in rx_list:
subset = base[['id', i, 'rxName']].drop_duplicates()
subset[j] = subset['rxName'] == j
subset = subset.loc[subset[i].isin(['Yes', 'No'])]
subset = subset[[i, j]]
tab = pd.crosstab(subset[i], subset[j])
if len(tab.columns) == 2:
rx_dict[j] = fisher_exact(tab)[1]
else:
rx_dict[j] = np.nan
most_indicative_medication[i] = min(rx_dict, key=rx_dict.get)
You need multiprocessing/multithreading, I have added the code.:
from multiprocessing.dummy import Pool as ThreadPool
most_indicative_medication = {}
rx_list = list(meps_meds.rxName.unique())
disease_list = list(meps_base_data.columns.values)[8:]
def run_pairwise(i):
print i
rx_dict = {}
for j in rx_list:
subset = base[['id', i, 'rxName']].drop_duplicates()
subset[j] = subset['rxName'] == j
subset = subset.loc[subset[i].isin(['Yes', 'No'])]
subset = subset[[i, j]]
tab = pd.crosstab(subset[i], subset[j])
if len(tab.columns) == 2:
rx_dict[j] = fisher_exact(tab)[1]
else:
rx_dict[j] = np.nan
most_indicative_medication[i] = min(rx_dict, key=rx_dict.get)
pool = ThreadPool(3)
pairwise_test_results = pool.map(run_pairwise,disease_list)
pool.close()
pool.join()
notes:http://chriskiehl.com/article/parallelism-in-one-line/
Crunching faster is good, but a better algorithm usually beats it ;-)
Filling in a bit,
import numpy as np
import pandas as pd
from scipy.stats import fisher_exact
# data files can be download at
# https://github.com/Saynah/platform/tree/d7e9f150ef2ff436387585960ca312a301847a46/data
meps_meds = pd.read_csv("meps_meds.csv") # 8 cols * 1,148,347 rows
meps_base_data = pd.read_csv("meps_base_data.csv") # 18 columns * 61,489 rows
# merge to get disease and drug info in same table
merged = pd.merge( # 25 cols * 1,148,347 rows
meps_base_data, meps_meds,
how='inner', left_on='id', right_on='id'
)
rx_list = meps_meds.rxName.unique().tolist() # 9218 items
disease_list = meps_base_data.columns.values[8:].tolist() # 10 items
Note that rx_list has a LOT of duplicates (eg. 52 entries for Amoxicillin, if you include misspellings).
Then
most_indicative = {}
for disease in disease_list:
# get unique (id, disease, prescription)
subset = merged[['id', disease, 'rxName']].drop_duplicates()
# keep only Yes/No entries under disease
subset = subset[subset[disease].isin(['Yes', 'No'])]
# summarize (replaces the inner loop)
tab = pd.crosstab(subset.rxName, subset[disease])
# bind "No" values for Fisher exact function
nf, yf = tab.sum().values
def p_value(x, nf=nf, yf=yf):
return fisher_exact([[nf - x.No, x.No], [yf - x.Yes, x.Yes]])[1]
# OPTIONAL:
# We can probably assume that the most-indicative drugs are among
# the most-prescribed; get just the 100 most-prescribed drugs
# Note: you have to get the nf, yf values before doing this!
tab = tab.sort_values("Yes", ascending=False)[:100]
# and apply the function
tab["P-value"] = tab.apply(p_value, axis=1)
# find the best match
best_med = tab.sort_values("P-value").index[0]
most_indicative[disease] = best_med
This now runs in about 18 minutes on my machine, and you could probably combine it with EM28's answer to speed it up by another factor of 4 or more.

Pandas DF nested loop to find value matching value from loop1

I'm new to python/pandas.. so please don't judge:)
I have a DF with stock data (i.e., Date, Close Value, ...).
Now I want to see if a given Close value will hit a target value (e.g. Close+50€, Close-50€).
I wrote a nested loop to check every close value with the following close values of that day:
def calc_zv(_df, _distance):
_df['ZV_C'] = 0
_df['ZV_P'] = 0
for i in range(0, len(_df)):
_date = _df.iloc[i].get('Date')
target_put = _df.iloc[i].get('Close') - _distance
target_call = _df.iloc[i].get('Close') + _distance
for x in range(i, len(_df)-1):
a = _df.iloc[x+1].get('Close')
_date2 = _df.iloc[x+1].get('Date')
if(target_call <= a and _date == _date2):
_df.ix[i,'ZV_C'] = 1
break
elif(target_put >= a and _date == _date2):
_df.ix[i,'ZV_P'] = 1
break
elif (_date != _date2):
break
This works fine.. but I wonder if there is a "better" (Faster, more panda-like) solution?
Thanks and best wishes.
M.
EDIT
hi again,
here is some sample data generator:
import numpy as np
import pandas as pd
from PX.indicator_macros import calc_zv
import datetime
abc = datetime.datetime.now()
print(abc)
df2 = pd.DataFrame({'DateTime' : pd.Timestamp('20130102'),
'Close' : pd.Series(np.random.randn(5000))})
#print(df2.to_string())
calc_zv(df2, 2)
#print(df2.to_string())
abc = datetime.datetime.now()
print(abc)
for 5000 rows i need approx. 10s.
I have stock data for 3 years (in 15min intervall) which takes some minutes.
cheers

Python looping and Pandas rank/index quirk

This question pertains to one posted here:
Sort dataframe rows independently by values in another dataframe
In the linked question, I utilize a Pandas Dataframe to sort each row independently using values in another Pandas Dataframe. The function presented there works perfectly every single time it is directly called. For example:
import pandas as pd
import numpy as np
import os
##Generate example dataset
d1 = {}
d2 = {}
d3 = {}
d4 = {}
## generate data:
np.random.seed(5)
for col in list("ABCDEF"):
d1[col] = np.random.randn(12)
d2[col+'2'] = np.random.random_integers(0,100, 12)
d3[col+'3'] = np.random.random_integers(0,100, 12)
d4[col+'4'] = np.random.random_integers(0,100, 12)
t_index = pd.date_range(start = '2015-01-31', periods = 12, freq = "M")
#place data into dataframes
dat1 = pd.DataFrame(d1, index = t_index)
dat2 = pd.DataFrame(d2, index = t_index)
dat3 = pd.DataFrame(d3, index = t_index)
dat4 = pd.DataFrame(d4, index = t_index)
## Functions
def sortByAnthr(X,Y,Xindex, Reverse=False):
#order the subset of X.index by Y
ordrX = [x for (x,y) in sorted(zip(Xindex,Y), key=lambda pair: pair[1],reverse=Reverse)]
return(ordrX)
def OrderRow(row,df):
ordrd_row = df.ix[row.dropna().name,row.dropna().values].tolist()
return(ordrd_row)
def r_selectr(dat2,dat1, n, Reverse=False):
ordr_cols = dat1.apply(lambda x: sortByAnthr(x,dat2.loc[x.name,:],x.index,Reverse),axis=1).iloc[:,-n:]
ordr_cols.columns = list(range(0,n)) #assign interpretable column names
ordr_r = ordr_cols.apply(lambda x: OrderRow(x,dat1),axis=1)
return([ordr_cols, ordr_r])
## Call functions
ordr_cols2,ordr_r = r_selectr(dat2,dat1,5)
##print output:
print("Ordering set:\n",dat2.iloc[-2:,:])
print("Original set:\n", dat1.iloc[-2:,:])
print("Column ordr:\n",ordr_cols2.iloc[-2:,:])
As can be checked, the columns of dat1 are correctly ordered according to the values in dat2.
However, when called from a loop over dataframes, it does not rank/index correctly and produces completely dubious results. Although I am not quite able to recreate the problem using the reduced version presented here, the idea should be the same.
## Loop test:
out_dict = {}
data_dicts = {'dat2':dat2, 'dat3': dat3, 'dat4':dat4}
for i in range(3):
#this outer for loop supplies different parameter values to a wrapper
#function that calls r_selectr.
for key in data_dicts.keys():
ordr_cols,_ = r_selectr(data_dicts[key], dat1,5)
out_list.append(ordr_cols)
#do stuff here
#print output:
print("Ordering set:\n",dat3.iloc[-2:,:])
print("Column ordr:\n",ordr_cols2.iloc[-2:,:])
In my code (almost completely analogous to the example given here), the ordr_cols are no longer ordered correctly for any of the sorting data frames.
I currently solve the issue by separating the ordering and indexing operations with r_selectr into two separate functions. That, for some reason, resolves the issue though I have no idea why.

Trying to use Deque to limit DataFrame of incoming data... suggestions?

I've imported deque from collections to limit the size of my data frame. When new data is entered, the older ones should be progressively deleted over time.
Big Picture:
Im creating a Data Frame of historical values of the previous 26 days from time "whatever day it is..."
Confusion:
I think my data each minute comes in a series format, which then I attempted to restrict the maxlen using deque. Then I tried implementing the data into an data frame. However I just get NaN values.
Code:
import numpy as np
import pandas as pd
from collections import deque
def initialize(context):
context.stocks = (symbol('AAPL'))
def before_trading_start(context, data):
data = data.history(context.stocks, 'close', 20, '1m').dropna()
length = 5
d = deque(maxlen = length)
data = d.append(data)
index = pd.DatetimeIndex(start='2016-04-03 00:00:00', freq='S', periods=length)
columns = ['price']
df = pd.DataFrame(index=index, columns=columns, data=data)
print df
How can I get this to work?
Mike
If I understand correctly the question, you want to keep all the values of the last twenty six last days. Does the following function is enough for you?
def select_values_of_the_last_twenty_six_days(old_data, new_data):
length = 5
twenty_six_day_before = (
pd.Timestamp.now(tz='Europe/Paris').round('D')
- pd.to_timedelta(26, 'D')
)
return (
pd.concat([old_data, new_data])
.loc[lambda x: x.index > twenty_six_day_before, :]
.iloc[-length:, :]
)
If the dates are not in the index:
def select_values_of_the_last_twenty_six_days(old_data, new_data):
length = 5
twenty_six_day_before = (
pd.Timestamp.now(tz='Europe/Paris').round('D')
- pd.to_timedelta(26, 'D')
)
return (
pd.concat([old_data, new_data])
# the following line is changed for values in a specific column
.loc[lambda x: x['column_with_date'] > twenty_six_day_before, :]
.iloc[-length:, :]
)
Don't forget to change the hard coded timezone if you are not in France. :-)

Categories