Python - For loop millions of rows - python

I have a dataframe c with lots of different columns. Also, arr is a dataframe that corresponds to a subset of c: arr = c[c['A_D'] == 'A'].
The main idea of my code is to iterate over all rows in the c-dataframe and search for all the possible cases (in the arr dataframe) where some specific conditions should happen:
It is only necessary to iterate over rows were c['A_D'] == D and c['Already_linked'] == 0
The hour in the arr dataframe must be less than the hour_aux in the c dataframe
The column Already_linked of the arr dataframe must be zero: arr.Already_linked == 0
The Terminal and the Operator needs to be the same in the c and arr dataframe
Right now, the conditions are stored using both Boolean indexing and groupby get_group:
Groupby the arr dataframe in order to choose the same Operator and Terminal: g = groups.get_group((row.Operator, row.Terminal))
Choose only the arrivals where the hour is smaller than the hour in the c dataframe and where Already_linked==0: vb = g[(g.Already_linked==0) & (g.hour<row.hour_aux)]
For each of the rows in the c dataframe that verify all conditions, a vb dataframe is created. Naturally, this dataframe has different lengths in each iteration. After creating the vb dataframe, my goal is to choose the index of the vb dataframe that minimises the time between vb.START and c[x]. The FightID that corresponds to this index is then stored in the c dataframe on column a. Additionally, since the arrival was linked to a departure, the column Already_linked in the arr dataframe is changed from 0 to 1.
It is important to notice that the column Already_linked of the arr dataframe may change in every iteration (and arr.Already_linked == 0 is one of the conditions to create the vb dataframe). Therefore, it is not possible to parallelize this code.
I have already used c.itertuples() for efficiency, however since c has millions of rows, this code is still too time consuming.
Other option would also be to use pd.apply to every row. Nonetheless, this is not really straightforward since in each loop there are values that change in both c and arr (also, I believe that even with pd.apply it would be extremely slow).
Is there any possible way to convert this for loop in a vectorized solution (or decrease the running time by 10X(if possible even more) )?
Initial dataframe:
START END A_D Operator FlightID Terminal TROUND_ID tot
0 2017-03-26 16:55:00 2017-10-28 16:55:00 A QR QR001 4 QR002 70
1 2017-03-26 09:30:00 2017-06-11 09:30:00 D DL DL001 3 " " 84
2 2017-03-27 09:30:00 2017-10-28 09:30:00 D DL DL001 3 " " 78
3 2017-10-08 15:15:00 2017-10-22 15:15:00 D VS VS001 3 " " 45
4 2017-03-26 06:50:00 2017-06-11 06:50:00 A DL DL401 3 " " 9
5 2017-03-27 06:50:00 2017-10-28 06:50:00 A DL DL401 3 " " 19
6 2017-03-29 06:50:00 2017-04-19 06:50:00 A DL DL401 3 " " 3
7 2017-05-03 06:50:00 2017-10-25 06:50:00 A DL DL401 3 " " 32
8 2017-06-25 06:50:00 2017-10-22 06:50:00 A DL DL401 3 " " 95
9 2017-03-26 07:45:00 2017-10-28 07:45:00 A DL DL402 3 " " 58
Desired Output (some of the columns were excluded in the dataframe below. Only the a and Already_linked columns are relevant):
START END A_D Operator a Already_linked
0 2017-03-26 16:55:00 2017-10-28 16:55:00 A QR 0 1
1 2017-03-26 09:30:00 2017-06-11 09:30:00 D DL DL402 1
2 2017-03-27 09:30:00 2017-10-28 09:30:00 D DL DL401 1
3 2017-10-08 15:15:00 2017-10-22 15:15:00 D VS No_link_found 0
4 2017-03-26 06:50:00 2017-06-11 06:50:00 A DL 0 0
5 2017-03-27 06:50:00 2017-10-28 06:50:00 A DL 0 1
6 2017-03-29 06:50:00 2017-04-19 06:50:00 A DL 0 0
7 2017-05-03 06:50:00 2017-10-25 06:50:00 A DL 0 0
8 2017-06-25 06:50:00 2017-10-22 06:50:00 A DL 0 0
9 2017-03-26 07:45:00 2017-10-28 07:45:00 A DL 0 1
Code:
groups = arr.groupby(['Operator', 'Terminal'])
for row in c[(c.A_D == "D") & (c.Already_linked == 0)].itertuples():
try:
g = groups.get_group((row.Operator, row.Terminal))
vb = g[(g.Already_linked==0) & (g.hour<row.hour_aux)]
aux = (vb.START - row.x).abs().idxmin()
c.loc[row.Index, 'a'] = vb.loc[aux].FlightID
arr.loc[aux, 'Already_linked'] = 1
continue
except:
continue
c['Already_linked'] = np.where((c.a != 0) & (c.a != 'No_link_found') & (c.A_D == 'D'), 1, c['Already_linked'])
c.Already_linked.loc[arr.Already_linked.index] = arr.Already_linked
c['a'] = np.where((c.Already_linked == 0) & (c.A_D == 'D'),'No_link_found',c['a'])
Code for the initial c dataframe:
import numpy as np
import pandas as pd
import io
s = '''
A_D Operator FlightID Terminal TROUND_ID tot
A QR QR001 4 QR002 70
D DL DL001 3 " " 84
D DL DL001 3 " " 78
D VS VS001 3 " " 45
A DL DL401 3 " " 9
A DL DL401 3 " " 19
A DL DL401 3 " " 3
A DL DL401 3 " " 32
A DL DL401 3 " " 95
A DL DL402 3 " " 58
'''
data_aux = pd.read_table(io.StringIO(s), delim_whitespace=True)
data_aux.Terminal = data_aux.Terminal.astype(str)
data_aux.tot= data_aux.tot.astype(str)
d = {'START': ['2017-03-26 16:55:00', '2017-03-26 09:30:00','2017-03-27 09:30:00','2017-10-08 15:15:00',
'2017-03-26 06:50:00','2017-03-27 06:50:00','2017-03-29 06:50:00','2017-05-03 06:50:00',
'2017-06-25 06:50:00','2017-03-26 07:45:00'], 'END': ['2017-10-28 16:55:00' ,'2017-06-11 09:30:00' ,
'2017-10-28 09:30:00' ,'2017-10-22 15:15:00','2017-06-11 06:50:00' ,'2017-10-28 06:50:00',
'2017-04-19 06:50:00' ,'2017-10-25 06:50:00','2017-10-22 06:50:00' ,'2017-10-28 07:45:00']}
aux_df = pd.DataFrame(data=d)
aux_df.START = pd.to_datetime(aux_df.START)
aux_df.END = pd.to_datetime(aux_df.END)
c = pd.concat([aux_df, data_aux], axis = 1)
c['A_D'] = c['A_D'].astype(str)
c['Operator'] = c['Operator'].astype(str)
c['Terminal'] = c['Terminal'].astype(str)
c['hour'] = pd.to_datetime(c['START'], format='%H:%M').dt.time
c['hour_aux'] = pd.to_datetime(c['START'] - pd.Timedelta(15, unit='m'),
format='%H:%M').dt.time
c['start_day'] = c['START'].astype(str).str[0:10]
c['end_day'] = c['END'].astype(str).str[0:10]
c['x'] = c.START - pd.to_timedelta(c.tot.astype(int), unit='m')
c["a"] = 0
c["Already_linked"] = np.where(c.TROUND_ID != " ", 1 ,0)
arr = c[c['A_D'] == 'A']

While this is not a vecterized solution, it should speed things up rapidly if your sample data set mimics your true data set. Currently, you are wasting time looping over every row, but you only care about looping over rows where ['A_D'] == 'D' and ['Already_linked'] ==0. Instead remove the if's and loop over the truncated dataframe which is only 30% of the initial dataframe
for row in c[(c.A_D == 'D') & (c.Already_linked == 0)].itertuples():
vb = arr[(arr.Already_linked == 0) & (arr.hour < row.hour_aux)].copy().query(row.query_string)
try:
aux = (vb.START - row.x).abs().idxmin()
print(row.x)
c.loc[row.Index, 'a'] = vb.loc[aux,'FlightID']
arr.loc[aux, 'Already_linked'] = 1
continue
except:
continue

Your question was if there is a way to vectorize the for loop, but I think that question hides what you really want which is an easy way to speed your code up. For performance questions, a good starting point is always profiling. However, I have a strong suspicion that the dominant operation in your code is .query(row.query_string). Running that for every row is expensive if arr is large.
For arbitrary queries, that runtime can't really be improved at all without removing dependencies between iterations and parallelizing the expensive step. You might be a bit luckier though. Your query string always checks two different columns to see if they're equal to something you care about. However, for each row that requires going through your entire slice of arr. Since the slice changes each time, that could cause problems, but here are some ideas:
Since you're slicing arr each time anyway, maintain a view of just the arr.Already_Linked==0 rows so you're iterating over a smaller object.
Better yet, before you do any looping you should first group arr by Terminal and Operator. Then, instead of running through all of arr, first select the group you want and then do your slicing and filtering. This would require rethinking the exact implementation of query_string a little bit, but the advantage is that if you have a lot of terminals and operators you'll typically be working over a much smaller object than arr. Moreover, you wouldn't even have to query that object since that was implicitly done by the groupby.
Depending on how aux.hour typically relates to row.hour_aux, you might have improvements by sorting aux at the beginning with respect to hour. Just using the inequality operator you probably wouldn't see any gains, but you could pair that with a logarithmic search for the cutoff point and then just slice up to that cutoff point.
And so on. Again, I suspect any way of restructuring the query you're doing on all of arr for every row will offer substantially more gains than just switching frameworks or vectorizing bits and pieces.
Expanding on some of those points a little bit and adapting #DJK's code a bit, look at what happens when we have the following changes.
groups = arr.groupby(['Operator', 'Terminal'])
for row in c[(c.A_D == 'D') & (c.Already_linked == 0)].itertuples():
g = groups.get_group((row.Operator, row.Terminal))
vb = g[(g.Already_linked==0) & (g.hour<row.hour_aux)]
try:
aux = (vb.START - row.x).abs().idxmin()
print(row.x)
c.loc[row.Index, 'a'] = vb.loc[aux,'FlightID']
g.loc[aux, 'Already_linked'] = 1
continue
except:
continue
Part of the reason your query is so slow is because it's searching over all of arr each time. In contrast, the .groupby() executes in roughly the same time as one query, but then for every subsequent iteration you can just use .get_group() to efficiently find the tiny subset of the data you care about.
A helpful (extremely crude) rule of thumb when benchmarking is that a billion things takes a second. If you're seeing much longer times than that for something measured in millions of things, like your millions of rows, that means that for each of those rows you're doing tons of things to get up to billions of operations. That leaves a ton of potential for better algorithms to reduce the number of operations, whereas vectorization really only yields constant factor improvements (and for many string/query operations not even great improvements at that).

This solution uses pd.DataFrame.isin which uses numpy.in1d
Apparently 'isin' isn't necessarily faster for small datasets (like this sample), but is significantly faster for large datasets. You'll have to run it against your data to determine performance.
flight_record_linkage.ipynb
Expanded the dataset using c = pd.concat([c] * 10000, ignore_index=True)
Increase the dataset length by 3 orders of magnitude (10000 rows total).
Original method: Wall time: 8.98s
New method: Wall time: 16.4s
Increase the dataset length by 4 orders of magnitude (100000 rows total).
Original method: Wall time: 8min 17s
New method: Wall time: 1min 14s
Increase the dataset length by 5 orders of magnitude (1000000 rows total).
New method: Wall time: 11min 33s
New Method: Using isin and apply
def apply_do_g(it_row):
"""
This is your function, but using isin and apply
"""
keep = {'Operator': [it_row.Operator], 'Terminal': [it_row.Terminal]} # dict for isin combined mask
holder1 = arr[list(keep)].isin(keep).all(axis=1) # create boolean mask
holder2 = arr.Already_linked.isin([0]) # create boolean mask
holder3 = arr.hour < it_row.hour_aux # create boolean mask
holder = holder1 & holder2 & holder3 # combine the masks
holder = arr.loc[holder]
if not holder.empty:
aux = np.absolute(holder.START - it_row.x).idxmin()
c.loc[it_row.name, 'a'] = holder.loc[aux].FlightID # use with apply 'it_row.name'
arr.loc[aux, 'Already_linked'] = 1
def new_way_2():
keep = {'A_D': ['D'], 'Already_linked': [0]}
df_test = c[c[list(keep)].isin(keep).all(axis=1)].copy() # returns the resultant df
df_test.apply(lambda row: apply_do_g(row), axis=1) # g is multiple DataFrames"
#call the function
new_way_2()

Your problem looks like one of the most common problems in database operation. I do not fully understand what you want to get because you have not formulated the task. Now to the possible solution - avoid loops at all.
You have a very long table with columns time, FlightID, Operator, Terminal, A_D. Other columns and dates do not matter if I understand you correctly. Also start_time and end_time are the same in every row. By the way you may get time column with the code table.loc[:, 'time'] = table.loc[:, 'START'].dt.time.
table = table.drop_duplicates(subset=['time', 'FlightID', 'Operator', 'Terminal']). And your table gonna become significantly shorter.
Split table into table_arr and table_dep according to A_D value: table_arr = table.loc[table.loc[:, 'A_D'] == 'A', ['FlightID', 'Operator', 'Terminal', 'time']], table_dep = table.loc[table.loc[:, 'A_D'] == 'D', ['FlightID', 'Operator', 'Terminal', 'time']]
Seems like all you tried to get with loops you may get with a single line: table_result = table_arr.merge(table_dep, how='right', on=['Operator', 'Terminal'], suffixes=('_arr', '_dep')). It is basically the same operation as JOIN in SQL.
According to my understanding of your problem and having the tiny piece of data you have provided you get just the desired output (correspondence between FlightID_dep and FlightID_arr for all FlightID_dep values) without any loop so much faster. table_result is:
FlightID_arr Operator Terminal time_arr FlightID_dep time_dep
0 DL401 DL 3 06:50:00 DL001 09:30:00
1 DL402 DL 3 07:45:00 DL001 09:30:00
2 NaN VS 3 NaN VS001 15:15:00
Of course, in general case (with actual data) you will need one more step - filter table_result on condition time_arr < time_dep or any other condition you have. Unfortunately the data you have provided is not enough to fully solve your problem.
Complete code is:
import io
import pandas as pd
data = '''
START,END,A_D,Operator,FlightID,Terminal,TROUND_ID,tot
2017-03-26 16:55:00,2017-10-28 16:55:00,A,QR,QR001,4,QR002,70
2017-03-26 09:30:00,2017-06-11 09:30:00,D,DL,DL001,3,,84
2017-03-27 09:30:00,2017-10-28 09:30:00,D,DL,DL001,3,,78
2017-10-08 15:15:00,2017-10-22 15:15:00,D,VS,VS001,3,,45
2017-03-26 06:50:00,2017-06-11 06:50:00,A,DL,DL401,3,,9
2017-03-27 06:50:00,2017-10-28 06:50:00,A,DL,DL401,3,,19
2017-03-29 06:50:00,2017-04-19 06:50:00,A,DL,DL401,3,,3
2017-05-03 06:50:00,2017-10-25 06:50:00,A,DL,DL401,3,,32
2017-06-25 06:50:00,2017-10-22 06:50:00,A,DL,DL401,3,,95
2017-03-26 07:45:00,2017-10-28 07:45:00,A,DL,DL402,3,,58
'''
table = pd.read_csv(io.StringIO(data), parse_dates=[0, 1])
table.loc[:, 'time'] = table.loc[:, 'START'].dt.time
table = table.drop_duplicates(subset=['time', 'FlightID', 'Operator', 'Terminal'])
table_arr = table.loc[table.loc[:, 'A_D'] == 'A', ['FlightID', 'Operator', 'Terminal', 'time']]
table_dep = table.loc[table.loc[:, 'A_D'] == 'D', ['FlightID', 'Operator', 'Terminal', 'time']]
table_result = table_arr.merge(
table_dep,
how='right',
on=['Operator', 'Terminal'],
suffixes=('_arr', '_dep'))
print(table_result)

Related

Date Difference based on matching values in two columns - Pandas

I have a dataframe, I am struggling to create a column based out of other columns, I will share the problem for a sample data.
Date Target1 Close
0 2018-05-25 198.0090 188.580002
1 2018-05-25 197.6835 188.580002
2 2018-05-25 198.0090 188.580002
3 2018-05-29 196.6230 187.899994
4 2018-05-29 196.9800 187.899994
5 2018-05-30 197.1375 187.500000
6 2018-05-30 196.6965 187.500000
7 2018-05-30 196.8750 187.500000
8 2018-05-31 196.2135 186.869995
9 2018-05-31 196.2135 186.869995
10 2018-05-31 196.5600 186.869995
11 2018-05-31 196.7700 186.869995
12 2018-05-31 196.9275 186.869995
13 2018-05-31 196.2135 186.869995
14 2018-05-31 196.2135 186.869995
15 2018-06-01 197.2845 190.240005
16 2018-06-01 197.2845 190.240005
17 2018-06-04 201.2325 191.830002
18 2018-06-04 201.4740 191.830002
I want to create another column (for each observation) (called days_to_hit_target for example) which is the difference of days such that close hits (or crosses target of specific day), then it counts the difference of days and put them in the column days_to_hit_target.
The idea is, suppose close price today in 2018-05-25 is 188.58, so, I want to get the date for which this target (198.0090) is hit close which it is doing somewhere later on 2018-06-04, where close has reached to the target of first observation, (198.0090), that will be fed to the first observation of the column (days_to_hit_target ).
Use a combination of loc and at to find the date at which the target is hit, then subtract the dates.
df['TargetDate'] = 'NA'
for i, row in df.iterrows():
t = row['Target1']
d = row['Date']
targdf = df.loc[df['Close'] >= t]
if len(targdf)>0:
targdt = targdf.at[0,'Date']
df.at[i,'TargetDate'] = targdt
else:
df.at[i,'TargetDate'] = '0'
df['Diff'] = df['Date'].sub(df['TargetDate'], axis=0)
import pandas as pd
csv = pd.read_csv(
'sample.csv',
parse_dates=['Date']
)
csv.sort_values('Date', inplace=True)
def find_closest(row):
target = row['Target1']
date = row['Date']
matches = csv[
(csv['Close'] >= target) &
(csv['Date'] > date)
]
closest_date = matches['Date'].iloc[0] if not matches.empty else None
row['days to hit target'] = (closest_date - date).days if closest_date else None
return row
final = csv.apply(find_closest, axis=1)
It's a bit hard to test because none of the targets appear in the close. But the idea is simple. Subset your original frame such that date is after the current row date and Close is greater than or equal to Target1 and get the first entry (this is after you've sorted it using df.sort_values.
If the subset is empty, use None. Otherwise, use the Date. Days to hit target is pretty simple at that point.

Inserting missing numbers in dataframe

I have a program that ideally measures the temperature every second. However, in reality this does not happen. Sometimes, it skips a second or it breaks down for 400 seconds and then decides to start recording again. This leaves gaps in my 2-by-n dataframe, where ideally n = 86400 (the amount of seconds in a day). I want to apply some sort of moving/rolling average to it to get a nicer plot, but if I do that to the "raw" datafiles, the amount of data points becomes less. This is shown here, watch the x-axis. I know the "nice data" doesn't look nice yet; I'm just playing with some values.
So, I want to implement a data cleaning method, which adds data to the dataframe. I thought about it, but don't know how to implement it. I thought of it as follows:
If the index is not equal to the time, then we need to add a number, at time = index. If this gap is only 1 value, then the average of the previous number and the next number will do for me. But if it is bigger, say 100 seconds are missing, then a linear function needs to be made, which will increase or decrease the value steadily.
So I guess a training set could be like this:
index time temp
0 0 20.10
1 1 20.20
2 2 20.20
3 4 20.10
4 100 22.30
Here, I would like to get a value for index 3, time 3 and the values missing between time = 4 and time = 100. I'm sorry about my formatting skills, I hope it is clear.
How would I go about programming this?
Use merge with complete time column and then interpolate:
# Create your table
time = np.array([e for e in np.arange(20) if np.random.uniform() > 0.6])
temp = np.random.uniform(20, 25, size=len(time))
temps = pd.DataFrame([time, temp]).T
temps.columns = ['time', 'temperature']
>>> temps
time temperature
0 4.0 21.662352
1 10.0 20.904659
2 15.0 20.345858
3 18.0 24.787389
4 19.0 20.719487
The above is a random table generated with missing time data.
# modify it
filled = pd.Series(np.arange(temps.iloc[0,0], temps.iloc[-1, 0]+1))
filled = filled.to_frame()
filled.columns = ['time'] # Create a fully filled time column
merged = pd.merge(filled, temps, on='time', how='left') # merge it with original, time without temperature will be null
merged.temperature = merged.temperature.interpolate() # fill nulls linearly.
# Alternatively, use reindex, this does the same thing.
final = temps.set_index('time').reindex(np.arange(temps.time.min(),temps.time.max()+1)).reset_index()
final.temperature = final.temperature.interpolate()
>>> merged # or final
time temperature
0 4.0 21.662352
1 5.0 21.536070
2 6.0 21.409788
3 7.0 21.283505
4 8.0 21.157223
5 9.0 21.030941
6 10.0 20.904659
7 11.0 20.792898
8 12.0 20.681138
9 13.0 20.569378
10 14.0 20.457618
11 15.0 20.345858
12 16.0 21.826368
13 17.0 23.306879
14 18.0 24.787389
15 19.0 20.719487
First you can set the second values to actual time values as such:
df.index = pd.to_datetime(df['time'], unit='s')
After which you can use pandas' built-in time series operations to resample and fill in the missing values:
df = df.resample('s').interpolate('time')
Optionally, if you still want to do some smoothing you can use the following operation for that:
df.rolling(5, center=True, win_type='hann').mean()
Which will smooth with a 5 element wide Hanning window. Note: any window-based smoothing will cost you value points at the edges.
Now your dataframe will have datetimes (including date) as index. This is required for the resample method. If you want to lose the date, you can simply use:
df.index = df.index.time

Is there vectorized string.format in Pandas?

From my research, I see that I can only use apply to format a string in Pandas, which is extremely slow in large datasets because apply is essentially a loop over the entire data. Theoretically, format is a vectorizable function because it does not depend on other rows. Therefore, is there any way that we can vectorize it?
For example, one of my work wants to do this:
joined["timestamp"] = joined.apply(lambda row: args.date + " {:0>2d}:{:0>2d}:00".format(row["tid"]/6, row["tid"]%6*10), axis=1)
where tid is an integer. Some sample data (joined): (date="20170101")
tid timestamp
1 20170101 00:10:00
10 20170101 01:40:00
I believe it is a common case to append a new string column by formatting some other columns.
Thank you!
I believe you need str.zfill and change division to floor division (//):
print (joined)
tid
0 1
1 10
a ='20170101'
b = ' ' + (joined["tid"] // 6).astype(str).str.zfill(2) + ':'
c = (joined["tid"] % 6 * 10).astype(str).str.zfill(2) + ':00'
joined["timestamp"] = a + b + c
print (joined)
tid timestamp
0 1 20170101 00:10:00
1 10 20170101 01:40:00

Finding start of the maximum drawdown in Pandas [duplicate]

Maximum Drawdown is a common risk metric used in quantitative finance to assess the largest negative return that has been experienced.
Recently, I became impatient with the time to calculate max drawdown using my looped approach.
def max_dd_loop(returns):
"""returns is assumed to be a pandas series"""
max_so_far = None
start, end = None, None
r = returns.add(1).cumprod()
for r_start in r.index:
for r_end in r.index:
if r_start < r_end:
current = r.ix[r_end] / r.ix[r_start] - 1
if (max_so_far is None) or (current < max_so_far):
max_so_far = current
start, end = r_start, r_end
return max_so_far, start, end
I'm familiar with the common perception that a vectorized solution would be better.
The questions are:
can I vectorize this problem?
What does this solution look like?
How beneficial is it?
Edit
I modified Alexander's answer into the following function:
def max_dd(returns):
"""Assumes returns is a pandas Series"""
r = returns.add(1).cumprod()
dd = r.div(r.cummax()).sub(1)
mdd = dd.min()
end = dd.argmin()
start = r.loc[:end].argmax()
return mdd, start, end
df_returns is assumed to be a dataframe of returns, where each column is a seperate strategy/manager/security, and each row is a new date (e.g. monthly or daily).
cum_returns = (1 + df_returns).cumprod()
drawdown = 1 - cum_returns.div(cum_returns.cummax())
I had first suggested using .expanding() window but that's obviously not necessary with the .cumprod() and .cummax() built ins to calculate max drawdown up to any given point:
df = pd.DataFrame(data={'returns': np.random.normal(0.001, 0.05, 1000)}, index=pd.date_range(start=date(2016,1,1), periods=1000, freq='D'))
df = pd.DataFrame(data={'returns': np.random.normal(0.001, 0.05, 1000)},
index=pd.date_range(start=date(2016, 1, 1), periods=1000, freq='D'))
df['cumulative_return'] = df.returns.add(1).cumprod().subtract(1)
df['max_drawdown'] = df.cumulative_return.add(1).div(df.cumulative_return.cummax().add(1)).subtract(1)
returns cumulative_return max_drawdown
2016-01-01 -0.014522 -0.014522 0.000000
2016-01-02 -0.022769 -0.036960 -0.022769
2016-01-03 0.026735 -0.011214 0.000000
2016-01-04 0.054129 0.042308 0.000000
2016-01-05 -0.017562 0.024004 -0.017562
2016-01-06 0.055254 0.080584 0.000000
2016-01-07 0.023135 0.105583 0.000000
2016-01-08 -0.072624 0.025291 -0.072624
2016-01-09 -0.055799 -0.031919 -0.124371
2016-01-10 0.129059 0.093020 -0.011363
2016-01-11 0.056123 0.154364 0.000000
2016-01-12 0.028213 0.186932 0.000000
2016-01-13 0.026914 0.218878 0.000000
2016-01-14 -0.009160 0.207713 -0.009160
2016-01-15 -0.017245 0.186886 -0.026247
2016-01-16 0.003357 0.190869 -0.022979
2016-01-17 -0.009284 0.179813 -0.032050
2016-01-18 -0.027361 0.147533 -0.058533
2016-01-19 -0.058118 0.080841 -0.113250
2016-01-20 -0.049893 0.026914 -0.157492
2016-01-21 -0.013382 0.013173 -0.168766
2016-01-22 -0.020350 -0.007445 -0.185681
2016-01-23 -0.085842 -0.092648 -0.255584
2016-01-24 0.022406 -0.072318 -0.238905
2016-01-25 0.044079 -0.031426 -0.205356
2016-01-26 0.045782 0.012917 -0.168976
2016-01-27 -0.018443 -0.005764 -0.184302
2016-01-28 0.021461 0.015573 -0.166797
2016-01-29 -0.062436 -0.047836 -0.218819
2016-01-30 -0.013274 -0.060475 -0.229189
... ... ... ...
2018-08-28 0.002124 0.559122 -0.478738
2018-08-29 -0.080303 0.433921 -0.520597
2018-08-30 -0.009798 0.419871 -0.525294
2018-08-31 -0.050365 0.348359 -0.549203
2018-09-01 0.080299 0.456631 -0.513004
2018-09-02 0.013601 0.476443 -0.506381
2018-09-03 -0.009678 0.462153 -0.511158
2018-09-04 -0.026805 0.422960 -0.524262
2018-09-05 0.040832 0.481062 -0.504836
2018-09-06 -0.035492 0.428496 -0.522411
2018-09-07 -0.011206 0.412489 -0.527762
2018-09-08 0.069765 0.511031 -0.494817
2018-09-09 0.049546 0.585896 -0.469787
2018-09-10 -0.060201 0.490423 -0.501707
2018-09-11 -0.018913 0.462235 -0.511131
2018-09-12 -0.094803 0.323611 -0.557477
2018-09-13 0.025736 0.357675 -0.546088
2018-09-14 -0.049468 0.290514 -0.568542
2018-09-15 0.018146 0.313932 -0.560713
2018-09-16 -0.034118 0.269104 -0.575700
2018-09-17 0.012191 0.284576 -0.570527
2018-09-18 -0.014888 0.265451 -0.576921
2018-09-19 0.041180 0.317562 -0.559499
2018-09-20 0.001988 0.320182 -0.558623
2018-09-21 -0.092268 0.198372 -0.599348
2018-09-22 -0.015386 0.179933 -0.605513
2018-09-23 -0.021231 0.154883 -0.613888
2018-09-24 -0.023536 0.127701 -0.622976
2018-09-25 0.030160 0.161712 -0.611605
2018-09-26 0.025528 0.191368 -0.601690
Given a time series of returns, we need to evaluate the aggregate return for every combination of starting point to ending point.
The first trick is to convert a time series of returns into a series of return indices. Given a series of return indices, I can calculate the return over any sub-period with the return index at the beginning ri_0 and at the end ri_1. The calculation is: ri_1 / ri_0 - 1.
The second trick is to produce a second series of inverses of return indices. If r is my series of return indices then 1 / r is my series of inverses.
The third trick is to take the matrix product of r * (1 / r).Transpose.
r is an n x 1 matrix. (1 / r).Transpose is a 1 x n matrix. The resulting product contains every combination of ri_j / ri_k. Just subtract 1 and I've actually got returns.
The fourth trick is to ensure that I'm constraining my denominator to represent periods prior to those being represented by the numerator.
Below is my vectorized function.
import numpy as np
import pandas as pd
def max_dd(returns):
# make into a DataFrame so that it is a 2-dimensional
# matrix such that I can perform an nx1 by 1xn matrix
# multiplication and end up with an nxn matrix
r = pd.DataFrame(returns).add(1).cumprod()
# I copy r.T to ensure r's index is not the same
# object as 1 / r.T's columns object
x = r.dot(1 / r.T.copy()) - 1
x.columns.name, x.index.name = 'start', 'end'
# let's make sure we only calculate a return when start
# is less than end.
y = x.stack().reset_index()
y = y[y.start < y.end]
# my choice is to return the periods and the actual max
# draw down
z = y.set_index(['start', 'end']).iloc[:, 0]
return z.min(), z.argmin()[0], z.argmin()[1]
How does this perform?
for the vectorized solution I ran 10 iterations over the time series of lengths [10, 50, 100, 150, 200]. The time it took is below:
10: 0.032 seconds
50: 0.044 seconds
100: 0.055 seconds
150: 0.082 seconds
200: 0.047 seconds
The same test for the looped solution is below:
10: 0.153 seconds
50: 3.169 seconds
100: 12.355 seconds
150: 27.756 seconds
200: 49.726 seconds
Edit
Alexander's answer provides superior results. Same test using modified code
10: 0.000 seconds
50: 0.000 seconds
100: 0.004 seconds
150: 0.007 seconds
200: 0.008 seconds
I modified his code into the following function:
def max_dd(returns):
r = returns.add(1).cumprod()
dd = r.div(r.cummax()).sub(1)
mdd = drawdown.min()
end = drawdown.argmin()
start = r.loc[:end].argmax()
return mdd, start, end
I recently had a similar issue, but instead of a global MDD, I was required to find the MDD for the interval after each peak. Also, in my case, I was supposed to take the MDD of each strategy alone and thus wasn't required to apply the cumprod. My vectorized implementation is also based on Investopedia.
def calc_MDD(networth):
df = pd.Series(networth, name="nw").to_frame()
max_peaks_idx = df.nw.expanding(min_periods=1).apply(lambda x: x.argmax()).fillna(0).astype(int)
df['max_peaks_idx'] = pd.Series(max_peaks_idx).to_frame()
nw_peaks = pd.Series(df.nw.iloc[max_peaks_idx.values].values, index=df.nw.index)
df['dd'] = ((df.nw-nw_peaks)/nw_peaks)
df['mdd'] = df.groupby('max_peaks_idx').dd.apply(lambda x: x.expanding(min_periods=1).apply(lambda y: y.min())).fillna(0)
return df
Here is an sample after running this code:
nw max_peaks_idx dd mdd
0 10000.000 0 0.000000 0.000000
1 9696.948 0 -0.030305 -0.030305
2 9538.576 0 -0.046142 -0.046142
3 9303.953 0 -0.069605 -0.069605
4 9247.259 0 -0.075274 -0.075274
5 9421.519 0 -0.057848 -0.075274
6 9315.938 0 -0.068406 -0.075274
7 9235.775 0 -0.076423 -0.076423
8 9091.121 0 -0.090888 -0.090888
9 9033.532 0 -0.096647 -0.096647
10 8947.504 0 -0.105250 -0.105250
11 8841.551 0 -0.115845 -0.115845
And here is an image of the complete applied to the complete dataset.
Although vectorized, this code is probably slower than the other, because for each time-series, there should be many peaks, and each one of these requires calculation, and so O(n_peaks*n_intervals).
PS: I could have eliminated the zero values in the dd and mdd columns, but I find it useful that these values help indicate when a new peak was observed in the time-series.

Speeding up the insertion of null rows into a large Pandas DataFrame?

I have a Pandas data frame with hundreds of millions of rows that looks like this:
Date Attribute A Attribute B Value
01/01/16 A 1 50
01/05/16 A 1 60
01/02/16 B 1 59
01/04/16 B 1 90
01/10/16 B 1 84
For each unique combination (call it b) of Attribute A x Attribute B, I need to fill in empty dates starting from the oldest date for that unique group b to the maximum date in the entire dataframe df. That is, so it looks like this:
Date Attribute A Attribute B Value
01/01/16 A 1 50
01/02/16 A 1 0
01/03/16 A 1 0
01/04/16 A 1 0
01/05/16 A 1 60
01/02/16 B 1 59
01/03/16 B 1 0
01/04/16 B 1 90
01/05/16 B 1 0
01/06/16 B 1 0
01/07/16 B 1 0
01/08/16 B 1 84
and then calculate the coefficient of variation (standard deviation/mean) for each unique combination's values (after inserting 0s). My code is this:
final = pd.DataFrame()
max_date = df['Date'].max()
for name, group in df.groupby(['Attribute_A','Attribute_B']):
idx = pd.date_range(group['Date'].min(),
max_date)
temp = group.set_index('Date').reindex(idx, fill_value=0)
coeff_var = temp['Value'].std()/temp['Value'].mean()
final = pd.concat([final, pd.DataFrame({'Attribute_A':[name[0]], 'Attribute_B':[name[1]],'Coeff_Var':[coeff_var]})])
This runs insanely slow, and I'm looking for a way to speed it up.
Suggestions?
This runs insanely slow, and I'm looking for a way to speed it up.
Suggestions?
I don't have a ready solution, however this is how I suggest you approach the problem:
Understand what makes this slow
Find ways to make the critical parts faster
Or, alternatively, find a new approach
Here's the analysis of your code using line profiler:
Timer unit: 1e-06 s
Total time: 0.028074 s
File: <ipython-input-54-ad49822d490b>
Function: foo at line 1
Line # Hits Time Per Hit % Time Line Contents
==============================================================
1 def foo():
2 1 875 875.0 3.1 final = pd.DataFrame()
3 1 302 302.0 1.1 max_date = df['Date'].max()
4 3 3343 1114.3 11.9 for name, group in df.groupby(['Attribute_A','Attribute_B']):
5 2 836 418.0 3.0 idx = pd.date_range(group['Date'].min(),
6 2 3601 1800.5 12.8 max_date)
7
8 2 6713 3356.5 23.9 temp = group.set_index('Date').reindex(idx, fill_value=0)
9 2 1961 980.5 7.0 coeff_var = temp['Value'].std()/temp['Value'].mean()
10 2 10443 5221.5 37.2 final = pd.concat([final, pd.DataFrame({'Attribute_A':[name[0]], 'Attribute_B':[name[1]],'Coeff_Var':[coeff_var]})])
In conclusion, the .reindex and concat statements take 60% of the time.
A first approach that saves 42% of time in my measurement is to collect the data for the final data frame as a list of rows, and create the dataframe as the very last step. Like so:
newdata = []
max_date = df['Date'].max()
for name, group in df.groupby(['Attribute_A','Attribute_B']):
idx = pd.date_range(group['Date'].min(),
max_date)
temp = group.set_index('Date').reindex(idx, fill_value=0)
coeff_var = temp['Value'].std()/temp['Value'].mean()
newdata.append({'Attribute_A': name[0], 'Attribute_B': name[1],'Coeff_Var':coeff_var})
final = pd.DataFrame.from_records(newdata)
Using timeit to measure best execution times I get
your solution: 100 loops, best of 3: 11.5 ms per loop
improved concat: 100 loops, best of 3: 6.67 ms per loop
Details see this ipython notebook
Note: Your mileage may vary - I used the sample data provided in the original post. You should run the line profiler on a subset of your real data - the dominating factor in regards to time use may well be something else then.
I am not sure if my way is faster than the way that you set up, but here goes:
df = pd.DataFrame({'Date': ['1/1/2016', '1/5/2016', '1/2/2016', '1/4/2016', '1/10/2016'],
'Attribute A': ['A', 'A', 'B', 'B', 'B'],
'Attribute B': [1, 1, 1, 1, 1],
'Value': [50, 60, 59, 90, 84]})
unique_attributes = df['Attribute A'].unique()
groups = []
for i in unique_attributes:
subset = df[df['Attribute A'] ==i]
dates = subset['Date'].tolist()
Dates = pd.date_range(dates[0], dates[-1])
subset.set_index('Date', inplace=True)
subset.index = pd.DatetimeIndex(subset.index)
subset = subset.reindex(Dates)
subset['Attribute A'].fillna(method='ffill', inplace=True)
subset['Attribute B'].fillna(method='ffill', inplace=True)
subset['Value'].fillna(0, inplace=True)
groups.append(subset)
result = pd.concat(groups)

Categories