I want to resample() my daily data into six-month chunks. However, I want the ends of the six-month chunks to be the ends of April and October. If I use df.resample('6M').sum() (or df.groupby(pd.Grouper(freq='6M').sum()), the end of the first six-month chunk is the end of the first month in the data. I know about anchored offsets, but I do not know how to create a custom anchored offset (e.g., '6M-APR' does not work).
Here is some example code:
import pandas as pd
import numpy as np
np.random.seed(42)
df = pd.DataFrame(
data={'logret': np.random.randn(1000)},
index=pd.date_range(start='2001-05-25', periods=1000, freq='B')
)
df.resample('6M').sum()
Which yields the following output:
logret
2001-05-31 2.2950148716254297
2001-11-30 -12.536360930670858
2002-05-31 5.468848462868161
2002-11-30 13.027927629740189
2003-05-31 -10.37282118563155
2003-11-30 -0.156275418330286
2004-05-31 -3.0768727498370905
2004-11-30 28.328856464071546
2005-05-31 -3.6462613215100546
I have not achieved my goal (six-month resampling that ends in April and October) with the start, offset, and loffset arguments to .resample().
I have achieved my goal with the hack below. However, it loses the date index, and I would like a more robust/repeatable approach.
def sixmonth(d, b=4):
y, m, h = d.year, d.month, 1
if (m > (b + 6)): y += 1
elif (m > b): h += 1
return y + h/10
df.groupby(sixmonth).sum()
Which yields the following output without a date:
logret
2001.2 -10.300839024148
2002.1 9.321994034984547
2002.2 8.855517878860585
2003.1 -2.4576797445001493
2003.2 -7.002919570231796
2004.1 -9.36895555474087
2004.2 27.13038641177464
2005.1 3.154551390326532
Of course, I could improve this hack. But is there a better/robust/repeatable solution for n-period resampling that ends in arbitrary months?
Another workaround, keeping the datetime index:
def custom_6M(df, month=4):
df = df.resample("M").sum()
df = df.rolling(6).sum()
return df[df.index.month.isin([month,month+6])]
>>> custom_6M(df)
logret
2001-10-31 -10.300839
2002-04-30 9.321994
2002-10-31 8.855518
2003-04-30 -2.457680
2003-10-31 -7.002920
2004-04-30 -9.368956
2004-10-31 27.130386
It's a pain. When I needed something similar, I ended up with the following approach:
anchor_month = 4
non_months = (anchor_month + 3) % 12, (anchor_month + 9) % 12
df = df.resample('Q-APR').sum()
df = (df.reset_index()
.groupby(df.index.month.isin(non_months).cumsum())
.agg({'index': 'last', 'logret': 'sum'})
.set_index('index'))
Result here:
logret
index
2001-10-31 -10.300839
2002-04-30 9.321994
2002-10-31 8.855518
2003-04-30 -2.457680
2003-10-31 -7.002920
2004-04-30 -9.368956
2004-10-31 27.130386
2005-04-30 3.154551
But the problem is, that sometimes the last index doesn't fit (okay here). That can be fixed by another '6M'-resample. Overall: Not pretty.
Thanks for the answers.
I have two more options.
Append a time-stamped series to df to anchor the six-month resampling periods
I hoped that .resample()'s origin argument would let me manually anchor my six-month resampling periods. It doesn't, but the following code does.
df.append(pd.Series(name=pd.to_datetime('2001-04-30'), dtype='float')).resample('6M').sum()
Improve my sixmonth() function to use timestamps
def sixmonth(d, m=6, n=4):
o = (m - (d.month - n)) % m
return d + pd.offsets.MonthEnd(o)
I first .resample('M') to make sure that I have end-of-month dates.
I could modify sixmonth() to check for end-of-month dates, but I'm more afraid of finding some new edge case than a little inefficiency.
df.resample('M').sum().groupby(sixmonth).sum()
I have two dataframes
import numpy as np
import pandas as pd
test1 = pd.date_range(start='1/1/2018', end='1/10/2018')
test1 = pd.DataFrame(test1)
test1.rename(columns = {list(test1)[0]: 'time'}, inplace = True)
test2 = pd.date_range(start='1/5/2018', end='1/20/2018')
test2 = pd.DataFrame(test2)
test2.rename(columns = {list(test2)[0]: 'time'}, inplace = True)
Now in first dataframe I create column
test1['values'] = np.zeros(10)
I want to fill this column, next to each date there should be the index of the closest date from second dataframe. I want it to look like this:
0 2018-01-01 0
1 2018-01-02 0
2 2018-01-03 0
3 2018-01-04 0
4 2018-01-05 0
5 2018-01-06 1
6 2018-01-07 2
7 2018-01-08 3
Of course my real data is not evenly spaced and has minutes and seconds, but the idea is same. I use the following code:
def nearest(items, pivot):
return min(items, key=lambda x: abs(x - pivot))
for k in range(10):
a = nearest(test2['time'], test1['time'][k]) ### find nearest timestamp from second dataframe
b = test2.index[test2['time'] == a].tolist()[0] ### identify the index of this timestamp
test1['value'][k] = b ### assign this value to the cell
This code is very slow on large datasets, how can I make it more efficient?
P.S. timestamps in my real data are sorted and increasing just like in these artificial examples.
You could do this in one line, using numpy's argmin:
test1['values'] = test1['time'].apply(lambda t: np.argmin(np.absolute(test2['time'] - t)))
Note that applying a lambda function is essentially also a loop. Check if that satisfies your requirements performance-wise.
You might also be able to leverage the fact that your timestamps are sorted and the timedelta between each timestamp is constant (if I got that correctly). Calculate the offset in days and derive the index vector, e.g. as follows:
offset = (test1['time'] - test2['time']).iloc[0].days
if offset < 0: # test1 time starts before test2 time, prepend zeros:
offset = abs(offset)
idx = np.append(np.zeros(offset), np.arange(len(test1['time'])-offset)).astype(int)
else: # test1 time starts after or with test2 time, use arange right away:
idx = np.arange(offset, offset+len(test1['time']))
test1['values'] = idx
I am dealing with a pandas dataframe where the index is a DateTime object and the columns represent minute-by-minute returns on several stocks from the SP500 index, together with a column of returns from the index. It's fairly long (100 stocks, 1510 trading days, minute-by-minute data each day) and looks like this (only three stocks for the sake of example):
DateTime SPY AAPL AMZN T
2014-01-02 9:30 0.032 -0.01 0.164 0.007
2014-01-02 9:31 -0.012 0.02 0.001 -0.004
2014-01-02 9:32 -0.015 0.031 0.004 -0.001
I am trying to compute the betas of each stock for each different day and for each 30-minute window. The beta of a stock in this case is defined as the covariance between its returns and the SPY returns divided by the variance of SPY in the same period. My desired output is a 3-dimensional numpy array beta_HF where beta_HF[s, i, j], for instance, means the beta of stock s at day i at window j. At this moment, I am computing the betas in the following way (let returns be full dataframe):
trading_days = pd.unique(returns.index.date)
window = "30min"
moments = pd.date_range(start = "9:30", end = "16:00", freq = window).time
def dispersion(trading_days, moments, df, verbose = True):
index = 'SPY'
beta_HF = np.zeros((df.shape[1] - 1, len(trading_days), len(moments) - 1))
for i, day in enumerate(trading_days):
daily_data = df[df.index.date == day]
start_time = dt.time(9,30)
for j, end_time in enumerate(moments[1:]):
moment_data = daily_data.between_time(start_time, end_time)
covariances = np.array([moment_data[index].cov(moment_data[symbol]) for symbol in df])
beta_HF[:, i,j] = covariances[1:]/covariances[0]
if verbose == True:
if np.remainder(i, 100) == 0:
print("Current Trading Day: {}".format(day))
return(beta_HF)
The dispersion() function generates the correct output. However, I understand that I am looping over long iterables and this is not very efficient. I seek a more efficient way to "slice" the dataframe at each 30-minute window for each day in the sample and compute the covariances. Effectively, for each slice, I need to compute 101 numbers (100 covariances + 1 variance). On my local machine (a 2013 Retina i5 Macbook Pro) it's taking around 8 minutes to compute everything. I tested it on a research server of my university and the computing time was basically the same, which probably implies that computing power is not the bottleneck but my code has low quality in this part. I would appreciate any ideas on how to make this faster.
One might point out that parallelization is the way to go here since the elements in beta_HF never interact with each other. So this seems to be easy to parallelize. However, I have never implemented anything with parallelization so I am very new to these concepts. Any ideas on how to make the code run faster? Thanks a lot!
You can use pandas Grouper in order to group your data by frequency. The only drawbacks are that you cannot have overlapping windows and it will iterate over times that are not existant.
The first issue basically means that the window will slide from 9:30-9:59 to 10:00-10:29 instead of 9:30-10:00 to 10:00-10:30.
The second issue comes to play during holidays and night when no trading takes place. Hence, if you have a large period without trading then you might want to split the DataFrame and combine them afterwards.
Create example data
import pandas as pd
import numpy as np
time = pd.date_range(start="2014-01-02 09:30",
end="2014-01-02 16:00", freq="min")
time = time.append( pd.date_range(start="2014-01-03 09:30",
end="2014-01-03 16:00", freq="min") )
df = pd.DataFrame(data=np.random.rand(time.shape[0], 4)-0.5,
index=time, columns=['SPY','AAPL','AMZN','T'])
define the range you want to use
freq = '30min'
obs_per_day = len(pd.date_range(start = "9:30", end = "16:00", freq = "30min"))
trading_days = len(pd.unique(df.index.date))
make a function to calculate the beta values
def beta(df):
if df.empty: # returns nan when no trading takes place
return np.nan
mat = df.to_numpy() # numpy is faster than pandas
m = mat.mean(axis=0)
mat = mat - m[np.newaxis,:] # demean
dof = mat.shape[0] - 1 # degree of freedom
if dof != 0: # check if you data has more than one observation
mat = mat.T.dot(mat[:,0]) / dof # covariance with first column
return mat[1:] / mat[0] # beta
else:
return np.zeros(mat.shape[1] - 1) # return zeros for to short data e.g. 16:00
and in the end use pd.groupby().apply()
res = df.groupby(pd.Grouper(freq=freq)).apply(beta)
res = np.array( [k for k in res.values if ~np.isnan(k).any()] ) # remove NaN
res = res.reshape([trading_days, obs_per_day, df.shape[1]-1])
Note that the result is in a slightly different shape than yours.
The results also differ a bit because of the different window sliding. To check whether the results are the same, simply try somthing like this
trading_days = pd.unique(df.index.date)
# Your result
moments1 = pd.date_range(start = "9:30", end = "10:00", freq = "30min").time
beta(df[df.index.date == trading_days[0]].between_time(moments1[0], moments1[1]))
# mine
moments2 = pd.date_range(start = "9:30", end = "10:00", freq = "29min").time
beta(df[df.index.date == trading_days[0]].between_time(moments[0], moments2[1]))
This question already has answers here:
How to deal with SettingWithCopyWarning in Pandas
(20 answers)
Closed 2 years ago.
I am trying to replace string values in a column without creating a copy. I have looked at the docs provided in the warning and also this question. I have also tried using .replace() with the same results. What am I not understanding?
Code:
import pandas as pd
from datetime import timedelta
# set csv file as constant
TRADER_READER = pd.read_csv('TastyTrades.csv')
TRADER_READER['Strategy'] = ''
def iron_condor():
TRADER_READER['Date'] = pd.to_datetime(TRADER_READER['Date'], format="%Y-%m-%d %H:%M:%S")
a = 0
b = 1
c = 2
d = 3
for row in TRADER_READER.index:
start_time = TRADER_READER['Date'][a]
end_time = start_time + timedelta(seconds=5)
e = TRADER_READER.iloc[a]
f = TRADER_READER.iloc[b]
g = TRADER_READER.iloc[c]
h = TRADER_READER.iloc[d]
if start_time <= f['Date'] <= end_time and f['Underlying Symbol'] == e['Underlying Symbol']:
if start_time <= g['Date'] <= end_time and g['Underlying Symbol'] == e['Underlying Symbol']:
if start_time <= h['Date'] <= end_time and h['Underlying Symbol'] == e['Underlying Symbol']:
e.loc[e['Strategy']] = 'Iron Condor'
f.loc[f['Strategy']] = 'Iron Condor'
g.loc[g['Strategy']] = 'Iron Condor'
h.loc[h['Strategy']] = 'Iron Condor'
print(e, f, g, h)
if (d + 1) > int(TRADER_READER.index[-1]):
break
else:
a += 1
b += 1
c += 1
d += 1
iron_condor()
Warning:
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
self._setitem_with_indexer(indexer, value)
Hopefully this satisfies the data needed to replicate:
,Date,Type,Action,Symbol,Instrument Type,Description,Value,Quantity,Average Price,Commissions,Fees,Multiplier,Underlying Symbol,Expiration Date,Strike Price,Call or Put
36,2019-12-31 16:01:44,Trade,BUY_TO_OPEN,QQQ 200103P00206500,Equity Option,Bought 1 QQQ 01/03/20 Put 206.50 # 0.07,-7,1,-7,-1.0,-0.14,100.0,QQQ,1/3/2020,206.5,PUT
37,2019-12-31 16:01:44,Trade,BUY_TO_OPEN,QQQ 200103C00217500,Equity Option,Bought 1 QQQ 01/03/20 Call 217.50 # 0.03,-3,1,-3,-1.0,-0.14,100.0,QQQ,1/3/2020,217.5,CALL
38,2019-12-31 16:01:44,Trade,SELL_TO_OPEN,QQQ 200103P00209000,Equity Option,Sold 1 QQQ 01/03/20 Put 209.00 # 0.14,14,1,14,-1.0,-0.15,100.0,QQQ,1/3/2020,209.0,PUT
39,2019-12-31 16:01:44,Trade,SELL_TO_OPEN,QQQ 200103C00214500,Equity Option,Sold 1 QQQ 01/03/20 Call 214.50 # 0.30,30,1,30,-1.0,-0.15,100.0,QQQ,1/3/2020,214.5,CALL
40,2020-01-03 16:08:13,Trade,BUY_TO_CLOSE,QQQ 200103C00214500,Equity Option,Bought 1 QQQ 01/03/20 Call 214.50 # 0.07,-7,1,-7,0.0,-0.14,100.0,QQQ,1/3/2020,214.5,CALL
Expected result:
,Date,Type,Action,Symbol,Instrument Type,Description,Value,Quantity,Average Price,Commissions,Fees,Multiplier,Underlying Symbol,Expiration Date,Strike Price,Call or Put
36,2019-12-31 16:01:44,Trade,BUY_TO_OPEN,QQQ 200103P00206500,Equity Option,Bought 1 QQQ 01/03/20 Put 206.50 # 0.07,-7,1,-7,-1.0,-0.14,100.0,QQQ,1/3/2020,206.5,PUT,Iron Condor
37,2019-12-31 16:01:44,Trade,BUY_TO_OPEN,QQQ 200103C00217500,Equity Option,Bought 1 QQQ 01/03/20 Call 217.50 # 0.03,-3,1,-3,-1.0,-0.14,100.0,QQQ,1/3/2020,217.5,CALL,Iron Condor
38,2019-12-31 16:01:44,Trade,SELL_TO_OPEN,QQQ 200103P00209000,Equity Option,Sold 1 QQQ 01/03/20 Put 209.00 # 0.14,14,1,14,-1.0,-0.15,100.0,QQQ,1/3/2020,209.0,PUT,Iron Condor
39,2019-12-31 16:01:44,Trade,SELL_TO_OPEN,QQQ 200103C00214500,Equity Option,Sold 1 QQQ 01/03/20 Call 214.50 # 0.30,30,1,30,-1.0,-0.15,100.0,QQQ,1/3/2020,214.5,CALL,Iron Condor
40,2020-01-03 16:08:13,Trade,BUY_TO_CLOSE,QQQ 200103C00214500,Equity Option,Bought 1 QQQ 01/03/20 Call 214.50 # 0.07,-7,1,-7,0.0,-0.14,100.0,QQQ,1/3/2020,214.5,CALL,
Let's start from some improvements in the initial part of your code:
The leftmost column of your input file is apparently the index column,
so it should be read as the index. The consequence is some different approach
to the way to access rows (details later).
The Date column can be converted to datetime64 as early as at the reading time.
So the initial part of your code can be:
TRADER_READER = pd.read_csv('Input.csv', index_col=0, parse_dates=['Date'])
TRADER_READER['Strategy'] = ''
Then I decided to organize the loop other way:
indStart is the integer index of the index column.
As you process your file in "overlapping" couples of 4 consecutive rows,
a more natural way to organize the loop is to stop on 4-th row from the end.
So the loop is over the range(TRADER_READER.index.size - 3).
Indices of 4 rows of interest can be read from the respective slice of the
index, i.e. [indStart : indStart + 4]
Check of particular row can be performed with a nested function.
To avoid your warning, setting of values in Strategy column should be
performed using loc on the original DataFrame, with row parameter for
the respective row and column parameter for Strategy.
The whole update (for the current couple of 4 rows) can be performed in
a single instruction, specifying row parameter as a slice,
from a thru d.
So the code can be something like below:
def iron_condor():
def rowCheck(row):
return start_time <= row.Date <= end_time and row['Underlying Symbol'] == undSymb
for indStart in range(TRADER_READER.index.size - 3):
a, b, c, d = TRADER_READER.index[indStart : indStart + 4]
e = TRADER_READER.loc[a]
undSymb = e['Underlying Symbol']
start_time = e.Date
end_time = start_time + pd.Timedelta('5S')
if rowCheck(TRADER_READER.loc[b]) and rowCheck(TRADER_READER.loc[c]) and rowCheck(TRADER_READER.loc[d]):
TRADER_READER.loc[a:d, 'Strategy'] = 'Iron Condor'
print('New values:')
print(TRADER_READER.loc[a:d])
No need to increment a, b, c and d. Neither break is needed.
Edit
If for some reason you have to do other updates on the rows in question,
you can change my code accordingly.
But I don't understand "this csv file will make a new column" in your
comment. For now anything you do is performed on the DataFrame
in memory. Only after that you can save the DataFrame back to the
original file. But note that even your code changes the type of Date
column, so I assume you do it once and then the type of this column
is just datetime64.
So you probably should change the type of Date column as a separate
operation and then (possibly many times) update thie DataFrame and save
the updated content back to the source file.
Edit following the comment as of 21:22:46Z
re.search('.*TO_OPEN$', row['Action']) returns a re.Match object if
a match has been found, otherwise None.
So can not compare this result with the string searched. If you wanted to get
the string matched, you should run e.g.:
mtch = re.search('.*TO_OPEN$', row['Action'])
textFound = None
if mtch:
textFound = mtch.group(0)
But you actually don't need to do it. It is enough to check whether
a match has been found, so the condition can be:
found = bool(re.search('.*TO_OPEN$', row['Action']))
(note that None cast to bool returns False and any non-Null object
returns True).
Yet another (probably simpler and quicker) solution is that you run just:
row.Action.endswith('TO_OPEN')
without invoking any regex fuction.
Here is a quite elaborating post that can not only answer your question but also explain in details why things are the case.
Deal with SettingWithCopyWarning
In short if you want to set the value of the original df, either use .replace(inplace=True) or df.loc[condition, theColtoBeSet] = new_val
I have a data frame with a column of start dates and a column of end dates. I want to check the integrity of the dates by ensuring that the start date is before the end date (i.e. start_date < end_date).I have over 14,000 observations to run through.
I have data in the form of:
Start End
0 2008-10-01 2008-10-31
1 2006-07-01 2006-12-31
2 2000-05-01 2002-12-31
3 1971-08-01 1973-12-31
4 1969-01-01 1969-12-31
I have added a column to write the result to, even though I just want to highlight whether there are incorrect ones so I can delete them:
dates['Correct'] = " "
And have began to check each date pair using the following, where my dataframe is called dates:
for index, row in dates.iterrows():
if dates.Start[index] < dates.End[index]:
dates.Correct[index] = "correct"
elif dates.Start[index] == dates.End[index]:
dates.Correct[index] = "same"
elif dates.Start[index] > dates.End[index]:
dates.Correct[index] = "incorrect"
Which works, it is just taking a really really long-time (about over 15 minutes). I need a more efficiently running code - is there something I am doing wrong or could improve?
Why not just do it in a vectorized way:
is_correct = dates['Start'] < dates['End']
is_incorrect = dates['Start'] > dates['End']
is_same = ~is_correct & ~is_incorrect
Since the list doesn't need to be compared sequentially, you can gain performance by splitting your dataset and then using multiple processes to perform the comparison simultaneously. Take a look at the multiprocessing module for help.
Something like the following may be quicker:
import pandas as pd
import datetime
df = pd.DataFrame({
'start': ["2008-10-01", "2006-07-01", "2000-05-01"],
'end': ["2008-10-31", "2006-12-31", "2002-12-31"],
})
def comparison_check(df):
start = datetime.datetime.strptime(df['start'], "%Y-%m-%d").date()
end = datetime.datetime.strptime(df['end'], "%Y-%m-%d").date()
if start < end:
return "correct"
elif start == end:
return "same"
return "incorrect"
In [23]: df.apply(comparison_check, axis=1)
Out[23]:
0 correct
1 correct
2 correct
dtype: object
Timings
In [26]: %timeit df.apply(comparison_check, axis=1)
1000 loops, best of 3: 447 µs per loop
So by my calculations, 14,000 rows should take (447/3)*14,000 = (149 µs)*14,000 = 2.086s, so a might shorter than 15 minutes :)