Remove non-business days rows from pandas dataframe - python

I have a dataframe with a timeseries data of wheat in df.
df = wt["WHEAT_USD"]
2016-05-02 02:00:00+02:00 4.780
2016-05-02 02:01:00+02:00 4.777
2016-05-02 02:02:00+02:00 4.780
2016-05-02 02:03:00+02:00 4.780
2016-05-02 02:04:00+02:00 4.780
Name: closeAsk, dtype: float64
When I plot the data it has these annoying horizontal lines because of weekends. Is there a simple way of removing the non-business days from the dataframe itself?
Something like
df = df.BDays()

One simple solution is to slice out the days not in Monday to Friday:
In [11]: s[s.index.dayofweek < 5]
Out[11]:
2016-05-02 00:00:00 4.780
2016-05-02 00:01:00 4.777
2016-05-02 00:02:00 4.780
2016-05-02 00:03:00 4.780
2016-05-02 00:04:00 4.780
Name: closeAsk, dtype: float64
Note: this doesn't take into account bank holidays etc.

Pandas BDay just ends up using .dayofweek<5 like the chosen answer, but can be extended to account for bank holidays, etc.
import pandas as pd
from pandas.tseries.offsets import BDay
isBusinessDay = BDay().onOffset
csv_path = 'C:\\Python27\\Lib\\site-packages\\bokeh\\sampledata\\daylight_warsaw_2013.csv'
dates_df = pd.read_csv(csv_path)
match_series = pd.to_datetime(dates_df['Date']).map(isBusinessDay)
dates_df[match_series]

I am building a backtester for stock/FX trading and I also have these issue with days that are nan because that they are holidays or other non trading days..
you can download a financial calendar for the days that there is no trading and then you need to think about timezone and weekends.. etc..
But the best solution is not to use date/time as the index for the candles or price.
So do not connect your price data to a date/time but just to a counter of candles or prices .. you can use a second index for this..
so for calculations of MA or other technical lines dont use date/time ..
if you look at Metatrader 4/5 it also doesnt use date/time but the index of the data is the candle number !!
I think that you need to let go of the date-time for the price if you work with stock or FX data , of cause you can put them in a column of the data-frame but dont use it as the index
This way you can avoid many problems

using workdays, you can count for holidays pretty easily
import workdays as wd
def drop_non_busdays(df, holidays=None):
if holidays is None:
holidays = []
start_date = df.index.to_list()[0].date()
end_date = df.index.to_list()[-1].date()
start_wd = wd.workday(wd.workday(start_date, -1, holidays), 1, holidays)
end_wd = wd.workday(wd.workday(end_date, 1, holidays), -1, holidays)
b_days = [start_wd]
while b_days[-1] < end_wd:
b_days.append(wd.workday(b_days[-1], 1, holidays))
valid = [i in b_days for i in df.index]
return df[valid]

Building on #Andy Hayden solution, you can also use query with a dataframe for better method chaining in a "modern pandas" fashion.
If the date is a column (e.g and is named my_date)
df.query("my_date.dt.dayofweek < 5")
If the date is the index and has a name (e.g. my_index_name or date)
df.query("my_index_name.dt.dayofweek < 5")
If the date is the index and it has no name
df.rename_axis("date").query("date.dt.dayofweek < 5")
( index.dt.dayofweek or index.dayofweek does not works for me ) .

simply, filtering can be done by day names. For example if you don't want saturdays and sundays you can use this:
df=df[(df['date'].dt.day_name()!='Saturday') & (df['date'].dt.day_name()!='Sunday')]
not for special holidays etc

Related

Get records for the nearest date if record does not exist for a particular date

I have a pandas dataframe of stock records, my goal is to pass in a particular 'day' e.g 8 and get the filtered data frame for the 8th of each month and year in the dataset.
I have gone through some SO questions and managed to get one part of my requirement that was getting the records for a particular day, however if the data for say '8th' does not exist for the particular month and year, I need to get the records for the closest day where record exists for this particular month and year.
As an example, if I pass in 8th and there is no record for 8th Jan' 2022, I need to see if records exists for 7th and 9th Jan'22, and so on..and get the record for the nearest date.
If record is present in both 7th and 9th, I will get the record for 9th (higher date).
However, it is possible if the record for 7th exists and 9th does not exist, then I will get the record for 7th (closest).
Code I have written so far
filtered_df = data.loc[(data['Date'].dt.day == 8)]
If the dataset is required, please let me know. I tried to make it clear but if there is any doubt, please let me know. Any help in the correct direction is appreciated.
Alternative 1
Resample to a daily resolution, selecting the nearest day to fill in missing values:
df2 = df.resample('D').nearest()
df2 = df2.loc[df2.index.day == 8]
Alternative 2
A more general method (and a tiny bit faster) is to generate dates/times of your choice, then use reindex() and method 'nearest'. It is more general because you can use any series of timestamps you could come up with (not necessarily aligned with any frequency).
dates = pd.date_range(
start=df.first_valid_index().normalize(), end=df.last_valid_index(),
freq='D')
dates = dates[dates.day == 8]
df2 = df.reindex(dates, method='nearest')
Example
Let's start with a reproducible example:
import yfinance as yf
df = yf.download(['AAPL', 'AMZN'], start='2022-01-01', end='2022-12-31', freq='D')
>>> df.iloc[:10, :5]
Adj Close Close High
AAPL AMZN AAPL AMZN AAPL
Date
2022-01-03 180.959747 170.404495 182.009995 170.404495 182.880005
2022-01-04 178.663086 167.522003 179.699997 167.522003 182.940002
2022-01-05 173.910645 164.356995 174.919998 164.356995 180.169998
2022-01-06 171.007523 163.253998 172.000000 163.253998 175.300003
2022-01-07 171.176529 162.554001 172.169998 162.554001 174.139999
2022-01-10 171.196426 161.485992 172.190002 161.485992 172.500000
2022-01-11 174.069748 165.362000 175.080002 165.362000 175.179993
2022-01-12 174.517136 165.207001 175.529999 165.207001 177.179993
2022-01-13 171.196426 161.214005 172.190002 161.214005 176.619995
2022-01-14 172.071335 162.138000 173.070007 162.138000 173.779999
Now:
df2 = df.resample('D').nearest()
df2 = df2.loc[df2.index.day == 8]
>>> df2.iloc[:5, :5]
Adj Close Close High
AAPL AMZN AAPL AMZN AAPL
2022-01-08 171.176529 162.554001 172.169998 162.554001 174.139999
2022-02-08 174.042633 161.413498 174.830002 161.413498 175.350006
2022-03-08 156.730942 136.014496 157.440002 136.014496 162.880005
2022-04-08 169.323975 154.460495 170.089996 154.460495 171.779999
2022-05-08 151.597595 108.789001 152.059998 108.789001 155.830002
Warning
Replacing a missing day with data from the future (which is what happens when the nearest day is after the missing one) is called peak-ahead and can cause peak-ahead bias in quant research that would use that data. It is usually considered dangerous. You'd be safer using method='ffill'.

How to aggregate irregularly sampled data for Time Series Analysis

I am trying to forecast daily profit using time series analysis, but daily profit is not only recorded unevenly, but some of the data is missing.
Raw Data:
Date
Revenue
2020/1/19
10$
2020/1/20
7$
2020/1/25
14$
2020/1/29
18$
2020/2/1
12$
2020/2/2
17$
2020/2/9
28$
The above table is an example of what kind of data I have. Profit is not recorded daily, so date between 2020/1/20 and 2020/1/24 does not exist. Not only that, say the profit recorded during the period between 2020/2/3 and 2020/3/8 went missing in the database. I would like to recover this missing data and use time series analysis to predict the profit after 2020/2/9 ~.
My approach was to first aggregate the profit every 6 days since I have to recover the profit between 2020/2/3 and 2020/3/8. So my cleaned data will look something like this
Date
Revenue
2020/1/16 ~ 2020/1/21
17$
2020/1/22 ~ 2020/1/27
14$
2020/1/28 ~ 2020/2/2
47$
2020/2/3 ~ 2020/2/8
? (to predict)
After applying this to a time series model, I would like to further predict the profit after 2020/2/9 ~.
This is my general idea, but as a beginner at Python, using pandas library, I have trouble executing my ideas. Could you please help me how to aggregate the profit every 6 days and have the data look like the above table?
Easiest way is using pandas resample function.
Provided you have an index of type Datetime resampling to aggregate profits at every 6 days would be as simple as your_dataframe.resample('6D').sum()
You can do all sorts of resampling (end of month, end of quarter, begining of week, every hour, minute, second, ...). Check the full documentation if you're interested: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.resample.html?highlight=resample#pandas.DataFrame.resample
I suggest using a combination of .rolling, pd.date_range, and .reindex
Say your DataFrame is df, with proper datetime indexing:
df = pd.DataFrame([['2020/1/19',10],
['2020/1/20',7],
['2020/1/25',14],
['2020/1/29',18],
['2020/2/1',12],
['2020/2/2',17],
['2020/2/9',28]],columns=['Date','Revenue'])
df['Date'] = pd.to_datetime(df['Date'])
df.set_index('Date',inplace=True)
The first step is to 'fill in' the missing days with dummy, zero revenue. We can use pd.date_range to get an index with evenly spaced dates from 2020/1/16 to 2020/2/8, and then .reindex to bring this into the main df DataFrame:
evenly_spaced_idx = pd.date_range(start='2020/1/16',end='2020/2/8',freq='1d')
df = df.reindex(evenly_spaced_idx, fill_value=0)
Now we can take a rolling sum for each 6 day period. We're not interested in every day's six day revenue total, only every 6th days, though:
summary_df = df.rolling('6d').sum().iloc[5::6, :]
The last thing with summary_df is just to format it the way you'd like so that it clearly states the date range which each row refers to.
summary_df['Start Date'] = summary_df.index-pd.Timedelta('6d')
summary_df['End Date'] = summary_df.index
summary_df.reset_index(drop=True,inplace=True)
You can use resample for this.
Make sure to have the "Date" column as datetime type.
>>> df = pd.DataFrame([["2020/1/19" ,10],
... ["2020/1/20" ,7],
... ["2020/1/25" ,14],
... ["2020/1/29" ,18],
... ["2020/2/1" ,12],
... ["2020/2/2" ,17],
... ["2020/2/9" ,28]], columns=['Date', 'Revenue'])
>>> df['Date'] = pd.to_datetime(df.Date)
For pandas < 1.1.0
>>> df.set_index('Date').resample('6D', base=3).sum()
Revenue
Date
2020-01-16 17
2020-01-22 14
2020-01-28 47
2020-02-03 0
2020-02-09 28
For pandas >= 1.1.0
>>> df.set_index('Date').resample('6D', origin='2020-01-16').sum()
Revenue
Date
2020-01-16 17
2020-01-22 14
2020-01-28 47
2020-02-03 0
2020-02-09 28

Pandas return the next Sunday for every row

In Pandas for Python, I have a data set that has a column of datetimes in it. I need to create a new column that has the date of the following Sunday for each row.
I've tried various methods trying to use iterrows and then figure out the day of the week, and add a day until the day is 7, but it hasn't worked and I'm not even sure how I'd return the date instead of just the day number then. I also don't feel like iterrows would be the best way to do it either.
What is the best way to return a column of the following Sunday from a date column?
Use the Pandas date offsets, e.g.:
>>> pd.to_datetime('2019-04-09') + pd.offsets.Week(n=0, weekday=6)
Timestamp('2019-04-14 00:00:00')
For example, this changes the provided datetime over a week. This is vectorised, so you can run it against a series like so:
temp['sunday_dates'] = temp['our_dates'] + pd.offsets.Week(n=0, weekday=6)
our_dates random_data sunday_dates
0 2010-12-31 4012 2011-01-02
1 2007-12-31 3862 2008-01-06
2 2006-12-31 3831 2007-01-07
3 2011-12-31 3811 2012-01-01
N.b. Pass n=0 to keep a day, which is already on a Sunday, on that day. Pass n=1 if you want to force it to the next Sunday. The Week(weekday=INT) parameter is 0 indexed on Monday and takes values from 0 to 6 (inclusive). Thus, passing 0 yields all Mondays, 1 yields all Tuesdays, etc. Using this, you can make everything any day of the week you would like.
N.b. If you want to go to the last Sunday, just swap + to - to go back.
N.b. (Such note, much bene) The specific documentation on time series functionality can be found here: https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html
The function
import datetime
def datetime_to_next_sunday(original_datetime):
return original_datetime + datetime.timedelta(days=6-original_datetime.weekday())
returns the datetime.datetime shifted to the next sunday. Having
import pandas as pd
df = pd.DataFrame({'A': ['Foo', 'Bar'],
'datetime': [datetime.datetime.now(),
datetime.datetime.now() + datetime.timedelta(days=1)]})
the following line should to the job:
df['dt_following_sunday'] = df[['datetime']].applymap(datetime_to_next_sunday)
I suggest to use calendar library
import calendar
import datetime as dt
#today date
now = datetime.datetime.now()
print (now.year, now.month, now.day, now.hour, now.minute, now.second)
# diffrence in days between current date and Sunday
difday = 7 - calendar.weekday(now.year, now.month, now.day)
# Afterwards next Sunday from today
nextsunday = datetime.date(now.year, now.month , now.day + difday)
print(nextsunday)
Write this function and use
The accepted answer is the way to go, but you can also use Series.apply() and pandas.Timedelta() for this, i.e.:
df["ns"] = df["d"].apply(lambda d: d + pd.Timedelta(days=(6 if d.weekday() == 6 else 6-d.weekday())))
d ns
0 2019-04-09 21:22:10.886702 2019-04-14 21:22:10.886702
Demo

Resampling a time series to year to date

Let's say I'm looking at the US Treasury bill maturity data. It's measured daily, but not really daily, as a per cent rate.
I can get the geometric mean of a quarter's rate like so:
import pandas as pd
from scipy.stats.mstats import gmean
# ...
tbill_quarterly = raw_tbill.resample('Q').apply(lambda x: gmean(x).item())
How would I get a year-to-date quarterly aggregate from this data? That is, a figure each quarter (for 2018: 2018-03-31, 2018-06-30, 2018-09-30, 2018-12-31) holding the return from the start of the year to the quarterly date.
The resampling documentation (or really, the StackOverflow answer which serves as replacement for non-existent documentation) only provides frequencies. And I can't seem to find some kind of year-to-date function in scipy.stats.
Using help from Pandas DataFrame groupby overlapping intervals of variable length
import pandas as pd
import numpy as np
from scipy.stats.mstats import gmean
# Get data & format
df = pd.read_csv("...\DGS1.csv")
def convert(x):
try:
return float(x)
except ValueError:
return np.nan
# Format data
df['DATE'] = pd.to_datetime(df.DATE)
df['DGS1'] = df.DGS1.map(convert)
df = df.set_index('DATE').dropna()
# Reindex date according to #ifly6 answer
df = df.reindex(pd.date_range(df.index.min(), df.index.max(), freq='D'),method='ffill')
df = df.reset_index().rename(columns={'index': 'date'})
# Determine if the date sits inside the date interval
def year_to_quarter_groups(x):
return pd.Series([l for l in bins if l[0] <= x <= l[1]])
# Create all the date intervals
# bins = [
# (pd.datetime(2013, 1, 1), pd.datetime(2013, 3, 31)),
# (pd.datetime(2013, 1, 1), pd.datetime(2013, 6, 30)),
# ...
# ]
dates_from_ = [[i]*4 for i in pd.date_range('1/1/2013', end='1/1/2019', freq='AS')]
dates_from = [item for sublist in dates_from_ for item in sublist] # flatten list
dates_to = pd.date_range('1/1/2013', end='1/1/2019', freq='Q')
bins = list(zip(dates_from, dates_to))
# Create a set of intervals that each date belongs to
# A date can belong to up to four intervals/quarters [0, 1, 2, 3]
intervals = df['date'].apply(year_to_quarter_groups).stack().reset_index(1, drop=True)
# Combine the dataframes
new = pd.concat([df, intervals], axis=1).rename(columns={0: 'year_to_quarter'}).drop('date', axis=1)
# Calculate the geometric mean
new.groupby(['year_to_quarter']).DGS1.apply(lambda x: gmean(x))
Out[]:
year_to_quarter
(2013-01-01 00:00:00, 2013-06-30 00:00:00) 0.140469
(2013-01-01 00:00:00, 2013-09-30 00:00:00) 0.125079
(2013-01-01 00:00:00, 2013-12-31 00:00:00) 0.124699
(2014-01-01 00:00:00, 2014-03-31 00:00:00) 0.119801
(2014-01-01 00:00:00, 2014-06-30 00:00:00) 0.110655
(2014-01-01 00:00:00, 2014-09-30 00:00:00) 0.109624
(2014-01-01 00:00:00, 2014-12-31 00:00:00) 0.117386
(2015-01-01 00:00:00, 2015-03-31 00:00:00) 0.222842
(2015-01-01 00:00:00, 2015-06-30 00:00:00) 0.235393
(2015-01-01 00:00:00, 2015-09-30 00:00:00) 0.267067
(2015-01-01 00:00:00, 2015-12-31 00:00:00) 0.301378
(2016-01-01 00:00:00, 2016-03-31 00:00:00) 0.574620
(2016-01-01 00:00:00, 2016-06-30 00:00:00) 0.569675
(2016-01-01 00:00:00, 2016-09-30 00:00:00) 0.564723
(2016-01-01 00:00:00, 2016-12-31 00:00:00) 0.605566
(2017-01-01 00:00:00, 2017-03-31 00:00:00) 0.882396
(2017-01-01 00:00:00, 2017-06-30 00:00:00) 0.994391
(2017-01-01 00:00:00, 2017-09-30 00:00:00) 1.071789
(2017-01-01 00:00:00, 2017-12-31 00:00:00) 1.175368
(2018-01-01 00:00:00, 2018-03-31 00:00:00) 1.935798
(2018-01-01 00:00:00, 2018-06-30 00:00:00) 2.054127
(2018-01-01 00:00:00, 2018-09-30 00:00:00) 2.054127
(2018-01-01 00:00:00, 2018-12-31 00:00:00) 2.054127
I hate to post an answer to my own question, but having solved the problem, I feel that I should, in the case that someone else comes on a problem like this. I don't guarantee that this is the most elegant solution. It probably isn't.
I downloaded the data from FRED (link in answer) into a file treasury-1year-rates_1980-present.csv containing the data from the 1979-12-31 point to present (currently 2018-06-12). You need to get the data point for 1979-12-31 because 1980-01-01 is NA, since that is a federal holiday, being the New Year.
raw_tbill = pd.read_csv(path.join(base_dir, 'treasury-1year-rates_1980-present.csv'),
parse_dates=['DATE'], na_values=['.'])
raw_tbill.columns = [s.lower() for s in raw_tbill.columns.values.tolist()]
print(f'Loaded t-bill 1-year rates data, from 1980 to present, with {len(raw_tbill)} entries')
The FRED data uses . to representing missing data. Thus, the inclusion of na_values['.'] and we also want the date column parsed, thus, the inclusion of the parse_dates parameter.
I happen to like to have everything in lower case. It's only kept here because I don't want to change all the following column names. That's a real pain.
Two misconceptions, or gotcha's, to get out of the way first.
Arithmetic means wrong. Arithmetic means are wrong for dealing with per cent data. You should be using geometric means. See this for more clarification. This creates the quarter by quarter data.
Data not actually daily. Anyway, this data isn't actually daily. To deal with that problem, and the fact that Treasury bills still pay on holidays and weekends, all of those weekends need to be filled in with forward propagated data. Otherwise, the geometric means will be wrong, since one of the geometric mean assumptions is that the data are evenly spaced in time (unless you weight them, which is effectively the same thing that I do here, but I did this because calculating weights takes time to think through. This doesn't).
# fill in days and put in the previous applicable figure
# need to deal with gaps in data
raw_tbill.set_index('date', inplace=True)
raw_tbill.dropna(inplace=True)
tbill_data = raw_tbill.reindex(pd.date_range(raw_tbill.index.min(), raw_tbill.index.max(), freq='D'),
method='ffill')
Years not complete. After completing this, I have years that aren't actually really filled in (for example, apparently 1979-12-31 is empty). They need to be removed for being useless.
# drop incomplete years
count = tbill_data.set_index([tbill_data.index.year, tbill_data.index.day]).count(level=0)
years = count[count['dgs1'] >= 365].index
tbill_data['tmp_remove'] = tbill_data.apply(lambda r : 0 if r.name.year in years else 1, axis=1)
tbill_data = tbill_data[tbill_data['tmp_remove'] == 0].drop('tmp_remove', axis=1)
From here, if you're following the code, the index is now DatetimeIndex. Thus, there is no date column.
Get quarter indices and calculate. Now, technically, you don't need to do this step. It's in my code because I have to produce it. In this processing path, you have to do it, however, just to get the indices for each quarter. Otherwise, no quarters, no cigar.
Also, the DSG1 data is in per cent, we don't want those, if you're doing anything with it, you probably want it in a proportion, ie 100 pc = 1.
# turn the daily tbill data into quarterly data
# use geometric means
tbill_data['dgs1'] = tbill_data['dgs1'] / 100
tbill_qtrly = tbill_data.resample('Q').apply(lambda x: gmean(x).item())
Anyway I then define a function to calculate the year to date, which also uses geometric means for this. This then subsets the relevant data to the date. I believe that year to date includes the report date, justifying <=. If it doesn't actually do that, comment.
def calculate_ytd(row):
year = row.name.year
year_data = tbill_data[tbill_data.index.year == year]
applicable_data = year_data[year_data.index <= row.name]
return gmean(applicable_data['dgs1'])
tbill_qtrly['dgs1_ytd'] = tbill_qtrly.apply(lambda r : calculate_ytd(r), axis=1)
Application of that function produces the data.
Post-script. One could also use the quarterly geometric means as a basis for calculation, if all input variables are positive, since
where all the variables a through e are positive.

Convert daily pandas stock data to monthly data using first trade day of the month

I have a set of calculated OHLCVA daily securities data in a pandas dataframe like this:
>>> type(data_dy)
<class 'pandas.core.frame.DataFrame'>
>>> data_dy
Open High Low Close Volume Adj Close
Date
2012-12-28 140.64 141.42 139.87 140.03 148806700 134.63
2012-12-31 139.66 142.56 139.54 142.41 243935200 136.92
2013-01-02 145.11 146.15 144.73 146.06 192059000 140.43
2013-01-03 145.99 146.37 145.34 145.73 144761800 140.11
2013-01-04 145.97 146.61 145.67 146.37 116817700 140.72
[5 rows x 6 columns]
I'm using the following dictionary and the pandas resample function to convert the dataframe to monthly data:
>>> ohlc_dict = {'Open':'first','High':'max','Low':'min','Close': 'last','Volume': 'sum','Adj Close': 'last'}
>>> data_dy.resample('M', how=ohlc_dict, closed='right', label='right')
Volume Adj Close High Low Close Open
Date
2012-12-31 392741900 136.92 142.56 139.54 142.41 140.64
2013-01-31 453638500 140.72 146.61 144.73 146.37 145.11
[2 rows x 6 columns]
This does the calculations correctly, but I'd like to use the Yahoo! date convention for monthly data of using the first trading day of the period rather than the last calendar day of the period that pandas uses.
So I'd like the answer set to be:
Volume Adj Close High Low Close Open
Date
2012-12-28 392741900 136.92 142.56 139.54 142.41 140.64
2013-01-02 453638500 140.72 146.61 144.73 146.37 145.11
I could do this by converting the daily data to a python list, process the data and return the data to a dataframe, but how do can this be done with pandas?
Instead of M you can pass MS as the resample rule:
df =pd.DataFrame( range(72), index = pd.date_range('1/1/2011', periods=72, freq='D'))
#df.resample('MS', how = 'mean') # pandas <0.18
df.resample('MS').mean() # pandas >= 0.18
Updated to use the first business day of the month respecting US Federal Holidays:
df =pd.DataFrame( range(200), index = pd.date_range('12/1/2012', periods=200, freq='D'))
from pandas.tseries.offsets import CustomBusinessMonthBegin
from pandas.tseries.holiday import USFederalHolidayCalendar
bmth_us = CustomBusinessMonthBegin(calendar=USFederalHolidayCalendar())
df.resample(bmth_us).mean()
if you want custom starts of the month using the min month found in the data try this. (It isn't pretty, but it should work).
month_index =df.index.to_period('M')
min_day_in_month_index = pd.to_datetime(df.set_index(new_index, append=True).reset_index(level=0).groupby(level=0)['level_0'].min())
custom_month_starts =CustomBusinessMonthBegin(calendar = min_day_in_month_index)
Pass custom_start_months to the fist parameter of resample
Thank you J Bradley, your solution worked perfectly. I did have to upgrade my version of pandas from their official website though as the version installed via pip did not have CustomBusinessMonthBegin in pandas.tseries.offsets. My final code was:
#----- imports -----
import pandas as pd
from pandas.tseries.offsets import CustomBusinessMonthBegin
import pandas.io.data as web
#----- get sample data -----
df = web.get_data_yahoo('SPY', '2012-12-01', '2013-12-31')
#----- build custom calendar -----
month_index =df.index.to_period('M')
min_day_in_month_index = pd.to_datetime(df.set_index(month_index, append=True).reset_index(level=0).groupby(level=0)['Open'].min())
custom_month_starts = CustomBusinessMonthBegin(calendar = min_day_in_month_index)
#----- convert daily data to monthly data -----
ohlc_dict = {'Open':'first','High':'max','Low':'min','Close': 'last','Volume': 'sum','Adj Close': 'last'}
mthly_ohlcva = df.resample(custom_month_starts, how=ohlc_dict)
This yielded the following:
>>> mthly_ohlcva
Volume Adj Close High Low Close Open
Date
2012-12-03 2889875900 136.92 145.58 139.54 142.41 142.80
2013-01-01 2587140200 143.92 150.94 144.73 149.70 145.11
2013-02-01 2581459300 145.76 153.28 148.73 151.61 150.65
2013-03-01 2330972300 151.30 156.85 150.41 156.67 151.09
2013-04-01 2907035000 154.20 159.72 153.55 159.68 156.59
2013-05-01 2781596000 157.84 169.07 158.10 163.45 159.33
2013-06-03 3533321800 155.74 165.99 155.73 160.42 163.83
2013-07-01 2330904500 163.78 169.86 160.22 168.71 161.26
2013-08-01 2283131700 158.87 170.97 163.05 163.65 169.99
2013-09-02 2226749600 163.90 173.60 163.70 168.01 165.23
2013-10-01 2901739000 171.49 177.51 164.53 175.79 168.14
2013-11-01 1930952900 176.57 181.75 174.76 181.00 176.02
2013-12-02 2232775900 181.15 184.69 177.32 184.69 181.09
I've seen in the last version of pandas you can use time offset alias 'BMS', which stands for "business month start frequency" or 'BM', which stands for "business month end frequency".
The code in the first case would look like
data_dy.resample('BMS', closed='right', label='right').apply(ohlc_dict)
or, in the second case,
data_dy.resample('BM', closed='right', label='right').apply(ohlc_dict)

Categories