I've got a DataFrame storing daily-based data which is as below:
Date Open High Low Close Volume
2010-01-04 38.660000 39.299999 38.509998 39.279999 1293400
2010-01-05 39.389999 39.520000 39.029999 39.430000 1261400
2010-01-06 39.549999 40.700001 39.020000 40.250000 1879800
2010-01-07 40.090000 40.349998 39.910000 40.090000 836400
2010-01-08 40.139999 40.310001 39.720001 40.290001 654600
2010-01-11 40.209999 40.520000 40.040001 40.290001 963600
2010-01-12 40.160000 40.340000 39.279999 39.980000 1012800
2010-01-13 39.930000 40.669998 39.709999 40.560001 1773400
2010-01-14 40.490002 40.970001 40.189999 40.520000 1240600
2010-01-15 40.570000 40.939999 40.099998 40.450001 1244200
What I intend to do is to merge it into weekly-based data. After grouping:
the Date should be every Monday (at this point, holidays scenario should be considered when Monday is not a trading day, we should apply the first trading day in current week as the Date).
Open should be Monday's (or the first trading day of current week) Open.
Close should be Friday's (or the last trading day of current week) Close.
High should be the highest High of trading days in current week.
Low should be the lowest Low of trading days in current week.
Volumn should be the sum of all Volumes of trading days in current week.
which should look like this:
Date Open High Low Close Volume
2010-01-04 38.660000 40.700001 38.509998 40.290001 5925600
2010-01-11 40.209999 40.970001 39.279999 40.450001 6234600
Currently, my code snippet is as below, which function should I use to mapping daily-based data to the expected weekly-based data? Many thanks!
import pandas_datareader.data as web
start = datetime.datetime(2010, 1, 1)
end = datetime.datetime(2016, 12, 31)
f = web.DataReader("MNST", "yahoo", start, end, session=session)
print f
You can resample (to weekly), offset (shift), and apply aggregation rules as follows:
logic = {'Open' : 'first',
'High' : 'max',
'Low' : 'min',
'Close' : 'last',
'Volume': 'sum'}
offset = pd.offsets.timedelta(days=-6)
f = pd.read_clipboard(parse_dates=['Date'], index_col=['Date'])
f.resample('W', loffset=offset).apply(logic)
to get:
Open High Low Close Volume
Date
2010-01-04 38.660000 40.700001 38.509998 40.290001 5925600
2010-01-11 40.209999 40.970001 39.279999 40.450001 6234600
In general, assuming that you have the dataframe in the form you specified, you need to do the following steps:
put Date in the index
resample the index.
What you have is a case of applying different functions to different columns. See.
You can resample in various ways. for e.g. you can take the mean of the values or count or so on. check pandas resample.
You can also apply custom aggregators (check the same link).
With that in mind, the code snippet for your case can be given as:
f['Date'] = pd.to_datetime(f['Date'])
f.set_index('Date', inplace=True)
f.sort_index(inplace=True)
def take_first(array_like):
return array_like[0]
def take_last(array_like):
return array_like[-1]
output = f.resample('W', # Weekly resample
how={'Open': take_first,
'High': 'max',
'Low': 'min',
'Close': take_last,
'Volume': 'sum'},
loffset=pd.offsets.timedelta(days=-6)) # to put the labels to Monday
output = output[['Open', 'High', 'Low', 'Close', 'Volume']]
Here, W signifies a weekly resampling which by default spans from Monday to Sunday. To keep the labels as Monday, loffset is used.
There are several predefined day specifiers. Take a look at pandas offsets. You can even define custom offsets (see).
Coming back to the resampling method. Here for Open and Close you can specify custom methods to take the first value or so on and pass the function handle to the how argument.
This answer is based on the assumption that the data seems to be daily, i.e. for each day you have only 1 entry. Also, no data is present for the non-business days. i.e. Sat and Sun. So taking the last data point for the week as the one for Friday is ok. If you so want you can use business week instead of 'W'. Also, for more complex data you may want to use groupby to group the weekly data and then work on the time indices within them.
btw a gist for the solution can be found at:
https://gist.github.com/prithwi/339f87bf9c3c37bb3188
I had the exact same question and found a great solution here.
https://www.techtrekking.com/how-to-convert-daily-time-series-data-into-weekly-and-monthly-using-pandas-and-python/
The weekly code is posted below.
import pandas as pd
import numpy as np
print('*** Program Started ***')
df = pd.read_csv('15-06-2016-TO-14-06-2018HDFCBANKALLN.csv')
# ensuring only equity series is considered
df = df.loc[df['Series'] == 'EQ']
# Converting date to pandas datetime format
df['Date'] = pd.to_datetime(df['Date'])
# Getting week number
df['Week_Number'] = df['Date'].dt.week
# Getting year. Weeknum is common across years to we need to create unique index by using year and weeknum
df['Year'] = df['Date'].dt.year
# Grouping based on required values
df2 = df.groupby(['Year','Week_Number']).agg({'Open Price':'first', 'High Price':'max', 'Low Price':'min', 'Close Price':'last','Total Traded Quantity':'sum'})
# df3 = df.groupby(['Year','Week_Number']).agg({'Open Price':'first', 'High Price':'max', 'Low Price':'min', 'Close Price':'last','Total Traded Quantity':'sum','Average Price':'avg'})
df2.to_csv('Weekly_OHLC.csv')
print('*** Program ended ***')
Adding to #Stefan 's answer with recent pandas API as loffset was deprecated since version 1.1.0 and later removed.
df = pd.read_clipboard(parse_dates=['Date'], index_col=['Date'])
logic = {'Open' : 'first',
'High' : 'max',
'Low' : 'min',
'Close' : 'last',
'Volume': 'sum'}
dfw = df.resample('W').apply(logic)
# set the index to the beginning of the week
dfw.index = dfw.index - pd.tseries.frequencies.to_offset("6D")
At first I use df.resample() according to the answers forementioned, but it fills NaN when a week is missed, unhappy about that, after some research, I use groupby() instead of resample(). Thanks for your sharing.
My original data is:
c date h l o
260 6014.78 20220321 6053.90 5984.79 6030.43
261 6052.59 20220322 6099.53 5995.22 6012.17
262 6040.86 20220323 6070.85 6008.26 6059.11
263 6003.05 20220324 6031.73 5987.40 6020.00
264 5931.33 20220325 6033.04 5928.72 6033.04
265 5946.98 20220328 5946.98 5830.93 5871.35
266 5900.04 20220329 5958.71 5894.82 5950.89
267 6003.05 20220330 6003.05 5913.08 5913.08
268 6033.04 20220331 6059.11 5978.27 5993.92
269 6126.91 20220401 6134.74 5975.66 6006.96
270 6149.08 20220406 6177.77 6106.05 6126.91
271 6134.74 20220407 6171.25 6091.71 6130.83
272 6151.69 20220408 6160.82 6096.93 6147.78
273 6095.62 20220411 6166.03 6072.15 6164.73
274 6184.28 20220412 6228.62 6049.99 6094.32
275 6119.09 20220413 6180.37 6117.79 6173.85
276 6188.20 20220414 6201.24 6132.13 6150.38
277 6173.85 20220415 6199.93 6137.35 6137.35
278 6124.31 20220418 6173.85 6108.66 6173.85
279 6065.63 20220419 6147.78 6042.16 6124.31
I don't care the date is not Monday, so I didn't handle that, the code is:
data['Date'] = pd.to_datetime(data['date'], format="%Y%m%d")
# Refer to: https://www.techtrekking.com/how-to-convert-daily-time-series-data-into-weekly-and-monthly-using-pandas-and-python/
# and here: https://stackoverflow.com/a/60518425/5449346
# and this: https://github.com/pandas-dev/pandas/issues/11217#issuecomment-145253671
logic = {'o' : 'first',
'h' : 'max',
'l' : 'min',
'c' : 'last',
'Date': 'first',
}
data = data.groupby([data['Date'].dt.year, data['Date'].dt.week]).agg(logic)
data.set_index('Date', inplace=True)
And the result is, there's no NaN on 2022.01.31 which resample() will produce:
l o h c
Date
2021-11-29 6284.68 6355.09 6421.59 6382.47
2021-12-06 6365.52 6372.04 6700.62 6593.70
2021-12-13 6445.06 6593.70 6690.19 6450.28
2021-12-20 6415.07 6437.24 6531.12 6463.31
2021-12-27 6463.31 6473.75 6794.50 6649.77
2022-01-04 6625.00 6649.77 7089.18 7055.27
2022-01-10 6804.93 7055.27 7181.75 6808.84
2022-01-17 6769.73 6776.25 7098.30 6919.67
2022-01-24 6692.80 6906.63 7048.76 6754.08
2022-02-07 6737.13 6811.45 7056.58 7023.98
2022-02-14 6815.36 7073.53 7086.57 6911.85
2022-02-21 6634.12 6880.56 6904.03 6668.02
2022-02-28 6452.88 6669.33 6671.93 6493.30
2022-03-07 5953.50 6463.31 6468.53 6228.62
2022-03-14 5817.90 6154.30 6205.15 6027.82
2022-03-21 5928.72 6030.43 6099.53 5931.33
2022-03-28 5830.93 5871.35 6134.74 6126.91
2022-04-06 6091.71 6126.91 6177.77 6151.69
2022-04-11 6049.99 6164.73 6228.62 6173.85
2022-04-18 6042.16 6173.85 6173.85 6065.63
Updated solution for 2022
import pandas as pd
from pandas.tseries.frequencies import to_offset
df = pd.read_csv('your_ticker.csv')
logic = {'<Open>' : 'first',
'<High>' : 'max',
'<Low>' : 'min',
'<Close>' : 'last',
'<Volume>': 'sum'}
df['<DTYYYYMMDD>'] = pd.to_datetime(df['<DTYYYYMMDD>'])
df = df.set_index('<DTYYYYMMDD>')
df = df.sort_index()
df = df.resample('W').apply(logic)
df.index = df.index - pd.tseries.frequencies.to_offset("6D")
Not a direct answer, but suppose the columns are the dates (transpose of your table), without missing dates.
'''sum up daily results in df to weekly results in wdf'''
wdf = pd.DataFrame(index = df.index)
for i in range(len(df.columns)):
if (i!=0) & (i%7==0):
wdf['week'+str(i//7)]= df[df.columns[i-7:i]].sum(axis = 1)
Related
I am trying to forecast daily profit using time series analysis, but daily profit is not only recorded unevenly, but some of the data is missing.
Raw Data:
Date
Revenue
2020/1/19
10$
2020/1/20
7$
2020/1/25
14$
2020/1/29
18$
2020/2/1
12$
2020/2/2
17$
2020/2/9
28$
The above table is an example of what kind of data I have. Profit is not recorded daily, so date between 2020/1/20 and 2020/1/24 does not exist. Not only that, say the profit recorded during the period between 2020/2/3 and 2020/3/8 went missing in the database. I would like to recover this missing data and use time series analysis to predict the profit after 2020/2/9 ~.
My approach was to first aggregate the profit every 6 days since I have to recover the profit between 2020/2/3 and 2020/3/8. So my cleaned data will look something like this
Date
Revenue
2020/1/16 ~ 2020/1/21
17$
2020/1/22 ~ 2020/1/27
14$
2020/1/28 ~ 2020/2/2
47$
2020/2/3 ~ 2020/2/8
? (to predict)
After applying this to a time series model, I would like to further predict the profit after 2020/2/9 ~.
This is my general idea, but as a beginner at Python, using pandas library, I have trouble executing my ideas. Could you please help me how to aggregate the profit every 6 days and have the data look like the above table?
Easiest way is using pandas resample function.
Provided you have an index of type Datetime resampling to aggregate profits at every 6 days would be as simple as your_dataframe.resample('6D').sum()
You can do all sorts of resampling (end of month, end of quarter, begining of week, every hour, minute, second, ...). Check the full documentation if you're interested: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.resample.html?highlight=resample#pandas.DataFrame.resample
I suggest using a combination of .rolling, pd.date_range, and .reindex
Say your DataFrame is df, with proper datetime indexing:
df = pd.DataFrame([['2020/1/19',10],
['2020/1/20',7],
['2020/1/25',14],
['2020/1/29',18],
['2020/2/1',12],
['2020/2/2',17],
['2020/2/9',28]],columns=['Date','Revenue'])
df['Date'] = pd.to_datetime(df['Date'])
df.set_index('Date',inplace=True)
The first step is to 'fill in' the missing days with dummy, zero revenue. We can use pd.date_range to get an index with evenly spaced dates from 2020/1/16 to 2020/2/8, and then .reindex to bring this into the main df DataFrame:
evenly_spaced_idx = pd.date_range(start='2020/1/16',end='2020/2/8',freq='1d')
df = df.reindex(evenly_spaced_idx, fill_value=0)
Now we can take a rolling sum for each 6 day period. We're not interested in every day's six day revenue total, only every 6th days, though:
summary_df = df.rolling('6d').sum().iloc[5::6, :]
The last thing with summary_df is just to format it the way you'd like so that it clearly states the date range which each row refers to.
summary_df['Start Date'] = summary_df.index-pd.Timedelta('6d')
summary_df['End Date'] = summary_df.index
summary_df.reset_index(drop=True,inplace=True)
You can use resample for this.
Make sure to have the "Date" column as datetime type.
>>> df = pd.DataFrame([["2020/1/19" ,10],
... ["2020/1/20" ,7],
... ["2020/1/25" ,14],
... ["2020/1/29" ,18],
... ["2020/2/1" ,12],
... ["2020/2/2" ,17],
... ["2020/2/9" ,28]], columns=['Date', 'Revenue'])
>>> df['Date'] = pd.to_datetime(df.Date)
For pandas < 1.1.0
>>> df.set_index('Date').resample('6D', base=3).sum()
Revenue
Date
2020-01-16 17
2020-01-22 14
2020-01-28 47
2020-02-03 0
2020-02-09 28
For pandas >= 1.1.0
>>> df.set_index('Date').resample('6D', origin='2020-01-16').sum()
Revenue
Date
2020-01-16 17
2020-01-22 14
2020-01-28 47
2020-02-03 0
2020-02-09 28
Goal:
Calculate 50day moving average for each day, based on the past 50 days. I can calculate the mean for the entire dataset, but I am trying to contiously calculate the mean based on the past 50 days...with it changing each day of course!
import numpy as np
import pandas_datareader.data as pdr
import pandas as pd
# Define the instruments to download. We would like to see Apple, Microsoft and the S&P500 index.
ticker = ['AAPL']
#Define the data period that you would like
start_date = '2017-07-01'
end_date = '2019-02-08'
# User pandas_reader.data.DataReader to load the stock prices from Yahoo Finance.
df = pdr.DataReader(ticker, 'yahoo', start_date, end_date)
# Yahoo Finance gives 'High', 'Low', 'Open', 'Close', 'Volume', 'Adj Close'.
#Export Close PRice, Volume, and Date from yahoo finance
CloseP = df['Close']
CloseP.head()
Volm = df['Volume']
Volm.head()
Date = df["Date"] = df.index
#create a table with Date, Close Price, and Volume
Table = pd.DataFrame(np.array(Date), columns = ['Date'])
Table['Close Price'] = np.array(CloseP)
Table['Volume'] = np.array(Volm)
print (Table)
#create a column that contiosuly calculates 50 day MA
#This is what I can't get to work!
MA = np.mean(df['Close'])
Table['Moving Average'] = np.array(MA)
print (Table)
First of all, please, don't use CamelCase to name your variables, as they look as class names otherwise.
Next, use merge() to join your data frames instead of those yours np.array way:
>>> table = CloseP.merge(Volm, left_index=True, right_index=True)
>>> table.columns = ['close', 'volume'] # give names to columns
>>> table.head(10)
close volume
Date
2017-07-03 143.500000 14277800.0
2017-07-05 144.089996 21569600.0
2017-07-06 142.729996 24128800.0
2017-07-07 144.179993 19201700.0
2017-07-10 145.059998 21090600.0
2017-07-11 145.529999 19781800.0
2017-07-12 145.740005 24884500.0
2017-07-13 147.770004 25199400.0
2017-07-14 149.039993 20132100.0
2017-07-17 149.559998 23793500.0
Finally, use combination of rolling(), mean() and dropna() to calculate moving average:
>>> ma50 = table.rolling(window=50).mean().dropna()
>>> ma50.head(10)
close volume
Date
2017-09-12 155.075401 26092540.0
2017-09-13 155.398401 26705132.0
2017-09-14 155.682201 26748954.0
2017-09-15 156.025201 27248670.0
2017-09-18 156.315001 27430024.0
2017-09-19 156.588401 27424424.0
2017-09-20 156.799201 28087816.0
2017-09-21 156.952201 28340360.0
2017-09-22 157.034601 28769280.0
2017-09-25 157.064801 29254384.0
Please, refer to the docs of mentioned API calls to get more info about their usage. Good luck!
I have managed to convert my daily data to weekly data by looking at a previous answer, but I am setting date as the index. My goal is to keep 'Symbol' as the index and include 'Date' as a column.
I tried including 'Date' in the dictionary and Symbol as the index, but it is resulting in an error that the index needs to be Datetime.
This is my code:
if ( data_duration == 'w' ):
df['Date'] = pd.to_datetime(df['Date'])
df.set_index('Date', inplace=True)
df.sort_index(inplace=True)
def take_first(array_like):
return array_like[0]
def take_last(array_like):
return array_like[-1]
output = df.resample('W', # Weekly resample
how={'Open': take_first,
'High': 'max',
'Low': 'min',
'Close': take_last,
'Volume': 'sum'},
loffset=pd.offsets.timedelta(days=-6)) # to put the labels to Monday
df = output[['Open', 'High', 'Low', 'Close', 'Volume']]
But I want to retain my index as 'Symbol', like it is in the daily data, while including date in 'Output'.
This is how the daily data looks like:
Date Close High Low Open Volume
Symbol
AAPL 2017-05-25 153.87 154.3500 153.0300 153.7300 19235598
AAPL 2017-05-26 153.61 154.2400 153.3100 154.0000 21927637
However, after the weekly formatting, everything remains the same, apart from 'Symbol'. How can I fix this?
You want unstack(): https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.unstack.html
It will move one of the index levels to be a column in the DataFrame. Something like this:
df.unstack('Date')
I have a set of calculated OHLCVA daily securities data in a pandas dataframe like this:
>>> type(data_dy)
<class 'pandas.core.frame.DataFrame'>
>>> data_dy
Open High Low Close Volume Adj Close
Date
2012-12-28 140.64 141.42 139.87 140.03 148806700 134.63
2012-12-31 139.66 142.56 139.54 142.41 243935200 136.92
2013-01-02 145.11 146.15 144.73 146.06 192059000 140.43
2013-01-03 145.99 146.37 145.34 145.73 144761800 140.11
2013-01-04 145.97 146.61 145.67 146.37 116817700 140.72
[5 rows x 6 columns]
I'm using the following dictionary and the pandas resample function to convert the dataframe to monthly data:
>>> ohlc_dict = {'Open':'first','High':'max','Low':'min','Close': 'last','Volume': 'sum','Adj Close': 'last'}
>>> data_dy.resample('M', how=ohlc_dict, closed='right', label='right')
Volume Adj Close High Low Close Open
Date
2012-12-31 392741900 136.92 142.56 139.54 142.41 140.64
2013-01-31 453638500 140.72 146.61 144.73 146.37 145.11
[2 rows x 6 columns]
This does the calculations correctly, but I'd like to use the Yahoo! date convention for monthly data of using the first trading day of the period rather than the last calendar day of the period that pandas uses.
So I'd like the answer set to be:
Volume Adj Close High Low Close Open
Date
2012-12-28 392741900 136.92 142.56 139.54 142.41 140.64
2013-01-02 453638500 140.72 146.61 144.73 146.37 145.11
I could do this by converting the daily data to a python list, process the data and return the data to a dataframe, but how do can this be done with pandas?
Instead of M you can pass MS as the resample rule:
df =pd.DataFrame( range(72), index = pd.date_range('1/1/2011', periods=72, freq='D'))
#df.resample('MS', how = 'mean') # pandas <0.18
df.resample('MS').mean() # pandas >= 0.18
Updated to use the first business day of the month respecting US Federal Holidays:
df =pd.DataFrame( range(200), index = pd.date_range('12/1/2012', periods=200, freq='D'))
from pandas.tseries.offsets import CustomBusinessMonthBegin
from pandas.tseries.holiday import USFederalHolidayCalendar
bmth_us = CustomBusinessMonthBegin(calendar=USFederalHolidayCalendar())
df.resample(bmth_us).mean()
if you want custom starts of the month using the min month found in the data try this. (It isn't pretty, but it should work).
month_index =df.index.to_period('M')
min_day_in_month_index = pd.to_datetime(df.set_index(new_index, append=True).reset_index(level=0).groupby(level=0)['level_0'].min())
custom_month_starts =CustomBusinessMonthBegin(calendar = min_day_in_month_index)
Pass custom_start_months to the fist parameter of resample
Thank you J Bradley, your solution worked perfectly. I did have to upgrade my version of pandas from their official website though as the version installed via pip did not have CustomBusinessMonthBegin in pandas.tseries.offsets. My final code was:
#----- imports -----
import pandas as pd
from pandas.tseries.offsets import CustomBusinessMonthBegin
import pandas.io.data as web
#----- get sample data -----
df = web.get_data_yahoo('SPY', '2012-12-01', '2013-12-31')
#----- build custom calendar -----
month_index =df.index.to_period('M')
min_day_in_month_index = pd.to_datetime(df.set_index(month_index, append=True).reset_index(level=0).groupby(level=0)['Open'].min())
custom_month_starts = CustomBusinessMonthBegin(calendar = min_day_in_month_index)
#----- convert daily data to monthly data -----
ohlc_dict = {'Open':'first','High':'max','Low':'min','Close': 'last','Volume': 'sum','Adj Close': 'last'}
mthly_ohlcva = df.resample(custom_month_starts, how=ohlc_dict)
This yielded the following:
>>> mthly_ohlcva
Volume Adj Close High Low Close Open
Date
2012-12-03 2889875900 136.92 145.58 139.54 142.41 142.80
2013-01-01 2587140200 143.92 150.94 144.73 149.70 145.11
2013-02-01 2581459300 145.76 153.28 148.73 151.61 150.65
2013-03-01 2330972300 151.30 156.85 150.41 156.67 151.09
2013-04-01 2907035000 154.20 159.72 153.55 159.68 156.59
2013-05-01 2781596000 157.84 169.07 158.10 163.45 159.33
2013-06-03 3533321800 155.74 165.99 155.73 160.42 163.83
2013-07-01 2330904500 163.78 169.86 160.22 168.71 161.26
2013-08-01 2283131700 158.87 170.97 163.05 163.65 169.99
2013-09-02 2226749600 163.90 173.60 163.70 168.01 165.23
2013-10-01 2901739000 171.49 177.51 164.53 175.79 168.14
2013-11-01 1930952900 176.57 181.75 174.76 181.00 176.02
2013-12-02 2232775900 181.15 184.69 177.32 184.69 181.09
I've seen in the last version of pandas you can use time offset alias 'BMS', which stands for "business month start frequency" or 'BM', which stands for "business month end frequency".
The code in the first case would look like
data_dy.resample('BMS', closed='right', label='right').apply(ohlc_dict)
or, in the second case,
data_dy.resample('BM', closed='right', label='right').apply(ohlc_dict)
import datetime
import pandas.io.data
sp = pd.io.data.get_data_yahoo('^IXIC',start = datetime.datetime(1972, 1, 3),
end = datetime.datetime(2010, 1, 3))
I have used the above example, but that just pulls DAILY data into a dataframe when I would like to pull weekly. It doesn't seem like get_data_yahoo has a parameter where you can select perhaps from daily, weekly or monthly like the options made available on yahoo itself. Any other packages or ideas that you know of that might be able to facilitate this?
You can downsample using the asfreq method:
sp = sp.asfreq('W-FRI', method='pad')
The pad method will propagate the last valid observation forward.
Using resample (as #tshauck has shown) is another possibility.
Use asfreq if you want to guarantee that the values in your downsample are values found in the original data set. Use resample if you wish to aggregate groups of rows from the original data set (for example, by taking a mean). reindex might introduce NaN values if the original data set does not have a value on the date specified by the reindex -- though (as #behzad.nouri points out) you could use method=pad to propagate last observations here as well.
If you check the latest pandas source code on github, you will see that interval param is included in the latest master branch. You can manually modify your local copy by overwriting the same data.py under your Site-Packages/pandas/io folder
you can always reindex to your desired frequency:
sp.reindex( pd.date_range( start=sp.index.min( ),
end=sp.index.max( ),
freq='W-WED' ) ) # weekly, Wednesdays
edit: you may add , method='ffill' to forward fill NaN values.
As a suggestion, take Wednesdays because that tend to have least missing values. ( i.e. fewer NYSE holidays falls on Wednesday ). I think Yahoo weekly data gives the stock price each Monday, which is worst weekly frequency based on S&P data from 2000 onwards:
import pandas.io.data as web
sp = web.DataReader("^GSPC", "yahoo", start=dt.date( 2000, 1, 1 ) )
weekday = { 0:'MON', 1:'TUE', 2:'WED', 3:'THU', 4:'FRI' }
sp[ 'weekday' ] = list( map( weekday.get, sp.index.dayofweek ) )
sp.weekday.value_counts( )
output:
WED 722
TUE 717
THU 707
FRI 705
MON 659
One option would be to mask on the day of week you want.
sp[sp.index.dayofweek == 0]
Another option would be to resample.
sp.resample('W', how='mean')
That's how I convert daily to weekly price data:
import datetime
import pandas as pd
import pandas_datareader.data as web
start = datetime.datetime(1972, 1, 3)
end = datetime.datetime(2010, 1, 3)
stock_d = web.DataReader('^IXIC', 'yahoo', start, end)
def week_open(array_like):
return array_like[0]
def week_close(array_like):
return array_like[-1]
stock_w = stock_d.resample('W',
how={'Open': week_open,
'High': 'max',
'Low': 'min',
'Close': week_close,
'Volume': 'sum'},
loffset=pd.offsets.timedelta(days=-6))
stock_w = stock_w[['Open', 'High', 'Low', 'Close', 'Volume']]
more info:
https://pandas-datareader.readthedocs.io/en/latest/remote_data.html#yahoo-finance
https://gist.github.com/prithwi/339f87bf9c3c37bb3188