import datetime
import pandas.io.data
sp = pd.io.data.get_data_yahoo('^IXIC',start = datetime.datetime(1972, 1, 3),
end = datetime.datetime(2010, 1, 3))
I have used the above example, but that just pulls DAILY data into a dataframe when I would like to pull weekly. It doesn't seem like get_data_yahoo has a parameter where you can select perhaps from daily, weekly or monthly like the options made available on yahoo itself. Any other packages or ideas that you know of that might be able to facilitate this?
You can downsample using the asfreq method:
sp = sp.asfreq('W-FRI', method='pad')
The pad method will propagate the last valid observation forward.
Using resample (as #tshauck has shown) is another possibility.
Use asfreq if you want to guarantee that the values in your downsample are values found in the original data set. Use resample if you wish to aggregate groups of rows from the original data set (for example, by taking a mean). reindex might introduce NaN values if the original data set does not have a value on the date specified by the reindex -- though (as #behzad.nouri points out) you could use method=pad to propagate last observations here as well.
If you check the latest pandas source code on github, you will see that interval param is included in the latest master branch. You can manually modify your local copy by overwriting the same data.py under your Site-Packages/pandas/io folder
you can always reindex to your desired frequency:
sp.reindex( pd.date_range( start=sp.index.min( ),
end=sp.index.max( ),
freq='W-WED' ) ) # weekly, Wednesdays
edit: you may add , method='ffill' to forward fill NaN values.
As a suggestion, take Wednesdays because that tend to have least missing values. ( i.e. fewer NYSE holidays falls on Wednesday ). I think Yahoo weekly data gives the stock price each Monday, which is worst weekly frequency based on S&P data from 2000 onwards:
import pandas.io.data as web
sp = web.DataReader("^GSPC", "yahoo", start=dt.date( 2000, 1, 1 ) )
weekday = { 0:'MON', 1:'TUE', 2:'WED', 3:'THU', 4:'FRI' }
sp[ 'weekday' ] = list( map( weekday.get, sp.index.dayofweek ) )
sp.weekday.value_counts( )
output:
WED 722
TUE 717
THU 707
FRI 705
MON 659
One option would be to mask on the day of week you want.
sp[sp.index.dayofweek == 0]
Another option would be to resample.
sp.resample('W', how='mean')
That's how I convert daily to weekly price data:
import datetime
import pandas as pd
import pandas_datareader.data as web
start = datetime.datetime(1972, 1, 3)
end = datetime.datetime(2010, 1, 3)
stock_d = web.DataReader('^IXIC', 'yahoo', start, end)
def week_open(array_like):
return array_like[0]
def week_close(array_like):
return array_like[-1]
stock_w = stock_d.resample('W',
how={'Open': week_open,
'High': 'max',
'Low': 'min',
'Close': week_close,
'Volume': 'sum'},
loffset=pd.offsets.timedelta(days=-6))
stock_w = stock_w[['Open', 'High', 'Low', 'Close', 'Volume']]
more info:
https://pandas-datareader.readthedocs.io/en/latest/remote_data.html#yahoo-finance
https://gist.github.com/prithwi/339f87bf9c3c37bb3188
Related
I have about two years of monthly gas usage for a city and want to generate daily use concerning daily usage sum equal to monthly and keep time-series shape, but I don't know how to do that.
Here is my data Link [1]
The following code sample demonstrates date and data interpolation using pandas.
The following steps are taken:
Using the provided dataset, read this into a DataFrame.
Calculate a cumulative sum of usage data.
Set the DataFrame's index as the date, to facilitate date resampling.
Resample for dates to a daily frequency.
Calculate the daily usage.
Example code:
# Read the CSV and convert dates to a datetime object.
path = '~/Downloads/usage.csv'
df = pd.read_csv(path,
header=0,
names=['date', 'gas_usage'],
converters={'date': pd.to_datetime})
# Calculate a cumulative sum to be interpolated.
df['gas_usage_c'] = df['gas_usage'].cumsum()
# Move the date to the index, for resampling.
df.set_index('date', inplace=True)
# Resample the data to a daily ('D') frequency.
df2 = df.resample('D').interpolate('time')
# Calculate the daily usage.
df2['daily_usage'] = df2['gas_usage_c'].diff()
Sample output of df2:
gas_usage gas_usage_c daily_usage
date
2016-03-20 3.989903e+07 3.989903e+07 NaN
2016-03-21 3.932781e+07 4.061487e+07 7.158445e+05
2016-03-22 3.875659e+07 4.133072e+07 7.158445e+05
... ... ...
2018-02-18 4.899380e+07 7.967041e+08 1.598856e+06
2018-02-19 4.847973e+07 7.983029e+08 1.598856e+06
2018-02-20 4.796567e+07 7.999018e+08 1.598856e+06
[703 rows x 3 columns]
Visual confirmation
I've included two simple graphs to illustrate the dataset alignment and interpolation.
Plotting code:
For completeness, the rough plotting code is included below.
from plotly.offline import plot
plot({'data': [{'x': df.index,
'y': df['gas_usage'],
'type': 'bar'}],
'layout': {'title': 'Original',
'template': 'plotly_dark'}})
plot({'data': [{'x': df2.index,
'y': df2['daily_usage'],
'type': 'bar'}],
'layout': {'title': 'Interpolated',
'template': 'plotly_dark'}})
I have a dataset with 10 years of data from 2000 to 2010. I have the initial datetime on 2000-01-01, with data resampled to daily. I also have a weekly counter for when I apply the slice() function, I will only ask for week 5 to week 21 (February 1 to May 30).
I am a little stuck with how I can slice it every year, does it involve a loop or is there a timeseries function in python that will know to slice for a specific period in every year? Below is the code I have so far, I had a for loop that was supposed to slice(5, 21) but that didn't work.
Any suggestions how might I get this to work?
import pandas as pd
from datetime import datetime, timedelta
initial_datetime = pd.to_datetime("2000-01-01")
# Read the file
df = pd.read_csv("D:/tseries.csv")
# Convert seconds to datetime
df["Time"] = df["Time"].map(lambda dt: initial_datetime+timedelta(seconds=dt))
df = df.set_index(pd.DatetimeIndex(df["Time"]))
resampling_period = "24H"
df = df.resample(resampling_period).mean().interpolate()
df["Week"] = df.index.map(lambda dt: dt.week)
print(df)
You can slice using loc:
df.loc[df.Week.isin(range(5,22))]
If you want separate calculations per year (f.e. mean), you can use groupby:
subset = df.loc[df.Week.isin(range(5,22))]
subset.groupby(subset.index.year).mean()
From the daily stock price data, I want to sample and select end of the month price. I am accomplishing using the following code.
import datetime
from pandas_datareader import data as pdr
import pandas as pd
end = datetime.date.today()
begin=end-pd.DateOffset(365*2)
st=begin.strftime('%Y-%m-%d')
ed=end.strftime('%Y-%m-%d')
data = pdr.get_data_yahoo("AAPL",st,ed)
mon_data=pd.DataFrame(data['Adj Close'].resample('M').apply(lambda x: x[-2])).set_index(data.index)
The line above selects end of the month data and here is the output.
If I want to select penultimate value of the month, I can do it using the following code.
mon_data=pd.DataFrame(data['Adj Close'].resample('M').apply(lambda x: x[-2]))
Here is the output.
However the index shows end of the month value. When I choose penultimate value of the month, I want index to be 2015-12-30 instead of 2015-12-31.
Please suggest the way forward. I hope my question is clear.
Thanking you in anticipation.
Regards,
Abhishek
I am not sure if there is a way to do it with resample. But, you can get what you want using groupby and TimeGrouper.
import datetime
from pandas_datareader import data as pdr
import pandas as pd
end = datetime.date.today()
begin = end - pd.DateOffset(365*2)
st = begin.strftime('%Y-%m-%d')
ed = end.strftime('%Y-%m-%d')
data = pdr.get_data_yahoo("AAPL",st,ed)
data['Date'] = data.index
mon_data = (
data[['Date', 'Adj Close']]
.groupby(pd.TimeGrouper(freq='M')).nth(-2)
.set_index('Date')
)
simplest solution is to take the index of your newly created dataframe and subtract the number of days you want to go back:
n = 1
mon_data=pd.DataFrame(data['Adj Close'].resample('M').apply(lambda x: x[-1-n]))
mon_data.index = mon_data.index - datetime.timedelta(days=n)
also, seeing your data, i think that you should resample not to ' month end frequency' but rather to 'business month end frequency':
.resample('BM')
but even that won't cover it all, because for instance December 29, 2017 is a business month end, but this date doesn't appear in your data (which ends in December 08 2017). so you could add a small fix to that (assuming the original data is sorted by the date):
end_of_months = mon_data.index.tolist()
end_of_months[-1] = data.index[-1]
mon_data.index = end_of_months
so, the full code will look like:
n = 1
mon_data=pd.DataFrame(data['Adj Close'].resample('BM').apply(lambda x: x[-1-n]))
end_of_months = mon_data.index.tolist()
end_of_months[-1] = data.index[-1]
mon_data.index = end_of_months
mon_data.index = mon_data.index - datetime.timedelta(days=n)
btw: your .set_index(data.index) throw an error because data and mon_data are in different dimensions (mon_data is monthly grouped_by)
Using python and pandas I am trying to download security price data from Yahoo Finance with the aim of ending up with the month-end adjusted price in a time series.
My code is shown below. I have used ix to filter the dataframe to produce a list of business month-end dates. This works for all but 2 dates in the time series, where 31 May 2010 and 29 March 2013 both appear as blanks, and I think is because these are federal holidays in the US.
Rather than going down the route of trying to create a calendar for trading days, is it possible to create a custom frequency or calendar that simply looks for the month-end date, and if it is not available, checks the previous dates until it finds a value? For example, 31 March 2013 has no data, so check for 30 March (no data), 29 March (no data), 28 March (data) -> display 28 March in series.
import io
import requests
from datetime import datetime
import pandas
ticker = 'SPY'
start_date = '2009-12-31'
end_date = '2016-12-08'
s_dt = datetime.strptime(start_date, '%Y-%m-%d')
e_dt = datetime.strptime(end_date, '%Y-%m-%d')
url = 'http://chart.finance.yahoo.com/table.csv?s={0}&a={1}&b={2}&c={3}&d={4}&e={5}&f={6}&g=d&ignore=.csv'
url = url.format(ticker, s_dt.month-1, s_dt.day, s_dt.year, e_dt.month-1, e_dt.day, e_dt.year)
data = requests.get(url).content
df = pandas.read_csv(io.StringIO(data.decode('utf-8')))
df.drop('Open', 1, inplace=True)
df.drop('High', 1, inplace=True)
df.drop('Low', 1, inplace=True)
df.drop('Volume', 1, inplace=True)
df.drop('Close', 1, inplace=True)
df.columns = ['date', ticker]
df['date'] = pandas.to_datetime(df['date'], format='%Y-%m-%d')
df = df.set_index('date')
df = df.ix[pandas.date_range(start=start_date, end=end_date, freq='BM')]
I figured it out a way of achieving what I wanted by using the fillna method.
The last line of the my original code should be replaced with:
# expand series to add all dates in date range
df = df.ix[pandas.date_range(start=start_date, end=end_date, freq='d')]
# fill in the NaN values with the last available value
df = df.fillna(method='pad')
# reduce series to just business month-end dates
df = df.ix[pandas.date_range(start=start_date, end=end_date, freq='BM')]
You could get all the raw daily data with pandas_datareader (pip install it if you do not have it already)
Then you would just do
from pandas_datareader.data import DataReader
df = DataReader('SPY', 'yahoo', '2009-12-31', '2016-12-08')
You could skip the post processing step and get the monthly data directly, but the interface for this was a bit more finicky, you would do:
from pandas_datareader.yahoo.daily import YahooDailyReader
df_monthly = YahooDailyReader('SPY', '2009-12-31', '2016-12-08', interval='m').read()
I've got a DataFrame storing daily-based data which is as below:
Date Open High Low Close Volume
2010-01-04 38.660000 39.299999 38.509998 39.279999 1293400
2010-01-05 39.389999 39.520000 39.029999 39.430000 1261400
2010-01-06 39.549999 40.700001 39.020000 40.250000 1879800
2010-01-07 40.090000 40.349998 39.910000 40.090000 836400
2010-01-08 40.139999 40.310001 39.720001 40.290001 654600
2010-01-11 40.209999 40.520000 40.040001 40.290001 963600
2010-01-12 40.160000 40.340000 39.279999 39.980000 1012800
2010-01-13 39.930000 40.669998 39.709999 40.560001 1773400
2010-01-14 40.490002 40.970001 40.189999 40.520000 1240600
2010-01-15 40.570000 40.939999 40.099998 40.450001 1244200
What I intend to do is to merge it into weekly-based data. After grouping:
the Date should be every Monday (at this point, holidays scenario should be considered when Monday is not a trading day, we should apply the first trading day in current week as the Date).
Open should be Monday's (or the first trading day of current week) Open.
Close should be Friday's (or the last trading day of current week) Close.
High should be the highest High of trading days in current week.
Low should be the lowest Low of trading days in current week.
Volumn should be the sum of all Volumes of trading days in current week.
which should look like this:
Date Open High Low Close Volume
2010-01-04 38.660000 40.700001 38.509998 40.290001 5925600
2010-01-11 40.209999 40.970001 39.279999 40.450001 6234600
Currently, my code snippet is as below, which function should I use to mapping daily-based data to the expected weekly-based data? Many thanks!
import pandas_datareader.data as web
start = datetime.datetime(2010, 1, 1)
end = datetime.datetime(2016, 12, 31)
f = web.DataReader("MNST", "yahoo", start, end, session=session)
print f
You can resample (to weekly), offset (shift), and apply aggregation rules as follows:
logic = {'Open' : 'first',
'High' : 'max',
'Low' : 'min',
'Close' : 'last',
'Volume': 'sum'}
offset = pd.offsets.timedelta(days=-6)
f = pd.read_clipboard(parse_dates=['Date'], index_col=['Date'])
f.resample('W', loffset=offset).apply(logic)
to get:
Open High Low Close Volume
Date
2010-01-04 38.660000 40.700001 38.509998 40.290001 5925600
2010-01-11 40.209999 40.970001 39.279999 40.450001 6234600
In general, assuming that you have the dataframe in the form you specified, you need to do the following steps:
put Date in the index
resample the index.
What you have is a case of applying different functions to different columns. See.
You can resample in various ways. for e.g. you can take the mean of the values or count or so on. check pandas resample.
You can also apply custom aggregators (check the same link).
With that in mind, the code snippet for your case can be given as:
f['Date'] = pd.to_datetime(f['Date'])
f.set_index('Date', inplace=True)
f.sort_index(inplace=True)
def take_first(array_like):
return array_like[0]
def take_last(array_like):
return array_like[-1]
output = f.resample('W', # Weekly resample
how={'Open': take_first,
'High': 'max',
'Low': 'min',
'Close': take_last,
'Volume': 'sum'},
loffset=pd.offsets.timedelta(days=-6)) # to put the labels to Monday
output = output[['Open', 'High', 'Low', 'Close', 'Volume']]
Here, W signifies a weekly resampling which by default spans from Monday to Sunday. To keep the labels as Monday, loffset is used.
There are several predefined day specifiers. Take a look at pandas offsets. You can even define custom offsets (see).
Coming back to the resampling method. Here for Open and Close you can specify custom methods to take the first value or so on and pass the function handle to the how argument.
This answer is based on the assumption that the data seems to be daily, i.e. for each day you have only 1 entry. Also, no data is present for the non-business days. i.e. Sat and Sun. So taking the last data point for the week as the one for Friday is ok. If you so want you can use business week instead of 'W'. Also, for more complex data you may want to use groupby to group the weekly data and then work on the time indices within them.
btw a gist for the solution can be found at:
https://gist.github.com/prithwi/339f87bf9c3c37bb3188
I had the exact same question and found a great solution here.
https://www.techtrekking.com/how-to-convert-daily-time-series-data-into-weekly-and-monthly-using-pandas-and-python/
The weekly code is posted below.
import pandas as pd
import numpy as np
print('*** Program Started ***')
df = pd.read_csv('15-06-2016-TO-14-06-2018HDFCBANKALLN.csv')
# ensuring only equity series is considered
df = df.loc[df['Series'] == 'EQ']
# Converting date to pandas datetime format
df['Date'] = pd.to_datetime(df['Date'])
# Getting week number
df['Week_Number'] = df['Date'].dt.week
# Getting year. Weeknum is common across years to we need to create unique index by using year and weeknum
df['Year'] = df['Date'].dt.year
# Grouping based on required values
df2 = df.groupby(['Year','Week_Number']).agg({'Open Price':'first', 'High Price':'max', 'Low Price':'min', 'Close Price':'last','Total Traded Quantity':'sum'})
# df3 = df.groupby(['Year','Week_Number']).agg({'Open Price':'first', 'High Price':'max', 'Low Price':'min', 'Close Price':'last','Total Traded Quantity':'sum','Average Price':'avg'})
df2.to_csv('Weekly_OHLC.csv')
print('*** Program ended ***')
Adding to #Stefan 's answer with recent pandas API as loffset was deprecated since version 1.1.0 and later removed.
df = pd.read_clipboard(parse_dates=['Date'], index_col=['Date'])
logic = {'Open' : 'first',
'High' : 'max',
'Low' : 'min',
'Close' : 'last',
'Volume': 'sum'}
dfw = df.resample('W').apply(logic)
# set the index to the beginning of the week
dfw.index = dfw.index - pd.tseries.frequencies.to_offset("6D")
At first I use df.resample() according to the answers forementioned, but it fills NaN when a week is missed, unhappy about that, after some research, I use groupby() instead of resample(). Thanks for your sharing.
My original data is:
c date h l o
260 6014.78 20220321 6053.90 5984.79 6030.43
261 6052.59 20220322 6099.53 5995.22 6012.17
262 6040.86 20220323 6070.85 6008.26 6059.11
263 6003.05 20220324 6031.73 5987.40 6020.00
264 5931.33 20220325 6033.04 5928.72 6033.04
265 5946.98 20220328 5946.98 5830.93 5871.35
266 5900.04 20220329 5958.71 5894.82 5950.89
267 6003.05 20220330 6003.05 5913.08 5913.08
268 6033.04 20220331 6059.11 5978.27 5993.92
269 6126.91 20220401 6134.74 5975.66 6006.96
270 6149.08 20220406 6177.77 6106.05 6126.91
271 6134.74 20220407 6171.25 6091.71 6130.83
272 6151.69 20220408 6160.82 6096.93 6147.78
273 6095.62 20220411 6166.03 6072.15 6164.73
274 6184.28 20220412 6228.62 6049.99 6094.32
275 6119.09 20220413 6180.37 6117.79 6173.85
276 6188.20 20220414 6201.24 6132.13 6150.38
277 6173.85 20220415 6199.93 6137.35 6137.35
278 6124.31 20220418 6173.85 6108.66 6173.85
279 6065.63 20220419 6147.78 6042.16 6124.31
I don't care the date is not Monday, so I didn't handle that, the code is:
data['Date'] = pd.to_datetime(data['date'], format="%Y%m%d")
# Refer to: https://www.techtrekking.com/how-to-convert-daily-time-series-data-into-weekly-and-monthly-using-pandas-and-python/
# and here: https://stackoverflow.com/a/60518425/5449346
# and this: https://github.com/pandas-dev/pandas/issues/11217#issuecomment-145253671
logic = {'o' : 'first',
'h' : 'max',
'l' : 'min',
'c' : 'last',
'Date': 'first',
}
data = data.groupby([data['Date'].dt.year, data['Date'].dt.week]).agg(logic)
data.set_index('Date', inplace=True)
And the result is, there's no NaN on 2022.01.31 which resample() will produce:
l o h c
Date
2021-11-29 6284.68 6355.09 6421.59 6382.47
2021-12-06 6365.52 6372.04 6700.62 6593.70
2021-12-13 6445.06 6593.70 6690.19 6450.28
2021-12-20 6415.07 6437.24 6531.12 6463.31
2021-12-27 6463.31 6473.75 6794.50 6649.77
2022-01-04 6625.00 6649.77 7089.18 7055.27
2022-01-10 6804.93 7055.27 7181.75 6808.84
2022-01-17 6769.73 6776.25 7098.30 6919.67
2022-01-24 6692.80 6906.63 7048.76 6754.08
2022-02-07 6737.13 6811.45 7056.58 7023.98
2022-02-14 6815.36 7073.53 7086.57 6911.85
2022-02-21 6634.12 6880.56 6904.03 6668.02
2022-02-28 6452.88 6669.33 6671.93 6493.30
2022-03-07 5953.50 6463.31 6468.53 6228.62
2022-03-14 5817.90 6154.30 6205.15 6027.82
2022-03-21 5928.72 6030.43 6099.53 5931.33
2022-03-28 5830.93 5871.35 6134.74 6126.91
2022-04-06 6091.71 6126.91 6177.77 6151.69
2022-04-11 6049.99 6164.73 6228.62 6173.85
2022-04-18 6042.16 6173.85 6173.85 6065.63
Updated solution for 2022
import pandas as pd
from pandas.tseries.frequencies import to_offset
df = pd.read_csv('your_ticker.csv')
logic = {'<Open>' : 'first',
'<High>' : 'max',
'<Low>' : 'min',
'<Close>' : 'last',
'<Volume>': 'sum'}
df['<DTYYYYMMDD>'] = pd.to_datetime(df['<DTYYYYMMDD>'])
df = df.set_index('<DTYYYYMMDD>')
df = df.sort_index()
df = df.resample('W').apply(logic)
df.index = df.index - pd.tseries.frequencies.to_offset("6D")
Not a direct answer, but suppose the columns are the dates (transpose of your table), without missing dates.
'''sum up daily results in df to weekly results in wdf'''
wdf = pd.DataFrame(index = df.index)
for i in range(len(df.columns)):
if (i!=0) & (i%7==0):
wdf['week'+str(i//7)]= df[df.columns[i-7:i]].sum(axis = 1)