I am trying to perform time series data analysis on financial data and I want to perform seasonal decomposition
from statsmodels.tsa.seasonal import seasonal_decompose
import pandas as pd
import datetime
import pandas_datareader as data
df = data.get_data_yahoo('UGA', start=everSince, end=today)
df_close = df[['Close']]
result = seasonal_decompose(df_close, model='multiplicative')
The error I get in this way
You must specify a period or x must be a pandas object with a PeriodIndex or a DatetimeIndex with a freq not set to None
I know I can specify the frequency as df.asfreq() but financial data do not have a daily frequency (i.e., I do not have an entry for every single day) since they are from Monday to Friday and sometimes there are holidays.
How can I apply seasonal_decompose to this kind of data? I have also tried df_close.index = df_close.index.to_period('B') but did not work.
An example of the df is:
Close
Date
2008-02-28 49.790001
2008-02-29 49.610001
2008-03-03 49.810001
2008-03-04 47.450001
2008-03-05 49.049999
2008-03-06 49.369999
2008-03-07 50.230000
2008-03-10 50.610001
2008-03-11 50.700001
2008-03-12 50.919998
2008-03-13 49.939999
2008-03-14 50.049999
2008-03-17 46.869999
2008-03-18 48.980000
2008-03-19 47.540001
2008-03-20 48.070000
2008-03-24 48.459999
2008-03-25 49.490002
2008-03-26 50.320000
2008-03-27 50.110001
2008-03-28 50.009998
2008-03-31 48.509998
2008-04-01 48.840000
2008-04-02 51.130001
2008-04-03 50.419998
2008-04-04 50.900002
2008-04-07 51.430000
2008-04-08 50.959999
2008-04-09 51.290001
2008-04-10 51.540001
where indices are of type pandas.core.indexes.datetimes.DatetimeIndex.
Your issue can be solved by:
filling the missing date gaps within dataframe if you don't have daily data and replace respected values with 0
Set period/frequency for target attribute to make seasonality :
# import libraries
import numpy as np
import pandas as pd
import datetime as dt
import statsmodels.api as sm
from statsmodels.tsa.seasonal import seasonal_decompose
print(sm.__version__)
# Generate some data
TODAY = dt.date.today()
ONE_WEEK = dt.timedelta(days=107)
ONE_DAY = dt.timedelta(days=1)
# Create pandas dataframe
df = pd.DataFrame({'Date': [TODAY-ONE_WEEK, TODAY-3*ONE_DAY, TODAY], 'Close': [42, 45,127]})
# Date Close
#0 2021-09-02 42
#1 2021-12-15 45
#2 2021-12-18 127
# Fill the missing dates and relative attribute with 0
r = pd.date_range(start=df.Date.min(), end=df.Date.max())
df = df.set_index('Date').reindex(r).fillna(0).rename_axis('Date').reset_index().dropna()
# Set period/frequency using set_index() dates
df['Date'] = pd.to_datetime(df['Date'])
df = df.set_index('Date').asfreq('D').dropna()
# Close
#Date
#2021-09-02 42.0
#2021-09-03 0.0
#2021-09-04 0.0
#2021-09-05 0.0
#2021-09-06 0.0
#... ...
#2021-12-14 0.0
#2021-12-15 45.0
#2021-12-16 0.0
#2021-12-17 0.0
#2021-12-18 127.0
# 108 rows × 1 columns
Finally, now we can use the function seasonal_decompose() to decompose time-series data into other components:
# inspect frequency attribute
print(df.index.freq) #<Day>
# Reproduce the example for OP and plot output
seasonal_decompose(df, model='additive').plot()
outputs:
Here is another output plot you can achieve via my another answer if you wish:
Note: decomposition doesn't work for model='multiplicative' due to:
ValueError: Multiplicative seasonality is not appropriate for zero and negative values
Related
I have a half-hourly dataframe with two columns. I would like to take all the hours of a day, then do some calculation which returns one number and assign that to all half-hours of that day. Below is an example code:
dates = pd.date_range("2003-01-01 08:30:00","2003-01-05",freq="30min")
data = np.transpose(np.array([np.random.rand(dates.shape[0]),np.random.rand(dates.shape[0])*100]))
data[0:50,0]=np.nan # my actual dataframe includes nan
df = pd.DataFrame(data = data,index =dates,columns=["DATA1","DATA2"])
print(df)
DATA1 DATA2
2003-01-01 08:30:00 NaN 79.990866
2003-01-01 09:00:00 NaN 5.461791
2003-01-01 09:30:00 NaN 68.892447
2003-01-01 10:00:00 NaN 44.823338
2003-01-01 10:30:00 NaN 57.860309
... ... ...
2003-01-04 22:00:00 0.394574 31.943657
2003-01-04 22:30:00 0.140950 78.275981
Then I would like to apply the following function which returns one numbre:
def my_f(data1,data2):
y = data1[data2>20]
return np.median(y)
This function selects all data in DATA1 based on a condition (DATA2>20) then takes the median of all these data.
How can I create a third column (let's say result) and assign back this fixed number (y) for all half-hours data of that day?
My guess is I should use something like this:
daily_tmp = df.resample('D').apply(my_f)
df['results'] = b.reindex(df.index,method='ffill')
If this approach is correct, how can I pass my_f with two arguments to resample.apply()?
Or is there any other way to do the similar task?
My solution assumes that you have a fairly small dataset. Please let me know if it is not the case.
I would decompose your goal as follows:
(1) group data by day
(2) for each day, compute some complicated function
(3) assign the resulted value in to half-hours.
# specify the day for each datapoint
df['day'] = df.index.map(lambda x: x.strftime('%Y-%m-%d'))
# compute a complicated function for each day and store the result
mapping = {}
for day, data_for_the_day in df.groupby(by='day'):
# assign to mapping[day] the result of a complicated function
mapping[day] = np.mean(data_for_the_day[data_for_the_day['Data2'] > 20]['Data1'])
# assign the values to half-hours
df['result'] = df.index.map(lambda x: mapping.get(x.strftime('%Y-%m-%d'), np.nan) if x.strftime('%M')=='30' else np.nan)
That's not the neatest solution, but it is straight-forward, easy-to-understand, and works well on small datasets.
Here is a fast way to do it.
First, import libraries :
import time
import pandas as pd
import numpy as np
import datetime as dt
Second, the code to achieve it:
%%time
dates = pd.date_range("2003-01-01 08:30:00","2003-01-05",freq="30min")
data = np.transpose(np.array([np.random.rand(dates.shape[0]),np.random.rand(dates.shape[0])*100]))
data[0:50,0]=np.nan # my actual dataframe includes nan
df = pd.DataFrame(data = data,index =dates,columns=["DATA1","DATA2"])
#### Create an unique marker per hour
df['Date'] = df.index
df['Date'] = df['Date'].dt.strftime(date_format='%Y-%m-%d %H')
#### Then Stipulate some conditions
_condition_1 = df.Date == df.Date.shift(-1) # if full hour
_condition_2 = df.DATA2 > 20 # yours
_condition_3 = df.Date == df.Date.shift(1) # if half an hour
#### Now, report median where condition 1 and 2 are fullfilled
df['result'] = np.where(_condition_1 & _condition_2,(df.DATA1+df.DATA1.shift(-1)/2),0)
#### Fill the hours with median
df['result'] = np.where(_condition_3,df.result.shift(1),df.result)
#### Drop useless column
df = df.drop(['Date'],axis=1)
df[df.DATA2>20].tail(20)
Third: the output
output
I'm trying to find dates when wipro close price was max per year. (What date and what price?) Here's an example of some code I've tried:
import pandas as pd
import numpy as np
from nsepy import get_history
import datetime as dt
start = dt.datetime(2015, 1, 1)
end = dt.datetime.today()
wipro=get_history(symbol='WIPRO', start = start, end = end)
wipro.index = pd.to_datetime(wipro.index)
# This should get me my grouped results
wipro_agg = wipro.groupby(wipro.index.year).Close.idxmax()
Solving this problem requires 2 steps. First, get the max price each year. Then, find the exact date of that instance.
# Find max price each year:
# note: specific format to keep as a dataframe
wipro_max_yr = wipro.groupby(wipro.index.dt.year)[['Close']].max()
# Now, do an inner join to find exact dates
wipro_max_dates = wipro_max_yr.merge(wipro, how='inner')
You can simply call "max" the same way you called "idxmax"
In [25]: df_ids = pd.DataFrame(wipro.groupby(wipro.index.year).Close.idxmax())
In [26]: df_ids['price'] = wipro.groupby(wipro.index.year).Close.max()
In [27]: df_ids.rename({'Close': 'date'}, axis= 1).set_index('date')
Out[27]:
price
date
2015-03-03 672.45
2016-04-20 601.25
2017-06-06 560.55
2018-12-19 340.70
2019-02-26 387.65
I have a set of calculated OHLCVA daily securities data in a pandas dataframe like this:
>>> type(data_dy)
<class 'pandas.core.frame.DataFrame'>
>>> data_dy
Open High Low Close Volume Adj Close
Date
2012-12-28 140.64 141.42 139.87 140.03 148806700 134.63
2012-12-31 139.66 142.56 139.54 142.41 243935200 136.92
2013-01-02 145.11 146.15 144.73 146.06 192059000 140.43
2013-01-03 145.99 146.37 145.34 145.73 144761800 140.11
2013-01-04 145.97 146.61 145.67 146.37 116817700 140.72
[5 rows x 6 columns]
I'm using the following dictionary and the pandas resample function to convert the dataframe to monthly data:
>>> ohlc_dict = {'Open':'first','High':'max','Low':'min','Close': 'last','Volume': 'sum','Adj Close': 'last'}
>>> data_dy.resample('M', how=ohlc_dict, closed='right', label='right')
Volume Adj Close High Low Close Open
Date
2012-12-31 392741900 136.92 142.56 139.54 142.41 140.64
2013-01-31 453638500 140.72 146.61 144.73 146.37 145.11
[2 rows x 6 columns]
This does the calculations correctly, but I'd like to use the Yahoo! date convention for monthly data of using the first trading day of the period rather than the last calendar day of the period that pandas uses.
So I'd like the answer set to be:
Volume Adj Close High Low Close Open
Date
2012-12-28 392741900 136.92 142.56 139.54 142.41 140.64
2013-01-02 453638500 140.72 146.61 144.73 146.37 145.11
I could do this by converting the daily data to a python list, process the data and return the data to a dataframe, but how do can this be done with pandas?
Instead of M you can pass MS as the resample rule:
df =pd.DataFrame( range(72), index = pd.date_range('1/1/2011', periods=72, freq='D'))
#df.resample('MS', how = 'mean') # pandas <0.18
df.resample('MS').mean() # pandas >= 0.18
Updated to use the first business day of the month respecting US Federal Holidays:
df =pd.DataFrame( range(200), index = pd.date_range('12/1/2012', periods=200, freq='D'))
from pandas.tseries.offsets import CustomBusinessMonthBegin
from pandas.tseries.holiday import USFederalHolidayCalendar
bmth_us = CustomBusinessMonthBegin(calendar=USFederalHolidayCalendar())
df.resample(bmth_us).mean()
if you want custom starts of the month using the min month found in the data try this. (It isn't pretty, but it should work).
month_index =df.index.to_period('M')
min_day_in_month_index = pd.to_datetime(df.set_index(new_index, append=True).reset_index(level=0).groupby(level=0)['level_0'].min())
custom_month_starts =CustomBusinessMonthBegin(calendar = min_day_in_month_index)
Pass custom_start_months to the fist parameter of resample
Thank you J Bradley, your solution worked perfectly. I did have to upgrade my version of pandas from their official website though as the version installed via pip did not have CustomBusinessMonthBegin in pandas.tseries.offsets. My final code was:
#----- imports -----
import pandas as pd
from pandas.tseries.offsets import CustomBusinessMonthBegin
import pandas.io.data as web
#----- get sample data -----
df = web.get_data_yahoo('SPY', '2012-12-01', '2013-12-31')
#----- build custom calendar -----
month_index =df.index.to_period('M')
min_day_in_month_index = pd.to_datetime(df.set_index(month_index, append=True).reset_index(level=0).groupby(level=0)['Open'].min())
custom_month_starts = CustomBusinessMonthBegin(calendar = min_day_in_month_index)
#----- convert daily data to monthly data -----
ohlc_dict = {'Open':'first','High':'max','Low':'min','Close': 'last','Volume': 'sum','Adj Close': 'last'}
mthly_ohlcva = df.resample(custom_month_starts, how=ohlc_dict)
This yielded the following:
>>> mthly_ohlcva
Volume Adj Close High Low Close Open
Date
2012-12-03 2889875900 136.92 145.58 139.54 142.41 142.80
2013-01-01 2587140200 143.92 150.94 144.73 149.70 145.11
2013-02-01 2581459300 145.76 153.28 148.73 151.61 150.65
2013-03-01 2330972300 151.30 156.85 150.41 156.67 151.09
2013-04-01 2907035000 154.20 159.72 153.55 159.68 156.59
2013-05-01 2781596000 157.84 169.07 158.10 163.45 159.33
2013-06-03 3533321800 155.74 165.99 155.73 160.42 163.83
2013-07-01 2330904500 163.78 169.86 160.22 168.71 161.26
2013-08-01 2283131700 158.87 170.97 163.05 163.65 169.99
2013-09-02 2226749600 163.90 173.60 163.70 168.01 165.23
2013-10-01 2901739000 171.49 177.51 164.53 175.79 168.14
2013-11-01 1930952900 176.57 181.75 174.76 181.00 176.02
2013-12-02 2232775900 181.15 184.69 177.32 184.69 181.09
I've seen in the last version of pandas you can use time offset alias 'BMS', which stands for "business month start frequency" or 'BM', which stands for "business month end frequency".
The code in the first case would look like
data_dy.resample('BMS', closed='right', label='right').apply(ohlc_dict)
or, in the second case,
data_dy.resample('BM', closed='right', label='right').apply(ohlc_dict)
I have a df, self.meter_readings, where the index is datetime values and there is a column of numbers, as below:
self.meter_readings['PointProduction']
2012-03 7707.443
2012-04 9595.481
2012-05 5923.493
2012-06 4813.446
2012-07 5384.159
2012-08 4108.496
2012-09 6370.271
2012-10 8829.357
2012-11 7495.700
2012-12 13709.940
2013-01 6148.129
2013-02 7249.951
2013-03 6546.819
2013-04 7290.730
2013-05 5056.485
Freq: M, Name: PointProduction, dtype: float64
I want to get the gradient of PointProduction against time. i.e. y=PointProduction x=time. I'm currently trying to obtain m using a linear regression:
m,c,r,x,y = stats.linregress(list(self.meter_readings.index),list(self.meter_readings['PointProduction']))
However I am getting an error:
raise TypeError(other).
This is seemingly due to the formation of the x-axis as timestamps as oppose to just numbers.
How can I correct this?
You could try converting each Timestamp to Gregorian ordinal: linregress should then work with your freq='M' index.
import pandas as pd
from scipy import stats
data = [
7707.443,
9595.481,
5923.493,
4813.446,
5384.159,
4108.496,
6370.271,
8829.357,
7495.700,
13709.940,
6148.129,
7249.951,
6546.819,
7290.730,
5056.485
]
period_index = pd.period_range(start='2012-03', periods=len(data), freq='M')
df = pd.DataFrame(data=data,
index=period_index,
columns=['PointProduction'])
# these ordinals are months since the start of the Unix epoch
df['ords'] = [tstamp.ordinal for tstamp in df.index]
m,c,r,x,y = stats.linregress(list(df.ords),
list(df['PointProduction']))
Convert the datetimestamps in the x-axis as epoch time in seconds.
If the indexes are in datetime objects you need to convert them to epoch time, for example if ts is a datetime object the following function does the conversion
ts_epoch = int(ts.strftime('%s'))
This is an example of piece of code that could it be good for you, for converting the index column into epoch seconds:
import pandas as pd
from datetime import datetime
import numpy as np
rng = pd.date_range('1/1/2011', periods=5, freq='H')
ts = pd.Series(np.random.randn(len(rng)), index=rng)
t = ts.index
print [int(t[x].strftime('%s')) for x in range(len(t)) ]
This code is fully working on python2.7.
For using this piece of code on your problem, the solution could be the following:
t = self.meter_readings.index
indexes = [int(t[x].strftime('%s')) for x in range(len(t)) ]
m,c,r,x,y = stats.linregress(indexes,list(self.meter_readings['PointProduction']))
I'm trying to calculate daily sums of values using pandas. Here's the test file - http://pastebin.com/uSDfVkTS
This is the code I came up so far:
import numpy as np
import datetime as dt
import pandas as pd
f = np.genfromtxt('test', dtype=[('datetime', '|S16'), ('data', '<i4')], delimiter=',')
dates = [dt.datetime.strptime(i, '%Y-%m-%d %H:%M') for i in f['datetime']]
s = pd.Series(f['data'], index = dates)
d = s.resample('D', how='sum')
Using the given test file this produces:
2012-01-02 1128
Freq: D
First problem is that calculated sum corresponds to the next day. I've been able to solve that by using parameter loffset='-1d'.
Now the actual problem is that the data may start not from 00:30 of a day but at any time of a day. Also the data has gaps filled with 'nan' values.
That said, is it possible to set a lower threshold of number of values that are necessary to calculate daily sums? (e.g. if there're less than 40 values in a single day, then put NaN instead of a sum)
I believe that it is possible to define a custom function to do that and refer to it in 'how' parameter, but I have no clue how to code the function itself.
You can do it directly in Pandas:
s = pd.read_csv('test', header=None, index_col=0, parse_dates=True)
d = s.groupby(lambda x: x.date()).aggregate(lambda x: sum(x) if len(x) >= 40 else np.nan)
X.2
2012-01-01 1128
Much easier way is to use pd.Grouper:
d = s.groupby(pd.Grouper(freq='1D')).sum()