I have got a data set with records in the interval of 30 seconds, I am trying to do forecast prediction using ARMA function from time series module. Due to data privacy, I have used random data to reproduce the error
import numpy as np
from pandas import *
import statsmodels.api as sm
data = np.random.rand(100000)
data_index = date_range('2013-5-26', periods = len(data), freq='30s')
data = np.array(data)
data_series = Series(data, index = data_index)
model = sm.tsa.ARMA(data_series,(1,0)).fit()
My package versions:
Python version 2.7.3
pandas version 0.11.0
statsmodels version 0.5.0
The main error message is as follows(I omitted some):
ValueError Traceback (most recent call last)
<ipython-input-24-0f57c74f0fc9> in <module>()
6 data = np.array(data)
7 data_series = Series(data, index = data_index)
----> 8 model = sm.tsa.ARMA(data_series,(1,0)).fit()
...........
...........
ValueError: freq 30S not understood
It seems to me ARMA does not support the date format generated by pandas? If I remove freq option in date_range, then this command will again not work for large series since the year will go well beyond pandas limit.
Anyway to get around? Thanks
Update:
OK, using data_series.values will work, but next, how do I do the prediction? my data_index is from [2013-05-26 00:00:00, ..., 2013-06-29 17:19:30]
prediction = model.predict('2013-05-26 00:00:00', '2013-06-29 17:19:30', dynamic=False)
still give me an error
I know prediction = model.predict() could go through and generate whole sequence prediction and then I can match, but overall it is not that convenient.
The problem is that this freq doesn't give back an offset from pandas for some reason, and we need an offset to be able to use the dates for anything. It looks like a pandas bug/not implemented to me.
from pandas.tseries.frequencies import get_offset
get_offset('30s')
Perhaps we could improve the error message though.
[Edit We don't really need the dates except for adding them back in for convenience in prediction, so you can still estimate the model by using data_series.values.]
Related
I'm currently facing an issue with GARCH modelling in python. Came across a datascale issue, where y is poorly scaled. Would really appreciate if could get an explanation on the error and perhaps a fix to the issue. GARCH model still runs but with warning messages of no successful convergence.
Error Message
This is the data which I used for my y-values. It is taken from the residual values of an ARIMA model which i did.
Y-dataset
GARCH output:
fit output
GARCH output
*Update
After setting rescale=False
inequality constraints incompatible
Minimal reproducable example
import pandas_datareader.data as pdr
import numpy as np
import datetime
import arch
from statsmodels.tsa.arima.model import ARIMA
#Extract Data, create column log returns
eurusd = pdr.DataReader('DEXUSEU', 'fred', start='1/1/2010', end='31/12/2019')
eurusd.index = pd.DatetimeIndex(eurusd.index).to_period('D')
eurusd = eurusd.to_timestamp()
eurusd['LR'] = np.log(eurusd) - np.log(eurusd.shift(1))
# ARIMA model
arima_model = ARIMA(eurusd.LR.dropna(), order=(1,0,1)).fit()
print(arima_model.summary())
# GARCH model
am = arch.arch_model(arima_model.resid)
res = am.fit()
print(res.summary())
Heading
I am creating Accumulated Local Effect plots using Python's PyALE function. I am using a RandomForestRegression function to build the model.
I can create 1D ALE plots. However, I get a Value Error when I try to create a 2D ALE plot using the same model and training data.
Here is my code.
ale(training_data, model=model1, feature=["feature1", "feature2"])
I can plot a 1D ALE plot for feature1 and feature2 with the following code.
ale(training_data, model=model1, feature=["feature1"], feature_type="continuous")
ale(training_data, model=model1, feature=["feature2"], feature_type="continuous")
There are no missing or infinite values for any column in the data frame.
I am getting the following error with the 2D ALE plot command.
ValueError: Input contains NaN, infinity or a value too large for dtype('float32').
This is a link to the function https://pypi.org/project/PyALE/#description
I am not sure why I am getting this error. I would appreciate some help on this.
Thank you,
Rohin
This issue was addressed in release v1.1.2 of the package PyALE. For those using earlier versions the workaround mentioned in the issue thread in github is to reset the index of the dataset fed to the function ale. For completeness here's a code that reproduces the error and the workaround:
from PyALE import ale
import pandas as pd
import matplotlib.pyplot as plt
import random
from sklearn.ensemble import RandomForestRegressor
# get the raw diamond data (from R's ggplot2)
dat_diamonds = pd.read_csv(
"https://raw.githubusercontent.com/tidyverse/ggplot2/master/data-raw/diamonds.csv"
)
X = dat_diamonds.loc[:, ~dat_diamonds.columns.str.contains("price")].copy()
y = dat_diamonds.loc[:, "price"].copy()
features = ["carat","depth", "table", "x", "y", "z"]
# fit the model
model = RandomForestRegressor(random_state=1345)
model.fit(X[features], y)
# sample the data
random.seed(1234)
indices = random.sample(range(X.shape[0]), 10000)
sampleData = X.loc[indices, :]
# get the effects.....
# This throws the error
ale_eff = ale(X=sampleData[features], model=model, feature=["z", "table"], grid_size=100)
# This will work, just reset the index with drop=True
ale_eff = ale(X=sampleData[features].reset_index(drop=True), model=model, feature=["z", "table"], grid_size=100)
First time trying to forecast using basic linear regression in Python. Discovered I had to convert dates to ordinal dates then into a 2D numpy array. I now want to convert the numpy array back to YYYY/MMM/DD for a useable visual plot, but am failing. Never used numpy before, therefore x_full_month.map(dt.datetime.fromordinal) is not working, as does not seem to be valid in numpy.
from sklearn.linear_model import LinearRegression
model=LinearRegression()
df['Date_Ordinal']=df['Date'].map(dt.datetime.toordinal)
x=df['Date_Ordinal']
y=df['Cost']
x_train = x.values.reshape(-1, 1)
y_train = y.values.reshape(-1, 1)
y_pred = model.predict(x_train)
From the predictive model, I'm then creating a new X of ordinal dates for the full month, to get a full months response
x_full_month = np.arange(737850,737880,1).reshape((-1, 1))
y_pred_new = model.predict(x_new)
print('predicted response:', y_pred.T, sep='\n')
This seems to work, however has an ordinal dated X (as expected), how would I get a nicely formatted X for plotting. Or get this back into a Pandas array, which I'm more familiar with? Or, am I completely going about this a roundabout way?
Edit: corrected parameter name
Several hours later and I have a solution. I'm still sure I'm going about this in-efficiently, but the steps below do work for me.
# .flatten converts numpy arrays into pandas df columns
df = pd.DataFrame(y_pred.flatten(),x_full_month.flatten())
# creates a new index (as pd.Dataframe made x_full_month the index initially)
df.reset_index(inplace=True)
# meaningful column names
df = df.rename(columns = {'index':'ord_date',0:'cumul_DN'})
# Convert oridinal date to yyyy-mm-dd
df['date']=df['ord_date'].map(dt.datetime.fromordinal)
I am using the AirPassengers dataset to predict a timeseries. For the model I am using, I chosen to use auto_arima to forecast the predicted values. However, it seems that the chosen order by the auto_arima is unable to fit the model. The corresponding chart is produced.
What can I do to get a better fit?
My code for those that want to try:
import pandas as pd
import numpy as np
import matplotlib.pylab as plt
%matplotlib inline
from pmdarima import auto_arima
df = pd.read_csv("https://raw.githubusercontent.com/AileenNielsen/TimeSeriesAnalysisWithPython/master/data/AirPassengers.csv")
df = df.rename(columns={"#Passengers":"Passengers"})
df.Month = pd.to_datetime(df.Month)
df.set_index('Month',inplace=True)
train,test=df[:-24],df[-24:]
model = auto_arima(train,trace=True,error_action='ignore', suppress_warnings=True)
model.fit(train)
forecast = model.predict(n_periods=24)
forecast = pd.DataFrame(forecast,index = test.index,columns=['Prediction'])
plt.plot(train, label='Train')
plt.plot(test, label='Valid')
plt.plot(forecast, label='Prediction')
plt.show()
from sklearn.metrics import mean_squared_error
print(mean_squared_error(test['Passengers'],forecast['Prediction']))
Thank you for reading. Any advice is appreciated.
This series is not stationary, and no amount of differencing (notice that the amplitude of the variations keeps increasing) will make it so. However, transforming the data first by taking logs should do better (experiment shows that it does do better, but not what I would call well). Setting the seasonality (as I suggest in the comment by m=12, and taking logs produces this: which is essentially perfect.
The problem was that I did not specify the m, in this case, I assigned the value of m to be 12, denoting that it is a monthly cycle, that each data row is a month. That's how I understand it. source
Feel free to comment, I'm not entirely sure as I am new to using ARIMA.
Code:
model = auto_arima(train,m=12,trace=True,error_action='ignore', suppress_warnings=True)
Just add m=12,to denote that the data is monthly.
Result:
I am working with an autoregressive model in Python using Statsmodels. The package is great and I am getting the exact results I need. However, testing for residual correlation (Breusch-Godfrey LM-test) doesn't seem to work, because I get an error message.
My code:
import pandas as pd
import datetime
import numpy as np
from statsmodels.tsa.api import VAR
import statsmodels.api as sm
df = pd.read_csv('US_data.csv')
# converting str formatted dates to datetime and setting the index
j = []
for i in df['Date']:
j.append(datetime.datetime.strptime(i, '%Y-%m-%d').date())
df['Date'] = j
df = df.set_index('Date')
# dataframe contains three columns (GDP, INV and CONS)
# log difference
df = pd.DataFrame(np.log(df)*100)
df = df.diff()
p = 4 # order
model = VAR(df[1:])
results = model.fit(p, method='ols')
sm.stats.diagnostic.acorr_breusch_godfrey(results)
Error message:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-6-11abf518baae> in <module>()
----> 1 sm.stats.diagnostic.acorr_breusch_godfrey(results)
/home/****/anaconda3/lib/python3.6/site-packages/statsmodels/sandbox/stats/diagnostic.py in acorr_breusch_godfrey(results, nlags, store)
501 nlags = int(nlags)
502
--> 503 x = np.concatenate((np.zeros(nlags), x))
504
505 #xdiff = np.diff(x)
ValueError: all the input arrays must have same number of dimensions
A similar question was asked here over five months ago, but with no luck. Does anybody have an idea how to resolve this? Thank you very much in advance!
Those diagnostic tests were designed for univariate models like OLS where we have a one-dimensional residual array.
The only way to use it is most likely to use only a single equation of the VAR system or loop over each equation or variable.
VARResults in statsmodels master has a test_whiteness_new method which is a test for no autocorrelation of the multivariate residuals of a VAR.
It uses a Portmanteau test, which I think is the same as Ljung-Box.
The statespace models also use Ljung-Box for the related tests.