ARCH modelling, DataScaleWarning: y is poorly scaled - python

I'm currently facing an issue with GARCH modelling in python. Came across a datascale issue, where y is poorly scaled. Would really appreciate if could get an explanation on the error and perhaps a fix to the issue. GARCH model still runs but with warning messages of no successful convergence.
Error Message
This is the data which I used for my y-values. It is taken from the residual values of an ARIMA model which i did.
Y-dataset
GARCH output:
fit output
GARCH output
*Update
After setting rescale=False
inequality constraints incompatible
Minimal reproducable example
import pandas_datareader.data as pdr
import numpy as np
import datetime
import arch
from statsmodels.tsa.arima.model import ARIMA
#Extract Data, create column log returns
eurusd = pdr.DataReader('DEXUSEU', 'fred', start='1/1/2010', end='31/12/2019')
eurusd.index = pd.DatetimeIndex(eurusd.index).to_period('D')
eurusd = eurusd.to_timestamp()
eurusd['LR'] = np.log(eurusd) - np.log(eurusd.shift(1))
# ARIMA model
arima_model = ARIMA(eurusd.LR.dropna(), order=(1,0,1)).fit()
print(arima_model.summary())
# GARCH model
am = arch.arch_model(arima_model.resid)
res = am.fit()
print(res.summary())
Heading

Related

Get better fit on test data using Auto_Arima

I am using the AirPassengers dataset to predict a timeseries. For the model I am using, I chosen to use auto_arima to forecast the predicted values. However, it seems that the chosen order by the auto_arima is unable to fit the model. The corresponding chart is produced.
What can I do to get a better fit?
My code for those that want to try:
import pandas as pd
import numpy as np
import matplotlib.pylab as plt
%matplotlib inline
from pmdarima import auto_arima
df = pd.read_csv("https://raw.githubusercontent.com/AileenNielsen/TimeSeriesAnalysisWithPython/master/data/AirPassengers.csv")
df = df.rename(columns={"#Passengers":"Passengers"})
df.Month = pd.to_datetime(df.Month)
df.set_index('Month',inplace=True)
train,test=df[:-24],df[-24:]
model = auto_arima(train,trace=True,error_action='ignore', suppress_warnings=True)
model.fit(train)
forecast = model.predict(n_periods=24)
forecast = pd.DataFrame(forecast,index = test.index,columns=['Prediction'])
plt.plot(train, label='Train')
plt.plot(test, label='Valid')
plt.plot(forecast, label='Prediction')
plt.show()
from sklearn.metrics import mean_squared_error
print(mean_squared_error(test['Passengers'],forecast['Prediction']))
Thank you for reading. Any advice is appreciated.
This series is not stationary, and no amount of differencing (notice that the amplitude of the variations keeps increasing) will make it so. However, transforming the data first by taking logs should do better (experiment shows that it does do better, but not what I would call well). Setting the seasonality (as I suggest in the comment by m=12, and taking logs produces this: which is essentially perfect.
The problem was that I did not specify the m, in this case, I assigned the value of m to be 12, denoting that it is a monthly cycle, that each data row is a month. That's how I understand it. source
Feel free to comment, I'm not entirely sure as I am new to using ARIMA.
Code:
model = auto_arima(train,m=12,trace=True,error_action='ignore', suppress_warnings=True)
Just add m=12,to denote that the data is monthly.
Result:

Output of a statsmodels regression

I would like to perform a simple linear regression using statsmodels and I've tried several different methods by now but I just don't get it to work. The code that I have constructed now doesn't give me any errors but it also doesn't show me the result
I am trying to create a model for the variable "Direction" which takes the value 0 if the return for the corresponding date was negative and 1 if it was positive. The explinatory variables are the (5) lags of the returns. The df13 contains the lags and also the direction for each observed date. I tried this code and as I mentioned it doesn't give an error but says " Optimization terminated successfully.
Current function value: 0.682314
Iterations 5
However, I would like to see the typical table with all the beta values, their significance etc.
Also, what would you say, since Direction is a binary variable may it be better to use a logit instead of a linear model? However, in the assignment it appeared as a linear model.
And lastly, I am sorry its not displayed here correctly but I don't know how to write as code or insert my dataframe
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
import os
import itertools
from sklearn import preprocessing
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
import statsmodels.api as sm
import matplotlib.pyplot as plt
from statsmodels.sandbox.regression.predstd import wls_prediction_std
...
X = df13[['Lag1', 'Lag2', 'Lag3', 'Lag4', 'Lag5']]
Y = df13['Direction']
X = sm.add_constant(X)
model = sm.Logit(Y.astype(float), X.astype(float)).fit()
predictions = model.predict(X)
print_model = model.summary
print(print_model)
Edit: I'm sure it has to be a logit regression so I updated that part
I don't know if this is unintentional, but it looks like you need to define X and Y separately:
X = df13[['Lag1', 'Lag2', 'Lag3', 'Lag4', 'Lag5']]
Y = df13['Direction']
Secondly, I'm not familiar with statsmodel, but I would try converting your dataframes to numpy arrays. You can do this with
Xnum = X.to_numpy()
ynum = y.to_numpy()
And try passing those to the regressors.

Python: Random intercept model (have to replicate R code)

I'm trying to replicate code from R that estimates a random intercept model. The R code is:
fit=lmer(resid~-1+(1|groupid),data=df)
I'm using the lmer command of the lme4 package to estimate random intercepts for the variable resid for observations in different groups (defined by groupid). There is no 'fixed effects' part, therefore no variable before the (1|groupid). Moreover, I do not want a constant estimated so that I get an intercept for each group.
Not sure how to do similar estimation in Python. I tried something like:
import pandas as pd
import numpy as np
import statsmodels.formula.api as smf
np.random.seed(12345)
df = pd.DataFrame(np.random.randn(25, 4), columns=list('ABCD'))
df['groupid'] = [1,1,1,1,1,2,2,2,2,2,3,3,3,3,3,4,4,4,4,4,5,5,5,5,5]
df['groupid'] = df['groupid'].astype('category')
###Random intercepts models
md = smf.mixedlm('A~B-1',data=df,groups=df['groupid'])
mdf = md.fit()
print(mdf.random_effects)
A is resid from the earlier example, while groupid is the same.
1) I am not sure whether the mdf.random_effects are the random intercepts I am looking for
2) I cannot remove the variable B, which I understand is the fixed effects part. If I try:
md = smf.mixedlm('A~-1',data=df,groups=df['groupid'])
I get an error that "Arrays cannot be empty".
Just trying to estimate the exact same model as in the R code. Any advice will be appreciated.

Statsmodels Poisson glm different than R

I am trying to fit some models (Spatial interaction models) according to some code which is provided in R. I have been able to get some of the code to work using statsmodels in a python framework but some of them do not match at all. I believe that the code I have for R and Python should give identical results. Does anyone see any differences? Or is there some fundamental differences which might be throwing things off? The R code is the original code which matches the numbers given in a tutorial (Found here: http://www.bartlett.ucl.ac.uk/casa/pdf/paper181).
R sample Code:
library(mosaic)
Data = fetchData('http://dl.dropbox.com/u/8649795/AT_Austria.csv')
Model = glm(Data~Origin+Destination+Dij+offset(log(Offset)), family=poisson(link="log"), data = Data)
cor = cor(Data$Data, Model$fitted, method = "pearson", use = "complete")
rsquared = cor * cor
rsquared
R output:
> Model = glm(Data~Origin+Destination+Dij+offset(log(Offset)), family=poisson(link="log"), data = Data)
Warning messages:
1: glm.fit: fitted rates numerically 0 occurred
2: glm.fit: fitted rates numerically 0 occurred
> cor = cor(Data$Data, Model$fitted, method = "pearson", use = "complete")
> rsquared = cor * cor
> rsquared
[1] 0.9753279
Python Code:
import numpy as np
import pandas as pd
import statsmodels.formula.api as smf
import statsmodels.api as sm
from scipy.stats.stats import pearsonr
Data= pd.DataFrame(pd.read_csv('http://dl.dropbox.com/u/8649795/AT_Austria.csv'))
Model = smf.glm('Data~Origin+Destination+Dij', data=Data, offset=np.log(Data['Offset']), family=sm.families.Poisson(link=sm.families.links.log)).fit()
cor = pearsonr(doubleConstrained.fittedvalues, Data["Data"])[0]
print "R-squared for doubly-constrained model is: " + str(cor*cor)
Python Output:
R-squared for doubly-constrained model is: 0.104758481123
It looks like GLM has convergence problems here in statsmodels. Maybe in R too, but R only gives these warnings.
Warning messages:
1: glm.fit: fitted rates numerically 0 occurred
2: glm.fit: fitted rates numerically 0 occurred
That could mean something like perfect separation in Logit/Probit context. I'd have to think about it for a Poisson model.
R is doing a better, if subtle, job of telling you that something may be wrong in your fitting. If you look at the fitted likelihood in statsmodels for instance, it's -1.12e27. That should be a clue right there that something is off.
Using Poisson model directly (I always prefer maximum likelihood to GLM when possible), I can replicate the R results (but I get a convergence warning). Tellingly, again, the default newton-raphson solver fails, so I use bfgs.
import numpy as np
import pandas as pd
import statsmodels.formula.api as smf
import statsmodels.api as sm
from scipy.stats.stats import pearsonr
data= pd.DataFrame(pd.read_csv('http://dl.dropbox.com/u/8649795/AT_Austria.csv'))
mod = smf.poisson('Data~Origin+Destination+Dij', data=data, offset=np.log(data['Offset'])).fit(method='bfgs')
print mod.mle_retvals['converged']

statsmodel ARMA function is incompatiable with Pandas?

I have got a data set with records in the interval of 30 seconds, I am trying to do forecast prediction using ARMA function from time series module. Due to data privacy, I have used random data to reproduce the error
import numpy as np
from pandas import *
import statsmodels.api as sm
data = np.random.rand(100000)
data_index = date_range('2013-5-26', periods = len(data), freq='30s')
data = np.array(data)
data_series = Series(data, index = data_index)
model = sm.tsa.ARMA(data_series,(1,0)).fit()
My package versions:
Python version 2.7.3
pandas version 0.11.0
statsmodels version 0.5.0
The main error message is as follows(I omitted some):
ValueError Traceback (most recent call last)
<ipython-input-24-0f57c74f0fc9> in <module>()
6 data = np.array(data)
7 data_series = Series(data, index = data_index)
----> 8 model = sm.tsa.ARMA(data_series,(1,0)).fit()
...........
...........
ValueError: freq 30S not understood
It seems to me ARMA does not support the date format generated by pandas? If I remove freq option in date_range, then this command will again not work for large series since the year will go well beyond pandas limit.
Anyway to get around? Thanks
Update:
OK, using data_series.values will work, but next, how do I do the prediction? my data_index is from [2013-05-26 00:00:00, ..., 2013-06-29 17:19:30]
prediction = model.predict('2013-05-26 00:00:00', '2013-06-29 17:19:30', dynamic=False)
still give me an error
I know prediction = model.predict() could go through and generate whole sequence prediction and then I can match, but overall it is not that convenient.
The problem is that this freq doesn't give back an offset from pandas for some reason, and we need an offset to be able to use the dates for anything. It looks like a pandas bug/not implemented to me.
from pandas.tseries.frequencies import get_offset
get_offset('30s')
Perhaps we could improve the error message though.
[Edit We don't really need the dates except for adding them back in for convenience in prediction, so you can still estimate the model by using data_series.values.]

Categories