R's auto.arima() equivalent in Python - python

I would like to implement equivalent of auto.arima() function of R in python.
In R auto.arima function takes time series values as input computes ARIMA order parameters (p,d,q values) and fits a model, there is no need to provide p,d,q values as inputs by the user.
I want to use the equivalent of auto.arima function in python (without calling auto.arima R) to predict future values in a time series. In the following time series executing auto.arima-python for 40 points and predicting next 6 values, then moving the window by 1 point and again performing the same procedure.
Following is exemplary data:
value
0
2.584751
2.884758
2.646735
2.882105
3.267503
3.94552
4.70788
5.384803
54.77972
62.87139
78.68957
112.7166
155.0074
170.8084
196.1941
237.4928
254.9718
175.0717
217.3807
244.7357
274.4517
304.6838
373.3202
345.6252
461.2653
443.5982
472.3653
469.3326
506.8819
532.1639
542.2837
514.9269
528.0194
540.539
542.7031
556.8262
569.7132
576.2339
577.7212
577.0873
569.6199
573.2445
573.7825
589.3506
I have tried to write functions to compute order of differencing using AD Fuller Test, passing differentiated time series (which becomes stationary after differencing original time series as per the adfuller test result) to arma order select function to compute P,Q order values.
Further use these values to pass on to the arima function in Statsmodels. But the functions do not seem to work.
import numpy as np
import pandas as pd
import statsmodels.api as sm
from statsmodels.tsa.stattools import adfuller
from statsmodels.tsa.stattools import acf, pacf
def diff_terms(timeseries):
i=1
j=0
while i != 0:
dftest = adfuller(timeseries, autolag='AIC')
if dftest[0] <= dftest[4]["5%"]:
i = 0
else:
timeseries = np.diff(timeseries)
i = 1
j = j + 1
return j
def p_q_values_estimator(timeseries):
p=0
q=0
lag_acf = acf(timeseries, nlags=20)
lag_pacf = pacf(timeseries, nlags=20, method='ols')
y=1.96/np.sqrt(len(timeseries))
if lag_acf[0] < y:
for a in lag_acf:
if a < y:
q = q + 1
break
elif lag_acf[0] > y:
for c in lag_acf:
if c > y:
q = q + 1
break
if lag_pacf[0] < y:
for b in lag_pacf:
if b < y:
p = p + 1
break
elif lag_pacf[0] > y:
for d in lag_pacf:
if d > y:
p = p + 1
break
p_q=[p,q]
return(p_q)
def p_q_values_estimator2(timeseries):
res = sm.tsa.arma_order_select_ic(timeseries, ic=['aic'], max_ar=5, max_ma=4,trend='nc')
return res.aic_min_order
data1=[]
data=pd.read_csv('ABC.csv')
d_value=diff_terms(data.value)
data1[:]=data[:]
data = data[0:40]
i=0
while i < d_value:
data_diff = np.diff(data)
i = i+1
p_q_values=p_q_values_estimator(data)
p_value=p_q_values[0]
q_value=p_q_values[1]
p_q_values2=p_q_values_estimator2(data_diff)
p_value2=p_q_values2[0]
q_value2=p_q_values2[1]
exogx = np.array(range(0,40))
fit2 = sm.tsa.ARIMA(np.array(data), (p_value, d_value, q_value), exog = exogx).fit()
print(fit2.fittedvalues)
pred2 = fit2.predict(start = 40, end = 45, exog = np.array(range(40,46)))
print(pred2)
plt.plot(fit2.fittedvalues)
plt.plot(np.array(data))
plt.plot(range(40,45), np.array(pred2))
plt.show()
Errors – on using arma order select
p_q_values2=p_q_values_estimator2(data_diff)
line 56, in p_q_values_estimator2
res = sm.tsa.arma_order_select_ic(timeseries, ic=['aic'], max_ar=5, max_ma=4,trend='nc')
File "C:\Python27\lib\site-packages\statsmodels\tsa\stattools.py", line 1052, in arma_order_select_ic min_res.update({i + '_min_order' : (mins[0][0], mins[1][0])})
IndexError: index 0 is out of bounds for axis 0 with size 0
Errors – on using acf pacf based function for computation of P,Q order
fit2 = sm.tsa.ARIMA(np.array(data), (p_value, d_value, q_value), exog = exogx).fit()
File "C:\Python27\lib\site-packages\statsmodels\tsa\arima_model.py", line 1104, in fit
callback, **kwargs)
File "C:\Python27\lib\site-packages\statsmodels\tsa\arima_model.py", line 942, in fit
armafit.mle_retvals = mlefit.mle_retvals
AttributeError: 'LikelihoodModelResults' object has no attribute 'mle_retvals'

Vals is my own thing, but you can create your own index with pd.date_range
rdata=ts(traindf.requests_per_active.values,frequency=12)
#forecasts
fit=forecast.auto_arima(rdata)
forecast_output=forecast.forecast(fit,h=6,level=(95.0))
#convert forecasts to dataframe
forecast_results=pd.Series(forecast_output[3], index=vals.index)
lowerpi=pd.Series(forecast_output[4], index=vals.index)
upperpi=pd.Series(forecast_output[5], index=vals.index)
results = pd.DataFrame({'forecast' : forecast_results, 'lowerpi' : lowerpi, 'upperpi' : upperpi})

You can use pyramid-arima library. It bring auto.arima() of R language to python.It wraps "statsmodels.tsa.ARIMA and statsmodels.tsa.statespace.SARIMAX into one estimator class" (per https://pypi.org/project/pyramid-arima/).

Related

IndexError: index 1967 is out of bounds for axis 0 with size 1967

By calculating the p-value, I am reducing the number of features in a large sparse file. But I get this error. I have seen similar posts but this code works with non-sparse input. Can you help, please? (I can upload the input file if needed)
import statsmodels.formula.api as sm
def backwardElimination(x, Y, sl, columns):
numVars = len(x[0])
pvalue_removal_counter = 0
for i in range(0, numVars):
print(i, 'of', numVars)
regressor_OLS = sm.OLS(Y, x).fit()
maxVar = max(regressor_OLS.pvalues).astype(float)
if maxVar > sl:
for j in range(0, numVars - i):
if (regressor_OLS.pvalues[j].astype(float) == maxVar):
x = np.delete(x, j, 1)
pvalue_removal_counter += 1
columns = np.delete(columns, j)
regressor_OLS.summary()
return x, columns
Output:
0 of 1970
1 of 1970
2 of 1970
Traceback (most recent call last):
File "main.py", line 142, in <module>
selected_columns)
File "main.py", line 101, in backwardElimination
if (regressor_OLS.pvalues[j].astype(float) == maxVar):
IndexError: index 1967 is out of bounds for axis 0 with size 1967
Here is a fixed version.
I made a number of changes:
Import the correct OLS from statsmodels.api
Generate columns in the function
Use np.argmax to find the location of the maximum value
Use a boolean index to select columns. In pseudo-code it is like x[:, [True, False, True]] which keeps columns 0 and 2.
Stop if there is nothing to drop.
import numpy as np
# Wrong import. Not using the formula interface, so using statsmodels.api
import statsmodels.api as sm
def backwardElimination(x, Y, sl):
numVars = x.shape[1] # variables in columns
columns = np.arange(numVars)
for i in range(0, numVars):
print(i, 'of', numVars)
regressor_OLS = sm.OLS(Y, x).fit()
if maxVar > sl:
# Use boolean selection
retain = np.ones(x.shape[1], bool)
drop = np.argmax(regressor_OLS.pvalues)
# Drop the highest pvalue(s)
retain[drop] = False
# Keep the x we with to retain
x = x[:, retain]
# Also keep their column indices
columns = columns[retain]
else:
# Exit early if everything has pval above sl
break
# Show the final summary
print(regressor_OLS.summary())
return x, columns
You can test it with
x = np.random.standard_normal((1000,100))
y = np.random.standard_normal(1000)
backwardElimination(x,y,0.1)

Using numpy digitize output in scipy minimize problem

I am trying to minimize the quadratic weighted kappa function using scipy minimize fmin Powell function.
The two functions digitize_train and digitize_train2 gives 100% EXACT same results.
However, when I tried to use these functions with scipy minimize the second method fails.
I have been trying to debug the problem for hours, to my surprise despite the two functions being exact same the bumpy digitize function fails to give fmin Powell mimimization.
How to fix the error?
Question
How to use numpy.digitize in scipy fmin_powell?
SETUP
# imports
import numpy as np
import pandas as pd
import seaborn as sns
from scipy.optimize import fmin_powell
from sklearn import metrics
# data
train_labels = [1,1,8,7,6,5,3,2,4,4]
train_preds = [0.1,1.2,8.9, 7.6, 5.5, 5.5, 2.99, 2.4, 3.5, 4.0]
guess_lst = (1.5,2.9,3.1,4.5,5.5,6.1,7.1)
# functions
# here I am trying the convert real numbers -inf to +inf to integers 1 to 8
def digitize_train(train_preds, guess_lst):
(x1,x2,x3,x4,x5,x6,x7) = list(guess_lst)
res = []
for y in list(train_preds):
if y < x1:
res.append(1)
elif y < x2:
res.append(2)
elif y < x3:
res.append(3)
elif y < x4:
res.append(4)
elif y < x5:
res.append(5)
elif y < x6:
res.append(6)
elif y < x7:
res.append(7)
else: res.append(8)
return res
def digitize_train2(train_preds, guess_lst):
return np.digitize(train_preds,guess_lst) + 1
# compare two functions
df = pd.DataFrame({'train_labels': train_labels,
'train_preds': train_preds,
'method_1': digitize_train(train_preds, guess_lst),
'method_2': digitize_train2(train_preds, guess_lst)
})
df
** NOTE: The two functions are exact same**
Method 1: without numpy digitize runs fine
# using fmin_powel for method 1
def get_offsets_minimizing_train_preds_kappa(guess_lst):
res = digitize_train(train_preds, guess_lst)
return - metrics.cohen_kappa_score(train_labels, res,weights='quadratic')
offsets = fmin_powell(get_offsets_minimizing_train_preds_kappa, guess_lst, disp = True)
print(offsets)
Method 2: using numpy digitize fails
# using fmin_powell for method 2
def get_offsets_minimizing_train_preds_kappa2(guess_lst):
res = digitize_train2(train_preds, guess_lst)
return -metrics.cohen_kappa_score(train_labels, res,weights='quadratic')
offsets = fmin_powell(get_offsets_minimizing_train_preds_kappa2, guess_lst, disp = True)
print(offsets)
How to use numpy digitize method?
Update
As per suggestions I tried pandas cut, but still gives error.
ValueError: bins must increase monotonically.
# using fmin_powell for method 3
def get_offsets_minimizing_train_preds_kappa3(guess_lst):
res = pd.cut(train_preds, bins=[-np.inf] + list(guess_lst) + [np.inf],
right=False)
res = pd.Series(res).cat.codes + 1
res = res.to_numpy()
return -metrics.cohen_kappa_score(train_labels, res,weights='quadratic')
offsets = fmin_powell(get_offsets_minimizing_train_preds_kappa3, guess_lst, disp = True)
print(offsets)
It seems that during the minimization process, the value in guest_lst are not monotonically increasing anymore, one work around is to pass the sorted of guest_lst in digitize like:
def digitize_train2(train_preds, guess_lst):
return np.digitize(train_preds,sorted(guess_lst)) + 1
and you get
# using fmin_powell for method 2
def get_offsets_minimizing_train_preds_kappa2(guess_lst):
res = digitize_train2(train_preds, guess_lst)
return -metrics.cohen_kappa_score(train_labels, res,weights='quadratic')
offsets = fmin_powell(get_offsets_minimizing_train_preds_kappa2, guess_lst, disp = True)
print(offsets)
Optimization terminated successfully.
Current function value: -0.990792
Iterations: 2
Function evaluations: 400
[1.5 2.7015062 3.1 4.50379942 4.72643334 8.12463415
7.13652301]

How to resolve Boolean value error in linear regression model in python?

I am trying to run a fama-macbeth regression in a python. As afirst step I am running the time series for every asset in my portfolio but I am unable to run it because I am getting an error:
'ValueError: Must pass DataFrame with boolean values only'
I am relatively new to python and have heavily relied on this forum to help me out. I hope it you can help me with this issue.
Please let me know how I can resolve this. I will be very grateful to you!
I assume this line is producing the error. Cause when I run the function without the for loop, it works perfectly.
for i in range(cols):
df_beta = RegressionRoll(df=data_set, subset = 0, dependent = data_set.iloc[:,i], independent = data_set.iloc[:,30:], const = True, parameters = 'beta',
win = 12)
The dimension of my matrix is 108x35, 30 stocks and 5 factors over 108 points. Hence I want to run a regression for every stock against the 4 factors and store the result of the coeffs in a dataframe. Sample dataframe:
Date BAS GY AI FP SGL GY LNA GY AKZA NA Market Factor
1/29/2010 -5.28% -7.55% -1.23% -5.82% -7.09% -5.82%
2/26/2010 0.04% 13.04% -1.84% 4.06% -14.62% -14.62%
3/31/2010 10.75% 1.32% 7.33% 6.61% 12.21% 12.21%
The following is the entire code:
import pandas as pd
import statsmodels.api as sm
import numpy as np
import matplotlib.pyplot as plt
from statsmodels.stats.outliers_influence import variance_inflation_factor
data_set = pd.read_excel(r'C:\XXX\Research Project\Data\Regression.xlsx', sheet_name = 'Fama Macbeth')
data_set.set_index(data_set['Date'], inplace=True)
data_set.drop('Date', axis=1, inplace=True)
X = data_set.iloc[:,30:]
y = data_set.iloc[:,:30]
def RegressionRoll(df, subset, dependent, independent, const, win, parameters):
# Data subset
if subset != 0:
df = df.tail(subset)
else:
df = df
# Loopinfo
end = df.shape[0]
win = win
rng = np.arange(start = win, stop = end, step = 1)
# Subset and store dataframes
frames = {}
n = 1
for i in rng:
df_temp = df.iloc[:i].tail(win)
newname = 'df' + str(n)
frames.update({newname: df_temp})
n += 1
# Analysis on subsets
df_results = pd.DataFrame()
for frame in frames:
#print(frames[frame])
# Rolling data frames
dfr = frames[frame]
y = dependent
x = independent
if const == True:
x = sm.add_constant(dfr[x])
model = sm.OLS(dfr[y], x).fit()
else:
model = sm.OLS(dfr[y], dfr[x]).fit()
if parameters == 'beta':
theParams = model.params[0:]
coefs = theParams.to_frame()
df_temp = pd.DataFrame(coefs.T)
indx = dfr.tail(1).index[-1]
df_temp['Date'] = indx
df_temp = df_temp.set_index(['Date'])
df_results = pd.concat([df_results, df_temp], axis = 0)
if parameters == 'R2':
theParams = model.rsquared
df_temp = pd.DataFrame([theParams])
indx = dfr.tail(1).index[-1]
df_temp['Date'] = indx
df_temp = df_temp.set_index(['Date'])
df_temp.columns = [', '.join(independent)]
df_results = pd.concat([df_results, df_temp], axis = 0)
return(df_results)
cols = len(y.columns)
for i in range(cols):
df_beta = RegressionRoll(df=data_set, subset = 0, dependent = data_set.iloc[:,i], independent = data_set.iloc[:,30:], const = True, parameters = 'beta',
win = 12)
ValueError: Must pass DataFrame with boolean values only

Simple Linear Regression Error in updating cost function and Theta Parameters

I wrote a piece code to make a simple linear regression model using Python. However, I am having trouble getting the correct cost function, and most importantly the correct theta parameters. The model is implemented from scratch and not using Scikit learn module. I have used Andrew NG's notes from his ML Coursera course to create the model. The correct values of theta are [[-3.630291] [1.166362]].
Would be really grateful if someone could offer their expertise, and point out what I'm doing wrong.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
#Load The Dataset
dataset = pd.read_csv("Population vs Profit.txt",names=["Population" ,
"Profit"])
print (dataset.head())
col = len(dataset.columns)
x = dataset.iloc[:,:col-1].values
y = dataset.iloc[:,col-1].values
#Visualizing The Dataset
plt.scatter(x, y, color="red", marker="x", label="Profit")
plt.title("Population vs Profit")
plt.xlabel("Population")
plt.ylabel("Profit")
plt.legend()
plt.show()
#Preprocessing Data
dataset.insert(0,"x0",1)
col = len(dataset.columns)
x = dataset.iloc[:,:col-1].values
b = np.zeros(col-1)
m = len(y)
costlist = []
alpha = 0.001
iteration = 10000
#Defining Functions
def hypothesis(x,b,y):
h = x.dot(b.T) - y
return h
def cost(x,b,y,m):
j = np.sum(hypothesis(x,b,y)**2)
j = j/(2*m)
return j
print (cost(x,b,y,m))
def gradient_descent(x,b,y,m,alpha):
for i in range (iteration):
h = hypothesis(x,b,y)
product = np.sum(h.dot(x))
b = b - ((alpha/m)*product)
costlist.append(cost(x,b,y,m))
return b,cost(x,b,y,m)
b , mincost = gradient_descent(x,b,y,m,alpha)
print (b , mincost)
print (cost(x,b,y,m))
plt.plot(b,color="green")
plt.show()
The dataset I'm using is the following text.
6.1101,17.592
5.5277,9.1302
8.5186,13.662
7.0032,11.854
5.8598,6.8233
8.3829,11.886
7.4764,4.3483
8.5781,12
6.4862,6.5987
5.0546,3.8166
5.7107,3.2522
14.164,15.505
5.734,3.1551
8.4084,7.2258
5.6407,0.71618
5.3794,3.5129
6.3654,5.3048
5.1301,0.56077
6.4296,3.6518
7.0708,5.3893
6.1891,3.1386
20.27,21.767
5.4901,4.263
6.3261,5.1875
5.5649,3.0825
18.945,22.638
12.828,13.501
10.957,7.0467
13.176,14.692
22.203,24.147
5.2524,-1.22
6.5894,5.9966
9.2482,12.134
5.8918,1.8495
8.2111,6.5426
7.9334,4.5623
8.0959,4.1164
5.6063,3.3928
12.836,10.117
6.3534,5.4974
5.4069,0.55657
6.8825,3.9115
11.708,5.3854
5.7737,2.4406
7.8247,6.7318
7.0931,1.0463
5.0702,5.1337
5.8014,1.844
11.7,8.0043
5.5416,1.0179
7.5402,6.7504
5.3077,1.8396
7.4239,4.2885
7.6031,4.9981
6.3328,1.4233
6.3589,-1.4211
6.2742,2.4756
5.6397,4.6042
9.3102,3.9624
9.4536,5.4141
8.8254,5.1694
5.1793,-0.74279
21.279,17.929
14.908,12.054
18.959,17.054
7.2182,4.8852
8.2951,5.7442
10.236,7.7754
5.4994,1.0173
20.341,20.992
10.136,6.6799
7.3345,4.0259
6.0062,1.2784
7.2259,3.3411
5.0269,-2.6807
6.5479,0.29678
7.5386,3.8845
5.0365,5.7014
10.274,6.7526
5.1077,2.0576
5.7292,0.47953
5.1884,0.20421
6.3557,0.67861
9.7687,7.5435
6.5159,5.3436
8.5172,4.2415
9.1802,6.7981
6.002,0.92695
5.5204,0.152
5.0594,2.8214
5.7077,1.8451
7.6366,4.2959
5.8707,7.2029
5.3054,1.9869
8.2934,0.14454
13.394,9.0551
5.4369,0.61705
One issue is with your "product". It is currently a number when it should be a vector. I was able to get the values [-3.24044334 1.12719788] by rerwitting your for-loop as follows:
def gradient_descent(x,b,y,m,alpha):
for i in range (iteration):
h = hypothesis(x,b,y)
#product = np.sum(h.dot(x))
xvalue = x[:,1]
product = h.dot(xvalue)
hsum = np.sum(h)
b = b - ((alpha/m)* np.array([hsum , product]) )
costlist.append(cost(x,b,y,m))
return b,cost(x,b,y,m)
There's possibly another issue besides this as it doesn't converge to your answer. You should make sure you are using the same alpha also.

How to remove outliers correctly and define predictors for linear model?

I am learning how to build a simple linear model to find a flat price based on its squared meters and the number of rooms. I have a .csv data set with several features and of course 'Price' is one of them, but it contains several suspicious values like '1' or '4000'. I want to remove these values based on mean and standard deviation, so I use the following function to remove outliers:
import numpy as np
import pandas as pd
def reject_outliers(data):
u = np.mean(data)
s = np.std(data)
data_filtered = [e for e in data if (u - 2 * s < e < u + 2 * s)]
return data_filtered
Then I construct function to build linear regression:
def linear_regression(data):
data_filtered = reject_outliers(data['Price'])
print(len(data)) # based on the lenght I see that several outliers have been removed
Next step is to define the data/predictors. I set my features:
features = data[['SqrMeters', 'Rooms']]
target = data_filtered
X = features
Y = target
And here is my question. How can I get the same set of observations for my X and Y? Now I have inconsistent numbers of samples (5000 for my X and 4995 for my Y after removing outliers). Thank you for any help in this topic.
The features and labels should have the same length
and you should pass the whole data object to reject_outliers:
def reject_outliers(data):
u = np.mean(data["Price"])
s = np.std(data["Price"])
data_filtered = data[(data["Price"]>(u-2*s)) & (data["Price"]<(u+2*s))]
return data_filtered
You can use it in this way:
data_filtered=reject_outliers(data)
features = data_filtered[['SqrMeters', 'Rooms']]
target = data_filtered['Price']
X=features
y=target
Following works for Pandas DataFrames (data):
def reject_outliers(data):
u = np.mean(data.Price)
s = np.std(data.Price)
data_filtered = data[(data.Price > u-2*s) & (data.Price < u+2*s)]
return data_filtered

Categories