I am writing a function that does a grid search for arima model based on bic criteria
I am able to do ARIMA grid search but I am trying to update it to SARIMA grid search
Code look like below I just cant make it work
import itertools
p_min = 0
d_min = 0
q_min = 0
p_max = 4
d_max = 0
q_max = 4
# Initialize a DataFrame to store the results
results_bic = pd.DataFrame(index=['AR{}'.format(i) for i in range(p_min,p_max+1)],
columns=['MA{}'.format(i) for i in range(q_min,q_max+1)])
for p,d,q in itertools.product(range(p_min,p_max+1),
range(d_min,d_max+1),
range(q_min,q_max+1)):
if p==0 and d==0 and q==0:
results_bic.loc['AR{}'.format(p), 'MA{}'.format(q)] = np.nan
continue
try:
model = sm.tsa.SARIMAX(data, order=(p, d, q),trend = "c", seasonal_order = (p, d, q,12)
)
results = model.fit()
results_bic.loc['AR{}'.format(p), 'MA{}'.format(q)] = results.bic
except:
continue
results_bic = results_bic[results_bic.columns].astype(float)
Related
I pretty sure I am doing something wrong, but I have no idea what it is.
I have weekly data from 2018 - 09/2022 and I am trying to forecast the last 13 weeks of 2022 out of sample using the last 13 weeks of 2021 as the exog variables for prediction.
The SARIMA_SEL_K function works really well in sample I.e., it performs well on the test set after training. However when I try to do out of sample prediction the forecasts looks very odd and does not capture the end of year decline that is inherent in the series seasonality.
I am using Fourier terms, with k-terms selected by minimizing AIC through auto_arima... either there is a fatal flaw in my understanding of this process or an error in my code. I cannot work out which one it is.
Question: Does it make sense to use historical data for exog Fourier terms? The rationale for doing this is a sharp decrease in values towards the end of the year. In my mind, using last year's data will allow the forecast to mimic this end-of-year behaviour while allowing the ARIMA components to take care of the short-term dynamics idiosyncratic to the current year.
Does this make conceptual sense as a forecasting strategy?
data['ts'] = pd.date_range('2017/12/29' ,'2022/12/23',freq='W-FRI')
data= data.set_index(pd.DatetimeIndex(data['ts'], freq = 'W-FRI'))
def fourier_transform(f, k , data):
four_terms = FourierFeaturizer(f, k)
y_prime, exog = four_terms.fit_transform(data)
exog['date'] = y_prime.index
exog = exog.set_index(exog['date'])
exog.index.freq = 'W-FRI'
exog = exog.drop(columns=['date'])
return (y_prime, exog)
def SARIMA_SEL_K(m, range_k, data, n):
if m > 1:
score_best = np.inf
best_model = 0
len_k = 0
for k in range_k:
y_prime, exog = fourier_transform(m, k, data)
y_to_train = y_prime.iloc[:(len(y_prime)-n)]
y_to_test = y_prime.iloc[(len(y_prime)-n):] # last year for testing
exog_to_train = exog.iloc[:(len(exog)-n)]
exog_to_test = exog.iloc[(len(exog)-n):]
model_auto_arima = pm.auto_arima(y_to_train, exog_to_train, n_fits = 15
#out_of_sample_size = n, scoring = 'mse',
)
order = model_auto_arima.get_params().get('order')
new_model = model_auto_arima
## test
preds = model_auto_arima.predict(len(y_to_test), exog_to_test)
abs_error = abs(y_to_test.values - preds)
N = 1/len(y_to_test)
score_value = N*np.sum(abs_error)
print('$k = ' + str(k) + ': ' + str(score_value) + str(order))
if score_value < score_best:
score_best = score_value
best_model = new_model
len_k = k
else:
pass
print(len_k)
else:
best_model = pm.auto_arima(data, n_fits = 15 )
return(best_model)
forecast_n = 13
model = SARIMA_SEL_K(52, [k for k in range(1 , 26, 1 )] ,data, forecast_n)
test_y, test_exog = fourier_transform(52, 9, data[-forecast_n:])
model_refit = model.update(test_y, test_exog)
pred_y, pred_exog = fourier_transform(52, 9, data_pred)
fc,ci = model.predict(forecast_n, pred_exog, return_conf_int = True, start='2022-09-30' , end='2022-12-23')
plt.figure(figsize=(12,5), dpi=100)
plt.plot(data.loc['2022-01-07':'2022-12-23'].index, data.loc['2022-01-07':'2022-12-23'], label='Actual')
plt.plot(fc, label = 'Forecast' )
I am trying to get RF feature importance, I fit the random forest on the data like this:
model = RandomForestRegressor()
n = model.fit(self.X_train,self.y_train)
if n is not None:
df = pd.DataFrame(data = n , columns = ["Feature","Importance_Score"])
df["Feature_Name"] = np.array(self.X_Headers)
df = df.drop(["Feature"], axis = 1)
df[["Feature_Name","Importance_Score"]].to_csv("RF_Importances.csv", index = False)
del df
However, the n variable returns None, why is this happening?
Not very sure how model.fit(self.X_train,self.y_train) is supposed to work. Need more information about how you set up the model.
If we set this up using simulated data, it works:
np.random.seed(111)
X = pd.DataFrame(np.random.normal(0,1,(100,5)),columns=['A','B','C','D','E'])
y = np.random.normal(0,1,100)
model = RandomForestRegressor()
n = model.fit(X,y)
if n is not None:
df = pd.DataFrame({'features':X.columns,'importance':n.feature_importances_})
df
features importance
0 A 0.176091
1 B 0.183817
2 C 0.169927
3 D 0.267574
4 E 0.202591
I have written a function to compute KS values for various iterations of a logistic regression model. When I run the code, the KS value is printed on the screen but it is not getting stored in the table.
Custom KS Function
def ks_value(bad_flag=None, predicted_prob=None):
## Data Prep
ksdf = pd.DataFrame([])
ksdf['bad_flag'] = bad_flag
ksdf['probability'] = predicted_prob
ksdf = ksdf.reset_index()
ksdf.drop(columns=['index'],inplace=True)
ksdf['decile'] = pd.qcut(ksdf['probability'],10,labels=['1','2','3','4','5','6','7','8','9','10'])
ksdf['good_flag'] = 1-ksdf['bad_flag']
ksdf.head()
## Pivot
ksdf1 = pd.pivot_table(data=ksdf,index=['decile'],values=['bad_flag','good_flag','probability'],
aggfunc={'bad_flag':[np.sum],
'good_flag':[np.sum],
'probability' : [np.min,np.max]})
## Add Columns
ksdf1['total_counts'] = ksdf1['bad_flag']+ksdf1['good_flag']
ksdf1 = ksdf1.reset_index()
ksdf1.columns = ['Decile','Defaulter_Count','Non-Defaulter_Count','max_score','min_score','Total_Count']
ksdf1 = ksdf1.sort_values(by='min_score',ascending=False)
ksdf1['Default_Rate'] = (ksdf1['Defaulter_Count'] / ksdf1['Total_Count']).apply('{0:.2%}'.format)
default_sum = ksdf1['Defaulter_Count'].sum()
non_default_sum = ksdf1['Non-Defaulter_Count'].sum()
ksdf1['Default %'] = (ksdf1['Defaulter_Count']/default_sum).apply('{0:.2%}'.format)
ksdf1['Non_Default %'] = (ksdf1['Non-Defaulter_Count']/non_default_sum).apply('{0:.2%}'.format)
## Compute KS
ksdf1['ks_stats'] = np.round(((ksdf1['Defaulter_Count'] / ksdf1['Defaulter_Count'].sum()).cumsum() -(ksdf1['Non-Defaulter_Count'] / ksdf1['Non-Defaulter_Count'].sum()).cumsum()), 4) * 100
return(ksdf1['ks_stats'].max())
Code for iterating on Logistic Regression
# Iterating to find the Optimal value of C for model overfitting - Checks on Test Data
C_param_range = [0.001,0.01,0.1,1,10,100]
table1 = pd.DataFrame(columns = ['C_parameter','Test Accuracy','Train Accuracy','Test KS','Train KS'])
table1['C_parameter'] = C_param_range
j = 0
for i in C_param_range:
# Apply logistic regression model to training data
lr = LogisticRegression(penalty = 'l2', C = i,random_state = 0,max_iter = 1000)
lr.fit(X_train,y_train)
# Predict class (0,1) using model
y_pred = lr.predict(X_test)
y_pred2 = lr.predict(X_train)
y_prob = lr.predict_proba(X_test)[:,1]
y_prob2 = lr.predict_proba(X_train)[:,1]
# KS Value
table1.iloc[j,3] = ks_value(bad_flag=y_test, predicted_prob=y_prob)
table1.iloc[j,4] = ks_value(bad_flag=y_train, predicted_prob=y_prob2)
# Saving accuracy score in table
table1.iloc[j,1] = accuracy_score(y_test,y_pred)
table1.iloc[j,2] = accuracy_score(y_train,y_pred2)
j += 1
Output is something like this:
KS is 35.49
KS is 34.25
C_parameter TestAccuracy TrainAccuracy TestKS TrainKS
0.001 0.919911 0.919056 NaN NaN
Most of my samples are repetitions, is there a way to give a weight to each sample that would represent how frequent it is so that the algorithm would only have to go through the unique set?
Or is there a way to manipulate the log(probability) function that I have defined to achieve this effect?
# simple example for data:
data = [(0,1,10), (0,2,10), (1,0,20), (1,0,20), (1,0,20), (0,0,49), (1,1,12)]
member_a = mc.Uniform('a', lower=-1.0, upper=0.0)
member_d = mc.Uniform('d', lower=-1.0, upper=0.0)
#mc.stochastic(observed=True, dtype=int)
def logLikelihood(value=data, a=member_a, d=member_d):
ratesMatrix = np.zeros((2,2))
ratesMatrix[0,0] = a
ratesMatrix[0,1] = -a
ratesMatrix[1,0] = -d
ratesMatrix[1,1] = d
r = []
t = []
for i in range(len(data)):
r.append(ratesMatrix[int(value[i][0]), int(value[i][1])])
t.append(value[i][2])
r = np.array(r, dtype=np.float64)
t = np.array(t, dtype=np.float64)
model = mc.MCMC([member_a,member_d,logLikelihood])
trace = model.sample(iter=5000)
I would like to implement equivalent of auto.arima() function of R in python.
In R auto.arima function takes time series values as input computes ARIMA order parameters (p,d,q values) and fits a model, there is no need to provide p,d,q values as inputs by the user.
I want to use the equivalent of auto.arima function in python (without calling auto.arima R) to predict future values in a time series. In the following time series executing auto.arima-python for 40 points and predicting next 6 values, then moving the window by 1 point and again performing the same procedure.
Following is exemplary data:
value
0
2.584751
2.884758
2.646735
2.882105
3.267503
3.94552
4.70788
5.384803
54.77972
62.87139
78.68957
112.7166
155.0074
170.8084
196.1941
237.4928
254.9718
175.0717
217.3807
244.7357
274.4517
304.6838
373.3202
345.6252
461.2653
443.5982
472.3653
469.3326
506.8819
532.1639
542.2837
514.9269
528.0194
540.539
542.7031
556.8262
569.7132
576.2339
577.7212
577.0873
569.6199
573.2445
573.7825
589.3506
I have tried to write functions to compute order of differencing using AD Fuller Test, passing differentiated time series (which becomes stationary after differencing original time series as per the adfuller test result) to arma order select function to compute P,Q order values.
Further use these values to pass on to the arima function in Statsmodels. But the functions do not seem to work.
import numpy as np
import pandas as pd
import statsmodels.api as sm
from statsmodels.tsa.stattools import adfuller
from statsmodels.tsa.stattools import acf, pacf
def diff_terms(timeseries):
i=1
j=0
while i != 0:
dftest = adfuller(timeseries, autolag='AIC')
if dftest[0] <= dftest[4]["5%"]:
i = 0
else:
timeseries = np.diff(timeseries)
i = 1
j = j + 1
return j
def p_q_values_estimator(timeseries):
p=0
q=0
lag_acf = acf(timeseries, nlags=20)
lag_pacf = pacf(timeseries, nlags=20, method='ols')
y=1.96/np.sqrt(len(timeseries))
if lag_acf[0] < y:
for a in lag_acf:
if a < y:
q = q + 1
break
elif lag_acf[0] > y:
for c in lag_acf:
if c > y:
q = q + 1
break
if lag_pacf[0] < y:
for b in lag_pacf:
if b < y:
p = p + 1
break
elif lag_pacf[0] > y:
for d in lag_pacf:
if d > y:
p = p + 1
break
p_q=[p,q]
return(p_q)
def p_q_values_estimator2(timeseries):
res = sm.tsa.arma_order_select_ic(timeseries, ic=['aic'], max_ar=5, max_ma=4,trend='nc')
return res.aic_min_order
data1=[]
data=pd.read_csv('ABC.csv')
d_value=diff_terms(data.value)
data1[:]=data[:]
data = data[0:40]
i=0
while i < d_value:
data_diff = np.diff(data)
i = i+1
p_q_values=p_q_values_estimator(data)
p_value=p_q_values[0]
q_value=p_q_values[1]
p_q_values2=p_q_values_estimator2(data_diff)
p_value2=p_q_values2[0]
q_value2=p_q_values2[1]
exogx = np.array(range(0,40))
fit2 = sm.tsa.ARIMA(np.array(data), (p_value, d_value, q_value), exog = exogx).fit()
print(fit2.fittedvalues)
pred2 = fit2.predict(start = 40, end = 45, exog = np.array(range(40,46)))
print(pred2)
plt.plot(fit2.fittedvalues)
plt.plot(np.array(data))
plt.plot(range(40,45), np.array(pred2))
plt.show()
Errors – on using arma order select
p_q_values2=p_q_values_estimator2(data_diff)
line 56, in p_q_values_estimator2
res = sm.tsa.arma_order_select_ic(timeseries, ic=['aic'], max_ar=5, max_ma=4,trend='nc')
File "C:\Python27\lib\site-packages\statsmodels\tsa\stattools.py", line 1052, in arma_order_select_ic min_res.update({i + '_min_order' : (mins[0][0], mins[1][0])})
IndexError: index 0 is out of bounds for axis 0 with size 0
Errors – on using acf pacf based function for computation of P,Q order
fit2 = sm.tsa.ARIMA(np.array(data), (p_value, d_value, q_value), exog = exogx).fit()
File "C:\Python27\lib\site-packages\statsmodels\tsa\arima_model.py", line 1104, in fit
callback, **kwargs)
File "C:\Python27\lib\site-packages\statsmodels\tsa\arima_model.py", line 942, in fit
armafit.mle_retvals = mlefit.mle_retvals
AttributeError: 'LikelihoodModelResults' object has no attribute 'mle_retvals'
Vals is my own thing, but you can create your own index with pd.date_range
rdata=ts(traindf.requests_per_active.values,frequency=12)
#forecasts
fit=forecast.auto_arima(rdata)
forecast_output=forecast.forecast(fit,h=6,level=(95.0))
#convert forecasts to dataframe
forecast_results=pd.Series(forecast_output[3], index=vals.index)
lowerpi=pd.Series(forecast_output[4], index=vals.index)
upperpi=pd.Series(forecast_output[5], index=vals.index)
results = pd.DataFrame({'forecast' : forecast_results, 'lowerpi' : lowerpi, 'upperpi' : upperpi})
You can use pyramid-arima library. It bring auto.arima() of R language to python.It wraps "statsmodels.tsa.ARIMA and statsmodels.tsa.statespace.SARIMAX into one estimator class" (per https://pypi.org/project/pyramid-arima/).