ARIMA forecast gives different results with new python statsmodels - python

I'm (out-of-sample) forecasting with ARIMA(0,1,0).
In python's statsmodels latest stable version 0.12. I calculate:
import statsmodels.tsa.arima_model as stats
time_series = [2, 3.0, 5, 7, 9, 11, 13, 17, 19]
steps = 4
alpha = 0.05
model = stats.ARIMA(time_series, order=(0, 1, 0))
model_fit = model.fit(disp=0)
forecast, _, intervals = model_fit.forecast(steps=steps, exog=None, alpha=alpha)
which results in
forecast = [21.125, 23.25, 25.375, 27.5]
intervals = [[19.5950036, 22.6549964 ], [21.08625835, 25.41374165], [22.72496851, 28.02503149], [24.44000721, 30.55999279]]
and a Future Warning, which suggests:
FutureWarning:
statsmodels.tsa.arima_model.ARMA and statsmodels.tsa.arima_model.ARIMA have
been deprecated in favor of statsmodels.tsa.arima.model.ARIMA (note the .
between arima and model) and
statsmodels.tsa.SARIMAX. These will be removed after the 0.12 release.
In the new version, as hinted to in the Future Warning, I calculate:
import statsmodels.tsa.arima.model as stats
time_series = [2, 3.0, 5, 7, 9, 11, 13, 17, 19]
steps = 4
alpha = 0.05
model = stats.ARIMA(time_series, order=(0, 1, 0))
model_fit = model.fit()
forecast = model_fit.get_forecast(steps=steps)
forecasts_and_intervals = forecast.summary_frame(alpha=alpha)
which gives different results:
forecasts_and_intervals =
y mean mean_se mean_ci_lower mean_ci_upper
0 19.0 2.263842 14.562951 23.437049
1 19.0 3.201556 12.725066 25.274934
2 19.0 3.921089 11.314806 26.685194
3 19.0 4.527684 10.125903 27.874097
I would like to obtain the same results as before.
Am I using the new interface correctly?
I need both the forecast and the intervals.
I tried already to use different functions as just forecast the new interface offers.
In particular I'm wondering why the forecast result is 19 for the entire list.
Many thanks for every help.
Here is the documentation for statsmodels 0.12.2: https://www.statsmodels.org/stable/generated/statsmodels.tsa.arima_model.ARIMA.html?highlight=arima#statsmodels.tsa.arima_model.ARIMA
Here is the documentation for newer version of Arima:
https://www.statsmodels.org/stable/generated/statsmodels.tsa.arima.model.ARIMA.html?highlight=arima#statsmodels.tsa.arima.model.ARIMA

The difference is due to whether the models include a "constant" term or not. For the first case i.e. older statsmodels.tsa.arima_model.ARIMA, it automatically includes a constant term (and no option to turn on/off). If you have a differencing, it also includes it but does so in the differenced domain (otherwise it would be eliminated anyway). So here is its ARIMA(0, 1, 0) model:
y_t - y_{t-1} = c + e_t
which is "random walk with drift".
For the new statsmodels.tsa.arima.model.ARIMA, as the documentation you linked says, not any kind of trend term (including constant, i.e. c) is included when differencing is involved, which is the case for you. So here is its ARIMA(0, 1, 0) model:
y_t - y_{t-1} = e_t
which is "random walk" and as we know, forecasts from it corresponds to naive forecasts i.e. repeating the last value (19 in your case).
Then, what to do to make the new one work?
It includes a parameter called trend which you can specify to get the same behaviour. Since you are using a differencing (d=1), passing trend="t" should give the same model as the old one. ("t" means linear trend but since d = 1, it will reduce to a constant in the differenced domain):
import statsmodels.tsa.arima.model as stats
time_series = [2, 3.0, 5, 7, 9, 11, 13, 17, 19]
steps = 4
alpha = 0.05
model = stats.ARIMA(time_series, order=(0, 1, 0), trend="t") # only change is here!
model_fit = model.fit()
forecast = model_fit.get_forecast(steps=steps)
forecasts_and_intervals = forecast.summary_frame(alpha=alpha)
and here is what I get for forecasts_and_intervals:
y mean mean_se mean_ci_lower mean_ci_upper
0 21.124995 0.780622 19.595004 22.654986
1 23.249990 1.103966 21.086256 25.413724
2 25.374985 1.352077 22.724962 28.025008
3 27.499980 1.561244 24.439997 30.559963

I think this raises another issue. I'm not sure exogenous variables are treated the same in the new arima.model version. I believe in the old version, arima_model, they are applied to the order of differences. For (0,0,0) Y=mx+b or if (0,1,0), then dy=mx+b.

Related

Python weighted quantile as R wtd.quantile()

I want to convert the R package Hmisc::wtd.quantile() into python.
Here is the example in R:
I took this as reference and it seems that the logics are different than R:
# First function
def weighted_quantile(values, quantiles, sample_weight = None,
values_sorted = False, old_style = False):
""" Very close to numpy.percentile, but supports weights.
NOTE: quantiles should be in [0, 1]!
:param values: numpy.array with data
:param quantiles: array-like with many quantiles needed
:param sample_weight: array-like of the same length as `array`
:return: numpy.array with computed quantiles.
"""
values = np.array(values)
quantiles = np.array(quantiles)
if sample_weight is None:
sample_weight = np.ones(len(values))
sample_weight = np.array(sample_weight)
assert np.all(quantiles >= 0) and np.all(quantiles <= 1), 'quantiles should be in [0, 1]'
if not values_sorted:
sorter = np.argsort(values)
values = values[sorter]
sample_weight = sample_weight[sorter]
# weighted_quantiles = np.cumsum(sample_weight)
# weighted_quantiles /= np.sum(sample_weight)
weighted_quantiles = np.cumsum(sample_weight)/np.sum(sample_weight)
return np.interp(quantiles, weighted_quantiles, values)
weighted_quantile(values = [0.4890342, 0.4079128, 0.5083345, 0.2136325, 0.6197319],
quantiles = np.arange(0, 1 + 1 / 5, 1 / 5),
sample_weight = [1,1,1,1,1])
>> array([0.2136325, 0.2136325, 0.4079128, 0.4890342, 0.5083345, 0.6197319])
# Second function
def weighted_percentile(data, weights, perc):
"""
perc : percentile in [0-1]!
"""
data = np.array(data)
weights = np.array(weights)
ix = np.argsort(data)
data = data[ix] # sort data
weights = weights[ix] # sort weights
cdf = (np.cumsum(weights) - 0.5 * weights) / np.sum(weights) # 'like' a CDF function
return np.interp(perc, cdf, data)
weighted_percentile([0.4890342, 0.4079128, 0.5083345, 0.2136325, 0.6197319], [1,1,1,1,1], np.arange(0, 1 + 1 / 5, 1 / 5))
>> array([0.2136325 , 0.31077265, 0.4484735 , 0.49868435, 0.5640332 ,
0.6197319 ])
Both are different with R. Any idea?
I am Python-illiterate, but from what I see and after some quick checks I can tell you the following.
Here you use uniform (sampling) weights, so you could also directly use the quantile() function. Not surprisingly, it gives the same results as wtd.quantile() with uniform weights:
x <- c(0.4890342, 0.4079128, 0.5083345, 0.2136325, 0.6197319)
n <- length(x)
x <- sort(x)
quantile(x, probs = seq(0,1,0.2))
# 0% 20% 40% 60% 80% 100%
# 0.2136325 0.3690567 0.4565856 0.4967543 0.5306140 0.6197319
The R quantile() function get the quantiles in a 'textbook' way, i.e. by determining the index i of the obs to use with i = q(n+1).
In your case:
seq(0,1,0.2)*(n+1)
# 0.0 1.2 2.4 3.6 4.8 6.0
Of course since you have 5 values/obs and you want quintiles, the indices are not integers. But you know for example that the first quintile (i = 1.2) lies between obs 1 and obs 2. More precisely, it is a linear combination of the two observations (the 'weights' are derived from the value of the index):
0.2*x[1] + 0.8*x[2]
# 0.3690567
You can do the same for the all the quintiles, on the basis of the indices:
q <-
c(min(x), ## 0: actually, the first obs
0.2*x[1] + 0.8*x[2], ## 1.2: quintile lies between obs 1 and 2
0.4*x[2] + 0.6*x[3], ## 2.4: quintile lies between obs 2 and 3
0.6*x[3] + 0.4*x[4], ## 3.6: quintile lies between obs 3 and 4
0.8*x[4] + 0.2*x[5], ## 4.8: quintile lies between obs 4 and 5
max(x) ## 6: actually, the last obs
)
q
# 0.2136325 0.3690567 0.4565856 0.4967543 0.5306140 0.6197319
You can see that you get exactly the output of quantile() and wtd.quantile().
If instead of 0.2*x[1] + 0.8*x[2] we consider the following:
0.5*x[1] + 0.5*x[2]
# 0.3107726
We get the output of your second Python function. It appears that your second function considers uniform 'weights' (obviously I am not talking about the sampling weights here) when combining the two observations. The issue (at least for the second Python function) seems to come from this. I know these are just insights, but I hope they will help.
EDIT: note that the difference between the two is not necessary an 'issue' with the python code. There are different quantile estimators (and their weighted versions) and the python functions could simply rely on a different estimator than Hmisc::wtd.quantile(). I think that the latter uses the weighted version of the Harrell-Davis quantile estimator. If you really want to implement this one, you should check the source code of Hmisc::wtd.quantile() and try to 'directly' translate this into Python.

Genetic Algorithm Population Individual as Array

I don't have much experience using Genetic Algorithms, so I would like to ask the community for some useful comments. I want to apologize for my terminology errors. Please, correct me if it's needed.
The problem I want to optimize is optimal power flow in an islanded microgrid. In the simple microgrid we have 2 diesel generators (DG), 1 PV array, 1 Energy Storage System (ESS) and Load. Let's assume we know Load and PV array output power values for next periods.
So, the objective function should be minimized is OPEX as sum of every microgrid component operational cost at each moment t in period T:
where a, b are some operational cost coefficients, is diesel generator binary (0/1 or ON/OFF) status variable and P is output power of the microgrid component at the time t.
And here are some of constraints (the real problem is hardly and nonlinearly constrained so I wrote down only three of constraints):
Power balance
ESS' Maximum depth of disharge
Diesel gensets power limit
So, it's mixed integer problem with nonlinear constraints. I tried to adapt the problem for solving it using Genetic Algorithm. I used pymoo Python library for multiobjective optimization with NSGA2 algorithm. Let's consider and for this T we have some Load and PV power data:
from pymoo.model.problem import FunctionalProblem
from pymoo.factory import get_sampling, get_crossover, get_mutation
from pymoo.operators.mixed_variable_operator import MixedVariableSampling, MixedVariableMutation, MixedVariableCrossover
from pymoo.algorithms.nsga2 import NSGA2
from pymoo.factory import get_sampling, get_crossover, get_mutation, get_termination
from pymoo.optimize import minimize
PV = np.array([10, 19.8, 16, 25, 7.8, 42.8, 10]) #PV inverter output power, kW
Load = np.array([100, 108, 150, 150, 90, 16, 170]) #Load, kW
balance_eps = 0.001 #equality constraint relaxing coefficient
DG1_pmin = 0.3 #DG1 min power
DG2_pmin = 0.3 #DG2 min power
P_dg1 = 75 #DG1 rated power, kW
P_dg2 = 75 #DG1 rated power, kW
P_PV_inv = 50 #PV inverter rated power, kW
P_ESS_inv = 30 #ESS bidirectional inverter absolute rated discharge/charge power, kW
ESS_c = 100 #ESS capacity, kWh
SOC_min = 30
SOC_max = 100
objs = [lambda x: x[0]*x[2]*200 + x[1]*x[3]*200 + x[4]*0.002 #objective function]
constr_eq = [lambda x: ((Load[t] - x[0]*x[2] - x[1]*x[3] - x[4] - PV[t] )**2)]
constr_ieq = [lambda x: -SOC_t + 100*x[4]/ESS_c + SOC_min,
lambda x: SOC_t - 100*x[4]/ESS_c - SOC_max]
problem = FunctionalProblem(n_var=n_var, objs, constr_eq=constr_eq, constr_eq_eps=1e-03, constr_ieq=constr_ieq,
xl=np.array([0, 0, DG1_pmin*P_dg1, DG2_pmin*P_dg2, -P_ESS_inv]), xu=np.array([1, 1, P_dg1, P_dg2, P_ESS_inv]))
mask = ["int", "int", "real", "real", "real"]
sampling = MixedVariableSampling(mask, {
"real": get_sampling("real_random"),
"int": get_sampling("int_random")})
crossover = MixedVariableCrossover(mask, {
"real": get_crossover("real_sbx", prob=1.0, eta=3.0),
"int": get_crossover("int_sbx", prob=1.0, eta=3.0)})
mutation = MixedVariableMutation(mask, {
"real": get_mutation("real_pm", eta=3.0),
"int": get_mutation("int_pm", eta=3.0)})
algorithm = NSGA2(
pop_size=150,
sampling=sampling,
crossover=crossover,
mutation=mutation,
eliminate_duplicates=True)
We have n_var = 5 decision variables which are being optimized: . We should also have an access to the previous value of SOC.
I wrote a recursive code to implement a consecutive optimization chain:
x=[]
s=[]
SOC_t = 100 #SOC at t = -1
for t in range (0, 7):
res = minimize(
problem,
algorithm,
seed=1,
termination = get_termination("n_gen", 300),
save_history=True, verbose=False)
SOC_t = SOC_t - 100*res.X[4]/ESS_c
print(res.X[:2], np.around(res.X[2:].astype(np.double), 3), np.around(SOC_t, 2))
x.append(res.X)
s.append(SOC_t)
So, we have initialized populations with size 150 for every time step t and individuals in that populations looked like . Running this code I get these optimization results found:
[1 1] [27.272 34.635 28.071] 71.93
[0 1] [28.127 58.168 30. ] 41.93
[1 1] [50.95 71.423 11.599] 30.33
[1 1] [53.966 70.97 0.034] 30.3
[1 1] [24.636 59.236 -1.702] 32.0
[0 0] [40.831 29.184 -26.832] 58.83
[1 1] [68.299 63.148 28.572] 30.26
Even my little experience in Genetic Algorithms allows me to state, that such approach is inappropriate and unefficient.
So, here is my question (if you're still reading my post :)
Is there a way to optimize such problem using not consecutive optimization of a particular variables set at t, but defining individuals in population as arrays with size (T, n_var)?
For the problem described an individual in population may look like
Is it possible to implement such approach? If yes, how to do it in pymoo?
Thank you very much for your time! Any comments and suggestions will be appreciated.

pymc with observations on multiple variables

I'm using an example of linear regression from bayesian methods for hackers but having trouble expanding it to my usage.
I have observations on a random variable, an assumed distribution on that random variable, and finally another assumed distribution on that random variable for which I have observations. How I have tried to model it is with intermediate distributions on a and b, but it complains Wrong number of dimensions: expected 0, got 1 with shape (788,).
To describe the actual model, I am predicting the conversion rate for a certain amount (n) of cultivating emails. My prior is that the conversion rate (described by a Beta function on alpha and beta) will be updated by having alpha and beta scaled by some factors (0,inf] a and b, which start at 1 for n=0 and increase to their max value at some threshold.
# Generate predictive data, X and target data, Y
data = [
{'n': 0 , 'trials': 120, 'successes': 1},
{'n': 5 , 'trials': 111, 'successes': 2},
{'n': 10, 'trials': 78 , 'successes': 1},
{'n': 15, 'trials': 144, 'successes': 3},
{'n': 20, 'trials': 280, 'successes': 7},
{'n': 25, 'trials': 55 , 'successes': 1}]
X = np.empty(0)
Y = np.empty(0)
for dat in data:
X = np.insert(X, 0, np.ones(dat['trials']) * dat['n'])
target = np.zeros(dat['trials'])
target[:dat['successes']] = 1
Y = np.insert(Y, 0, target)
with pm.Model() as model:
alpha = pm.Uniform("alpha_n", 5, 13)
beta = pm.Uniform("beta_n", 1000, 1400)
n_sat = pm.Gamma("n_sat", alpha=20, beta=2, testval=10)
a_gamma = pm.Gamma("a_gamma", alpha=18, beta=15)
b_gamma = pm.Gamma("b_gamma", alpha=18, beta=27)
a_slope = pm.Deterministic('a_slope', 1 + (X/n_sat)*(a_gamma-1))
b_slope = pm.Deterministic('b_slope', 1 + (X/n_sat)*(b_gamma-1))
a = pm.math.switch(X >= n_sat, a_gamma, a_slope)
b = pm.math.switch(X >= n_sat, b_gamma, b_slope)
p = pm.Beta("p", alpha=alpha*a, beta=beta*b)
observed = pm.Bernoulli("observed", p, observed=Y)
Is there a way to get this to work?
Data
First, note that the total likelihood of repeated Bernoulli trials is exactly a binomial likelihood, so there is no need to expand to individual trials in your data. I'd also suggest using a Pandas DataFrame to manage your data - it's helps to keep things tidy:
import pandas as pd
df = pd.DataFrame({
'n': [0, 5, 10, 15, 20, 25],
'trials': [120, 111, 78, 144, 280, 55],
'successes': [1, 2, 1, 3, 7, 1]
})
Solution
This will help simplify the model, but the solution really is to add a shape argument to the p random variable so that PyMC3 knows to how to interpret the one dimensional parameters. The fact is that you do want a different p distribution for each n case you have, so there is nothing conceptually wrong here.
with pm.Model() as model:
# conversion rate hyperparameters
alpha = pm.Uniform("alpha_n", 5, 13)
beta = pm.Uniform("beta_n", 1000, 1400)
# switchpoint prior
n_sat = pm.Gamma("n_sat", alpha=20, beta=2, testval=10)
a_gamma = pm.Gamma("a_gamma", alpha=18, beta=15)
b_gamma = pm.Gamma("b_gamma", alpha=18, beta=27)
# NB: I removed pm.Deterministic b/c (a|b)_slope[0] is constant
# and this causes issues when using ArViZ
a_slope = 1 + (df.n.values/n_sat)*(a_gamma-1)
b_slope = 1 + (df.n.values/n_sat)*(b_gamma-1)
a = pm.math.switch(df.n.values >= n_sat, a_gamma, a_slope)
b = pm.math.switch(df.n.values >= n_sat, b_gamma, b_slope)
# conversion rates
p = pm.Beta("p", alpha=alpha*a, beta=beta*b, shape=len(df.n))
# observations
pm.Binomial("observed", n=df.trials, p=p, observed=df.successes)
trace = pm.sample(5000, tune=10000)
This samples nicely
and yields reasonable intervals on the conversion rates
but the fact that the posteriors for alpha_n and beta_n go right up to your prior boundaries is a bit concerning:
I think the reason for this is that, for each condition you only do 55-280 trials, which, if the conditions were independent (worst case), conjugacy would tells us that your Beta hyperparameters should be in that range. Since you are doing a regression, then the best case scenario for information sharing across the trials would put your hyperparameters in the range of the sum of trials (788) - but that's an upper limit. Because you're outside this range, the concern here is that you're forcing the model to be more precise in its estimates than you really have the evidence to support. However, one can justify this is if the prior is based on strong independent evidence.
Otherwise, I'd suggest expanding the ranges on those priors that affect the final alpha*a and beta*b numbers (the sums of those should be close to your trial counts in the posterior).
Alternative Model
I'd probably do something along the following lines, which I think has a more transparent parameterization, though it's not completely identical to your model:
with pm.Model() as model_br_sp:
# regression coefficients
alpha = pm.Normal("alpha", mu=0, sd=1)
beta = pm.Normal("beta", mu=0, sd=1)
# saturation parameters
saturation_point = pm.Gamma("saturation_point", alpha=20, beta=2)
max_success_rate = pm.Beta("max_success_rate", 1, 9)
# probability of conversion
success_rate = pm.Deterministic("success_rate",
pm.math.switch(df.n.values > saturation_point,
max_success_rate,
max_success_rate*pm.math.sigmoid(alpha + beta*df.n)))
# observations
pm.Binomial("successes", n=df.trials, p=success_rate, observed=df.successes)
trace_br_sp = pm.sample(draws=5000, tune=10000)
Here we map the predictor space to probability space through a sigmoid that maxes out at the maximum success rate. The prior on the saturation point is identical to yours, while that on the maximum success rate is weakly informative (Beta[1,9] - though I will say it runs on a flat prior nearly as well). This also samples well,
and gives similar intervals (though the switchpoint seems to dominate more):
We can compare the two models and see that there isn't a significant difference in their explanatory power:
import arviz as az
model_compare = az.compare({'Binomial Regression w/ Switchpoint': trace_br_sp,
'Original Model': trace})
az.plot_compare(model_compare)

Robust Linear Model - No exogenous var, just constants

I'm doing a robust linear regression on only a constant (a column of 1s) and no exogenous variable. I'm able to calculate the model just fine by inputting a list of 1's equal to the size of the 'xi_list' from the code snippet below.
def sigma_and_miu(gvkey, statevar_dict):
statevar_list = statevar_dict[gvkey]
xi_list = [np.log(statevar_list[i]) - np.log(statevar_list[i-1]) for i in range(1, len(statevar_list))]
x = np.array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])
y = np.array(xi_list)
rlm_model = sm.RLM(y, x, M=sm.robust.norms.HuberT())
rlm_results = rlm_model.fit()
sigma = np.std(rlm_results.resid * rlm_results.weights)
miudelta = rlm_results.params[0] + (0.5 * sigma ** 2)
return miudelta, sigma
This function is ran with the following inputs.
dict = {1004:[1796.6, 1938.6, 2085.4, 2009.4, 1906.1, 2002.2, 2164.9, 2478.8, 2357.4, 2662.1, 2911.2, 2400.4, 2535.9, 2812.3, 2873.1, 2775.5, 3374.2, 3345.5, 3466.3, 2409.4]}
key = 1004
miu, sigma = sigma_and_miu(key,dict)
However, I'm looking for a more scalable approach. I was thinking that one solution could be to include a loop that appends as many 1's as the length of the xi_list variable but, this does not seem to be very efficient.
I know there is sm.add_constant() and I tried to add this constant to my 'y' variable and leaving 'x' blank in the sm.RLM() function. This results in not being able to run the model.
So my question is, whether there is a better way to create the list of 1s or should I just go for the loop?
Use basic numpy vectorized computation
e.g.
statevar = np.asarray(statevar_list)
y = np.log(statevar[1:]) - np.log(statevar[:-1])
x = np.ones(len(y))
Aside: The rlm_results should have the robust estimate of the standard deviation that is used in the estimation as a scale attribute.

Theano - Sum by group

I'm working on a custom likelihood function for Theano (Attempting to fit a conditional logistic regression.)
The likelihood requires summing values by group. In R we have the "ave()" function, in Python Pandas we have "groupby()". How would I do something similar in Theano?
Edited for more detail
I want to create a cox proportional hazards model (same as conditional logistic regression.) The log likelihood requires the sum of values by group:
In Pandas, this would be:
temp = df.groupby('groupid')['eta'].aggregate(np.sum)
denominator = np.log(temp).sum()
In the data, we have a column with group ID, and the values to be summed
group eta
1 2.1
1 1.8
1 0.9
2 1.2
2 0.75
2 1.42
The output for the group sums would then be:
group sum
1 4.8
2 3.37
Then, the sum of the log of the sums:
log(4.8) + log(3.37) = 2.7835
This is quick and easy to do in Pandas. How can I do something similar in Thano? Sure, could write a nexted loop, but that seems slow and I try to avoid manually coded loops when possible as they are slow.
Thanks!
Let say you have "X" (a list of all your etas), with the dim. Nx1 (I guess) and a matrix H. H is a NxG matrix that has a on-hot encoding of the groups.
The you you write something like:
import numpy as np
from numpy import newaxis as na
import theano.tensor as T
X = T.vector()
H = T.matrix()
tmp = T.sum(X[:, na] * H, axis=0)
O = T.sum(T.log(tmp))
x = np.array([5, 10, 10, 0.5, 5, 0.5])
# create a 1-hot encoding
g = np.array([1, 2, 2, 0, 1, 0])
h = np.zeros(shape=(len(x), 3))
for i,j in enumerate(g):
h[i,j] = 1.0
O.eval({X:x, H: h})
This should work as long as there is at least one eta per point (or else -inf).

Categories