pymc with observations on multiple variables - python

I'm using an example of linear regression from bayesian methods for hackers but having trouble expanding it to my usage.
I have observations on a random variable, an assumed distribution on that random variable, and finally another assumed distribution on that random variable for which I have observations. How I have tried to model it is with intermediate distributions on a and b, but it complains Wrong number of dimensions: expected 0, got 1 with shape (788,).
To describe the actual model, I am predicting the conversion rate for a certain amount (n) of cultivating emails. My prior is that the conversion rate (described by a Beta function on alpha and beta) will be updated by having alpha and beta scaled by some factors (0,inf] a and b, which start at 1 for n=0 and increase to their max value at some threshold.
# Generate predictive data, X and target data, Y
data = [
{'n': 0 , 'trials': 120, 'successes': 1},
{'n': 5 , 'trials': 111, 'successes': 2},
{'n': 10, 'trials': 78 , 'successes': 1},
{'n': 15, 'trials': 144, 'successes': 3},
{'n': 20, 'trials': 280, 'successes': 7},
{'n': 25, 'trials': 55 , 'successes': 1}]
X = np.empty(0)
Y = np.empty(0)
for dat in data:
X = np.insert(X, 0, np.ones(dat['trials']) * dat['n'])
target = np.zeros(dat['trials'])
target[:dat['successes']] = 1
Y = np.insert(Y, 0, target)
with pm.Model() as model:
alpha = pm.Uniform("alpha_n", 5, 13)
beta = pm.Uniform("beta_n", 1000, 1400)
n_sat = pm.Gamma("n_sat", alpha=20, beta=2, testval=10)
a_gamma = pm.Gamma("a_gamma", alpha=18, beta=15)
b_gamma = pm.Gamma("b_gamma", alpha=18, beta=27)
a_slope = pm.Deterministic('a_slope', 1 + (X/n_sat)*(a_gamma-1))
b_slope = pm.Deterministic('b_slope', 1 + (X/n_sat)*(b_gamma-1))
a = pm.math.switch(X >= n_sat, a_gamma, a_slope)
b = pm.math.switch(X >= n_sat, b_gamma, b_slope)
p = pm.Beta("p", alpha=alpha*a, beta=beta*b)
observed = pm.Bernoulli("observed", p, observed=Y)
Is there a way to get this to work?

Data
First, note that the total likelihood of repeated Bernoulli trials is exactly a binomial likelihood, so there is no need to expand to individual trials in your data. I'd also suggest using a Pandas DataFrame to manage your data - it's helps to keep things tidy:
import pandas as pd
df = pd.DataFrame({
'n': [0, 5, 10, 15, 20, 25],
'trials': [120, 111, 78, 144, 280, 55],
'successes': [1, 2, 1, 3, 7, 1]
})
Solution
This will help simplify the model, but the solution really is to add a shape argument to the p random variable so that PyMC3 knows to how to interpret the one dimensional parameters. The fact is that you do want a different p distribution for each n case you have, so there is nothing conceptually wrong here.
with pm.Model() as model:
# conversion rate hyperparameters
alpha = pm.Uniform("alpha_n", 5, 13)
beta = pm.Uniform("beta_n", 1000, 1400)
# switchpoint prior
n_sat = pm.Gamma("n_sat", alpha=20, beta=2, testval=10)
a_gamma = pm.Gamma("a_gamma", alpha=18, beta=15)
b_gamma = pm.Gamma("b_gamma", alpha=18, beta=27)
# NB: I removed pm.Deterministic b/c (a|b)_slope[0] is constant
# and this causes issues when using ArViZ
a_slope = 1 + (df.n.values/n_sat)*(a_gamma-1)
b_slope = 1 + (df.n.values/n_sat)*(b_gamma-1)
a = pm.math.switch(df.n.values >= n_sat, a_gamma, a_slope)
b = pm.math.switch(df.n.values >= n_sat, b_gamma, b_slope)
# conversion rates
p = pm.Beta("p", alpha=alpha*a, beta=beta*b, shape=len(df.n))
# observations
pm.Binomial("observed", n=df.trials, p=p, observed=df.successes)
trace = pm.sample(5000, tune=10000)
This samples nicely
and yields reasonable intervals on the conversion rates
but the fact that the posteriors for alpha_n and beta_n go right up to your prior boundaries is a bit concerning:
I think the reason for this is that, for each condition you only do 55-280 trials, which, if the conditions were independent (worst case), conjugacy would tells us that your Beta hyperparameters should be in that range. Since you are doing a regression, then the best case scenario for information sharing across the trials would put your hyperparameters in the range of the sum of trials (788) - but that's an upper limit. Because you're outside this range, the concern here is that you're forcing the model to be more precise in its estimates than you really have the evidence to support. However, one can justify this is if the prior is based on strong independent evidence.
Otherwise, I'd suggest expanding the ranges on those priors that affect the final alpha*a and beta*b numbers (the sums of those should be close to your trial counts in the posterior).
Alternative Model
I'd probably do something along the following lines, which I think has a more transparent parameterization, though it's not completely identical to your model:
with pm.Model() as model_br_sp:
# regression coefficients
alpha = pm.Normal("alpha", mu=0, sd=1)
beta = pm.Normal("beta", mu=0, sd=1)
# saturation parameters
saturation_point = pm.Gamma("saturation_point", alpha=20, beta=2)
max_success_rate = pm.Beta("max_success_rate", 1, 9)
# probability of conversion
success_rate = pm.Deterministic("success_rate",
pm.math.switch(df.n.values > saturation_point,
max_success_rate,
max_success_rate*pm.math.sigmoid(alpha + beta*df.n)))
# observations
pm.Binomial("successes", n=df.trials, p=success_rate, observed=df.successes)
trace_br_sp = pm.sample(draws=5000, tune=10000)
Here we map the predictor space to probability space through a sigmoid that maxes out at the maximum success rate. The prior on the saturation point is identical to yours, while that on the maximum success rate is weakly informative (Beta[1,9] - though I will say it runs on a flat prior nearly as well). This also samples well,
and gives similar intervals (though the switchpoint seems to dominate more):
We can compare the two models and see that there isn't a significant difference in their explanatory power:
import arviz as az
model_compare = az.compare({'Binomial Regression w/ Switchpoint': trace_br_sp,
'Original Model': trace})
az.plot_compare(model_compare)

Related

How to calculate multi-variable nonlinear regression in python?

I am trying to create a program about non-linear regression. I have three parameters [R,G,B] and I want to obtain the temperature of any pixel on image with respect to my reference color code. For example:
Reference Files R,G,B,Temperature = [(157,158,19,300),(146,55,18,320),(136,57,22,340),(133,88,25,460),(141,105,27,500),(210,195,3,580),(203,186,10,580),(214,195,4,600),(193,176,10,580)]
You can see above, all RGB values change as non-linear. Now, I use "minimum error algorithm" to obtain temperature w.r.t. RGB color codes but I want to obtain a value that not exist in the reference file (i.e. If I have (155,200,40) and it is not exist in reference file, I must obtain this three codes is equal to which temperature).
Here is the code to select the closest reference temperature given a RGB value:
from math import sqrt
referenceColoursRGB =[(157,158,19),
(146,55,18),
(136,57,22),
(133,88,25),
(141,105,27),
(203,186,10),
(214,195,4)]
referenceTemperatures = [
300,
320,
340,
460,
500,
580,
600]
def closest_color(rgb):
r, g, b = rgb
color_diffs = []
counter = 0
for color in referenceColoursRGB:
cr, cg, cb = color
color_diff = sqrt(abs(r - cr)**2 + abs(g - cg)**2 + abs(b - cb)**2)
color_diffs.append((color_diff, color))
minErrorIndex =color_diffs.index(min(color_diffs))
return minErrorIndex
temperatureLocation = closest_color((149, 60, 25))
print("Temperature : ", referenceTemperatures[temperatureLocation])
# => Temperature : 320
temperatureLocation = closest_color((220, 145, 4))
print("Temperature : ", referenceTemperatures[temperatureLocation])
# => Temperature : 580
I really want to calculate temperature values that don't appear in the reference list, but I am having problems using all RGB values and calculating/predicting reasonable/accurate temperatures from them.
I tried to obtain 1 parameter after that used polyfit but there is some problem because every variable have same effect on this one parameter. Therefore I can't realize which color code is highest (i.e. "oneParameter = 1000 *R + 100 *G + 10 *B" , in this situation if I have a parameter that color code is (2,20,50) and another color code is (2,5,200). As a result they are equal w.r.t. "oneParameter" equation)
I hope I explain my problem clearly. I am waiting for your helps !
Thank you.
from math import sqrt
referenceColoursRGB =[(157,158,19),
(146,55,18),
(136,57,22),
(133,88,25),
(141,105,27),
(203,186,10),
(214,195,4)]
referenceTemperatures = [
300,
320,
340,
460,
500,
580,
600]
def closest_color(rgb):
r, g, b = rgb
color_diffs = []
counter = 0
for color in referenceColoursRGB:
cr, cg, cb = color
color_diff = sqrt(abs(r - cr)**2 + abs(g - cg)**2 + abs(b - cb)**2)
color_diffs.append((color_diff, color))
minErrorIndex =color_diffs.index(min(color_diffs))
return minErrorIndex
temperatureLocation = closest_color((149, 60, 25))
print("Temperature : ", referenceTemperatures[temperatureLocation])
# => Temperature : 320
temperatureLocation = closest_color((220, 145, 4))
print("Temperature : ", referenceTemperatures[temperatureLocation])
# => Temperature : 580
N.B.: I can't vouch for the physical accuracy of this prediction, but this might be along the lines of what you're looking for. I.e., this makes the predictions match your reference data exactly, but I have no idea how accurate the temperature predictions might be for non-reference RGB colors. If I knew the exact physics of the mapping from RGB to temperature, I'd use that.
Bad Model 1
One simple way to do nonlinear regression is to preprocess your data so that you have nonlinear terms for your regression. sklearn has a builtin preprocessing function to do this by generating powers and interactions of the original input data.
referenceColoursRGB =[(157,158,19),
(146,55,18),
(136,57,22),
(133,88,25),
(141,105,27),
(203,186,10),
(214,195,4)]
referenceTemperatures = [
300,
320,
340,
460,
500,
580,
600]
from sklearn import linear_model
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=2)
poly_RGB = poly.fit_transform(referenceColoursRGB)
ols = linear_model.LinearRegression()
ols.fit(poly_RGB, referenceTemperatures)
ols.predict(poly_RGB)
# array([300., 320., 340., 460., 500., 580., 600.])
To make non-reference RGB predictions, you would do something like:
ols.predict(poly.transform([(149, 60, 25)]))
# array([369.68980598])
ols.predict(poly.transform([(220, 145, 4)]))
# array([949.34548347])
EDIT: Bad Model 2
So, before I picked something simple to implement a nonlinear fit using PolynomialFeatures without regard to any real physics that might be going on at the RGB sensor. You can decide if it fits your needs. Well, here's another model that uses RGB ratios without any regard to whatever physics is happening. Again, you can decide if this model is appropriate.
rat_RGB = [(r, g, b, r/g, r/b, g/r, g/b, b/r, b/g) for r,g,b in referenceColoursRGB]
rat_ols = linear_model.LinearRegression()
rat_ols.fit(rat_RGB, referenceTemperatures)
rat_ols.predict(rat_RGB)
# array([300., 320., 340., 460., 500., 580., 600.])
You can see that this model can also be fit perfectly to the reference data. It's interesting, and probably important to note that the other example predictions produce different temperatures with this model.
rat_ols.predict([(r, g, b, r/g, r/b, g/r, g/b, b/r, b/g) for r,g,b in [(149, 60, 25)]])
# array([481.79424789])
rat_ols.predict([(r, g, b, r/g, r/b, g/r, g/b, b/r, b/g) for r,g,b in [(220, 145, 4)]])
# array([653.06116368])
I hope you can find/develop a RGB/temp model that is physics based. I am wondering if the manufacturer of your RGB sensor has some specifications and/or engineering notes that might help.

Genetic Algorithm Population Individual as Array

I don't have much experience using Genetic Algorithms, so I would like to ask the community for some useful comments. I want to apologize for my terminology errors. Please, correct me if it's needed.
The problem I want to optimize is optimal power flow in an islanded microgrid. In the simple microgrid we have 2 diesel generators (DG), 1 PV array, 1 Energy Storage System (ESS) and Load. Let's assume we know Load and PV array output power values for next periods.
So, the objective function should be minimized is OPEX as sum of every microgrid component operational cost at each moment t in period T:
where a, b are some operational cost coefficients, is diesel generator binary (0/1 or ON/OFF) status variable and P is output power of the microgrid component at the time t.
And here are some of constraints (the real problem is hardly and nonlinearly constrained so I wrote down only three of constraints):
Power balance
ESS' Maximum depth of disharge
Diesel gensets power limit
So, it's mixed integer problem with nonlinear constraints. I tried to adapt the problem for solving it using Genetic Algorithm. I used pymoo Python library for multiobjective optimization with NSGA2 algorithm. Let's consider and for this T we have some Load and PV power data:
from pymoo.model.problem import FunctionalProblem
from pymoo.factory import get_sampling, get_crossover, get_mutation
from pymoo.operators.mixed_variable_operator import MixedVariableSampling, MixedVariableMutation, MixedVariableCrossover
from pymoo.algorithms.nsga2 import NSGA2
from pymoo.factory import get_sampling, get_crossover, get_mutation, get_termination
from pymoo.optimize import minimize
PV = np.array([10, 19.8, 16, 25, 7.8, 42.8, 10]) #PV inverter output power, kW
Load = np.array([100, 108, 150, 150, 90, 16, 170]) #Load, kW
balance_eps = 0.001 #equality constraint relaxing coefficient
DG1_pmin = 0.3 #DG1 min power
DG2_pmin = 0.3 #DG2 min power
P_dg1 = 75 #DG1 rated power, kW
P_dg2 = 75 #DG1 rated power, kW
P_PV_inv = 50 #PV inverter rated power, kW
P_ESS_inv = 30 #ESS bidirectional inverter absolute rated discharge/charge power, kW
ESS_c = 100 #ESS capacity, kWh
SOC_min = 30
SOC_max = 100
objs = [lambda x: x[0]*x[2]*200 + x[1]*x[3]*200 + x[4]*0.002 #objective function]
constr_eq = [lambda x: ((Load[t] - x[0]*x[2] - x[1]*x[3] - x[4] - PV[t] )**2)]
constr_ieq = [lambda x: -SOC_t + 100*x[4]/ESS_c + SOC_min,
lambda x: SOC_t - 100*x[4]/ESS_c - SOC_max]
problem = FunctionalProblem(n_var=n_var, objs, constr_eq=constr_eq, constr_eq_eps=1e-03, constr_ieq=constr_ieq,
xl=np.array([0, 0, DG1_pmin*P_dg1, DG2_pmin*P_dg2, -P_ESS_inv]), xu=np.array([1, 1, P_dg1, P_dg2, P_ESS_inv]))
mask = ["int", "int", "real", "real", "real"]
sampling = MixedVariableSampling(mask, {
"real": get_sampling("real_random"),
"int": get_sampling("int_random")})
crossover = MixedVariableCrossover(mask, {
"real": get_crossover("real_sbx", prob=1.0, eta=3.0),
"int": get_crossover("int_sbx", prob=1.0, eta=3.0)})
mutation = MixedVariableMutation(mask, {
"real": get_mutation("real_pm", eta=3.0),
"int": get_mutation("int_pm", eta=3.0)})
algorithm = NSGA2(
pop_size=150,
sampling=sampling,
crossover=crossover,
mutation=mutation,
eliminate_duplicates=True)
We have n_var = 5 decision variables which are being optimized: . We should also have an access to the previous value of SOC.
I wrote a recursive code to implement a consecutive optimization chain:
x=[]
s=[]
SOC_t = 100 #SOC at t = -1
for t in range (0, 7):
res = minimize(
problem,
algorithm,
seed=1,
termination = get_termination("n_gen", 300),
save_history=True, verbose=False)
SOC_t = SOC_t - 100*res.X[4]/ESS_c
print(res.X[:2], np.around(res.X[2:].astype(np.double), 3), np.around(SOC_t, 2))
x.append(res.X)
s.append(SOC_t)
So, we have initialized populations with size 150 for every time step t and individuals in that populations looked like . Running this code I get these optimization results found:
[1 1] [27.272 34.635 28.071] 71.93
[0 1] [28.127 58.168 30. ] 41.93
[1 1] [50.95 71.423 11.599] 30.33
[1 1] [53.966 70.97 0.034] 30.3
[1 1] [24.636 59.236 -1.702] 32.0
[0 0] [40.831 29.184 -26.832] 58.83
[1 1] [68.299 63.148 28.572] 30.26
Even my little experience in Genetic Algorithms allows me to state, that such approach is inappropriate and unefficient.
So, here is my question (if you're still reading my post :)
Is there a way to optimize such problem using not consecutive optimization of a particular variables set at t, but defining individuals in population as arrays with size (T, n_var)?
For the problem described an individual in population may look like
Is it possible to implement such approach? If yes, how to do it in pymoo?
Thank you very much for your time! Any comments and suggestions will be appreciated.

ARIMA forecast gives different results with new python statsmodels

I'm (out-of-sample) forecasting with ARIMA(0,1,0).
In python's statsmodels latest stable version 0.12. I calculate:
import statsmodels.tsa.arima_model as stats
time_series = [2, 3.0, 5, 7, 9, 11, 13, 17, 19]
steps = 4
alpha = 0.05
model = stats.ARIMA(time_series, order=(0, 1, 0))
model_fit = model.fit(disp=0)
forecast, _, intervals = model_fit.forecast(steps=steps, exog=None, alpha=alpha)
which results in
forecast = [21.125, 23.25, 25.375, 27.5]
intervals = [[19.5950036, 22.6549964 ], [21.08625835, 25.41374165], [22.72496851, 28.02503149], [24.44000721, 30.55999279]]
and a Future Warning, which suggests:
FutureWarning:
statsmodels.tsa.arima_model.ARMA and statsmodels.tsa.arima_model.ARIMA have
been deprecated in favor of statsmodels.tsa.arima.model.ARIMA (note the .
between arima and model) and
statsmodels.tsa.SARIMAX. These will be removed after the 0.12 release.
In the new version, as hinted to in the Future Warning, I calculate:
import statsmodels.tsa.arima.model as stats
time_series = [2, 3.0, 5, 7, 9, 11, 13, 17, 19]
steps = 4
alpha = 0.05
model = stats.ARIMA(time_series, order=(0, 1, 0))
model_fit = model.fit()
forecast = model_fit.get_forecast(steps=steps)
forecasts_and_intervals = forecast.summary_frame(alpha=alpha)
which gives different results:
forecasts_and_intervals =
y mean mean_se mean_ci_lower mean_ci_upper
0 19.0 2.263842 14.562951 23.437049
1 19.0 3.201556 12.725066 25.274934
2 19.0 3.921089 11.314806 26.685194
3 19.0 4.527684 10.125903 27.874097
I would like to obtain the same results as before.
Am I using the new interface correctly?
I need both the forecast and the intervals.
I tried already to use different functions as just forecast the new interface offers.
In particular I'm wondering why the forecast result is 19 for the entire list.
Many thanks for every help.
Here is the documentation for statsmodels 0.12.2: https://www.statsmodels.org/stable/generated/statsmodels.tsa.arima_model.ARIMA.html?highlight=arima#statsmodels.tsa.arima_model.ARIMA
Here is the documentation for newer version of Arima:
https://www.statsmodels.org/stable/generated/statsmodels.tsa.arima.model.ARIMA.html?highlight=arima#statsmodels.tsa.arima.model.ARIMA
The difference is due to whether the models include a "constant" term or not. For the first case i.e. older statsmodels.tsa.arima_model.ARIMA, it automatically includes a constant term (and no option to turn on/off). If you have a differencing, it also includes it but does so in the differenced domain (otherwise it would be eliminated anyway). So here is its ARIMA(0, 1, 0) model:
y_t - y_{t-1} = c + e_t
which is "random walk with drift".
For the new statsmodels.tsa.arima.model.ARIMA, as the documentation you linked says, not any kind of trend term (including constant, i.e. c) is included when differencing is involved, which is the case for you. So here is its ARIMA(0, 1, 0) model:
y_t - y_{t-1} = e_t
which is "random walk" and as we know, forecasts from it corresponds to naive forecasts i.e. repeating the last value (19 in your case).
Then, what to do to make the new one work?
It includes a parameter called trend which you can specify to get the same behaviour. Since you are using a differencing (d=1), passing trend="t" should give the same model as the old one. ("t" means linear trend but since d = 1, it will reduce to a constant in the differenced domain):
import statsmodels.tsa.arima.model as stats
time_series = [2, 3.0, 5, 7, 9, 11, 13, 17, 19]
steps = 4
alpha = 0.05
model = stats.ARIMA(time_series, order=(0, 1, 0), trend="t") # only change is here!
model_fit = model.fit()
forecast = model_fit.get_forecast(steps=steps)
forecasts_and_intervals = forecast.summary_frame(alpha=alpha)
and here is what I get for forecasts_and_intervals:
y mean mean_se mean_ci_lower mean_ci_upper
0 21.124995 0.780622 19.595004 22.654986
1 23.249990 1.103966 21.086256 25.413724
2 25.374985 1.352077 22.724962 28.025008
3 27.499980 1.561244 24.439997 30.559963
I think this raises another issue. I'm not sure exogenous variables are treated the same in the new arima.model version. I believe in the old version, arima_model, they are applied to the order of differences. For (0,0,0) Y=mx+b or if (0,1,0), then dy=mx+b.

Given an existing distribution, how can I draw samples of size N with std of X?

I have a existing distribution of values and I want to draw samples of size 5, but those 5 samples need to have a std of X within some tolerance. For example, I need 5 samples that have a std of 10 (even though the overall distribution is std=~32).
The example code below somewhat works, but is quite slow for large dataset. It randomly samples the distribution until it finds something close to the target std, then removes those elements so they can't be drawn again.
Is there a smarter way to do this properly and faster? It works ok for some target_std (above 6), but it isn't accurate below 6.
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(23)
# Create a distribution
d1 = np.random.normal(95, 5, 200)
d2 = np.random.normal(125, 5, 200)
d3 = np.random.normal(115, 10, 200)
d4 = np.random.normal(70, 10, 100)
d5 = np.random.normal(160, 5, 200)
d6 = np.random.normal(170, 20, 100)
dist = np.concatenate((d1, d2, d3, d4, d5, d6))
print(f"Full distribution: len={len(dist)}, mean={np.mean(dist)}, std={np.std(dist)}")
plt.hist(dist, bins=100)
plt.title("Full Distribution")
plt.show();
batch_size = 5
num_batches = math.ceil(len(dist)/batch_size)
target_std = 10
tolerance = 1
# how many samples to search
num_samples = 100
result = []
# Find samples of batch_size that are closest to target_std
for i in range(num_batches):
samples = []
idxs = np.arange(len(dist))
for j in range(num_samples):
indices = np.random.choice(idxs, size=batch_size, replace=False)
sample = dist[indices]
std = sample.std()
err = abs(std - target_std)
samples.append((sample, indices, std, err, np.mean(sample), max(sample), min(sample)))
if err <= tolerance:
# close enough, stop sampling
break
# sort by smallest err first, then take the first/best result
samples = sorted(samples, key=lambda x: x[3])
best = samples[0]
if i % 100 == 0:
pass
print(f"{i}, std={best[2]}, err={best[3]}, nsamples={num_samples}")
result.append(best)
# remove the data from our source
dist = np.delete(dist, best[1])
df_samples = pd.DataFrame(result, columns=["sample", "indices", "std", "err", "mean", "max", "min"])
df_samples["err"].plot(title="Errors (target_std - batch_std)")
batch_std = df_samples["std"].mean()
batch_err = df_samples["err"].mean()
print(f"RESULT: Target std: {target_std}, Mean batch std: {batch_std}, Mean batch err: {batch_err}")
Since your problem is not restricted to a certain distribution, I use a normally random distribution, but this should work for any distribution. However the run time will depend on the population size.
population = np.random.randn(1000)*32
std = 10.
tol = 1.
n_samples = 5
samples = list(np.random.choice(population, n_samples))
while True:
center = np.mean(samples)
dis = [abs(i-center) for i in samples]
if np.std(samples)>(std+tol):
samples.pop(dis.index(max(dis)))
elif np.std(samples)<(std-tol):
samples.pop(dis.index(min(dis)))
else:
break
samples.append(np.random.choice(population, 1)[0])
Here is how the code works.
First, draw n_samples, probably the std is not in the range you want, so we calculate the mean and absolute distance of each sample to the mean. Then if the std is larger than the desired value plus tolerance, we kick the furthest sample and draw a new one and vice versa.
Note that if this takes too much time to calculate for your data, after kicking the outlier out, you can calculate what should be the range of the next element that should be drawn in the population, instead of randomly taking one. Hopefully this works for you.
DISCLAIMER: This is not a random draw anymore, and you should be aware that the draw is biased and is not representative of the population.

Simulations Confidence Interval Not Equal to conf_int Results

Given this simulated data:
import numpy as np
from statsmodels.tsa.arima_process import ArmaProcess
from statsmodels.tsa.statespace.structural import UnobservedComponents
np.random.seed(12345)
ar = np.r_[1, 0.9]
ma = np.array([1])
arma_process = ArmaProcess(ar, ma)
X = 100 + arma_process.generate_sample(nsample=100)
y = 1.2 * X + np.random.normal(size=100)
We build a UnobservedComponents model with the first 70 points to run inferences on the last 30 points like so:
model = UnobservedComponents(y[:70], level='llevel', exog=X[:70])
f_model = model.fit()
forecaster = f_model.get_forecast(
steps=30,
exog=X[70:].reshape(-1, 1)
)
conf_int = forecaster.conf_int()
If we observe the mean for the 95% confidence interval, we get the following:
conf_int.mean(axis=0)
array([118.19789195, 122.14101161])
But when trying to get the same values through model simulations, we don't quite get the same results. Here's the script we run for the simulated boundaries:
sim_model = UnobservedComponents(np.zeros(30), level='llevel', exog=X[70:])
res = []
predicted_state = f_model.predicted_state[..., -1]
predicted_state_cov = f_model.predicted_state_cov[..., -1]
for i in range(1000):
init_state = np.random.multivariate_normal(
predicted_state,
predicted_state_cov
)
sim = sim_model.simulate(
f_model.params,
30,
initial_state=init_state)
res.append(sim.mean())
Printing the lower 2.5 and upper 97.5 percentile we get:
np.percentile(res, [2.5, 97.5])
array([119.06735028, 121.26810407])
As we use model simulations to distinguish signal from noise in data, this difference ended up being big enough to lead to contradictory conclusions. If we make for instance:
y[70:] += 1
Then according to the first technique we conclude the new y carries no signal as its mean is lower than 122.14. But the same is not true if we use the second technique: as the upper boundary is 121.2, we conclude that there's signal.
What we are trying to understand now is whether this is expected. Shouldn't the lower and upper 95% confidence interval of both techniques be equal?

Categories