normalize_y in GPR - python

I'm trying to use GaussianProcessRegressor in sklearn to predict values of unknown.
The target values are typically between 1000-10000.
Since they are not 0-mean prior, I set the model with normalize_y = False, which is a default setup.
from sklearn.gaussian_process import GaussianProcessRegressor
gpr = GaussianProcessRegressor(kernel = RBF, random_state=0, alpha=1e-10, normalize_y = False)
when I predicted unknown with the gpr model, the returned std values are unrealistically too small, like in the scale of 0.1, which is 0.001% of the predicted values.
When I changed the setting to normalize_y = True, the returned std values are more realistic, about 500ish.
Can someone explain exactly what normalize_y does here, and if I set it to True or False in this case?

I found the closest answer HERE: https://github.com/scikit-learn/scikit-learn/issues/15612
"OK I think I know what might be going on here. It's a bit tricky to see but I think that none of the kernels have a vertical length scale parameter, so kernel(x,x) is always equal to 1. All the diagonal elements of K are equal to 1 (before we add the ridge to it), for example.
We can then see that the variance of the predictions can only be between 0 and 1. For example, if we're predicting at a point far from the training data (so kernel(X, x_new) is a vector of zeros) then the variance is just
sigma^2 = kernel(x_new, x_new) = 1
I think the real problem here is that the prior is for data with unit variance, but the data doesn't have unit variance. The solution would be to normalise the data so that it has unit variance after it 'enters' the GP, conduct the GP analysis, and then 'unnormalise' it back again at the end. The code already removes the mean automatically, so I think we just need to divide by the standard deviation at the same point and it would work OK.
So could just need a few extra lines!"
For this reason, changing the length_scale_bounds parameter of your kernel should fix this issue!
I hope this helps those who land here as I faced the same issue!

Related

Difference in relative error when compared MinMaxScaled prediction and target with re-scaled prediction and target

I'm running a deep learning model which requires me to scale my dataset. I'm using scikit-learn's MinMaxScaler. After I make the prediction, if I compare the prediction with the target column I get a certain relative error. But if I rescale the dataset and the prediction, the relative error increases massively.
For reference, it's not a good model and the error when using the scaled dataset is around 40% and when I re-scale the error jumps to over 60%. I'm also calculating the relative error this way:
def calculate_error(prediction, y):
rel_error = 2 * np.absolute(y - prediction) / (np.absolute(y) + np.absolute(prediction))
return rel_error
From this I get the mean and the standard deviation using numpy's mean() and std() functions. An example is the following
predicted_scaled = [0.26652822, 0.2384195, 0.26829958, 0.25697553, 0.28840747]
real_scaled = [0.16201117, 0.37243948, 0.42085661, 0.49534451, 0.23649907]
rel_error.mean() = 44.02%
rel_error.std() = 14.03%
---
predicted_rescaled = [12.012565, 10.503127, 12.107687, 11.499586, 13.187481]
real_rescaled = [6.4, 17.7, 20.3, 24.3, 10.4]
rel_error.mean() = 51.54%
rel_error.std() = 17.8%
Why does this happen and how can I prevent it? Furthermore, what's the correct error: the one that compares prediction and target while scaled or the one I get after scaling?
It's because of your min value in your min/max scaler shifting the shape of your modelled distribution. Let us, for example, take a single datapoint, pred=0.6, true=0.8.
Let us calculate your error according to this point without scaling:
error = 2*|0.6-0.8|/ (1.4)
error = 2/7 = 0.28
Now we can calculate this scaled according to a (randomly-chosen) scaler with a min of 2.2 and max of 10.1:
error = 2*|6.94-8.52|/(16.46)
error = 0.19
So, this is not an error in the code, but rather the fact that you are calculating a relative error between two different distributions which will result in a different value!
In regards to which one is the 'correct' result to display, I would suggest it depends on what you're discussing. If you're conveying the real results, then I would suggest that you use the re-scaled results. If you're conveying model performance then either will suffice.
Also, I think it is important to scale your outputs/inputs as a model will learn better (generally) with scaled outputs/inputs with an activated output (ie. scaling with a sigmoid of tanh function at the output layer).

In the numpy std calculation , np.std(y_test == y_test_predict), what is the meaning of comparing the test split and predicted result

Pl look at the K Neight classification ML algorithm using Python
for i in range (0,Ks):
neigh = KNeighborsClassifier(n_neighbors = i+1).fit(x_train,y_train)
y_test_predict = neigh.predict(x_test)
mean_acc[i] = metrics.accuracy_score(y_test,y_test_predict)
std_acc[i] = np.std(y_test == y_test_predict)/np.sqrt(y_test_predict.shape[0])
I have 1 questions
(1)what is the meaning of the this statement "np.std(y_test == y_test_predict)". what will the output of the the operation y_test == y_test_predict and why that is being passed as an argument to the std() lib
also what is the logic behind the standard deviation formula
std_acc[i] = np.std(y_test == y_test_predict)/np.sqrt(y_test_predict.shape[0])
(2) why std dev accuracy is computed as below:
np.std(y_test == y_test_predict)/np.sqrt(y_test_predict)
Let me break this up into two parts.
A comparison of two numpy arrays returns a boolean array. Assume that y_test = np.array([0, 0, 1]) and y_test_pred = np.array([0, 1, 1]). Then, y_test == y_test_pred returns array([True, False, True]). Basically, an element-wise comparison of the two arrays are made according to index. If you understand this, now perhaps you can see the meaning behind np.std(y_test == y_test_predict): it simply calculates the standard deviation of the boolean array returned by the comparison.
The formula np.std(y_test == y_test_predict)/np.sqrt(y_test_predict.shape[0]) is basically taking the boolean array to compute its standard deviation, then dividing that standard deviation by the length of that array.
If anything is unclear, I'd be happy to provide further elaboration.
It looks like, std_acc may be calculated as the uncertainty of the accuracy score (mean_acc) for each k value in the k Nearest Neighbors classification algorithm. It looks similar to the standard error of the mean, where the uncertainty of the population mean, given a data set of several values sampled from that population, can be calculated approximately as the standard deviation of the sample divided by the number of data points (values) in the sample. This comes from the Central Limit Theorem and applies when the number of data points is large.
I have seen std_acc plotted as error bars (margins) for the plot of accuracy score (mean_acc) versus k value.
It does seem rather strange and unintuitive to use this as an uncertainty, though. The standard deviation of a boolean array (interpreted as 0's and 1's) indicates how many values are equal to the mean and equal to each other. When all predictions are correct or all are incorrect, the std of the boolean array (here np.std(y_test == y_test_predict)) is ~0. When half of the predictions are correct and half are incorrect, it is 0.5. When 25% or 75% of predictions are correct, it is ~0.43. So np.std(y_test == y_test_predict) is between 0 and 0.5. It's not clear at all that this quantifies uncertainty in the typical way of spread of values around the mean...

GaussianProcessRegressor Fitting Kernel/Hyperparameters

good day everyone. I have got the following:
I am using a GaussianProcessRegressor object from the Sklearn library.
After fitting the model, I want to sample points using predict, to get a better idea of what the model looks like so far. But now I do get the issue that it just assumed the points zero anywhere except for the training points.
I reset the alpha value of the Regressor from my initial 1e-5 back to default 1e-10 and the n_restarts_optimizer from 9 back to default zero, my kernel is a Matern kernel with nearly standard settings. Now I do get non-zero values, however I am not sure how to proceed:
I have the following:
a = df_reduced.values[0:4, :]
print("a[0,0]: ", a[0,0])
gp.predict(a)
Of course this gives me a nice result (since it's the fitting data):
a[0,0]: 150.0
Out[47]:
array([[10.4 ],
[ 9.3 ],
[78.39990234],
[78.39990234]])
Now I slightly alter the first feature of the first sample in it's initial vicinity:
a = df_reduced.values[1:4, :]
a[0, 0] = 151
gp.predict(a)
array([[4.85703698e-254],
[7.83999023e+001],
[7.83999023e+001]])
, and for a[0, 0] = 152
array([[ 0. ],
[78.39990234],
[78.39990234]])
. So it seems that in most of the area the function is simply zero, which is kind of a problem, because I want to use this for a Gaussian Hyperparameter Optimization minimising globally. Would somebody have a lead how to optimise better?
Btw I am using 16 features, and fitting on 30 samples so far and the output function takes values between 0 and 100.
Parameters are as follows (copy-paste):
matern = C(1.0)*Matern(length_scale=1.0, nu=2.5)
gp = GaussianProcessRegressor(kernel=matern)
gp.fit(df_reduced.values, Y) # df_reduced.values, because meanwhile we have overwritten X_reduced
Thanks already for any lead,
Best regards,
robTheBob86

trouble getting started with simple pymc3 example

I am new to using the PyMC3 package and am just trying to implement an example from a course on measurement uncertainty that I’m taking. (Note this is an optional employee education course through work, not a graded class where I shouldn’t find answers online). The course uses R but I find python to be preferable.
The (simple) problem is posed as following:
Say you have an end-gauge of actual (unknown) length at room-temperature length, and measured length m. The relationship between the two is:
length = m / (1 + alpha*dT)
where alpha is an expansion coefficient and dT is the deviation from room temperature and m is the measured quantity. The goal is to find the posterior distribution on length in order to determine its expected value and standard deviation (i.e. the measurement uncertainty)
The problem specifies prior distributions on alpha and dT (Gaussians with small standard deviation) and a loose prior on length (Gaussian with large standard deviation). The problem specifies that m was measured 25 times with an average of 50.000215 and standard deviation of 5.8e-6. We assume that the measurements of m are normally distributed with a mean of the true value of m.
One issue I had is that the likelihood doesn’t seem like it can be specified just based on these statistics in PyMC3, so I generated some dummy measurement data (I ended up doing 1000 measurements instead of 25). Again, the question is to get a posterior distribution on length (and in the process, although of less interest, updated posteriors on alpha and dT).
Here’s my code, which is not working and having convergence issues:
from IPython.core.pylabtools import figsize
import numpy as np
from matplotlib import pyplot as plt
import scipy.stats as stats
import pymc3 as pm
import theano.tensor as tt
basic_model = pm.Model()
xdata = np.random.normal(50.000215,5.8e-6*np.sqrt(1000),1000)
with basic_model:
#prior distributions
theta = pm.Normal('theta',mu=-.1,sd=.04)
alpha = pm.Normal('alpha',mu=.0000115,sd=.0000012)
length = pm.Normal('length',mu=50,sd=1)
mumeas = length*(1+alpha*theta)
with basic_model:
obs = pm.Normal('obs',mu=mumeas,sd=5.8e-6,observed=xdata)
#yobs = Normal('yobs',)
start = pm.find_MAP()
#trace = pm.sample(2000, step=pm.Metropolis, start=start)
step = pm.Metropolis()
trace = pm.sample(10000, tune=200000,step=step,start=start,njobs=1)
length_samples = trace['length']
fig,ax=plt.subplots()
plt.hist(length_samples, histtype='stepfilled', bins=30, alpha=0.85,
label="posterior of $\lambda_1$", color="#A60628", normed=True)
I would really appreciate any help as to why this isn’t working. I've been trying for a while and it never converges to the expected solution given from the R code. I tried the default sampler (NUTS I think) as well as Metropolis but that completely failed with a zero gradient error. The (relevant) course slides are attached as an image. Finally, here is the comparable R code:
library(rjags)
#Data
jags_data <- list(xbar=50.000215)
jags_code <- jags.model(file = "calibration.txt",
data = jags_data,
n.chains = 1,
n.adapt = 30000)
post_samples <- coda.samples(model = jags_code,
variable.names =
c("l","mu","alpha","theta"),#,"ypred"),
n.iter = 30000)
summary(post_samples)
mean(post_samples[[1]][,"l"])
sd(post_samples[[1]][,"l"])
plot(post_samples)
and the calibration.txt model:
model{
l~dnorm(50,1.0)
alpha~dnorm(0.0000115,694444444444)
theta~dnorm(-0.1,625)
mu<-l*(1+alpha*theta)
xbar~dnorm(mu,29726516052)
}
(note I think the dnorm distribution takes 1/sigma^2, hence the weird-looking variances)
Any help or insight as to why the PyMC3 sampling isn't converging and what I should do differently would be extremely appreciated. Thanks!
I also had trouble getting anything useful from the generated data and model in the code. It seems to me that the level of noise in the fake data could equally be explained by the different sources of variance in the model. That can lead to a situation of highly correlated posterior parameters. Add to that the extreme scale imbalances, then it makes sense this would have sampling issues.
However, looking at the JAGS model, it seems they really are using just that one input observation. I've never seen this technique(?) before, that is, inputting summary statistics of data instead of the raw data itself. I suppose it worked for them in JAGS, so I decided to try running the exact same MCMC, including using the precision (tau) parameterization of the Gaussian.
Original Model with Metropolis
with pm.Model() as m0:
# tau === precision parameterization
dT = pm.Normal('dT', mu=-0.1, tau=625)
alpha = pm.Normal('alpha', mu=0.0000115, tau=694444444444)
length = pm.Normal('length', mu=50.0, tau=1.0)
mu = pm.Deterministic('mu', length*(1+alpha*dT))
# only one input observation; tau indicates the 5.8 nm sd
obs = pm.Normal('obs', mu=mu, tau=29726516052, observed=[50.000215])
trace = pm.sample(30000, tune=30000, chains=4, cores=4, step=pm.Metropolis())
While it's still not that great at sampling length and dT, it at least appears convergent overall:
I think noteworthy here is that despite the relatively weak prior on length (sd=1), the strong priors on all the other parameters appear to propagate a tight uncertainty bound on the length posterior. Ultimately, this is the posterior of interest, so this seems to be consistent with the intent of the exercise. Also, see that mu comes out in the posterior as exactly the distribution described, namely, N(50.000215, 5.8e-6).
Trace Plots
Forest Plot
Pair Plot
Here, however, you can see the core problem is still there. There's both strong correlation between length and dT, plus 4 or 5 orders of magnitude scale difference between the standard errors. I'd definitely do a long run before I really trusted the result.
Alternative Model with NUTS
In order to get this running with NUTS, you'd have to address the scaling issue. That is, somehow we need to reparameterize to get all the tau values closer to 1. Then, you'd run the sampler and transform back into the units you're interested in. Unfortunately, I don't have time to play around with this right now (I'd have to figure it out too), but maybe it's something you can start exploring on your own.

How do I improve a Gaussian/Normal fit in Python 3.X by using a running median?

I have an array of 100x100 data points, where I'm trying to perform a Gaussian fit to each column of 100 values in the array. I then want the parameters of the Gaussian found by using the fit of the first column to be the initial parameters of the starting point for the next column to use. Let's say I start with the initial parameters of 1000, 0, and 1, and the fit finds values of 800, 3, and 1.5. I then want the fitter to use these three parameters as initial values for the next column.
My code is:
x = np.linspace(-50,50,100)
Gauss_Model = models.Gaussian1D(amplitude = 1000., mean = 0, stddev = 1.)
Fitting_Model = fitting.LevMarLSQFitter()
Fit_Data = []
for i in range(0, Data_Array.shape[0]):
Fit_Data.append(Fitting_Model(Gauss_Model, x, Data_Array[:,i]))
Right now it uses the same initial values for every fit. Does anyone know how to perform such a running median/mean for a Gaussian fitting method? Would really appreciate any help or being pointed in the right direction, thanks!
I'm not familiar with the specific library you are using, but if you can get your fitted parameters out with something like fit_data[-1].amplitude or fit_data[-1].mean, then you could modify your loop to use something like:
for i in range(0, data_array.shape[0]):
if fit_data: # true if not an empty list
Gauss_Model = models.Gaussian1D(amplitude=fit_data[-1].amplitude,
mean=fit_data[-1].mean,
stddev=fit_data[-1].stddev)
fit_data.append(Fitting_Model(Gauss_Model, x, Data_Array[:,i]))
basically checking whether you have already fit a model, and if you have, use the most recent fitted amplitude, mean, and standard deviation as the starting point for your next Gauss_Model.
A thought: this might speed up your fitting, but it shouldn't result in a "better" fit to the 100 data points in each fit operation. Your resulting model is probably the best fit model to the data it was presented. If you want to estimate the error in the parameters of your model, you can use the fact that, for two normal distributions A ~ N(m_a, v_a) and B ~ N(m_b, v_b), the distribution A + B will have mean m_a + m_b and variance is v_a + v_b. Thus, the distribution of your means will be N(sum(means)/n, sum(variances)/n). Basically you can say that your true mean is centered at the mean of your means with standard deviation (sum(stddev)/sqrt(n)).
I also cannot tell what library you are using, and the details of how to do this probably depend on the details of how that library stores the fitted values. I can say that for lmfit (https://lmfit.github.io/lmfit-py/) we struggled with this sort of usage and arrived at a design that makes what you are trying to do pretty easy. With lmfit, you might compose this problem as:
import numpy as np
from lmfit import GaussianModel
x = np.linspace(-50,50,100)
# get Data_Array from somewhere....
# create a model for a Gaussian
Gauss_Model = GaussianModel()
# make a set of parameters, setting initial values
params = Gauss_Model.make_params(amplitude=1000, center=0, sigma=1.0)
Fit_Results = []
for i in range(Data_Array.shape[1]):
result = Gauss_Model.fit(Data_Array[:, i], params, x=x)
Fit_Results.append(result)
# update `params` with the current best fit params for the next column
params = result.params
Note that this works because lmfit is careful that Model.fit() will not alter the input parameters, and will put the resulting best-fit parameters for each fit in result.params.
And, if you decide you do want to have all columns use the original initial values, just comment out that last params = result.params.
Lmfit has a lot more bells and whistles, but I hope that helps you do what you need.

Categories