Numpy polyfit: possible error in the scaling of the covariance matrix? - python

I am having a hard time figuring out the scaling for the covariance matrix in numpy polyfit.
In the documentation I read that the scaling factor to go from an unscaled to a scaled covariance matrix is
chi2 / sqrt(N - DOF).
In the code attached below, it seems that the scaling factor actually is
chi2 / DOF
Here is my code
# Generate synthetically the data
# True parameters
import numpy as np
true_slope = 3
true_intercept = 7
x_data = np.linspace(-5, 5, 30)
# The y-data will have a noise term, to simulate imperfect observations
sigma = 1
y_data = true_slope * np.linspace(-5, 5, 30) + true_intercept
y_obs = y_data + np.random.normal(loc=0.0, scale=sigma, size=x_data.size)
# Here I generate artificially some unequal uncertainties
# (even if there is no reason for them to be so)
y_uncertainties = sigma * np.random.normal(loc=1.0, scale=0.5*sigma, size=x_data.size)
# Make the fit
popt, pcov = np.polyfit(x_data, y_obs, 1, w=1/y_uncertainties, cov='unscaled')
popt, pcov_scaled = np.polyfit(x_data, y_obs, 1, w=1/y_uncertainties, cov=True)
my_scale_factor = np.sum((y_obs - popt[0] * x_data - popt[1])**2 / y_uncertainties**2)\
/ (len(y_obs)-2)
scale_factor = pcov_scaled[0,0] / pcov[0,0]
If I run the code, I see that the actual scale factor is chi2 / DOF and not the value reported in the documentation. Is this true or am I missing something?
I have a further question. Why is it suggested to use just the inverse of the y-data error instead of the square of the inverse of the y-data errors for the weights in the case that the uncertainties are normally-distributed?
Edit to add the data generated by a run of the code
x_data = array([-5. , -4.65517241, -4.31034483, -3.96551724, -3.62068966,
-3.27586207, -2.93103448, -2.5862069 , -2.24137931, -1.89655172,
-1.55172414, -1.20689655, -0.86206897, -0.51724138, -0.17241379,
0.17241379, 0.51724138, 0.86206897, 1.20689655, 1.55172414,
1.89655172, 2.24137931, 2.5862069 , 2.93103448, 3.27586207,
3.62068966, 3.96551724, 4.31034483, 4.65517241, 5. ])
y_obs = array([-7.27819725, -8.41939411, -3.9089926 , -5.24622589, -3.78747379,
-1.92898727, -1.375255 , -1.84388812, -0.37092441, 0.27572306,
2.57470918, 3.860485 , 4.62580789, 5.34147103, 6.68231985,
7.38242258, 8.28346559, 9.46008873, 10.69300274, 12.46051285,
13.35049975, 13.28279961, 14.31604781, 16.8226239 , 16.81708308,
18.64342284, 19.37375515, 19.6714002 , 20.13700708, 22.72327533])
y_uncertainties = array([ 0.63543112, 1.07608924, 0.83603265, -0.03442888, -0.07049299,
1.30864191, 1.36015322, 1.42125414, 1.04099854, 1.20556608,
0.43749964, 1.635056 , 1.00627014, 0.40512511, 1.19638787,
1.26230966, 0.68253139, 0.98055035, 1.01512232, 1.83910276,
0.96763007, 0.57373151, 1.69358475, 0.62068133, 0.70030971,
0.34648312, 1.85234844, 1.18687269, 1.23841579, 1.19741206])
With this data I obtain that scale_factor = 1.6534129347542432, my_scale_factor = 1.653412934754234 and that the "nominal" scale factor reported in the documentation, i.e.
nominal_scale_factor = np.sum((y_obs - popt[0] * x_data - popt[1])**2 /\
y_uncertainties**2) / np.sqrt(len(y_obs) - len(y_obs) + 2)
has value nominal_scale_factor = 32.73590595145554
PS. my numpy version is
1.18.5 3.7.7 (default, May 6 2020, 11:45:54) [MSC v.1916 64 bit (AMD64)]

Regarding the numpy.polyfit documentation:
By default, the covariance are scaled by chi2/sqrt(N-dof), i.e., the weights are presumed to be unreliable except in a relative sense and everything is scaled such that the reduced chi2 is unity.
This looks like a documentation bug. The correct scaling factor for the covariance is chi_square/(N-M) where M is the number of fit parameters and N-M is the number of degrees of freedom. It looks like np.polyfit is implemented correctly, because my_scale_factor and scale_factor are consistent.
Regarding the question on why not "the square of the inverse of the y-data errors": a polynomial fit or more generally, a least-squares fit involves solving the p vector in
A # p = y
where A is an (N, M) matrix for N data points in y and M elements in p and each column in A is the polynomial term evaluated at the corresponding x values.
The solution minimizes
(SUM_j A[i, j] p[j] - y[i])^2
SUM -----------------------------
i sigma_y[i]^2
Computationally, the cheapest way to calculate this is by multiplying each row in A and each y value by the corresponding 1/sigma_y and then taking a standard least-square solution of the A#p=y equation. By having the user supply the inverse errors, it saves the fit routine from handling division by zero issues and slow square-root operations.

Regarding the first part, I opened a Github issue
https://github.com/numpy/numpy/issues/16842
The conclusion on that thread is that the documentation is wrong, but the function behaves correctly.
The documentation should be updated to
By default, the covariance is scaled by chi2/dof, i.e., the weights are presumed to be unreliable except in a relative sense and everything is scaled such that the reduced chi2 is unity.

Related

How to calculate the probability between two numbers from a probability distribution in python

I've always thought it would be useful to calculate the probability between two values on a probability distribution. While there isn't a built-in way to do this using seaborn or matplotlib, I reckon it just takes some basic calculus, right? Here is some code I found from an article on this topic:
from sklearn.neighbors import KernelDensity
import numpy as np
x = np.random.normal(loc=0.0, scale=1.0, size=1000000)
kd = KernelDensity(kernel='gaussian', bandwidth=0.5).fit(np.array(x).reshape(-1, 1))
def get_probability(start_value, end_value, eval_points, kd):
# Number of evaluation points
N = eval_points
step = (end_value - start_value) / (N - 1) # Step size
x = np.linspace(start_value, end_value, N)[:, np.newaxis] # Generate values in the range
kd_vals = np.exp(kd.score_samples(x)) # Get PDF values for each x
probability = np.sum(kd_vals * step) # Approximate the integral of the PDF
return probability.round(4)
get_probability(x.mean() - x.std(), x.mean() + x.std(), 100, kd)
0.6338
This returns a probability that converges at 0.6338. This confused me, as the 68-95-99.7 rule states that the probability of a value being within one standard deviation of the mean in either direction should be 68%.
I decided to run another test by calculating the probability between the median and max of a randomly generated sample, figuring it should converge close to 50%:
x = np.random.randint(100, size=(1000000))
# sns.kdeplot(x) # this is how i'd generate a kdeplot of this data
kd = KernelDensity(kernel='gaussian', bandwidth=0.5).fit(np.array(x).reshape(-1, 1))
def get_probability(start_value, end_value, eval_points, kd):
# Number of evaluation points
N = eval_points
step = (end_value - start_value) / (N - 1) # Step size
x = np.linspace(start_value, end_value, N)[:, np.newaxis] # Generate values in the range
kd_vals = np.exp(kd.score_samples(x)) # Get PDF values for each x
probability = np.sum(kd_vals * step) # Approximate the integral of the PDF
return probability.round(4)
get_probability(np.median(x), x.max(), 100, kd)
0.4946
And it's pretty close. Am I missing something here? Why am I nearly 5 percentage points off from the 68-95-99.7 rule? Is this method of generating probabilities from a probability distribution wrong? Is there a better way to find the probability between two values from a probability distribution?
EDIT: Could you potentially calculate something by using the data generated from a kdeplot?
fig, ax = plt.subplots()
sns.kdeplot(x)
kdeline = ax.lines[0]
xs = kdeline.get_xdata()
ys = kdeline.get_ydata()
And implement np.interp() somehow?
More edits:
Using CDFs per #7shoe, I was able to get a way better (and correct) result for my normal distribution example:
from scipy.stats import norm
import numpy as np
np.random.seed(42)
x = np.random.normal(loc=0.0, scale=1.0, size=10000000)
norm.cdf(x.mean() + x.std()) - norm.cdf(x.mean() - x.std())
However, my curiosity is still piqued. Let's say we have a distribution that may or may not be normal. For example, let's look at Tom Brady's epa per pass from last season
import pandas as pd
import seaborn as sns
import random
import numpy as np
YEAR = 2021
data = pd.read_csv(
'https://github.com/nflverse/nflfastR-data/blob/master/data/play_by_play_' \
+ str(YEAR) + '.csv.gz?raw=True',compression='gzip', low_memory=False
)
df = data.loc[data.passer == 'T.Brady','epa'].copy()
# tom brady's distribution
sns.kdeplot(df)
sample_mean = []
for i in range(50):
y = np.random.choice(df, 500)
avg = np.mean(y)
sample_mean.append(avg)
# distribution of sampling means - can we assume this is normal and proceed with cdfs?
sns.kdeplot(sample_mean)
Could we use sampling means or even just bootstrap resampling methods to
Make a more "normal" distribution with sampling means in order to incorporate cdfs if the initial distribution doesn't quite appear normal (this, though, would be a distribution of means rather than individual samples. Is this not encouraged?)
or
If the distribution already resembles a normal distribution, simply use such resampling methods to create better parametric estimates?
Computing the probability p for some interval is not overly complicated. However, it might be tricky to combine the right tools to do so. In particular, since there are several statistical approaches to do so.
1. Probability theory
Given two numbers, let's call them lower and upper, what probability is enclosed in between them? If the cumulative distribution function (CDF) F is known, it is merely p = F(upper) - F(lower). Similarly, p coincides with the area enclosed by the probability density function(PDF) f's graph on the interval [lower, upper].
However, when the CDF/PDF is unknown, it constitutes a statistical question. In a nutshell, estimating the PDF f and computing the area its graph enclosed with the interval will do. But there are several paradigms and estimation procedures to obtain it.
1. Parametric estimation
One could assume that the data x is set of IID realizations of some normal distribution, either because of prior knowledge or convenience. Then, one just needs to estimate its parameters mu (aka scale) and sigma (aka standard deviation or scale). scipy.stats provides all we need in this setting. Moreover, it offers estimation procedures as well as pdf/cdf functions for various parametric distributions.
from scipy import stats
from matplotlib import pyplot as plt
lower, upper = 0.0, 2.0
x = [-0.804, -2.267, 1.55, -1.004, 3.173, -0.522, -0.231, 3.95, -0.574, -0.213, 1.333, 2.42, 1.879, 3.814]
# fit parameter
loc_hat, scale_hat = stats.norm.fit(x)
# probability
p = stats.norm.cdf(upper, loc=loc_hat, scale=scale_hat) - stats.norm.cdf(lower, loc=loc_hat, scale=scale_hat)
# plot
x_axis = np.linspace(-5, 7, 1000)
plt.title('1. Parametric Estimation', fontsize=18)
plt.plot(x_axis, stats.norm.pdf(x_axis, loc_hat, scale_hat))
plt.fill_between(x = np.arange(lower, upper, 0.01),
y1 = stats.norm.pdf(np.arange(lower, upper, 0.01), loc=loc_hat, scale=scale_hat) ,
facecolor='red',
alpha=0.35)
plt.text(x=0.1, y=0.1, s= 'p=' + str(round(p, 3)))
plt.show()
which yields
2. Non-parametric estimation
In the absence of a parametric assumption, various techniques exist to estimate the density directly (rather than identifying it by estimated parameters as seen above). Kernel density estimation is the most popular variant to do so. In this case, as alluded in the question, scikit-learn is an ideal tool. However, in the absence of an analytical CDF, we need to compute the area enclosed by the density's graph over the interval [lower, upper] directly.
In contrast to previous answers, I'd leave this to SciPy's numerical integration routines, e.g. scipy.inegrate.quad(). The advantage is that it is lightning-fast and can be applied to any function (beyond kernel density estimates). The resulting code is as follows
from sklearn.neighbors import KernelDensity
from scipy.integrate import quad
x = [-0.804, -2.267, 1.55, -1.004, 3.173, -0.522, -0.231, 3.95, -0.574, -0.213, 1.333, 2.42, 1.879, 3.814]
# fit density function
f_hat = KernelDensity(bandwidth=.9, kernel='gaussian').fit(np.array(x).reshape(-1, 1))
def f_pred(x):
'''wrapper function to compute probability'''
return np.exp(f_hat.score_samples(np.array(x).reshape(-1, 1)))[0]
p = quad(func=f_pred, a=lower, b=upper)
# plot
plt.title('2. Non-Parametric Estimation', fontsize=18)
xaxis = np.linspace(-5, 7, 1000)
plt.plot(x_axis, np.exp(f_hat.score_samples(xaxis.reshape(-1, 1))))
plt.fill_between(x = np.arange(lower, upper, 0.01),
y1 = np.exp(f_hat.score_samples(np.arange(lower, upper, 0.01).reshape(-1, 1))),
facecolor='red',
alpha=0.35)
plt.text(x=0.15, y=0.1, s= 'p=' + str(round(p[0], 3)))
plt.show()
and yields
I do see a bug in the get_probability function, but that bug causes it to compute a too high result - in np.sum(kd_vals * step), it's multiplying N sample values by a step with N-1 in the denominator, effectively resulting in an output a factor of N/(N-1) too high. (If they wanted to use a trapezoid rule computation for the integral, they should have divided the left and right endpoint values by 2 first.)
Other than that, the computation looks correct. The problem is that the model doesn't reflect the input distribution.
You're not modeling the distribution as a normal distribution. You're modeling it with a kernel density estimator with a Gaussian kernel, and the kernel bandwidth is very high relative to the scale of the distribution and the number of available samples. This results in the model being "flatter" than the actual distribution, with less of the probability concentrated in the center.

Diminishing the impact of one variable over output in a regression model

Currently I am implementing a gaussian regression process model and I have been having some problems when trying to apply it to the scope of my problem. My problem is that I have as input to my model three variables, which one of these values (theta) has way more significant impact than the other two, alpha1 and alpha2. The inputs and outputs have the following values (just a few values to better understand):
# X (theta, alpha1, alpha2)
array([[ 9.07660169, 0.61485493, 1.70396493],
[ 9.51498486, -5.49212002, -0.68659511],
[10.45737558, -2.2739529 , -2.03918961],
[10.46857663, -0.4587848 , 0.54434441],
[ 9.10133699, 8.38066374, 0.66538822],
[ 9.17279647, 0.36327109, -0.30558115],
[10.36532505, 0.87099676, -7.73775872],
[10.13681026, -1.64084098, -0.09169159],
[10.38549264, 1.80633583, 1.3453195 ],
[ 9.72533357, 0.55861224, 0.74180309])
# y
array([4.93483686, 5.66226844, 7.51133372, 7.54435854, 4.92758927,
5.0955348 , 7.26606153, 6.86027353, 7.36488184, 6.06864003])
As it can be seen, theta alters significantly the value of y, whereas changes in alpha1 and alpha2 are way more subtle over the y.
The situation that I am facing is that I am applying a model to my data and out of this model, I am applying a minimization with Scipy to the model setting one of the inputs variables fixed as on this minimization. The code bellow might illustrate better:
# model fitting
kernel = C(1.0, (1e-3, 1e3))*RBF(10,(1e-2,1e2))
model = GaussianProcessRegressor(kernel = kernel, n_restarts_optimizer = 9,optimizer='fmin_l_bfgs_b')
model.fit(X,y)
# minimization
bnds = np.array([(theta,theta),
(alpha1.min(),
alpha1.max()),
(alpha2.min(),
alpha2.max())])
x0 = [theta,alpha1.min(),alpha2.min()]
residual_plant = minimize(lambda x: -model.predict(np.array([x])),
x0, method='SLSQP',bounds=bnds,
options = {'eps': np.radians(5)})
My goal with that is that I want to set the first variable value as a fixed value and I want to study impact that the other two variables, alpha1 and alpha2, have over the output y for that specific theta value. The specific reasoning behind the minimization is that I want to find the combinations of alpha t1 and alpha2 that return me the optimal y for this fixed theta. Therefore, I was wondering how would I do that, as I believe that theta must be influencing drastically the impact that my other two variables have over my output, and then it might be negatively influencing my model on the task that I have in hand, as it has a heavier weight and will hidden the influence of alpha1 and alpha2 on my model, however, I cannot really ignore it or not feed it into my model as I want to find the optimal y value for this fixed theta and therefore I would still need to use theta as input.
My question is, how to deal with such issue? Is there any statistical trick to eliminate or at least diminish this influence without having to eliminate theta from my model? Is there a better way to deal with my problem?
First, did you normalize the data before training?
Second, it sounds like you want to see the relationship between x and y with a constant theta.
If you take your dataset and sort it by theta, you can try to find a group of records where theta is the same or very similar, where its variance is low and it doesn’t change much. You can take that group of data and form a new dataframe, and drop the theta column (because we picked a portion of the dataset where theta has a low variance and so it isn’t very useful). Then, you can train your model or do some data visualization on just the alpha1 and alpha2 data.
My overall understanding to your question is that you want to achieve two things:
To study the effect of alpha1 and alpha2 after turning theta into constant (i.e. eliminating the influence of theta on the model).
To find the best combination of alpha1 and alpha2 that returns the optimal y for this fixed theta.
That can be summarized under the study of the Correlation between the input variables and the target variable.
Since Correlation studies the changes in the relation between one variable from another independently, then you can get a good insight about the influence of alpha1, alpha2 and theta on y.
Two interesting correlations exist to help you:
Pearson's Correlation: Numerically reflects the strength of a linear correlation.
Spearman's Correlation: Numerically reflects the strength of a monotonic correlation (i.e. the rank, in case the correlation is not linear).
Let's give it a try:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.DataFrame(columns=['theta', 'alpha1', 'alpha2', 'y'],
data=[[ 9.07660169, 0.61485493, 1.70396493, 4.93483686],
[ 9.51498486, -5.49212002, -0.68659511, 5.66226844],
[10.45737558, -2.2739529 , -2.03918961, 7.51133372],
[10.46857663, -0.4587848 , 0.54434441, 7.54435854],
[ 9.10133699, 8.38066374, 0.66538822, 4.92758927],
[ 9.17279647, 0.36327109, -0.30558115, 5.0955348],
[10.36532505, 0.87099676, -7.73775872, 7.26606153],
[10.13681026, -1.64084098, -0.09169159, 6.86027353],
[10.38549264, 1.80633583, 1.3453195, 7.36488184],
[ 9.72533357, 0.55861224, 0.74180309, 6.06864003]])
plt.figure(figsize=(10, 8))
ax = sns.heatmap(df.corr(method="pearson"), annot=True)
plt.xticks(rotation = 90)
plt.title('Pearson Correlation Heatmap')
plt.show()
plt.figure(figsize=(10, 8))
ax = sns.heatmap(df.corr(method="spearman"), annot=True)
plt.xticks(rotation = 90)
plt.title('Spearman Correlation Heatmap')
plt.show()
As you can see, we got very good insights about the relation between theta, alpha1 and alpha2 with each other and with y.
According to Cohen's Standard, we can conclude that:
Alpha1 and Alpha2 have medium correlation with y.
Theta has very strong correlation with y.
Alpha1 has weak linear correlation with alpha2 but medium monotonic correlation.
Alpha1 and Alpha2 have medium correlation with theta.
But wait a minute, since alpha1 and alpha2 have medium correlation with y but weak correlation (to medium) between each other, we can then exploit the variance to produce an optimization function L that is a linear combination between alpha1 and alpha2, as follows:
Let m, n be two weights that maximize the correlation between alpha1 and alpha2 features with y according to the optimization function L:
m * alpha1 + n * alpha2
The optimal coefficients m and n achieving the maximum correlation between L and y do depend on the variances of alpha1, alpha2 and y.
We can derive from that, the following optimization solution:
m = [ πΆπ‘œπ‘£(𝑏,𝑐) * πΆπ‘œπ‘£(π‘Ž,𝑏) βˆ’ πΆπ‘œπ‘£(π‘Ž,𝑐) * π‘‰π‘Žπ‘Ÿ(𝑏) / πΆπ‘œπ‘£(π‘Ž,𝑐) * πΆπ‘œπ‘£(π‘Ž,𝑏) βˆ’ πΆπ‘œπ‘£(𝑏,𝑐) * π‘‰π‘Žπ‘Ÿ(π‘Ž) ] * n
where a , b and c correspond to alpha1, alpha2 and y respectively.
By choosing m or n to be either 1 or -1 , we can find the optimal solution to engineer the new feature.
cov = df[['alpha1', 'alpha2', 'y']].cov()
# applying the optimization function: a = alpha1 , b = alpha2 and c = y
# note that cov of a feature with itself = variance
coef = (cov['alpha2']['y'] * cov['alpha1']['alpha2'] - cov['alpha1']['y'] * cov['alpha2']['alpha2']) / \
(cov['alpha1']['y'] * cov['alpha1']['alpha2'] - cov['alpha2']['y'] * cov['alpha1']['alpha1'])
# let n = 1 --> m = coef --> L = coef * alpha1 + alpha2 : which is the new feature to add
df['alpha12'] = coef * df['alpha1'] + df['alpha2']
As you can see, there is a noticeable improvement in the correlation of the introduced alpha12.
Furthermore, related to question 1, to decrease the correlation of theta; and since the correlation is given by:
Corr(theta, y) = Cov(theta, y) / [sqrt(Var(that)) * sqrt(Var(y))]
You can increase the variance of theta. To do so, simply sample n points from some distribution and add them to the corresponding indices as noise.
Save this noise list for future use in case you need to get back to the original theta, something like this:
cov = df[['y', 'theta']].cov()
print("Theta Variance :: Before = {}".format(cov['theta']['theta']))
np.random.seed(2020) # add seed to make it reproducible for future undo
# create noise drawn from uniform distribution
noise = np.random.uniform(low=1.0, high=10., size=df.shape[0])
df['theta'] += noise # add noise to increase variance
cov = df[['y', 'theta']].cov()
print("Theta Variance :: After = {}".format(cov['theta']['theta']))
# df['theta'] -= noise to back to original variance
plt.figure(figsize=(15, 15))
ax = sns.heatmap(df.corr(method="spearman"), annot=True)
plt.xticks(rotation = 90)
plt.title('Spearman Correlation Heatmap After Reducing Variance of Theta\n')
plt.show()
plt.figure(figsize=(15, 15))
ax = sns.heatmap(df.corr(method="pearson"), annot=True)
plt.xticks(rotation = 90)
plt.title('Pearson Correlation Heatmap After Reducing Variance of Theta\n')
plt.show()
Theta Variance :: Before = 0.3478030891329485
Theta Variance :: After = 7.552229545792681
Now Alpha12 is taking the lead and has the highest influence on the target variable y.
I would say that the effect of theta on your predictor can't hide the effect of the other variables. The effect of the other variables may be small but that's likely just the way it is. I'd take the estimate you have and optimise y for constant theta as is

Gaussian process with constant variance after mean centering

I'm new to working with Gaussian processes, so please target answers to a relative beginner. I'm trying to sample correlated noise where the mean of each sample is 0 (i.e., mean-centered), so I've been running the following code. However, while the variance across samples is the same for each dimension before mean-centering each sample, mean-centering causes the variance to be increasingly large at the ends of the vectors. I'm fairly certain I understand why this happens, but I'm struggling to figure out if there's a way to have each sample mean-centered while maintaining equal variance across dimensions.
import numpy as np
def rbf_kernel(x_1, x_2, sig):
return np.exp((-(x_1-x_2)**2)/2*(sig**2))
X = np.array([[0.08333333],
[0.25 ],
[0.41666667],
[0.58333333],
[0.75 ],
[0.91666667]])
r = 0.1
covNoise = np.zeros((6, 6))
for i, x1 in enumerate(X):
for j, x2 in enumerate(X):
covNoise[i,j] = rbf_kernel(x1, x2, r)
noise = np.random.multivariate_normal(np.zeros(6), covNoise, 1000)
np.var(noise, axis=0)
# Variance before mean-centering -- variance is constant across the vector
# array([0.99994815, 0.99941361, 0.9989251 , 0.99848157, 0.99806782, 0.99768438])
noise_meanCentered = noise - noise.mean(axis=1, keepdims=True)
np.var(noise_meanCentered, axis=0)
# Variance after mean-centering -- variance is greatest at the ends of the vector
# array([0.15211363, 0.0589172, 0.01052137, 0.01053556, 0.0589244, 0.15203642])

Simulating correlated lognormals in Python

I'm following the answer in this question How can I sample a multivariate log-normal distribution in Python?, but I'm getting that the marginal distributions of the sample data fail to have the same mean and standard deviation of the inputted marginals. For example, consider the multivariate distribution below in the code sample. If we label the marginals as X, Y, and Z, then I would expect that the scale and location parameters (implied from the sample data) to match inputted data. However, for X, you can see below that the scale and location parameters are 0.1000 and 0.5219. So the scale is what we expect, but the location is off by 4%. I'm thinking I'm doing something wrong with the covariance matrix, but I can't seem to figure out what is wrong. I tried setting the correlation matrix to the identity matrix and then the location and scale of the sample data match with the inputted data. Something must be wrong with my covariance matrix, or I'm making another fundamental error. Any help would be appreciated. Please advise if the question is unclear.
import pandas as pd
import numpy as np
from copy import deepcopy
mu = [0.1, 0.2, 0.3]
sigma = [0.5, 0.8, 0.6]
sims = 3000000
rho = [[1, 0.9, 0.3], [0.9, 1, 0.8], [0.3, 0.8 ,1]]
cov = deepcopy(rho)
for row in range(len(rho)):
for col in range(len(rho)):
cov[row][col] = rho[row][col] * sigma[row] * sigma[col]
mvn = np.random.multivariate_normal(mu, cov, size=sims)
sim = pd.DataFrame(np.exp(mvn), columns=['X', 'Y', 'Z'])
def computeImpliedLogNormalsParams(mean, std):
# This method implies lognormal params which match the moments inputed
secondMoment = std * std + mean *mean
location = np.log(mean*mean / np.sqrt(secondMoment))
scale = np.sqrt(np.log(secondMoment / (mean * mean)))
return (location, scale)
def printDistributionProp(col, sim):
print(f"Mean = {sim[col].mean()}, std = {sim[col].std()}")
location, scale = computeImpliedLogNormalsParams(sim[col].mean(), sim[col].std())
print(f"Matching moments gives a lognormal with location {location} and scale {scale}")
printDistributionProp('X', sim)
Output:
Mean = 1.2665338803521895, std = 0.708713940557892
Matching moments gives a lognormal with location 0.10008162992913544 and scale 0.5219239625443672
Observing the output, we would expect that scale parameter to be very close to 0.5, but it's a bit off. Increasing the number of simulations does nothing since the value has converged.
The covariance matrix isn't positive semidefinite:
>>> mvn = np.random.multivariate_normal(mu, cov, size=sims, check='raise')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "mtrand.pyx", line 4542, in mtrand.RandomState.multivariate_normal
ValueError: covariance is not symmetric positive-semidefinite.
and therefore there is no distribution of data that actually has the requested covariance structure. At a high-level, consider that you are specifying X and Z to both be highly correlated with Y (0.8 and 0.9), but at the same time to be rather weakly correlated with each other (0.3). A detailed discussion specifically about three variable correlation constraints can be found on Mathematics SE.
I don't know the internals of how NumPy gets around it (you should have seen a warning), but if you check the final correlation structure:
>>> np.corrcoef(mvn.T)
array([[1. , 0.79817321, 0.33343102],
[0.79817321, 1. , 0.74525583],
[0.33343102, 0.74525583, 1. ]])
one can see that the X and Z have lower correlations with Y and higher correlation with each other than originally specified by rho. Again, not sure how exactly the variances get adjusted, but because the covariance is impossible, NumPy can pretty much do what it wants; fortunately, it seems to stay pretty close.

Linear fit including all errors with NumPy/SciPy

I have a lot of x-y data points with errors on y that I need to fit non-linear functions to. Those functions can be linear in some cases, but are more usually exponential decay, gauss curves and so on. SciPy supports this kind of fitting with scipy.optimize.curve_fit, and I can also specify the weight of each point. This gives me weighted non-linear fitting which is great. From the results, I can extract the parameters and their respective errors.
There is just one caveat: The errors are only used as weights, but not included in the error. If I double the errors on all of my data points, I would expect that the uncertainty of the result increases as well. So I built a test case (source code) to test this.
Fit with scipy.optimize.curve_fit gives me:
Parameters: [ 1.99900756 2.99695535]
Errors: [ 0.00424833 0.00943236]
Same but with 2 * y_err:
Parameters: [ 1.99900756 2.99695535]
Errors: [ 0.00424833 0.00943236]
Same but with 2 * y_err:
So you can see that the values are identical. This tells me that the algorithm does not take those into account, but I think the values should be different.
I read about another fit method here as well, so I tried to fit with scipy.odr as well:
Beta: [ 2.00538124 2.95000413]
Beta Std Error: [ 0.00652719 0.03870884]
Same but with 20 * y_err:
Beta: [ 2.00517894 2.9489472 ]
Beta Std Error: [ 0.00642428 0.03647149]
The values are slightly different, but I do think that this accounts for the increase in the error at all. I think that this is just rounding errors or a little different weighting.
Is there some package that allows me to fit the data and get the actual errors? I have the formulas here in a book, but I do not want to implement this myself if I do not have to.
I have now read about linfit.py in another question. This handles what I have in mind quite well. It supports both modes, and the first one is what I need.
Fit with linfit:
Parameters: [ 2.02600849 2.91759066]
Errors: [ 0.00772283 0.04449971]
Same but with 20 * y_err:
Parameters: [ 2.02600849 2.91759066]
Errors: [ 0.15445662 0.88999413]
Fit with linfit(relsigma=True):
Parameters: [ 2.02600849 2.91759066]
Errors: [ 0.00622595 0.03587451]
Same but with 20 * y_err:
Parameters: [ 2.02600849 2.91759066]
Errors: [ 0.00622595 0.03587451]
Should I answer my question or just close/delete it now?
One way that works well and actually gives a better result is the bootstrap method. When data points with errors are given, one uses a parametric bootstrap and let each x and y value describe a Gaussian distribution. Then one will draw a point from each of those distributions and obtains a new bootstrapped sample. Performing a simple unweighted fit gives one value for the parameters.
This process is repeated some 300 to a couple thousand times. One will end up with a distribution of the fit parameters where one can take mean and standard deviation to obtain value and error.
Another neat thing is that one does not obtain a single fit curve as a result, but lots of them. For each interpolated x value one can again take mean and standard deviation of the many values f(x, param) and obtain an error band:
Further steps in the analysis are then performed again hundreds of times with the various fit parameters. This will then also take into account the correlation of the fit parameters as one can see clearly in the plot above: Although a symmetric function was fitted to the data, the error band is asymmetric. This will mean that interpolated values on the left have a larger uncertainty than on the right.
Please note that, from the documentation of curvefit:
sigma : None or N-length sequence
If not None, this vector will be used as relative weights in the
least-squares problem.
The key point here is as relative weights, therefore, yerr in line 53 and 2*yerr in 57 should give you similar, if not the same result.
When you increase the actually residue error, you will see the values in the covariance matrix grow large. Say if we change the y += random to y += 5*random in function generate_data():
Fit with scipy.optimize.curve_fit:
('Parameters:', array([ 1.92810458, 3.97843448]))
('Errors: ', array([ 0.09617346, 0.64127574]))
Compares to the original result:
Fit with scipy.optimize.curve_fit:
('Parameters:', array([ 2.00760386, 2.97817514]))
('Errors: ', array([ 0.00782591, 0.02983339]))
Also notice that the parameter estimate is now further off from (2,3), as we would expect from increased residue error and larger confidence interval of parameter estimates.
Short answer
For absolute values that include uncertainty in y (and in x for odr case):
In the scipy.odr case use stddev = numpy.sqrt(numpy.diag(cov))
where the cov is the covariance matrix odr gives in the output.
In the scipy.optimize.curve_fit case use absolute_sigma=True
flag.
For relative values (excludes uncertainty):
In the scipy.odr case use the sd value from the output.
In the scipy.optimize.curve_fit case use absolute_sigma=False flag.
Use numpy.polyfit like this:
p, cov = numpy.polyfit(x, y, 1,cov = True)
errorbars = numpy.sqrt(numpy.diag(cov))
Long answer
There is some undocumented behavior in all of the functions. My guess is that the functions mixing relative and absolute values. At the end this answer is the code that either gives what you want (or doesn't) based on how you process the output (there is a bug?). Also, curve_fit might have gotten the 'absolute_sigma' flag recently?
My point is in the output. It seems that odr calculates the standard deviation as there is no uncertainties, similar to polyfit, but if the standard deviation is calculated from the covariance matrix, the uncertainties are there. The curve_fit does this with absolute_sigma=True flag. Below is the output containing
diagonal elements of the covariance matrix cov(0,0) and
cov(1,1),
wrong way for standard deviation from the outputs for slope and
wrong way for the constant, and
right way for standard deviation from the outputs for slope and
right way for the constant
odr: 1.739631e-06 0.02302262 [ 0.00014863 0.0170987 ] [ 0.00131895 0.15173207]
curve_fit: 2.209469e-08 0.00029239 [ 0.00014864 0.01709943] [ 0.0004899 0.05635713]
polyfit: 2.232016e-08 0.00029537 [ 0.0001494 0.01718643]
Notice that the odr and polyfit have exactly the same standard deviation. Polyfit does not get the uncertainties as an input so odr doesn't use uncertainties when calculating standard deviation. The covariance matrix uses them and if in the odr case the the standard deviation is calculated from the covariance matrix uncertainties are there and they change if the uncertainty is increased. Fiddling with dy in the code below will show it.
I am writing this here mostly because this is important to know when finding out error limits (and the fortran odrpack guide where scipy refers has some misleading information about this: standard deviation should be the square root of covariance matrix like the guide says but it is not).
import scipy.odr
import scipy.optimize
import numpy
x = numpy.arange(200)
y = x + 0.4*numpy.random.random(x.shape)
dy = 0.4
def stddev(cov): return numpy.sqrt(numpy.diag(cov))
def f(B, x): return B[0]*x + B[1]
linear = scipy.odr.Model(f)
mydata = scipy.odr.RealData(x, y, sy = dy)
myodr = scipy.odr.ODR(mydata, linear, beta0 = [1.0, 1.0], sstol = 1e-20, job=00000)
myoutput = myodr.run()
cov = myoutput.cov_beta
sd = myoutput.sd_beta
p = myoutput.beta
print 'odr: ', cov[0,0], cov[1,1], sd, stddev(cov)
p2, cov2 = scipy.optimize.curve_fit(lambda x, a, b:a*x+b,
x, y, [1,1],
sigma = dy,
absolute_sigma = False,
xtol = 1e-20)
p3, cov3 = scipy.optimize.curve_fit(lambda x, a, b:a*x+b,
x, y, [1,1],
sigma = dy,
absolute_sigma = True,
xtol = 1e-20)
print 'curve_fit: ', cov2[0,0], cov2[1,1], stddev(cov2), stddev(cov3)
p, cov4 = numpy.polyfit(x, y, 1,cov = True)
print 'polyfit: ', cov4[0,0], cov4[1,1], stddev(cov4)

Categories