Currently I am implementing a gaussian regression process model and I have been having some problems when trying to apply it to the scope of my problem. My problem is that I have as input to my model three variables, which one of these values (theta) has way more significant impact than the other two, alpha1 and alpha2. The inputs and outputs have the following values (just a few values to better understand):
# X (theta, alpha1, alpha2)
array([[ 9.07660169, 0.61485493, 1.70396493],
[ 9.51498486, -5.49212002, -0.68659511],
[10.45737558, -2.2739529 , -2.03918961],
[10.46857663, -0.4587848 , 0.54434441],
[ 9.10133699, 8.38066374, 0.66538822],
[ 9.17279647, 0.36327109, -0.30558115],
[10.36532505, 0.87099676, -7.73775872],
[10.13681026, -1.64084098, -0.09169159],
[10.38549264, 1.80633583, 1.3453195 ],
[ 9.72533357, 0.55861224, 0.74180309])
# y
array([4.93483686, 5.66226844, 7.51133372, 7.54435854, 4.92758927,
5.0955348 , 7.26606153, 6.86027353, 7.36488184, 6.06864003])
As it can be seen, theta alters significantly the value of y, whereas changes in alpha1 and alpha2 are way more subtle over the y.
The situation that I am facing is that I am applying a model to my data and out of this model, I am applying a minimization with Scipy to the model setting one of the inputs variables fixed as on this minimization. The code bellow might illustrate better:
# model fitting
kernel = C(1.0, (1e-3, 1e3))*RBF(10,(1e-2,1e2))
model = GaussianProcessRegressor(kernel = kernel, n_restarts_optimizer = 9,optimizer='fmin_l_bfgs_b')
model.fit(X,y)
# minimization
bnds = np.array([(theta,theta),
(alpha1.min(),
alpha1.max()),
(alpha2.min(),
alpha2.max())])
x0 = [theta,alpha1.min(),alpha2.min()]
residual_plant = minimize(lambda x: -model.predict(np.array([x])),
x0, method='SLSQP',bounds=bnds,
options = {'eps': np.radians(5)})
My goal with that is that I want to set the first variable value as a fixed value and I want to study impact that the other two variables, alpha1 and alpha2, have over the output y for that specific theta value. The specific reasoning behind the minimization is that I want to find the combinations of alpha t1 and alpha2 that return me the optimal y for this fixed theta. Therefore, I was wondering how would I do that, as I believe that theta must be influencing drastically the impact that my other two variables have over my output, and then it might be negatively influencing my model on the task that I have in hand, as it has a heavier weight and will hidden the influence of alpha1 and alpha2 on my model, however, I cannot really ignore it or not feed it into my model as I want to find the optimal y value for this fixed theta and therefore I would still need to use theta as input.
My question is, how to deal with such issue? Is there any statistical trick to eliminate or at least diminish this influence without having to eliminate theta from my model? Is there a better way to deal with my problem?
First, did you normalize the data before training?
Second, it sounds like you want to see the relationship between x and y with a constant theta.
If you take your dataset and sort it by theta, you can try to find a group of records where theta is the same or very similar, where its variance is low and it doesnβt change much. You can take that group of data and form a new dataframe, and drop the theta column (because we picked a portion of the dataset where theta has a low variance and so it isnβt very useful). Then, you can train your model or do some data visualization on just the alpha1 and alpha2 data.
My overall understanding to your question is that you want to achieve two things:
To study the effect of alpha1 and alpha2 after turning theta into constant (i.e. eliminating the influence of theta on the model).
To find the best combination of alpha1 and alpha2 that returns the optimal y for this fixed theta.
That can be summarized under the study of the Correlation between the input variables and the target variable.
Since Correlation studies the changes in the relation between one variable from another independently, then you can get a good insight about the influence of alpha1, alpha2 and theta on y.
Two interesting correlations exist to help you:
Pearson's Correlation: Numerically reflects the strength of a linear correlation.
Spearman's Correlation: Numerically reflects the strength of a monotonic correlation (i.e. the rank, in case the correlation is not linear).
Let's give it a try:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.DataFrame(columns=['theta', 'alpha1', 'alpha2', 'y'],
data=[[ 9.07660169, 0.61485493, 1.70396493, 4.93483686],
[ 9.51498486, -5.49212002, -0.68659511, 5.66226844],
[10.45737558, -2.2739529 , -2.03918961, 7.51133372],
[10.46857663, -0.4587848 , 0.54434441, 7.54435854],
[ 9.10133699, 8.38066374, 0.66538822, 4.92758927],
[ 9.17279647, 0.36327109, -0.30558115, 5.0955348],
[10.36532505, 0.87099676, -7.73775872, 7.26606153],
[10.13681026, -1.64084098, -0.09169159, 6.86027353],
[10.38549264, 1.80633583, 1.3453195, 7.36488184],
[ 9.72533357, 0.55861224, 0.74180309, 6.06864003]])
plt.figure(figsize=(10, 8))
ax = sns.heatmap(df.corr(method="pearson"), annot=True)
plt.xticks(rotation = 90)
plt.title('Pearson Correlation Heatmap')
plt.show()
plt.figure(figsize=(10, 8))
ax = sns.heatmap(df.corr(method="spearman"), annot=True)
plt.xticks(rotation = 90)
plt.title('Spearman Correlation Heatmap')
plt.show()
As you can see, we got very good insights about the relation between theta, alpha1 and alpha2 with each other and with y.
According to Cohen's Standard, we can conclude that:
Alpha1 and Alpha2 have medium correlation with y.
Theta has very strong correlation with y.
Alpha1 has weak linear correlation with alpha2 but medium monotonic correlation.
Alpha1 and Alpha2 have medium correlation with theta.
But wait a minute, since alpha1 and alpha2 have medium correlation with y but weak correlation (to medium) between each other, we can then exploit the variance to produce an optimization function L that is a linear combination between alpha1 and alpha2, as follows:
Let m, n be two weights that maximize the correlation between alpha1 and alpha2 features with y according to the optimization function L:
m * alpha1 + n * alpha2
The optimal coefficients m and n achieving the maximum correlation between L and y do depend on the variances of alpha1, alpha2 and y.
We can derive from that, the following optimization solution:
m = [ πΆππ£(π,π) * πΆππ£(π,π) β πΆππ£(π,π) * πππ(π) / πΆππ£(π,π) * πΆππ£(π,π) β πΆππ£(π,π) * πππ(π) ] * n
where a , b and c correspond to alpha1, alpha2 and y respectively.
By choosing m or n to be either 1 or -1 , we can find the optimal solution to engineer the new feature.
cov = df[['alpha1', 'alpha2', 'y']].cov()
# applying the optimization function: a = alpha1 , b = alpha2 and c = y
# note that cov of a feature with itself = variance
coef = (cov['alpha2']['y'] * cov['alpha1']['alpha2'] - cov['alpha1']['y'] * cov['alpha2']['alpha2']) / \
(cov['alpha1']['y'] * cov['alpha1']['alpha2'] - cov['alpha2']['y'] * cov['alpha1']['alpha1'])
# let n = 1 --> m = coef --> L = coef * alpha1 + alpha2 : which is the new feature to add
df['alpha12'] = coef * df['alpha1'] + df['alpha2']
As you can see, there is a noticeable improvement in the correlation of the introduced alpha12.
Furthermore, related to question 1, to decrease the correlation of theta; and since the correlation is given by:
Corr(theta, y) = Cov(theta, y) / [sqrt(Var(that)) * sqrt(Var(y))]
You can increase the variance of theta. To do so, simply sample n points from some distribution and add them to the corresponding indices as noise.
Save this noise list for future use in case you need to get back to the original theta, something like this:
cov = df[['y', 'theta']].cov()
print("Theta Variance :: Before = {}".format(cov['theta']['theta']))
np.random.seed(2020) # add seed to make it reproducible for future undo
# create noise drawn from uniform distribution
noise = np.random.uniform(low=1.0, high=10., size=df.shape[0])
df['theta'] += noise # add noise to increase variance
cov = df[['y', 'theta']].cov()
print("Theta Variance :: After = {}".format(cov['theta']['theta']))
# df['theta'] -= noise to back to original variance
plt.figure(figsize=(15, 15))
ax = sns.heatmap(df.corr(method="spearman"), annot=True)
plt.xticks(rotation = 90)
plt.title('Spearman Correlation Heatmap After Reducing Variance of Theta\n')
plt.show()
plt.figure(figsize=(15, 15))
ax = sns.heatmap(df.corr(method="pearson"), annot=True)
plt.xticks(rotation = 90)
plt.title('Pearson Correlation Heatmap After Reducing Variance of Theta\n')
plt.show()
Theta Variance :: Before = 0.3478030891329485
Theta Variance :: After = 7.552229545792681
Now Alpha12 is taking the lead and has the highest influence on the target variable y.
I would say that the effect of theta on your predictor can't hide the effect of the other variables. The effect of the other variables may be small but that's likely just the way it is. I'd take the estimate you have and optimise y for constant theta as is
Related
I've always thought it would be useful to calculate the probability between two values on a probability distribution. While there isn't a built-in way to do this using seaborn or matplotlib, I reckon it just takes some basic calculus, right? Here is some code I found from an article on this topic:
from sklearn.neighbors import KernelDensity
import numpy as np
x = np.random.normal(loc=0.0, scale=1.0, size=1000000)
kd = KernelDensity(kernel='gaussian', bandwidth=0.5).fit(np.array(x).reshape(-1, 1))
def get_probability(start_value, end_value, eval_points, kd):
# Number of evaluation points
N = eval_points
step = (end_value - start_value) / (N - 1) # Step size
x = np.linspace(start_value, end_value, N)[:, np.newaxis] # Generate values in the range
kd_vals = np.exp(kd.score_samples(x)) # Get PDF values for each x
probability = np.sum(kd_vals * step) # Approximate the integral of the PDF
return probability.round(4)
get_probability(x.mean() - x.std(), x.mean() + x.std(), 100, kd)
0.6338
This returns a probability that converges at 0.6338. This confused me, as the 68-95-99.7 rule states that the probability of a value being within one standard deviation of the mean in either direction should be 68%.
I decided to run another test by calculating the probability between the median and max of a randomly generated sample, figuring it should converge close to 50%:
x = np.random.randint(100, size=(1000000))
# sns.kdeplot(x) # this is how i'd generate a kdeplot of this data
kd = KernelDensity(kernel='gaussian', bandwidth=0.5).fit(np.array(x).reshape(-1, 1))
def get_probability(start_value, end_value, eval_points, kd):
# Number of evaluation points
N = eval_points
step = (end_value - start_value) / (N - 1) # Step size
x = np.linspace(start_value, end_value, N)[:, np.newaxis] # Generate values in the range
kd_vals = np.exp(kd.score_samples(x)) # Get PDF values for each x
probability = np.sum(kd_vals * step) # Approximate the integral of the PDF
return probability.round(4)
get_probability(np.median(x), x.max(), 100, kd)
0.4946
And it's pretty close. Am I missing something here? Why am I nearly 5 percentage points off from the 68-95-99.7 rule? Is this method of generating probabilities from a probability distribution wrong? Is there a better way to find the probability between two values from a probability distribution?
EDIT: Could you potentially calculate something by using the data generated from a kdeplot?
fig, ax = plt.subplots()
sns.kdeplot(x)
kdeline = ax.lines[0]
xs = kdeline.get_xdata()
ys = kdeline.get_ydata()
And implement np.interp() somehow?
More edits:
Using CDFs per #7shoe, I was able to get a way better (and correct) result for my normal distribution example:
from scipy.stats import norm
import numpy as np
np.random.seed(42)
x = np.random.normal(loc=0.0, scale=1.0, size=10000000)
norm.cdf(x.mean() + x.std()) - norm.cdf(x.mean() - x.std())
However, my curiosity is still piqued. Let's say we have a distribution that may or may not be normal. For example, let's look at Tom Brady's epa per pass from last season
import pandas as pd
import seaborn as sns
import random
import numpy as np
YEAR = 2021
data = pd.read_csv(
'https://github.com/nflverse/nflfastR-data/blob/master/data/play_by_play_' \
+ str(YEAR) + '.csv.gz?raw=True',compression='gzip', low_memory=False
)
df = data.loc[data.passer == 'T.Brady','epa'].copy()
# tom brady's distribution
sns.kdeplot(df)
sample_mean = []
for i in range(50):
y = np.random.choice(df, 500)
avg = np.mean(y)
sample_mean.append(avg)
# distribution of sampling means - can we assume this is normal and proceed with cdfs?
sns.kdeplot(sample_mean)
Could we use sampling means or even just bootstrap resampling methods to
Make a more "normal" distribution with sampling means in order to incorporate cdfs if the initial distribution doesn't quite appear normal (this, though, would be a distribution of means rather than individual samples. Is this not encouraged?)
or
If the distribution already resembles a normal distribution, simply use such resampling methods to create better parametric estimates?
Computing the probability p for some interval is not overly complicated. However, it might be tricky to combine the right tools to do so. In particular, since there are several statistical approaches to do so.
1. Probability theory
Given two numbers, let's call them lower and upper, what probability is enclosed in between them? If the cumulative distribution function (CDF) F is known, it is merely p = F(upper) - F(lower). Similarly, p coincides with the area enclosed by the probability density function(PDF) f's graph on the interval [lower, upper].
However, when the CDF/PDF is unknown, it constitutes a statistical question. In a nutshell, estimating the PDF f and computing the area its graph enclosed with the interval will do. But there are several paradigms and estimation procedures to obtain it.
1. Parametric estimation
One could assume that the data x is set of IID realizations of some normal distribution, either because of prior knowledge or convenience. Then, one just needs to estimate its parameters mu (aka scale) and sigma (aka standard deviation or scale). scipy.stats provides all we need in this setting. Moreover, it offers estimation procedures as well as pdf/cdf functions for various parametric distributions.
from scipy import stats
from matplotlib import pyplot as plt
lower, upper = 0.0, 2.0
x = [-0.804, -2.267, 1.55, -1.004, 3.173, -0.522, -0.231, 3.95, -0.574, -0.213, 1.333, 2.42, 1.879, 3.814]
# fit parameter
loc_hat, scale_hat = stats.norm.fit(x)
# probability
p = stats.norm.cdf(upper, loc=loc_hat, scale=scale_hat) - stats.norm.cdf(lower, loc=loc_hat, scale=scale_hat)
# plot
x_axis = np.linspace(-5, 7, 1000)
plt.title('1. Parametric Estimation', fontsize=18)
plt.plot(x_axis, stats.norm.pdf(x_axis, loc_hat, scale_hat))
plt.fill_between(x = np.arange(lower, upper, 0.01),
y1 = stats.norm.pdf(np.arange(lower, upper, 0.01), loc=loc_hat, scale=scale_hat) ,
facecolor='red',
alpha=0.35)
plt.text(x=0.1, y=0.1, s= 'p=' + str(round(p, 3)))
plt.show()
which yields
2. Non-parametric estimation
In the absence of a parametric assumption, various techniques exist to estimate the density directly (rather than identifying it by estimated parameters as seen above). Kernel density estimation is the most popular variant to do so. In this case, as alluded in the question, scikit-learn is an ideal tool. However, in the absence of an analytical CDF, we need to compute the area enclosed by the density's graph over the interval [lower, upper] directly.
In contrast to previous answers, I'd leave this to SciPy's numerical integration routines, e.g. scipy.inegrate.quad(). The advantage is that it is lightning-fast and can be applied to any function (beyond kernel density estimates). The resulting code is as follows
from sklearn.neighbors import KernelDensity
from scipy.integrate import quad
x = [-0.804, -2.267, 1.55, -1.004, 3.173, -0.522, -0.231, 3.95, -0.574, -0.213, 1.333, 2.42, 1.879, 3.814]
# fit density function
f_hat = KernelDensity(bandwidth=.9, kernel='gaussian').fit(np.array(x).reshape(-1, 1))
def f_pred(x):
'''wrapper function to compute probability'''
return np.exp(f_hat.score_samples(np.array(x).reshape(-1, 1)))[0]
p = quad(func=f_pred, a=lower, b=upper)
# plot
plt.title('2. Non-Parametric Estimation', fontsize=18)
xaxis = np.linspace(-5, 7, 1000)
plt.plot(x_axis, np.exp(f_hat.score_samples(xaxis.reshape(-1, 1))))
plt.fill_between(x = np.arange(lower, upper, 0.01),
y1 = np.exp(f_hat.score_samples(np.arange(lower, upper, 0.01).reshape(-1, 1))),
facecolor='red',
alpha=0.35)
plt.text(x=0.15, y=0.1, s= 'p=' + str(round(p[0], 3)))
plt.show()
and yields
I do see a bug in the get_probability function, but that bug causes it to compute a too high result - in np.sum(kd_vals * step), it's multiplying N sample values by a step with N-1 in the denominator, effectively resulting in an output a factor of N/(N-1) too high. (If they wanted to use a trapezoid rule computation for the integral, they should have divided the left and right endpoint values by 2 first.)
Other than that, the computation looks correct. The problem is that the model doesn't reflect the input distribution.
You're not modeling the distribution as a normal distribution. You're modeling it with a kernel density estimator with a Gaussian kernel, and the kernel bandwidth is very high relative to the scale of the distribution and the number of available samples. This results in the model being "flatter" than the actual distribution, with less of the probability concentrated in the center.
I've read a related post on manually calculating R-squared values after using scipy.optimize.curve_fit(). However, they calculate an R-squared value when their function follows the power-law (f(x) = a*x^b). I'm trying to do the same but get negative R-squared values.
Here is my code:
def powerlaw(x, a, b):
'''Generic power law function.'''
return a * x**b
X = s_lt[4:] # independent variable (Pandas series)
Y = s_lm[4:] # dependent variable (Pandas series)
popt, pcov = curve_fit(powerlaw, X, Y)
residuals = Y - powerlaw(X, *popt)
ss_res = np.sum(residuals**2) # residual sum of squares
ss_tot = np.sum((Y-np.mean(Y))**2) # total sum of squares
r_squared = 1 - (ss_res / ss_tot) # r-squared value
print("R-squared of power-law fit = ", str(r_squared))
I got an R-squared value of -0.057....
From my understanding, it's not good to use R-squared values for non-linear functions, but I expected to get a much higher R-squared value than a linear model due to overfitting. Did something else go wrong?
See The R-squared and nonlinear regression: a difficult marriage?. Also When is R squared negative?.
Basically, we have two problems:
nonlinear models do not have an intercept term, at least, not in the usual sense;
the equality SStot=SSreg+SSres may not hold.
The first reference above denotes your statistic "pseudo-R-square" (in the case of non-linear models), and notes that it may be lower than 0.
To further understand what's going on you probably want to plot your data Y as a function of X, the predicted values from the power law as a function of X, and the residuals as a function of X.
For non-linear models I have sometimes calculated the sum of squared deviation from zero, to examine how much of that is explained by the model. Something like this:
pred = powerlaw(X, *popt)
ss_total = np.sum(Y**2) # Not deviation from mean.
ss_resid = np.sum((Y - pred)**2)
pseudo_r_squared = 1 - ss_resid/ss_total
Calculated this way, pseudo_r_squared can potentially be negative (if the model is really bad, worse than just guessing the data are all 0), but if pseudo_r_squared is positive I interpret it as the amount of "variation from 0" explained by the model.
I need to fit a sine curve created from two sine waves and extract the parameters for the fitted curve (such as frequency, amplitude, etc).
Data example:
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline
x = np.arange(0, 50, 0.01)
x2 = np.arange(0, 100, 0.02)
x3 = np.arange(0, 150, 0.03)
sin1 = np.sin(x)
sin2 = np.sin(x2)
sin3= np.sin(x3/2)
sin4 = sin1 + sin2+sin3
plt.plot(x, sin4)
plt.show()
I used the codes provided in this answer.
yy = sin4
tt = x
res = fit_sin(tt, yy)
print(str(i), "Amplitude=%(amp)s, Angular freq.=%(omega)s, phase=%(phase)s, offset=%(offset)s, Max. Cov.=%(maxcov)s" % res )
fit_values=res["fitfunc"](tt)
Frequenc_fit= res['freq']
print(i, Frequenc_fit)
Frequenc_fit=Frequenc_fit
Amp_fit=res['amp']
Omega_fit=res['omega']
Phase_fit=res['phase']
Offset_fit=res['offset']
maxcov_fit=res['maxcov']
plt.plot(tt, yy, "-k", label="y", linewidth=2)
plt.plot(tt,fit_values, "r-", label="y fit curve", linewidth=2)
plt.legend(loc="best")
plt.show()
I got a fitted sine curve with a single frequency and amplitude as follows:
2 Amplitude=1.0149282025860233, Angular freq.=2.01112187048004, phase=-0.2730905030152767, offset=0.003304158823058212, Max. Cov.=0.0015266032307905222
2 0.3200799868471169
Is there a method to obtain fitted curve matches with the original one?
Supposing that the function to be fitted is
y(x)=a * sin( w * x )+b * sin( W * x )
the principle of the method below is explained in https://fr.scribd.com/doc/14674814/Regressions-et-equations-integrales
The graphical representation of the result is :
Blue curve : From data obtained by scanning the graph given in the question.
Black curve : From the above calculus.
The available data was not accurate because it comes from scanning of the original figure. The deviation is mainly due to the numerical integrations in computing the values of SS and SSSS (Four successive numerical integrations is not accurate especially with biaised data).
Probably the correct result should be : w=2 , W=1 , a=1 , b=1.
NOTE : The above method is not iterative and thus doesn't requires guessed values of the parameters to start an iterative process. The approximate results of the parameters can be good initial values in order to use an iterative non-linear regression process.
NOTE : If the values of w and W where known a-priori the solving thanks to linear regression would be very simple and much accurate (Only the last 2X2 matrix calculus shown above).
I am having a hard time figuring out the scaling for the covariance matrix in numpy polyfit.
In the documentation I read that the scaling factor to go from an unscaled to a scaled covariance matrix is
chi2 / sqrt(N - DOF).
In the code attached below, it seems that the scaling factor actually is
chi2 / DOF
Here is my code
# Generate synthetically the data
# True parameters
import numpy as np
true_slope = 3
true_intercept = 7
x_data = np.linspace(-5, 5, 30)
# The y-data will have a noise term, to simulate imperfect observations
sigma = 1
y_data = true_slope * np.linspace(-5, 5, 30) + true_intercept
y_obs = y_data + np.random.normal(loc=0.0, scale=sigma, size=x_data.size)
# Here I generate artificially some unequal uncertainties
# (even if there is no reason for them to be so)
y_uncertainties = sigma * np.random.normal(loc=1.0, scale=0.5*sigma, size=x_data.size)
# Make the fit
popt, pcov = np.polyfit(x_data, y_obs, 1, w=1/y_uncertainties, cov='unscaled')
popt, pcov_scaled = np.polyfit(x_data, y_obs, 1, w=1/y_uncertainties, cov=True)
my_scale_factor = np.sum((y_obs - popt[0] * x_data - popt[1])**2 / y_uncertainties**2)\
/ (len(y_obs)-2)
scale_factor = pcov_scaled[0,0] / pcov[0,0]
If I run the code, I see that the actual scale factor is chi2 / DOF and not the value reported in the documentation. Is this true or am I missing something?
I have a further question. Why is it suggested to use just the inverse of the y-data error instead of the square of the inverse of the y-data errors for the weights in the case that the uncertainties are normally-distributed?
Edit to add the data generated by a run of the code
x_data = array([-5. , -4.65517241, -4.31034483, -3.96551724, -3.62068966,
-3.27586207, -2.93103448, -2.5862069 , -2.24137931, -1.89655172,
-1.55172414, -1.20689655, -0.86206897, -0.51724138, -0.17241379,
0.17241379, 0.51724138, 0.86206897, 1.20689655, 1.55172414,
1.89655172, 2.24137931, 2.5862069 , 2.93103448, 3.27586207,
3.62068966, 3.96551724, 4.31034483, 4.65517241, 5. ])
y_obs = array([-7.27819725, -8.41939411, -3.9089926 , -5.24622589, -3.78747379,
-1.92898727, -1.375255 , -1.84388812, -0.37092441, 0.27572306,
2.57470918, 3.860485 , 4.62580789, 5.34147103, 6.68231985,
7.38242258, 8.28346559, 9.46008873, 10.69300274, 12.46051285,
13.35049975, 13.28279961, 14.31604781, 16.8226239 , 16.81708308,
18.64342284, 19.37375515, 19.6714002 , 20.13700708, 22.72327533])
y_uncertainties = array([ 0.63543112, 1.07608924, 0.83603265, -0.03442888, -0.07049299,
1.30864191, 1.36015322, 1.42125414, 1.04099854, 1.20556608,
0.43749964, 1.635056 , 1.00627014, 0.40512511, 1.19638787,
1.26230966, 0.68253139, 0.98055035, 1.01512232, 1.83910276,
0.96763007, 0.57373151, 1.69358475, 0.62068133, 0.70030971,
0.34648312, 1.85234844, 1.18687269, 1.23841579, 1.19741206])
With this data I obtain that scale_factor = 1.6534129347542432, my_scale_factor = 1.653412934754234 and that the "nominal" scale factor reported in the documentation, i.e.
nominal_scale_factor = np.sum((y_obs - popt[0] * x_data - popt[1])**2 /\
y_uncertainties**2) / np.sqrt(len(y_obs) - len(y_obs) + 2)
has value nominal_scale_factor = 32.73590595145554
PS. my numpy version is
1.18.5 3.7.7 (default, May 6 2020, 11:45:54) [MSC v.1916 64 bit (AMD64)]
Regarding the numpy.polyfit documentation:
By default, the covariance are scaled by chi2/sqrt(N-dof), i.e., the weights are presumed to be unreliable except in a relative sense and everything is scaled such that the reduced chi2 is unity.
This looks like a documentation bug. The correct scaling factor for the covariance is chi_square/(N-M) where M is the number of fit parameters and N-M is the number of degrees of freedom. It looks like np.polyfit is implemented correctly, because my_scale_factor and scale_factor are consistent.
Regarding the question on why not "the square of the inverse of the y-data errors": a polynomial fit or more generally, a least-squares fit involves solving the p vector in
A # p = y
where A is an (N, M) matrix for N data points in y and M elements in p and each column in A is the polynomial term evaluated at the corresponding x values.
The solution minimizes
(SUM_j A[i, j] p[j] - y[i])^2
SUM -----------------------------
i sigma_y[i]^2
Computationally, the cheapest way to calculate this is by multiplying each row in A and each y value by the corresponding 1/sigma_y and then taking a standard least-square solution of the A#p=y equation. By having the user supply the inverse errors, it saves the fit routine from handling division by zero issues and slow square-root operations.
Regarding the first part, I opened a Github issue
https://github.com/numpy/numpy/issues/16842
The conclusion on that thread is that the documentation is wrong, but the function behaves correctly.
The documentation should be updated to
By default, the covariance is scaled by chi2/dof, i.e., the weights are presumed to be unreliable except in a relative sense and everything is scaled such that the reduced chi2 is unity.
I have two numpy arrays, one is an array of x values and the other an array of y values and together they give me the empirical cdf. E.g.:
plt.plot(xvalues, yvalues)
plt.show()
I assume the data needs to be smoothed somehow in order to give a smooth pdf.
I would like to plot the pdf. How can I do that?
The raw data is at: http://dpaste.com/1HVK5DR .
There are two main problems: Your data seems to be quite noisy, and it is not equally spaced: The points at the low end are sampled quite densly, while the ponts at the high end are sampled quite sparsely. This can cause numerical issues.
So first I suggest resampling the data using a linear interpolation to get equaly spaced samples: (Note that all the snippets appended to eachother form the content of one python file.)
import matplotlib.pyplot as plt
import numpy as np
from data import xvalues, yvalues #load data from file
print("#datapoints: {}".format(len(xvalues)))
#don't use every point if your computer is not very fast
xv = np.array(xvalues)[::5]
yv = np.array(yvalues)[::5]
#interpolate to have evenly space data
xi = np.linspace(xv.min(), xv.max(), 400)
yi = np.interp(xi, xv, yv)
Then, to smoothen the data, I suggest performing a RBF regression (=using an "RBF Network"). The idea is fiting a curve of the form
c(t) = sum a(i) * phi(t - x(i)) #(not part of the program)
where phi is some radial basis function. (In theory we could use any functions.) To have a very smooth result I choose a very smooth function, namely a gaussian: phi(x) = exp( - x^2/sigma^2) where sigma is yet to be determined. The x(i) are just some nodes that we can define. If we have a smooth function, we just need a few nodes. The number of nodes also determines how much computation needs to be done. The a(i) are the coefficients we can optimize to get the best fit. In this case I just use a least squares approach.
Note that IF we can write a function in the form above, it is very easy to compute the derivative, it is just
c(t) = sum a(i) * phi'(t - x(i))
where phi' is the derivative of phi. #(not part of the program)
Regarding sigma: It is usually a good idea to choose it as a multiple of the step between the nodes we chose. The greater we choose sigma, the smoother the resulting function gets.
#set up rbf network
rbf_nodes = xv[::50][None, :]#use a subset of the x-values as rbf nodes
print("#rbfs: {}".format(rbf_nodes.shape[1]))
#estimate width of kernels:
sigma = 20 #greater = smoother, this is the primary parameter to play with
sigma *= np.max(np.abs(rbf_nodes[0,1:]-rbf_nodes[0,:-1]))
# kernel & derivative
rbf = lambda r:1/(1+(r/sigma)**2)
Drbf = lambda r: -2*r*sigma**2/(sigma**2 + r**2)**2
#compute coefficients of rbf network
r = np.abs(xi[:, None]-rbf_nodes)
A = rbf(r)
coeffs = np.linalg.lstsq(A, yi, rcond=None)[0]
print(coeffs)
#evaluate rbf network
N=1000
xe = np.linspace(xi.min(), xi.max(), N)
Ae = rbf(xe[:, None] - rbf_nodes)
ye = Ae # coeffs
#evaluate derivative
N=1000
xd = np.linspace(xi.min(), xi.max(), N)
Bd = Drbf(xe[:, None] - rbf_nodes)
yd = Bd # coeffs
fig,ax = plt.subplots()
ax2 = ax.twinx()
ax.plot(xv, yv, '-')
ax.plot(xi, yi, '-')
ax.plot(xe, ye, ':')
ax2.plot(xd, yd, '-')
fig.savefig('graph.png')
print('done')
You need the derivative to go from CDF to PDF
PDF(x) = d CDF(x)/ dx
With NumPy, you could use gradient
pdf = np.gradient(yvalues, xvalues)
plt.plot(xvalues, pdf)
plt.show()
or manual differential
pdf = np.diff(yvalues)/np.diff(xvalues)
l = np.asarray(xvalues[:-1])
r = np.asarray(xvalues[1:])
plt.plot((l+r)/2.0, pdf) # points in the middle of interval
plt.show()
Both produce something like, updated picture it got botched somehow