Related
I've always thought it would be useful to calculate the probability between two values on a probability distribution. While there isn't a built-in way to do this using seaborn or matplotlib, I reckon it just takes some basic calculus, right? Here is some code I found from an article on this topic:
from sklearn.neighbors import KernelDensity
import numpy as np
x = np.random.normal(loc=0.0, scale=1.0, size=1000000)
kd = KernelDensity(kernel='gaussian', bandwidth=0.5).fit(np.array(x).reshape(-1, 1))
def get_probability(start_value, end_value, eval_points, kd):
# Number of evaluation points
N = eval_points
step = (end_value - start_value) / (N - 1) # Step size
x = np.linspace(start_value, end_value, N)[:, np.newaxis] # Generate values in the range
kd_vals = np.exp(kd.score_samples(x)) # Get PDF values for each x
probability = np.sum(kd_vals * step) # Approximate the integral of the PDF
return probability.round(4)
get_probability(x.mean() - x.std(), x.mean() + x.std(), 100, kd)
0.6338
This returns a probability that converges at 0.6338. This confused me, as the 68-95-99.7 rule states that the probability of a value being within one standard deviation of the mean in either direction should be 68%.
I decided to run another test by calculating the probability between the median and max of a randomly generated sample, figuring it should converge close to 50%:
x = np.random.randint(100, size=(1000000))
# sns.kdeplot(x) # this is how i'd generate a kdeplot of this data
kd = KernelDensity(kernel='gaussian', bandwidth=0.5).fit(np.array(x).reshape(-1, 1))
def get_probability(start_value, end_value, eval_points, kd):
# Number of evaluation points
N = eval_points
step = (end_value - start_value) / (N - 1) # Step size
x = np.linspace(start_value, end_value, N)[:, np.newaxis] # Generate values in the range
kd_vals = np.exp(kd.score_samples(x)) # Get PDF values for each x
probability = np.sum(kd_vals * step) # Approximate the integral of the PDF
return probability.round(4)
get_probability(np.median(x), x.max(), 100, kd)
0.4946
And it's pretty close. Am I missing something here? Why am I nearly 5 percentage points off from the 68-95-99.7 rule? Is this method of generating probabilities from a probability distribution wrong? Is there a better way to find the probability between two values from a probability distribution?
EDIT: Could you potentially calculate something by using the data generated from a kdeplot?
fig, ax = plt.subplots()
sns.kdeplot(x)
kdeline = ax.lines[0]
xs = kdeline.get_xdata()
ys = kdeline.get_ydata()
And implement np.interp() somehow?
More edits:
Using CDFs per #7shoe, I was able to get a way better (and correct) result for my normal distribution example:
from scipy.stats import norm
import numpy as np
np.random.seed(42)
x = np.random.normal(loc=0.0, scale=1.0, size=10000000)
norm.cdf(x.mean() + x.std()) - norm.cdf(x.mean() - x.std())
However, my curiosity is still piqued. Let's say we have a distribution that may or may not be normal. For example, let's look at Tom Brady's epa per pass from last season
import pandas as pd
import seaborn as sns
import random
import numpy as np
YEAR = 2021
data = pd.read_csv(
'https://github.com/nflverse/nflfastR-data/blob/master/data/play_by_play_' \
+ str(YEAR) + '.csv.gz?raw=True',compression='gzip', low_memory=False
)
df = data.loc[data.passer == 'T.Brady','epa'].copy()
# tom brady's distribution
sns.kdeplot(df)
sample_mean = []
for i in range(50):
y = np.random.choice(df, 500)
avg = np.mean(y)
sample_mean.append(avg)
# distribution of sampling means - can we assume this is normal and proceed with cdfs?
sns.kdeplot(sample_mean)
Could we use sampling means or even just bootstrap resampling methods to
Make a more "normal" distribution with sampling means in order to incorporate cdfs if the initial distribution doesn't quite appear normal (this, though, would be a distribution of means rather than individual samples. Is this not encouraged?)
or
If the distribution already resembles a normal distribution, simply use such resampling methods to create better parametric estimates?
Computing the probability p for some interval is not overly complicated. However, it might be tricky to combine the right tools to do so. In particular, since there are several statistical approaches to do so.
1. Probability theory
Given two numbers, let's call them lower and upper, what probability is enclosed in between them? If the cumulative distribution function (CDF) F is known, it is merely p = F(upper) - F(lower). Similarly, p coincides with the area enclosed by the probability density function(PDF) f's graph on the interval [lower, upper].
However, when the CDF/PDF is unknown, it constitutes a statistical question. In a nutshell, estimating the PDF f and computing the area its graph enclosed with the interval will do. But there are several paradigms and estimation procedures to obtain it.
1. Parametric estimation
One could assume that the data x is set of IID realizations of some normal distribution, either because of prior knowledge or convenience. Then, one just needs to estimate its parameters mu (aka scale) and sigma (aka standard deviation or scale). scipy.stats provides all we need in this setting. Moreover, it offers estimation procedures as well as pdf/cdf functions for various parametric distributions.
from scipy import stats
from matplotlib import pyplot as plt
lower, upper = 0.0, 2.0
x = [-0.804, -2.267, 1.55, -1.004, 3.173, -0.522, -0.231, 3.95, -0.574, -0.213, 1.333, 2.42, 1.879, 3.814]
# fit parameter
loc_hat, scale_hat = stats.norm.fit(x)
# probability
p = stats.norm.cdf(upper, loc=loc_hat, scale=scale_hat) - stats.norm.cdf(lower, loc=loc_hat, scale=scale_hat)
# plot
x_axis = np.linspace(-5, 7, 1000)
plt.title('1. Parametric Estimation', fontsize=18)
plt.plot(x_axis, stats.norm.pdf(x_axis, loc_hat, scale_hat))
plt.fill_between(x = np.arange(lower, upper, 0.01),
y1 = stats.norm.pdf(np.arange(lower, upper, 0.01), loc=loc_hat, scale=scale_hat) ,
facecolor='red',
alpha=0.35)
plt.text(x=0.1, y=0.1, s= 'p=' + str(round(p, 3)))
plt.show()
which yields
2. Non-parametric estimation
In the absence of a parametric assumption, various techniques exist to estimate the density directly (rather than identifying it by estimated parameters as seen above). Kernel density estimation is the most popular variant to do so. In this case, as alluded in the question, scikit-learn is an ideal tool. However, in the absence of an analytical CDF, we need to compute the area enclosed by the density's graph over the interval [lower, upper] directly.
In contrast to previous answers, I'd leave this to SciPy's numerical integration routines, e.g. scipy.inegrate.quad(). The advantage is that it is lightning-fast and can be applied to any function (beyond kernel density estimates). The resulting code is as follows
from sklearn.neighbors import KernelDensity
from scipy.integrate import quad
x = [-0.804, -2.267, 1.55, -1.004, 3.173, -0.522, -0.231, 3.95, -0.574, -0.213, 1.333, 2.42, 1.879, 3.814]
# fit density function
f_hat = KernelDensity(bandwidth=.9, kernel='gaussian').fit(np.array(x).reshape(-1, 1))
def f_pred(x):
'''wrapper function to compute probability'''
return np.exp(f_hat.score_samples(np.array(x).reshape(-1, 1)))[0]
p = quad(func=f_pred, a=lower, b=upper)
# plot
plt.title('2. Non-Parametric Estimation', fontsize=18)
xaxis = np.linspace(-5, 7, 1000)
plt.plot(x_axis, np.exp(f_hat.score_samples(xaxis.reshape(-1, 1))))
plt.fill_between(x = np.arange(lower, upper, 0.01),
y1 = np.exp(f_hat.score_samples(np.arange(lower, upper, 0.01).reshape(-1, 1))),
facecolor='red',
alpha=0.35)
plt.text(x=0.15, y=0.1, s= 'p=' + str(round(p[0], 3)))
plt.show()
and yields
I do see a bug in the get_probability function, but that bug causes it to compute a too high result - in np.sum(kd_vals * step), it's multiplying N sample values by a step with N-1 in the denominator, effectively resulting in an output a factor of N/(N-1) too high. (If they wanted to use a trapezoid rule computation for the integral, they should have divided the left and right endpoint values by 2 first.)
Other than that, the computation looks correct. The problem is that the model doesn't reflect the input distribution.
You're not modeling the distribution as a normal distribution. You're modeling it with a kernel density estimator with a Gaussian kernel, and the kernel bandwidth is very high relative to the scale of the distribution and the number of available samples. This results in the model being "flatter" than the actual distribution, with less of the probability concentrated in the center.
I am having a hard time figuring out the scaling for the covariance matrix in numpy polyfit.
In the documentation I read that the scaling factor to go from an unscaled to a scaled covariance matrix is
chi2 / sqrt(N - DOF).
In the code attached below, it seems that the scaling factor actually is
chi2 / DOF
Here is my code
# Generate synthetically the data
# True parameters
import numpy as np
true_slope = 3
true_intercept = 7
x_data = np.linspace(-5, 5, 30)
# The y-data will have a noise term, to simulate imperfect observations
sigma = 1
y_data = true_slope * np.linspace(-5, 5, 30) + true_intercept
y_obs = y_data + np.random.normal(loc=0.0, scale=sigma, size=x_data.size)
# Here I generate artificially some unequal uncertainties
# (even if there is no reason for them to be so)
y_uncertainties = sigma * np.random.normal(loc=1.0, scale=0.5*sigma, size=x_data.size)
# Make the fit
popt, pcov = np.polyfit(x_data, y_obs, 1, w=1/y_uncertainties, cov='unscaled')
popt, pcov_scaled = np.polyfit(x_data, y_obs, 1, w=1/y_uncertainties, cov=True)
my_scale_factor = np.sum((y_obs - popt[0] * x_data - popt[1])**2 / y_uncertainties**2)\
/ (len(y_obs)-2)
scale_factor = pcov_scaled[0,0] / pcov[0,0]
If I run the code, I see that the actual scale factor is chi2 / DOF and not the value reported in the documentation. Is this true or am I missing something?
I have a further question. Why is it suggested to use just the inverse of the y-data error instead of the square of the inverse of the y-data errors for the weights in the case that the uncertainties are normally-distributed?
Edit to add the data generated by a run of the code
x_data = array([-5. , -4.65517241, -4.31034483, -3.96551724, -3.62068966,
-3.27586207, -2.93103448, -2.5862069 , -2.24137931, -1.89655172,
-1.55172414, -1.20689655, -0.86206897, -0.51724138, -0.17241379,
0.17241379, 0.51724138, 0.86206897, 1.20689655, 1.55172414,
1.89655172, 2.24137931, 2.5862069 , 2.93103448, 3.27586207,
3.62068966, 3.96551724, 4.31034483, 4.65517241, 5. ])
y_obs = array([-7.27819725, -8.41939411, -3.9089926 , -5.24622589, -3.78747379,
-1.92898727, -1.375255 , -1.84388812, -0.37092441, 0.27572306,
2.57470918, 3.860485 , 4.62580789, 5.34147103, 6.68231985,
7.38242258, 8.28346559, 9.46008873, 10.69300274, 12.46051285,
13.35049975, 13.28279961, 14.31604781, 16.8226239 , 16.81708308,
18.64342284, 19.37375515, 19.6714002 , 20.13700708, 22.72327533])
y_uncertainties = array([ 0.63543112, 1.07608924, 0.83603265, -0.03442888, -0.07049299,
1.30864191, 1.36015322, 1.42125414, 1.04099854, 1.20556608,
0.43749964, 1.635056 , 1.00627014, 0.40512511, 1.19638787,
1.26230966, 0.68253139, 0.98055035, 1.01512232, 1.83910276,
0.96763007, 0.57373151, 1.69358475, 0.62068133, 0.70030971,
0.34648312, 1.85234844, 1.18687269, 1.23841579, 1.19741206])
With this data I obtain that scale_factor = 1.6534129347542432, my_scale_factor = 1.653412934754234 and that the "nominal" scale factor reported in the documentation, i.e.
nominal_scale_factor = np.sum((y_obs - popt[0] * x_data - popt[1])**2 /\
y_uncertainties**2) / np.sqrt(len(y_obs) - len(y_obs) + 2)
has value nominal_scale_factor = 32.73590595145554
PS. my numpy version is
1.18.5 3.7.7 (default, May 6 2020, 11:45:54) [MSC v.1916 64 bit (AMD64)]
Regarding the numpy.polyfit documentation:
By default, the covariance are scaled by chi2/sqrt(N-dof), i.e., the weights are presumed to be unreliable except in a relative sense and everything is scaled such that the reduced chi2 is unity.
This looks like a documentation bug. The correct scaling factor for the covariance is chi_square/(N-M) where M is the number of fit parameters and N-M is the number of degrees of freedom. It looks like np.polyfit is implemented correctly, because my_scale_factor and scale_factor are consistent.
Regarding the question on why not "the square of the inverse of the y-data errors": a polynomial fit or more generally, a least-squares fit involves solving the p vector in
A # p = y
where A is an (N, M) matrix for N data points in y and M elements in p and each column in A is the polynomial term evaluated at the corresponding x values.
The solution minimizes
(SUM_j A[i, j] p[j] - y[i])^2
SUM -----------------------------
i sigma_y[i]^2
Computationally, the cheapest way to calculate this is by multiplying each row in A and each y value by the corresponding 1/sigma_y and then taking a standard least-square solution of the A#p=y equation. By having the user supply the inverse errors, it saves the fit routine from handling division by zero issues and slow square-root operations.
Regarding the first part, I opened a Github issue
https://github.com/numpy/numpy/issues/16842
The conclusion on that thread is that the documentation is wrong, but the function behaves correctly.
The documentation should be updated to
By default, the covariance is scaled by chi2/dof, i.e., the weights are presumed to be unreliable except in a relative sense and everything is scaled such that the reduced chi2 is unity.
Currently I am implementing a gaussian regression process model and I have been having some problems when trying to apply it to the scope of my problem. My problem is that I have as input to my model three variables, which one of these values (theta) has way more significant impact than the other two, alpha1 and alpha2. The inputs and outputs have the following values (just a few values to better understand):
# X (theta, alpha1, alpha2)
array([[ 9.07660169, 0.61485493, 1.70396493],
[ 9.51498486, -5.49212002, -0.68659511],
[10.45737558, -2.2739529 , -2.03918961],
[10.46857663, -0.4587848 , 0.54434441],
[ 9.10133699, 8.38066374, 0.66538822],
[ 9.17279647, 0.36327109, -0.30558115],
[10.36532505, 0.87099676, -7.73775872],
[10.13681026, -1.64084098, -0.09169159],
[10.38549264, 1.80633583, 1.3453195 ],
[ 9.72533357, 0.55861224, 0.74180309])
# y
array([4.93483686, 5.66226844, 7.51133372, 7.54435854, 4.92758927,
5.0955348 , 7.26606153, 6.86027353, 7.36488184, 6.06864003])
As it can be seen, theta alters significantly the value of y, whereas changes in alpha1 and alpha2 are way more subtle over the y.
The situation that I am facing is that I am applying a model to my data and out of this model, I am applying a minimization with Scipy to the model setting one of the inputs variables fixed as on this minimization. The code bellow might illustrate better:
# model fitting
kernel = C(1.0, (1e-3, 1e3))*RBF(10,(1e-2,1e2))
model = GaussianProcessRegressor(kernel = kernel, n_restarts_optimizer = 9,optimizer='fmin_l_bfgs_b')
model.fit(X,y)
# minimization
bnds = np.array([(theta,theta),
(alpha1.min(),
alpha1.max()),
(alpha2.min(),
alpha2.max())])
x0 = [theta,alpha1.min(),alpha2.min()]
residual_plant = minimize(lambda x: -model.predict(np.array([x])),
x0, method='SLSQP',bounds=bnds,
options = {'eps': np.radians(5)})
My goal with that is that I want to set the first variable value as a fixed value and I want to study impact that the other two variables, alpha1 and alpha2, have over the output y for that specific theta value. The specific reasoning behind the minimization is that I want to find the combinations of alpha t1 and alpha2 that return me the optimal y for this fixed theta. Therefore, I was wondering how would I do that, as I believe that theta must be influencing drastically the impact that my other two variables have over my output, and then it might be negatively influencing my model on the task that I have in hand, as it has a heavier weight and will hidden the influence of alpha1 and alpha2 on my model, however, I cannot really ignore it or not feed it into my model as I want to find the optimal y value for this fixed theta and therefore I would still need to use theta as input.
My question is, how to deal with such issue? Is there any statistical trick to eliminate or at least diminish this influence without having to eliminate theta from my model? Is there a better way to deal with my problem?
First, did you normalize the data before training?
Second, it sounds like you want to see the relationship between x and y with a constant theta.
If you take your dataset and sort it by theta, you can try to find a group of records where theta is the same or very similar, where its variance is low and it doesnβt change much. You can take that group of data and form a new dataframe, and drop the theta column (because we picked a portion of the dataset where theta has a low variance and so it isnβt very useful). Then, you can train your model or do some data visualization on just the alpha1 and alpha2 data.
My overall understanding to your question is that you want to achieve two things:
To study the effect of alpha1 and alpha2 after turning theta into constant (i.e. eliminating the influence of theta on the model).
To find the best combination of alpha1 and alpha2 that returns the optimal y for this fixed theta.
That can be summarized under the study of the Correlation between the input variables and the target variable.
Since Correlation studies the changes in the relation between one variable from another independently, then you can get a good insight about the influence of alpha1, alpha2 and theta on y.
Two interesting correlations exist to help you:
Pearson's Correlation: Numerically reflects the strength of a linear correlation.
Spearman's Correlation: Numerically reflects the strength of a monotonic correlation (i.e. the rank, in case the correlation is not linear).
Let's give it a try:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.DataFrame(columns=['theta', 'alpha1', 'alpha2', 'y'],
data=[[ 9.07660169, 0.61485493, 1.70396493, 4.93483686],
[ 9.51498486, -5.49212002, -0.68659511, 5.66226844],
[10.45737558, -2.2739529 , -2.03918961, 7.51133372],
[10.46857663, -0.4587848 , 0.54434441, 7.54435854],
[ 9.10133699, 8.38066374, 0.66538822, 4.92758927],
[ 9.17279647, 0.36327109, -0.30558115, 5.0955348],
[10.36532505, 0.87099676, -7.73775872, 7.26606153],
[10.13681026, -1.64084098, -0.09169159, 6.86027353],
[10.38549264, 1.80633583, 1.3453195, 7.36488184],
[ 9.72533357, 0.55861224, 0.74180309, 6.06864003]])
plt.figure(figsize=(10, 8))
ax = sns.heatmap(df.corr(method="pearson"), annot=True)
plt.xticks(rotation = 90)
plt.title('Pearson Correlation Heatmap')
plt.show()
plt.figure(figsize=(10, 8))
ax = sns.heatmap(df.corr(method="spearman"), annot=True)
plt.xticks(rotation = 90)
plt.title('Spearman Correlation Heatmap')
plt.show()
As you can see, we got very good insights about the relation between theta, alpha1 and alpha2 with each other and with y.
According to Cohen's Standard, we can conclude that:
Alpha1 and Alpha2 have medium correlation with y.
Theta has very strong correlation with y.
Alpha1 has weak linear correlation with alpha2 but medium monotonic correlation.
Alpha1 and Alpha2 have medium correlation with theta.
But wait a minute, since alpha1 and alpha2 have medium correlation with y but weak correlation (to medium) between each other, we can then exploit the variance to produce an optimization function L that is a linear combination between alpha1 and alpha2, as follows:
Let m, n be two weights that maximize the correlation between alpha1 and alpha2 features with y according to the optimization function L:
m * alpha1 + n * alpha2
The optimal coefficients m and n achieving the maximum correlation between L and y do depend on the variances of alpha1, alpha2 and y.
We can derive from that, the following optimization solution:
m = [ πΆππ£(π,π) * πΆππ£(π,π) β πΆππ£(π,π) * πππ(π) / πΆππ£(π,π) * πΆππ£(π,π) β πΆππ£(π,π) * πππ(π) ] * n
where a , b and c correspond to alpha1, alpha2 and y respectively.
By choosing m or n to be either 1 or -1 , we can find the optimal solution to engineer the new feature.
cov = df[['alpha1', 'alpha2', 'y']].cov()
# applying the optimization function: a = alpha1 , b = alpha2 and c = y
# note that cov of a feature with itself = variance
coef = (cov['alpha2']['y'] * cov['alpha1']['alpha2'] - cov['alpha1']['y'] * cov['alpha2']['alpha2']) / \
(cov['alpha1']['y'] * cov['alpha1']['alpha2'] - cov['alpha2']['y'] * cov['alpha1']['alpha1'])
# let n = 1 --> m = coef --> L = coef * alpha1 + alpha2 : which is the new feature to add
df['alpha12'] = coef * df['alpha1'] + df['alpha2']
As you can see, there is a noticeable improvement in the correlation of the introduced alpha12.
Furthermore, related to question 1, to decrease the correlation of theta; and since the correlation is given by:
Corr(theta, y) = Cov(theta, y) / [sqrt(Var(that)) * sqrt(Var(y))]
You can increase the variance of theta. To do so, simply sample n points from some distribution and add them to the corresponding indices as noise.
Save this noise list for future use in case you need to get back to the original theta, something like this:
cov = df[['y', 'theta']].cov()
print("Theta Variance :: Before = {}".format(cov['theta']['theta']))
np.random.seed(2020) # add seed to make it reproducible for future undo
# create noise drawn from uniform distribution
noise = np.random.uniform(low=1.0, high=10., size=df.shape[0])
df['theta'] += noise # add noise to increase variance
cov = df[['y', 'theta']].cov()
print("Theta Variance :: After = {}".format(cov['theta']['theta']))
# df['theta'] -= noise to back to original variance
plt.figure(figsize=(15, 15))
ax = sns.heatmap(df.corr(method="spearman"), annot=True)
plt.xticks(rotation = 90)
plt.title('Spearman Correlation Heatmap After Reducing Variance of Theta\n')
plt.show()
plt.figure(figsize=(15, 15))
ax = sns.heatmap(df.corr(method="pearson"), annot=True)
plt.xticks(rotation = 90)
plt.title('Pearson Correlation Heatmap After Reducing Variance of Theta\n')
plt.show()
Theta Variance :: Before = 0.3478030891329485
Theta Variance :: After = 7.552229545792681
Now Alpha12 is taking the lead and has the highest influence on the target variable y.
I would say that the effect of theta on your predictor can't hide the effect of the other variables. The effect of the other variables may be small but that's likely just the way it is. I'd take the estimate you have and optimise y for constant theta as is
I'm following the answer in this question How can I sample a multivariate log-normal distribution in Python?, but I'm getting that the marginal distributions of the sample data fail to have the same mean and standard deviation of the inputted marginals. For example, consider the multivariate distribution below in the code sample. If we label the marginals as X, Y, and Z, then I would expect that the scale and location parameters (implied from the sample data) to match inputted data. However, for X, you can see below that the scale and location parameters are 0.1000 and 0.5219. So the scale is what we expect, but the location is off by 4%. I'm thinking I'm doing something wrong with the covariance matrix, but I can't seem to figure out what is wrong. I tried setting the correlation matrix to the identity matrix and then the location and scale of the sample data match with the inputted data. Something must be wrong with my covariance matrix, or I'm making another fundamental error. Any help would be appreciated. Please advise if the question is unclear.
import pandas as pd
import numpy as np
from copy import deepcopy
mu = [0.1, 0.2, 0.3]
sigma = [0.5, 0.8, 0.6]
sims = 3000000
rho = [[1, 0.9, 0.3], [0.9, 1, 0.8], [0.3, 0.8 ,1]]
cov = deepcopy(rho)
for row in range(len(rho)):
for col in range(len(rho)):
cov[row][col] = rho[row][col] * sigma[row] * sigma[col]
mvn = np.random.multivariate_normal(mu, cov, size=sims)
sim = pd.DataFrame(np.exp(mvn), columns=['X', 'Y', 'Z'])
def computeImpliedLogNormalsParams(mean, std):
# This method implies lognormal params which match the moments inputed
secondMoment = std * std + mean *mean
location = np.log(mean*mean / np.sqrt(secondMoment))
scale = np.sqrt(np.log(secondMoment / (mean * mean)))
return (location, scale)
def printDistributionProp(col, sim):
print(f"Mean = {sim[col].mean()}, std = {sim[col].std()}")
location, scale = computeImpliedLogNormalsParams(sim[col].mean(), sim[col].std())
print(f"Matching moments gives a lognormal with location {location} and scale {scale}")
printDistributionProp('X', sim)
Output:
Mean = 1.2665338803521895, std = 0.708713940557892
Matching moments gives a lognormal with location 0.10008162992913544 and scale 0.5219239625443672
Observing the output, we would expect that scale parameter to be very close to 0.5, but it's a bit off. Increasing the number of simulations does nothing since the value has converged.
The covariance matrix isn't positive semidefinite:
>>> mvn = np.random.multivariate_normal(mu, cov, size=sims, check='raise')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "mtrand.pyx", line 4542, in mtrand.RandomState.multivariate_normal
ValueError: covariance is not symmetric positive-semidefinite.
and therefore there is no distribution of data that actually has the requested covariance structure. At a high-level, consider that you are specifying X and Z to both be highly correlated with Y (0.8 and 0.9), but at the same time to be rather weakly correlated with each other (0.3). A detailed discussion specifically about three variable correlation constraints can be found on Mathematics SE.
I don't know the internals of how NumPy gets around it (you should have seen a warning), but if you check the final correlation structure:
>>> np.corrcoef(mvn.T)
array([[1. , 0.79817321, 0.33343102],
[0.79817321, 1. , 0.74525583],
[0.33343102, 0.74525583, 1. ]])
one can see that the X and Z have lower correlations with Y and higher correlation with each other than originally specified by rho. Again, not sure how exactly the variances get adjusted, but because the covariance is impossible, NumPy can pretty much do what it wants; fortunately, it seems to stay pretty close.
I have been working to implement a Kalman filter to search for anomalies in a two dimensional data set. Very similar to the excellent post that I found here. As a next step, I'd like to predict confidence intervals (for example 95% confidence for floor and ceiling values) for what I predict the next values will fall in. So in addition to the line below, I'd like to be able to generate two additional lines which represent a 95% confidence that the next value will be above the floor or below the ceiling.
I assume that I'll want to use the uncertainty covariance matrix (P) that is returned with each prediction generated by the Kalman filter but I'm not sure if it's right. Any guidance or reference to how to do this would be much appreciated!
kalman 2d filter in python
The code in the post above generates a set of measurements over time and uses a Kalman filter to smooth the results.
import numpy as np
import matplotlib.pyplot as plt
def kalman_xy(x, P, measurement, R,
motion = np.matrix('0. 0. 0. 0.').T,
Q = np.matrix(np.eye(4))):
"""
Parameters:
x: initial state 4-tuple of location and velocity: (x0, x1, x0_dot, x1_dot)
P: initial uncertainty convariance matrix
measurement: observed position
R: measurement noise
motion: external motion added to state vector x
Q: motion noise (same shape as P)
"""
return kalman(x, P, measurement, R, motion, Q,
F = np.matrix('''
1. 0. 1. 0.;
0. 1. 0. 1.;
0. 0. 1. 0.;
0. 0. 0. 1.
'''),
H = np.matrix('''
1. 0. 0. 0.;
0. 1. 0. 0.'''))
def kalman(x, P, measurement, R, motion, Q, F, H):
'''
Parameters:
x: initial state
P: initial uncertainty convariance matrix
measurement: observed position (same shape as H*x)
R: measurement noise (same shape as H)
motion: external motion added to state vector x
Q: motion noise (same shape as P)
F: next state function: x_prime = F*x
H: measurement function: position = H*x
Return: the updated and predicted new values for (x, P)
See also http://en.wikipedia.org/wiki/Kalman_filter
This version of kalman can be applied to many different situations by
appropriately defining F and H
'''
# UPDATE x, P based on measurement m
# distance between measured and current position-belief
y = np.matrix(measurement).T - H * x
S = H * P * H.T + R # residual convariance
K = P * H.T * S.I # Kalman gain
x = x + K*y
I = np.matrix(np.eye(F.shape[0])) # identity matrix
P = (I - K*H)*P
# PREDICT x, P based on motion
x = F*x + motion
P = F*P*F.T + Q
return x, P
def demo_kalman_xy():
x = np.matrix('0. 0. 0. 0.').T
P = np.matrix(np.eye(4))*1000 # initial uncertainty
N = 20
true_x = np.linspace(0.0, 10.0, N)
true_y = true_x**2
observed_x = true_x + 0.05*np.random.random(N)*true_x
observed_y = true_y + 0.05*np.random.random(N)*true_y
plt.plot(observed_x, observed_y, 'ro')
result = []
R = 0.01**2
for meas in zip(observed_x, observed_y):
x, P = kalman_xy(x, P, meas, R)
result.append((x[:2]).tolist())
kalman_x, kalman_y = zip(*result)
plt.plot(kalman_x, kalman_y, 'g-')
plt.show()
demo_kalman_xy()
The 2D generalization of the 1-sigma interval is the confidence ellipse which is characterized by the equation (x-mx).T P^{-1}.(x-mx)==1, with x being the parameter 2D-Vector, mx the 2D mean or ellipse center and P^{-1} the inverse covariance matrix. See this answer on how to draw one. Like the sigma-intervals the ellipses area corresponds to a fixed probability that the true value lies within. By scaling with the factor n (scaling the interval length or the ellipse radii) a higher confidence can be reached. Note that the Factors n have different probabilities in one and two dimensions:
|`n` | 1D-Intverval | 2D Ellipse |
==================================
1 | 68.27% | 39.35%
2 | 95.5% | 86.47%
3 | 99.73% | 98.89%
Calculating these values in 2D is a bit involved and unfortunately I don't have a public reference to it.
If you want a 95% interval to predict the next values will fall in, then you want a prediction interval and not a confidence interval (http://en.wikipedia.org/wiki/Prediction_interval).
For 2-D (3-D) data, the semi-axes of the ellipse (ellipsoid) can be found by calculating the eigenvalues of the covariance matrix of the data and adjusting the size of the semi-axes to account for the necessary prediction probability.
See Prediction ellipse and prediction ellipsoid for a Python code to calculate the 95% prediction ellipse or ellipsoid.
This might help you to calculate the prediction ellipse for your data.
Because your statistic is of course derived from a sample, the probability the population statistic is greater than the 2 sigma standard deviation is 0.5. Therefore, I would contemplate the significance of considering whether you have a good prediction of a value you expect the next measure to be below with probability 0.95 if you have not applied an upper confidence factor of the 2x standard deviation. The magnitude of that factor will depend on the sample size used to derive the 0.5 population probability. The smaller the sample size used to derive the covariance matrix the larger the factor to derive the 0.95 probability the population 0.95 statistic is less than the factored up sample statistic.