How to compute and plot the pdf from the empirical cdf? - python

I have two numpy arrays, one is an array of x values and the other an array of y values and together they give me the empirical cdf. E.g.:
plt.plot(xvalues, yvalues)
plt.show()
I assume the data needs to be smoothed somehow in order to give a smooth pdf.
I would like to plot the pdf. How can I do that?
The raw data is at: http://dpaste.com/1HVK5DR .

There are two main problems: Your data seems to be quite noisy, and it is not equally spaced: The points at the low end are sampled quite densly, while the ponts at the high end are sampled quite sparsely. This can cause numerical issues.
So first I suggest resampling the data using a linear interpolation to get equaly spaced samples: (Note that all the snippets appended to eachother form the content of one python file.)
import matplotlib.pyplot as plt
import numpy as np
from data import xvalues, yvalues #load data from file
print("#datapoints: {}".format(len(xvalues)))
#don't use every point if your computer is not very fast
xv = np.array(xvalues)[::5]
yv = np.array(yvalues)[::5]
#interpolate to have evenly space data
xi = np.linspace(xv.min(), xv.max(), 400)
yi = np.interp(xi, xv, yv)
Then, to smoothen the data, I suggest performing a RBF regression (=using an "RBF Network"). The idea is fiting a curve of the form
c(t) = sum a(i) * phi(t - x(i)) #(not part of the program)
where phi is some radial basis function. (In theory we could use any functions.) To have a very smooth result I choose a very smooth function, namely a gaussian: phi(x) = exp( - x^2/sigma^2) where sigma is yet to be determined. The x(i) are just some nodes that we can define. If we have a smooth function, we just need a few nodes. The number of nodes also determines how much computation needs to be done. The a(i) are the coefficients we can optimize to get the best fit. In this case I just use a least squares approach.
Note that IF we can write a function in the form above, it is very easy to compute the derivative, it is just
c(t) = sum a(i) * phi'(t - x(i))
where phi' is the derivative of phi. #(not part of the program)
Regarding sigma: It is usually a good idea to choose it as a multiple of the step between the nodes we chose. The greater we choose sigma, the smoother the resulting function gets.
#set up rbf network
rbf_nodes = xv[::50][None, :]#use a subset of the x-values as rbf nodes
print("#rbfs: {}".format(rbf_nodes.shape[1]))
#estimate width of kernels:
sigma = 20 #greater = smoother, this is the primary parameter to play with
sigma *= np.max(np.abs(rbf_nodes[0,1:]-rbf_nodes[0,:-1]))
# kernel & derivative
rbf = lambda r:1/(1+(r/sigma)**2)
Drbf = lambda r: -2*r*sigma**2/(sigma**2 + r**2)**2
#compute coefficients of rbf network
r = np.abs(xi[:, None]-rbf_nodes)
A = rbf(r)
coeffs = np.linalg.lstsq(A, yi, rcond=None)[0]
print(coeffs)
#evaluate rbf network
N=1000
xe = np.linspace(xi.min(), xi.max(), N)
Ae = rbf(xe[:, None] - rbf_nodes)
ye = Ae # coeffs
#evaluate derivative
N=1000
xd = np.linspace(xi.min(), xi.max(), N)
Bd = Drbf(xe[:, None] - rbf_nodes)
yd = Bd # coeffs
fig,ax = plt.subplots()
ax2 = ax.twinx()
ax.plot(xv, yv, '-')
ax.plot(xi, yi, '-')
ax.plot(xe, ye, ':')
ax2.plot(xd, yd, '-')
fig.savefig('graph.png')
print('done')

You need the derivative to go from CDF to PDF
PDF(x) = d CDF(x)/ dx
With NumPy, you could use gradient
pdf = np.gradient(yvalues, xvalues)
plt.plot(xvalues, pdf)
plt.show()
or manual differential
pdf = np.diff(yvalues)/np.diff(xvalues)
l = np.asarray(xvalues[:-1])
r = np.asarray(xvalues[1:])
plt.plot((l+r)/2.0, pdf) # points in the middle of interval
plt.show()
Both produce something like, updated picture it got botched somehow

Related

How to calculate the probability between two numbers from a probability distribution in python

I've always thought it would be useful to calculate the probability between two values on a probability distribution. While there isn't a built-in way to do this using seaborn or matplotlib, I reckon it just takes some basic calculus, right? Here is some code I found from an article on this topic:
from sklearn.neighbors import KernelDensity
import numpy as np
x = np.random.normal(loc=0.0, scale=1.0, size=1000000)
kd = KernelDensity(kernel='gaussian', bandwidth=0.5).fit(np.array(x).reshape(-1, 1))
def get_probability(start_value, end_value, eval_points, kd):
# Number of evaluation points
N = eval_points
step = (end_value - start_value) / (N - 1) # Step size
x = np.linspace(start_value, end_value, N)[:, np.newaxis] # Generate values in the range
kd_vals = np.exp(kd.score_samples(x)) # Get PDF values for each x
probability = np.sum(kd_vals * step) # Approximate the integral of the PDF
return probability.round(4)
get_probability(x.mean() - x.std(), x.mean() + x.std(), 100, kd)
0.6338
This returns a probability that converges at 0.6338. This confused me, as the 68-95-99.7 rule states that the probability of a value being within one standard deviation of the mean in either direction should be 68%.
I decided to run another test by calculating the probability between the median and max of a randomly generated sample, figuring it should converge close to 50%:
x = np.random.randint(100, size=(1000000))
# sns.kdeplot(x) # this is how i'd generate a kdeplot of this data
kd = KernelDensity(kernel='gaussian', bandwidth=0.5).fit(np.array(x).reshape(-1, 1))
def get_probability(start_value, end_value, eval_points, kd):
# Number of evaluation points
N = eval_points
step = (end_value - start_value) / (N - 1) # Step size
x = np.linspace(start_value, end_value, N)[:, np.newaxis] # Generate values in the range
kd_vals = np.exp(kd.score_samples(x)) # Get PDF values for each x
probability = np.sum(kd_vals * step) # Approximate the integral of the PDF
return probability.round(4)
get_probability(np.median(x), x.max(), 100, kd)
0.4946
And it's pretty close. Am I missing something here? Why am I nearly 5 percentage points off from the 68-95-99.7 rule? Is this method of generating probabilities from a probability distribution wrong? Is there a better way to find the probability between two values from a probability distribution?
EDIT: Could you potentially calculate something by using the data generated from a kdeplot?
fig, ax = plt.subplots()
sns.kdeplot(x)
kdeline = ax.lines[0]
xs = kdeline.get_xdata()
ys = kdeline.get_ydata()
And implement np.interp() somehow?
More edits:
Using CDFs per #7shoe, I was able to get a way better (and correct) result for my normal distribution example:
from scipy.stats import norm
import numpy as np
np.random.seed(42)
x = np.random.normal(loc=0.0, scale=1.0, size=10000000)
norm.cdf(x.mean() + x.std()) - norm.cdf(x.mean() - x.std())
However, my curiosity is still piqued. Let's say we have a distribution that may or may not be normal. For example, let's look at Tom Brady's epa per pass from last season
import pandas as pd
import seaborn as sns
import random
import numpy as np
YEAR = 2021
data = pd.read_csv(
'https://github.com/nflverse/nflfastR-data/blob/master/data/play_by_play_' \
+ str(YEAR) + '.csv.gz?raw=True',compression='gzip', low_memory=False
)
df = data.loc[data.passer == 'T.Brady','epa'].copy()
# tom brady's distribution
sns.kdeplot(df)
sample_mean = []
for i in range(50):
y = np.random.choice(df, 500)
avg = np.mean(y)
sample_mean.append(avg)
# distribution of sampling means - can we assume this is normal and proceed with cdfs?
sns.kdeplot(sample_mean)
Could we use sampling means or even just bootstrap resampling methods to
Make a more "normal" distribution with sampling means in order to incorporate cdfs if the initial distribution doesn't quite appear normal (this, though, would be a distribution of means rather than individual samples. Is this not encouraged?)
or
If the distribution already resembles a normal distribution, simply use such resampling methods to create better parametric estimates?
Computing the probability p for some interval is not overly complicated. However, it might be tricky to combine the right tools to do so. In particular, since there are several statistical approaches to do so.
1. Probability theory
Given two numbers, let's call them lower and upper, what probability is enclosed in between them? If the cumulative distribution function (CDF) F is known, it is merely p = F(upper) - F(lower). Similarly, p coincides with the area enclosed by the probability density function(PDF) f's graph on the interval [lower, upper].
However, when the CDF/PDF is unknown, it constitutes a statistical question. In a nutshell, estimating the PDF f and computing the area its graph enclosed with the interval will do. But there are several paradigms and estimation procedures to obtain it.
1. Parametric estimation
One could assume that the data x is set of IID realizations of some normal distribution, either because of prior knowledge or convenience. Then, one just needs to estimate its parameters mu (aka scale) and sigma (aka standard deviation or scale). scipy.stats provides all we need in this setting. Moreover, it offers estimation procedures as well as pdf/cdf functions for various parametric distributions.
from scipy import stats
from matplotlib import pyplot as plt
lower, upper = 0.0, 2.0
x = [-0.804, -2.267, 1.55, -1.004, 3.173, -0.522, -0.231, 3.95, -0.574, -0.213, 1.333, 2.42, 1.879, 3.814]
# fit parameter
loc_hat, scale_hat = stats.norm.fit(x)
# probability
p = stats.norm.cdf(upper, loc=loc_hat, scale=scale_hat) - stats.norm.cdf(lower, loc=loc_hat, scale=scale_hat)
# plot
x_axis = np.linspace(-5, 7, 1000)
plt.title('1. Parametric Estimation', fontsize=18)
plt.plot(x_axis, stats.norm.pdf(x_axis, loc_hat, scale_hat))
plt.fill_between(x = np.arange(lower, upper, 0.01),
y1 = stats.norm.pdf(np.arange(lower, upper, 0.01), loc=loc_hat, scale=scale_hat) ,
facecolor='red',
alpha=0.35)
plt.text(x=0.1, y=0.1, s= 'p=' + str(round(p, 3)))
plt.show()
which yields
2. Non-parametric estimation
In the absence of a parametric assumption, various techniques exist to estimate the density directly (rather than identifying it by estimated parameters as seen above). Kernel density estimation is the most popular variant to do so. In this case, as alluded in the question, scikit-learn is an ideal tool. However, in the absence of an analytical CDF, we need to compute the area enclosed by the density's graph over the interval [lower, upper] directly.
In contrast to previous answers, I'd leave this to SciPy's numerical integration routines, e.g. scipy.inegrate.quad(). The advantage is that it is lightning-fast and can be applied to any function (beyond kernel density estimates). The resulting code is as follows
from sklearn.neighbors import KernelDensity
from scipy.integrate import quad
x = [-0.804, -2.267, 1.55, -1.004, 3.173, -0.522, -0.231, 3.95, -0.574, -0.213, 1.333, 2.42, 1.879, 3.814]
# fit density function
f_hat = KernelDensity(bandwidth=.9, kernel='gaussian').fit(np.array(x).reshape(-1, 1))
def f_pred(x):
'''wrapper function to compute probability'''
return np.exp(f_hat.score_samples(np.array(x).reshape(-1, 1)))[0]
p = quad(func=f_pred, a=lower, b=upper)
# plot
plt.title('2. Non-Parametric Estimation', fontsize=18)
xaxis = np.linspace(-5, 7, 1000)
plt.plot(x_axis, np.exp(f_hat.score_samples(xaxis.reshape(-1, 1))))
plt.fill_between(x = np.arange(lower, upper, 0.01),
y1 = np.exp(f_hat.score_samples(np.arange(lower, upper, 0.01).reshape(-1, 1))),
facecolor='red',
alpha=0.35)
plt.text(x=0.15, y=0.1, s= 'p=' + str(round(p[0], 3)))
plt.show()
and yields
I do see a bug in the get_probability function, but that bug causes it to compute a too high result - in np.sum(kd_vals * step), it's multiplying N sample values by a step with N-1 in the denominator, effectively resulting in an output a factor of N/(N-1) too high. (If they wanted to use a trapezoid rule computation for the integral, they should have divided the left and right endpoint values by 2 first.)
Other than that, the computation looks correct. The problem is that the model doesn't reflect the input distribution.
You're not modeling the distribution as a normal distribution. You're modeling it with a kernel density estimator with a Gaussian kernel, and the kernel bandwidth is very high relative to the scale of the distribution and the number of available samples. This results in the model being "flatter" than the actual distribution, with less of the probability concentrated in the center.

Is there an a method to fit a wave created from two wave?

I need to fit a sine curve created from two sine waves and extract the parameters for the fitted curve (such as frequency, amplitude, etc).
Data example:
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline
x = np.arange(0, 50, 0.01)
x2 = np.arange(0, 100, 0.02)
x3 = np.arange(0, 150, 0.03)
sin1 = np.sin(x)
sin2 = np.sin(x2)
sin3= np.sin(x3/2)
sin4 = sin1 + sin2+sin3
plt.plot(x, sin4)
plt.show()
I used the codes provided in this answer.
yy = sin4
tt = x
res = fit_sin(tt, yy)
print(str(i), "Amplitude=%(amp)s, Angular freq.=%(omega)s, phase=%(phase)s, offset=%(offset)s, Max. Cov.=%(maxcov)s" % res )
fit_values=res["fitfunc"](tt)
Frequenc_fit= res['freq']
print(i, Frequenc_fit)
Frequenc_fit=Frequenc_fit
Amp_fit=res['amp']
Omega_fit=res['omega']
Phase_fit=res['phase']
Offset_fit=res['offset']
maxcov_fit=res['maxcov']
plt.plot(tt, yy, "-k", label="y", linewidth=2)
plt.plot(tt,fit_values, "r-", label="y fit curve", linewidth=2)
plt.legend(loc="best")
plt.show()
I got a fitted sine curve with a single frequency and amplitude as follows:
2 Amplitude=1.0149282025860233, Angular freq.=2.01112187048004, phase=-0.2730905030152767, offset=0.003304158823058212, Max. Cov.=0.0015266032307905222
2 0.3200799868471169
Is there a method to obtain fitted curve matches with the original one?
Supposing that the function to be fitted is
y(x)=a * sin( w * x )+b * sin( W * x )
the principle of the method below is explained in https://fr.scribd.com/doc/14674814/Regressions-et-equations-integrales
The graphical representation of the result is :
Blue curve : From data obtained by scanning the graph given in the question.
Black curve : From the above calculus.
The available data was not accurate because it comes from scanning of the original figure. The deviation is mainly due to the numerical integrations in computing the values of SS and SSSS (Four successive numerical integrations is not accurate especially with biaised data).
Probably the correct result should be : w=2 , W=1 , a=1 , b=1.
NOTE : The above method is not iterative and thus doesn't requires guessed values of the parameters to start an iterative process. The approximate results of the parameters can be good initial values in order to use an iterative non-linear regression process.
NOTE : If the values of w and W where known a-priori the solving thanks to linear regression would be very simple and much accurate (Only the last 2X2 matrix calculus shown above).

scipy gaussian_kde produces different results depending on method used (weights vs no weights)

I have a series of coordinates that I want to apply a KDE to, and have been using scipy.stats.gaussian_kde to do so. The issue here is that this function expects a discrete set of coordinates, which it would then perform a density estimation of.
This causes issues when I wish to log my data (for sets where the coordinates are particularly sparese, and using the untouched data gives very little information). As you can imagine, if you must work with discrete amounts of points, if 2 points appear 18 times and the other 24 times, taking the log of 18 and 24 will make them identical, as they must be rounded to the nearest integer in order to remain discrete.
As a work around for this, I have been using the weights parameter in the scipy.stats.gaussian_kde function. Instead of having an array where each point appears an amount of times equal its density, each point appears a single time, and is instead weighted by its density. So now, using the example before, the 2 points that have density 18 and 24 will not be identical as with weightings these densities can be continuous.
This works and produces what appears to be a good estimate, however using these two different methods, they both produce graphs with minor differences. If I had just been using one method, I would remain blissfully ignorant, but now that I've used both, I can't be sure the estimate is reasonable.
Is there a reason these two methods produce differing results?
See below some example code that reproduces the issue:
from scipy.stats import gaussian_kde
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(1)
discrete_points = np.random.randint(0,10,size=(2,400))
continuous_points = [[],[]]
continuous_weights = []
recorded_points = []
for i in range(discrete_points.shape[1]):
p = discrete_points[:,i]
if tuple(p) in recorded_points:
continuous_weights[recorded_points.index(tuple(p))] += 1
else:
continuous_points[0].append(p[0])
continuous_points[1].append(p[1])
continuous_weights.append(1)
recorded_points.append(tuple(p))
resolution = 1
kde = gaussian_kde(discrete_points)
x, y = discrete_points
# https://www.oreilly.com/library/view/python-data-science/9781491912126/ch04.html
x_step = int((max(x)-min(x))/resolution)
y_step = int((max(y)-min(y))/resolution)
xgrid = np.linspace(min(x), max(x), x_step+1)
ygrid = np.linspace(min(y), max(y), y_step+1)
Xgrid, Ygrid = np.meshgrid(xgrid, ygrid)
Z = kde.evaluate(np.vstack([Xgrid.ravel(), Ygrid.ravel()]))
Zgrid = Z.reshape(Xgrid.shape)
ext = [min(x), max(x), min(y), max(y)]
earth = plt.cm.gist_earth_r
plt.imshow(Zgrid,
origin='lower', aspect='auto',
extent=ext,
alpha=0.8,
cmap=earth)
plt.title("Discrete method (no weights)")
plt.savefig("noweights.png")
kde = gaussian_kde(continuous_points, weights=continuous_weights)
x, y = continuous_points
# https://www.oreilly.com/library/view/python-data-science/9781491912126/ch04.html
x_step = int((max(x)-min(x))/resolution)
y_step = int((max(y)-min(y))/resolution)
xgrid = np.linspace(min(x), max(x), x_step+1)
ygrid = np.linspace(min(y), max(y), y_step+1)
Xgrid, Ygrid = np.meshgrid(xgrid, ygrid)
Z = kde.evaluate(np.vstack([Xgrid.ravel(), Ygrid.ravel()]))
Zgrid = Z.reshape(Xgrid.shape)
ext = [min(x), max(x), min(y), max(y)]
earth = plt.cm.gist_earth_r
plt.imshow(Zgrid,
origin='lower', aspect='auto',
extent=ext,
alpha=0.8,
cmap=earth)
plt.title("Continuous method (weights)")
plt.savefig("weights.png")
Which produces the following plots:
and
An import aspect of a kde is the bandwidth used. Scipy's gaussian_kde uses "Scott's factor" as a guess for the bandwidth.
In particular, gaussian_kde uses n**(-1./(d+4)) where d is the dimension (2 in this case), and n
the number of data points in case of the non-weighted version
the "effective number of datapoints" in case of the weighted version; it is calculated as neff = sum(weights)^2 / sum(weights^2)
In the example of the post n = 400 and neff = sum(continuous_weights)**2 / sum([w**2 for w in continuous_weights]) = 84.0336.
To get the same result, the same bandwidth should be used in both cases. It can be set explicitly as gaussian_kde(..., bw_method=bandwidth.
bandwidth = discrete_points.shape[1]**(-1./(2+4))
# kde without weights
kde = gaussian_kde(discrete_points, bw_method=bandwidth)
# kde for the weighted points
kde = gaussian_kde(continuous_points, weights=continuous_weights, bw_method=bandwidth)
If you plan to create multiple plots, you probably want to use the same bandwidth for all of them, independent to the number of points or the weights. You might want to experiment with the resolution and the bandwidth. A higher bandwidth smooths everything out over a larger distance, a smaller bandwidth is more faithful to given data.

I know scipy curve_fit can do better

I'm using python/numpy/scipy to implement this algorithm for aligning two digital elevation models (DEMs) based on terrain aspect and slope:
"Co-registration and bias corrections of satellite elevation data sets for quantifying glacier thickness change", C. Nuth and A. Kääb, doi:10.5194/tc-5-271-2011
I have things a framework set up, but the quality of the fit provided by scipy.optimize.curve_fit is poor.
def f(x, a, b, c):
y = a * numpy.cos(numpy.deg2rad(b-x)) + c
return y
def compute_offset(dh, slope, aspect):
import scipy.optimize as optimization
idx = random.sample(range(dh.compressed().size), 10000)
xdata = numpy.array(aspect.compressed()[idx], float)
ydata = numpy.array((dh/numpy.tan(numpy.deg2rad(slope))).compressed()[idx], float)
#Generate synthetic data to test curve_fit
#xdata = numpy.arange(0,360,0.01)
#ydata = f(xdata, 20.0, 130.0, -3.0) + 20*numpy.random.normal(size=len(xdata))
print xdata
print ydata
x0 = numpy.array([0.0, 0.0, 0.0])
fit = optimization.curve_fit(f, xdata, ydata, x0)[0]
#optimization.leastsq(f, x0[:], args=(xdata, ydata))
genplot(xdata, ydata, fit)
return fit
def genplot(x, y, fit):
a = (numpy.arange(0,360))
f_a = f(a, fit[0], fit[1], fit[2])
idx = random.sample(range(x.size), 10000)
plt.figure()
plt.xlabel('Aspect (deg)')
plt.ylabel('dh/tan(slope) (m)')
plt.plot(x[idx], y[idx], 'r.')
plt.axhline(color='k')
plt.plot(a, f_a, 'b')
plt.ylim(-80,80)
plt.show()
#Input DEMs
dem1_fn = sys.argv[1]
dem2_fn = sys.argv[2]
dem1_ds = gdal.Open(dem1_fn, gdal.GA_ReadOnly)
dem2_ds = gdal.Open(dem2_fn, gdal.GA_ReadOnly)
#Extract band 1 from each dataset as masked array using internal nodata value
dem1 = getperc_new.gdal_getma(dem1_ds, 1)
dem2 = getperc_new.gdal_getma(dem2_ds, 1)
#Produce slope and aspect maps using gdaldem and load into masked arrays
dem1_slope = gdaldem_slope(dem1_fn)
dem1_aspect = gdaldem_aspect(dem1_fn)
#Compute common mask and apply to all products
common_mask = dem1.mask + dem2.mask + dem1_slope.mask + dem1_aspect.mask
diff_euler = numpy.ma.array(dem2-dem1, mask=common_mask)
dem1_slope.__setmask__(common_mask)
dem1_aspect.__setmask__(common_mask)
#Compute relationship between elevation difference, slope and aspect
fit = compute_offset(diff_euler, dem1_slope, dem1_aspect)
print fit
Here is the fit for my data, which initially consists of ~2 million points, but I've randomly sampled for testing/plotting purposes:
[ -14.9639559 216.01093596 -41.96806735]
There is plenty of data there for a good fit, but the result from curve_fit is poor. When I run with synthetic data, I get a nice fit:
original input parameters [20.0, 130.0, -3.0]
result from curve_fit [-19.66719631 -49.6673076 -3.12198723]
Not sure if this has something to do with using masked arrays, a limitation of curve_fit, or if I'm just overlooking something simple. Thanks for any suggestions.
==========================
Edit 9/4/13 16:30 PDT
As suggested by #Evert and others, the problem was definitely related to outliers. I was able to obtain a much better fit after removing outliers. Looking at my old code, it seems I computed the median absolute deviation for each aspect, then removed anything outside of 2*mad before fitting.
I generated a few additional plots back in November 2012:
But looking at these again, I'm almost positive they were generated for different input data. It's all that I can find right now, so I'm including them here as an example of a case with biased sampling. This method for DEM alignment is bound to fail for cases like these - and it has nothing to do with scipy's curve fitting abilities.
I ended up developing a different approach for alignment involving normalized cross-correlation, sub-pixel refinement, and vertical offset removal for two masked 2D numpy arrays. It is faster and consistently provides better results. Although even that approach has been superseded by an Iterative Closest Point (ICP) tool (pc_align) developed by Oleg Alexandrov as part of the NASA Ames Stereo Pipeline.
Thanks for all of your responses and I apologize for abandoning this question.
If you're just trying to get a sine wave with phase offset, you don't need a non-linear fit.
You can replace that a * sin(x - b) + c by a * sin(x) + b * cos(x) + c, because any sine with an offset can be written as an appropriate combination of a sine and a cosine("Phasor addition", like in a fourier transform).
If that gives the same result then it is not the "non-linear" fit that is the problem.

Fitting a line in 3D

Are there any algorithms that will return the equation of a straight line from a set of 3D data points? I can find plenty of sources which will give the equation of a line from 2D data sets, but none in 3D.
Thanks.
If you are trying to predict one value from the other two, then you should use lstsq with the a argument as your independent variables (plus a column of 1's to estimate an intercept) and b as your dependent variable.
If, on the other hand, you just want to get the best fitting line to the data, i.e. the line which, if you projected the data onto it, would minimize the squared distance between the real point and its projection, then what you want is the first principal component.
One way to define it is the line whose direction vector is the eigenvector of the covariance matrix corresponding to the largest eigenvalue, that passes through the mean of your data. That said, eig(cov(data)) is a really bad way to calculate it, since it does a lot of needless computation and copying and is potentially less accurate than using svd. See below:
import numpy as np
# Generate some data that lies along a line
x = np.mgrid[-2:5:120j]
y = np.mgrid[1:9:120j]
z = np.mgrid[-5:3:120j]
data = np.concatenate((x[:, np.newaxis],
y[:, np.newaxis],
z[:, np.newaxis]),
axis=1)
# Perturb with some Gaussian noise
data += np.random.normal(size=data.shape) * 0.4
# Calculate the mean of the points, i.e. the 'center' of the cloud
datamean = data.mean(axis=0)
# Do an SVD on the mean-centered data.
uu, dd, vv = np.linalg.svd(data - datamean)
# Now vv[0] contains the first principal component, i.e. the direction
# vector of the 'best fit' line in the least squares sense.
# Now generate some points along this best fit line, for plotting.
# I use -7, 7 since the spread of the data is roughly 14
# and we want it to have mean 0 (like the points we did
# the svd on). Also, it's a straight line, so we only need 2 points.
linepts = vv[0] * np.mgrid[-7:7:2j][:, np.newaxis]
# shift by the mean to get the line in the right place
linepts += datamean
# Verify that everything looks right.
import matplotlib.pyplot as plt
import mpl_toolkits.mplot3d as m3d
ax = m3d.Axes3D(plt.figure())
ax.scatter3D(*data.T)
ax.plot3D(*linepts.T)
plt.show()
Here's what it looks like:
If your data is fairly well behaved then it should be sufficient to find the least squares sum of the component distances. Then you can find the linear regression with z independent of x and then again independent of y.
Following the documentation example:
import numpy as np
pts = np.add.accumulate(np.random.random((10,3)))
x,y,z = pts.T
# this will find the slope and x-intercept of a plane
# parallel to the y-axis that best fits the data
A_xz = np.vstack((x, np.ones(len(x)))).T
m_xz, c_xz = np.linalg.lstsq(A_xz, z)[0]
# again for a plane parallel to the x-axis
A_yz = np.vstack((y, np.ones(len(y)))).T
m_yz, c_yz = np.linalg.lstsq(A_yz, z)[0]
# the intersection of those two planes and
# the function for the line would be:
# z = m_yz * y + c_yz
# z = m_xz * x + c_xz
# or:
def lin(z):
x = (z - c_xz)/m_xz
y = (z - c_yz)/m_yz
return x,y
#verifying:
from mpl_toolkits.mplot3d import Axes3D
import matplotlib.pyplot as plt
fig = plt.figure()
ax = Axes3D(fig)
zz = np.linspace(0,5)
xx,yy = lin(zz)
ax.scatter(x, y, z)
ax.plot(xx,yy,zz)
plt.savefig('test.png')
plt.show()
If you want to minimize the actual orthogonal distances from the line (orthogonal to the line) to the points in 3-space (which I'm not sure is even referred to as linear regression). Then I would build a function that computes the RSS and use a scipy.optimize minimization function to solve it.

Categories