2D PCA line fitting with numpy - python

I'm trying to implement a 2D PCA with numpy.
The code is rather simple:
import numpy as np
n=10
d=10
x=np.linspace(0,10,n)
y=x*d
covmat = np.cov([x,y])
print(covmat)
eig_values, eig_vecs = np.linalg.eig(covmat)
largest_index = np.argmax(eig_values)
largest_eig_vec = eig_vecs[largest_index]
The covariance matrix is:
[[ 11.31687243 113.16872428]
[ 113.16872428 1131.6872428 ]]
Then I've got a simple helper method to plot a line (as a series of points) around a given center, in a given direction.
This is meant to be used by pyplot, therefore I'm preparing separate lists for the x and y coordinate.
def plot_line(center, dir, num_steps, step_size):
line_x = []
line_y = []
for i in range(num_steps):
dist_from_center = step_size * (i - num_steps / 2)
point_on_line = center + dist_from_center * dir
line_x.append(point_on_line[0])
line_y.append(point_on_line[1])
return (line_x, line_y)
And finally the plot setup:
lines = []
mean_point=np.array([np.mean(x),np.mean(y)])
lines.append(plot_line(mean_point, largest_eig_vec, 200, 0.5))
import matplotlib.pyplot as plt
fig = plt.figure()
ax = fig.add_subplot(111)
ax.scatter(x,y, c="b", marker=".", s=10
)
for line in lines:
ax.plot(line[0], line[1], c="r")
ax.scatter(mean_point[0], mean_point[1], c="y", marker="o", s=20)
plt.axes().set_aspect('equal', 'datalim')
plt.show()
Unfortunately, the PCA doesn't seem to work.
Here's the plot:
I'm afraid I've got no idea what went wrong.
I've computed the covariance manually -> same result.
I've checked the other eigenvalue -> perpendicular to the red line.
I've tested plot_line with the direction (1,10). It's perfectly aligned to my points:
The final plot shows that the line fitted by pca is the correct result only it is mirrored at the y axis.
In fact, if I change the x coordinate of the eigenvector, the line is fitted perfectly:
Apparently this is a fundamental problem. Somehow I've misunderstood how to use pca.
Where is my mistake ?
Online resources seem to describe PCA exactly as I implemented it.
I don't believe I have to categorically mirror my line-fits at the y-axis. It's got to be something else.

Your mistake is that you're extracting the last row of the eigenvector array. But the eigenvectors form the columns of the eigenvector array returned by np.linalg.eig, not the rows. From the documentation:
[...] the arrays a, w, and v satisfy the equations dot(a[:,:], v[:,i]) = w[i] * v[:,i] [for each i]
where a is the array that np.linalg.eig was applied to, w is the 1d array of eigenvalues, and v is the 2d array of eigenvectors. So the columns v[:, i] are the eigenvectors.
In this simple two-dimensional case, since the two eigenvectors are mutually orthogonal (because we're starting with a symmetric matrix) and unit length (because np.linalg.eig normalises them that way), the eigenvector array has one of the two forms
[[ cos(t) sin(t)]
[-sin(t) cos(t)]]
or
[[ cos(t) sin(t)]
[ sin(t) -cos(t)]]
for some real number t, and in the first case, reading the first row (for example) instead of the first column would give [cos(t), sin(t)] in place of [cos(t), -sin(t)]. This explains the apparent reflection that you're seeing.
Replace the line
largest_eig_vec = eig_vecs[largest_index]
with
largest_eig_vec = eig_vecs[:, largest_index]
and you should get the expected results.

Related

Manually recover the original function from numpy rfft

I have performed a numpy.fft.rfft on a function to obtain the Fourier coefficients. Since the docs do not seem to contain the exact formula used, I have been assuming a formula found in a textbook of mine:
S(x) = a_0/2 + SUM(real(a_n) * cos(nx) + imag(a_n) * sin(nx))
where imag(a_n) is the imaginary part of the n_th element of the Fourier coefficients.
To translate this into python-speak, I have implemented the following:
def fourier(freqs, X):
# input the fourier frequencies from np.fft.rfft, and arbitrary X
const_term = np.repeat(np.real(freqs[0])/2, X.shape[0]).reshape(-1,1)
# this is the "n" part of the inside of the trig terms
trig_terms = np.tile(np.arange(1,len(freqs)), (X.shape[0],1))
sin_terms = np.imag(freqs[1:])*np.sin(np.einsum('i,ij->ij', X, trig_terms))
cos_terms = np.real(freqs[1:])*np.cos(np.einsum('i,ij->ij', X, trig_terms))
return np.concatenate((const_term, sin_terms, cos_terms), axis=1)
This should give me an [X.shape[0], 2*freqs.shape[0] - 1] array, containing at entry i,j the i_th element of X evaluated at the j_th term of the Fourier decomposition (where the j_th term is a sin term for odd j).
By summing this array over the axis of Fourier terms, I should obtain the function evaluated at the i_th term in X:
import numpy as np
import matplotlib.pyplot as plt
X = np.linspace(-1,1,50)
y = X*(X-0.8)*(X+1)
reconstructed_y = np.sum(
fourier(
np.fft.rfft(y),
X
),
axis = 1
)
plt.plot(X,y)
plt.plot(X, reconstructed_y, c='r')
plt.show()
In any case, the red line should be basically on top of the blue line. Something has gone wrong either in my assumptions about what numpy.fft.rfft returns, or in my specific implementation, but I am having a hard time tracking down the bug. Can anyone shed some light on what I've done wrong here?

Least-square fitting for points in 2d doesn't pass through symmetrical axis

I'm trying to draw the best fitting line for given (x,y) data points.
Here shows data points (red pixels) and estimated line (green), I obtained using following library.
import numpy as np
m, c = np.linalg.lstsq(A, y)[0]
Documentation for used library module
We can see data points are roughly symmetrically distributed. Problem is why is this line not having the gradient similar to the long symmetric axis through the data points? Can you please explain can this result is correct? Then, how it gives minimum error? (Line is drawn correctly using gradient returned by the lstsq method). Thank you.
EDIT
Here is the code I'm trying. Input image can be downloaded from here. In this code I've not forced the line to pass through the center of the pixel distribution. (Note: here I've used polyfit instead of lstsq. Both gives same results)
import numpy as np
import cv2
import math
img = cv2.imread('points.jpg',1);
h, w = img.shape[:2]
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
points = np.argwhere(gray>10) # get (x,y) pairs where red pixels exist
y = points[:,0]
x = points[:,1]
m, c = np.polyfit(x, y, 1) # calculate least square fit line
# calculate two cordinates (x1,y1),(x2,y2) on the line
angle = np.arctan(m)
x1, y1, length = 0, int(c), 500
x2 = int(round(math.ceil(x1 + length * np.cos(angle)),0))
y2 = int(round(math.ceil(y1 + length * np.sin(angle)),0))
# draw line on the color image
cv2.line(img, (x1, y1), (x2, y2), (0,255,0), 1, cv2.LINE_8)
# show output the image
cv2.namedWindow("Display window", cv2.WINDOW_AUTOSIZE);
cv2.imshow("Display window", img);
cv2.waitKey(0);
cv2.destroyAllWindows()
How can I have the line pass through the longest symmetric axis of the pixel distribution? Can I use principle component analysis?
It's hard to say why this would be the case. The bottom line is that I can't see the data you're using, and I can't see what the calculated slope and y intercept are for the data you're using.
Here are a couple of things that could explain what we're seeing:
(1) The density of data points is actually quite different than it appears to a casual glance and everything is working properly.
(2) You're sending the wrong arguments to the least squares function and you've got a GIGO situation. (I haven't used numpy's least squares algorithm, so I can't check this.)
(3) The scatter plot and the line plot don't agree on the scale of the axes.
(4) The least squares function in question is broken.
(5) You're not passing the same data to the least squares algorithm as you're passing to the plotting routine.
(6) The data formatting is funky so that the scatter plot and least squares routines are interpreting your data differently.
I can't know which of these is the problem, and unless it's (3), I expect we'd need more data to be able to distinguish between these possibilities.
Here's how I'd proceed if I were you: (1) Create a small artificial data set that sits on a line and pass it to the least squares function and see if it spits out the right numbers. See if these look right when plotted or not. (2) If this looks okay, record the output of the least squares algorithm, see if you can find another least squares program to calculate the slope and y intercept and compare them. If they're the same, it's probably not the routine, it's probably something to do with plotting.
If you get this far and it's still a mystery, let us know what you've found and maybe we can make another suggestion.
Good luck.
If the red dots truly represent your data, you are probably applying your linear regression function in a way that forces the line through the origin. How do i know? When using linear regression on two variables x and y, the line will intercept a few specific points. For example the average of x, and the average of y. Also, depending on your specifications, a calculated or specified intercept of the y axis. If all variables of x and y are positive, you will have a line that looks like yours if the line is forced through the origin. Not much more can be said before you provide som reproducible data and code.
EDIT:
I didn't have much luck with the reproducble sample provided, so I built an example with random numbers to elaborate on my original answer. I think statsmodels is a decent library for linear regression analysis. First, I'll address this earlier comment:
If all variables of x and y are positive, you will have a line that looks like yours if the line is forced through the origin.
You'll see an increasing effect of this the larger your numbers are (the further away from the origin your numbers are). Using sm.OLS(y,sm.add_constant(x)).fit() and sm.OLS(y,x).fit() for two different sets of numbers will show you exactly what I mean. First, I'll run a regression on the dataset below without an estimated constant (the line goes through the origin). This will give us a plot that at resembles your original plot:
# Libraries
import statsmodels.api as sm
import numpy as np
import matplotlib.pyplot as plt
# Data
np.random.seed(123)
x = np.random.normal(size=2500) + 100
y = x * 2 + np.random.normal(size=2500) + 100
# Regression
results1 = sm.OLS(y,x).fit()
regLine_origin = x*results1.params[0]
# PLot
fig, ax = plt.subplots()
ax.scatter(x, y, c='red', s=4)
ax.scatter(x, regLine_origin, c = 'green', s = 1)
ax.patch.set_facecolor('black')
plt.show()
Next, I'll include a constant in the regression. Now, the yellow line will represent what I think you were after in your question:
# Libraries
import statsmodels.api as sm
import numpy as np
import matplotlib.pyplot as plt
# Data
np.random.seed(123)
x = np.random.normal(size=2500) + 100
y = x * 2 + np.random.normal(size=2500) + 100
# Regression
results1 = sm.OLS(y,x).fit()
results2 = sm.OLS(y,sm.add_constant(x)).fit()
regLine_origin = x*results1.params[0]
regLine_constant = results2.params[0] + x*results2.params[1]
# PLot
fig, ax = plt.subplots()
ax.scatter(x, y, c='red', s=4)
ax.scatter(x, regLine_origin, c = 'green', s = 1)
ax.scatter(x, regLine_constant, c = 'yellow', s = 1)
ax.patch.set_facecolor('black')
plt.show()
And lastly, we can take a look at what happens when the numbers are closer to the origin. So to speak. Here, I'll remove the +100 part when the numbers are produced:
# The following is changed in the snippet above:
# Data
x = np.random.normal(size=2500)
y = x * 2 + np.random.normal(size=2500)
And that's why I think your original regression line is set to go through the origin. Have a look at the statsmodels package. Here you can study the details of the estimate by running print(results2.summary()):
And as you've already seen in the snippets above, you'll have direct access to the regression coefficients by using results2.params.
Edit2: My explanation still isn't 100% valid. The x and y values will have to differ a bit in size to see this effect. You'll certainly find situations where the line goes through the origin no matter the size of the numbers.
Have a look at the different x labels, and you'll see what I mean.

numpy polyfit yields nonsense

I am trying to fit these values:
This is my code:
for i in range(-area,area):
stDev1= []
for j in range(-area,area):
stDev0 = stDev[i+i0][j+j0]
stDev1.append(stDev0)
slices[i] = stDev1
fitV = []
xV = []
for l in range(-area,area):
y = np.asarray(slices[l])
x = np.arange(0,2*area,1)
for m in range(-area,area):
fitV.append(slices[m][l])
xV.append(l)
fit = np.polyfit(xV,fitV,4)
yfit = function(fit,area)
x100 = np.arange(0,100,1)
plt.plot(xV,fitV,'.')
plt.savefig("fits1.png")
def function(fit,area):
yfit = []
for x in range(-area,area):
yfit.append(fit[0]+fit[1]*x+fit[2]*x**2+fit[3]*x**3+fit[4]*x**4)
return(yfit)
i0 = 400
j0 = 400
area = 50
stdev = 2d np.array([1300][800]) #just an image of "noise" feel free to add any image // 2d np array you like.
This yields:
obviously this is completly wrong?
I assume I miss understand the concept of polyfit? From the doc the requirement is that I feed it with with two arrays of shape x[i] y[i]? My values in
xV = [ x_1_-50,x_1_-49,...,x_1_49,x_2_-50,...,x_49_49]
and my ys are:
fitV = [y_1_-50,y_1_-49,...,y_1_49,...y_2_-50,...,y_2_49]
I do not completely understand your program. In the future, it would be helpful if you were to distill your issue to a MCVE. But here are some thoughts:
It seems, in your data, that for a given value of x there are multiple values of y. Given (x,y) data, polyfit returns a tuple that represents a polynomial function, but no function can map a single value of x onto multiple values of y. As a first step, consider collapsing each set of y values into a single representative value using, for example, the mean, median, or mode. Or perhaps, in your domain, there's a more natural way to do this.
Second, there is an idiomatic way to use the pair of functions np.polyfit and np.polyval, and you're not using them in the standard way. Of course, numerous useful departures from this pattern exist, but first make sure you understand the basic pattern of these two functions.
a. Given your measurements y_data, taken at times or locations x_data, plot them and make a guess as to the order of the fit. That is, does it look like a line? Like a parabola? Let's assume you believe your data to be parabolic, and that you'll use a second order polynomial fit.
b. Make sure that your arrays are sorted in order of increasing x. There are many ways to do this, but np.argsort is a easy one.
c. Run polyfit: p = polyfit(x_data,y_data,2), which returns a tuple containing the 2nd, 1st, and 0th order coefficients in p, (c2,c1,c0).
d. In the idiomatic use of polyfit and polyval, next you would generate your fit: polyval(p,x_data). Or perhaps you want the fit to be sampled more coarsely or finely, in which case you might take a subset of x_data or interpolate more values in x_data.
A complete example is below.
import numpy as np
from matplotlib import pyplot as plt
# these are your measurements, unsorted
x_data = np.array([18, 6, 9, 12 , 3, 0, 15])
y_data = np.array([583.26347805, 63.16059915, 100.94286909, 183.72581827, 62.24497418,
134.99558191, 368.78421529])
# first, sort both vectors in increasing-x order:
sorted_indices = np.argsort(x_data)
x_data = x_data[sorted_indices]
y_data = y_data[sorted_indices]
# now, plot and observe the parabolic shape:
plt.plot(x_data,y_data,'ks')
plt.show()
# generate the 2nd order fitting polynomial:
p = np.polyfit(x_data,y_data,2)
# make a more finely sampled x_fit vector with, for example
# 1024 equally spaced points between the first and last
# values of x_data
x_fit = np.linspace(x_data[0],x_data[-1],1024)
# now, compute the fit using your polynomial:
y_fit = np.polyval(p,x_fit)
# and plot them together:
plt.plot(x_data,y_data,'ks')
plt.plot(x_fit,y_fit,'b--')
plt.show()
Hope that helps.

Plot cross section through heat map

I have an array of shape(201,201), I would like to plot some cross sections through the data, but I am having trouble accessing the relevant points. For example say I want to plot the cross section given by the line in the figure produced by,
from pylab import *
Z = randn(201,201)
x = linspace(-1,1,201)
X,Y = meshgrid(x,x)
pcolormesh(X,Y,Z)
plot(x,x*.5)
I'd like to plot these at various orientations but they will always pass through the origin if that helps...
Basically, you want to interpolate a 2D grid along a line (or an arbitrary path).
First off, you should decide if you want to interpolate the grid or just do nearest-neighbor sampling. If you'd like to do the latter, you can just use indexing.
If you'd like to interpolate, have a look at scipy.ndimage.map_coordinates. It's a bit hard to wrap your head around at first, but it's perfect for this. (It's much, much more efficient than using an interpolation routine that assumes that the data points are randomly distributed.)
I'll give an example of both. These are adapted from an answer I gave to another question. However, in those examples, everything is plotted in "pixel" (i.e. row, column) coordinates.
In your case, you're working in a different coordinate system than the "pixel" coordinates, so you'll need to convert from "world" (i.e. x, y) coordinates to "pixel" coordinates for the interpolation.
First off, here's an example of using cubic interpolation with map_coordinates:
import numpy as np
import scipy.ndimage
import matplotlib.pyplot as plt
# Generate some data...
x, y = np.mgrid[-5:5:0.1, -5:5:0.1]
z = np.sqrt(x**2 + y**2) + np.sin(x**2 + y**2)
# Coordinates of the line we'd like to sample along
line = [(-3, -1), (4, 3)]
# Convert the line to pixel/index coordinates
x_world, y_world = np.array(zip(*line))
col = z.shape[1] * (x_world - x.min()) / x.ptp()
row = z.shape[0] * (y_world - y.min()) / y.ptp()
# Interpolate the line at "num" points...
num = 1000
row, col = [np.linspace(item[0], item[1], num) for item in [row, col]]
# Extract the values along the line, using cubic interpolation
zi = scipy.ndimage.map_coordinates(z, np.vstack((row, col)))
# Plot...
fig, axes = plt.subplots(nrows=2)
axes[0].pcolormesh(x, y, z)
axes[0].plot(x_world, y_world, 'ro-')
axes[0].axis('image')
axes[1].plot(zi)
plt.show()
Alternately, we could use nearest-neighbor interpolation. One way to do this would be to pass order=0 to map_coordinates in the example above. Instead, I'll use indexing just to show another approach. If we just change the line
# Extract the values along the line, using cubic interpolation
zi = scipy.ndimage.map_coordinates(z, np.vstack((row, col)))
To:
# Extract the values along the line, using nearest-neighbor interpolation
zi = z[row.astype(int), col.astype(int)]
We'll get:

Fitting a line in 3D

Are there any algorithms that will return the equation of a straight line from a set of 3D data points? I can find plenty of sources which will give the equation of a line from 2D data sets, but none in 3D.
Thanks.
If you are trying to predict one value from the other two, then you should use lstsq with the a argument as your independent variables (plus a column of 1's to estimate an intercept) and b as your dependent variable.
If, on the other hand, you just want to get the best fitting line to the data, i.e. the line which, if you projected the data onto it, would minimize the squared distance between the real point and its projection, then what you want is the first principal component.
One way to define it is the line whose direction vector is the eigenvector of the covariance matrix corresponding to the largest eigenvalue, that passes through the mean of your data. That said, eig(cov(data)) is a really bad way to calculate it, since it does a lot of needless computation and copying and is potentially less accurate than using svd. See below:
import numpy as np
# Generate some data that lies along a line
x = np.mgrid[-2:5:120j]
y = np.mgrid[1:9:120j]
z = np.mgrid[-5:3:120j]
data = np.concatenate((x[:, np.newaxis],
y[:, np.newaxis],
z[:, np.newaxis]),
axis=1)
# Perturb with some Gaussian noise
data += np.random.normal(size=data.shape) * 0.4
# Calculate the mean of the points, i.e. the 'center' of the cloud
datamean = data.mean(axis=0)
# Do an SVD on the mean-centered data.
uu, dd, vv = np.linalg.svd(data - datamean)
# Now vv[0] contains the first principal component, i.e. the direction
# vector of the 'best fit' line in the least squares sense.
# Now generate some points along this best fit line, for plotting.
# I use -7, 7 since the spread of the data is roughly 14
# and we want it to have mean 0 (like the points we did
# the svd on). Also, it's a straight line, so we only need 2 points.
linepts = vv[0] * np.mgrid[-7:7:2j][:, np.newaxis]
# shift by the mean to get the line in the right place
linepts += datamean
# Verify that everything looks right.
import matplotlib.pyplot as plt
import mpl_toolkits.mplot3d as m3d
ax = m3d.Axes3D(plt.figure())
ax.scatter3D(*data.T)
ax.plot3D(*linepts.T)
plt.show()
Here's what it looks like:
If your data is fairly well behaved then it should be sufficient to find the least squares sum of the component distances. Then you can find the linear regression with z independent of x and then again independent of y.
Following the documentation example:
import numpy as np
pts = np.add.accumulate(np.random.random((10,3)))
x,y,z = pts.T
# this will find the slope and x-intercept of a plane
# parallel to the y-axis that best fits the data
A_xz = np.vstack((x, np.ones(len(x)))).T
m_xz, c_xz = np.linalg.lstsq(A_xz, z)[0]
# again for a plane parallel to the x-axis
A_yz = np.vstack((y, np.ones(len(y)))).T
m_yz, c_yz = np.linalg.lstsq(A_yz, z)[0]
# the intersection of those two planes and
# the function for the line would be:
# z = m_yz * y + c_yz
# z = m_xz * x + c_xz
# or:
def lin(z):
x = (z - c_xz)/m_xz
y = (z - c_yz)/m_yz
return x,y
#verifying:
from mpl_toolkits.mplot3d import Axes3D
import matplotlib.pyplot as plt
fig = plt.figure()
ax = Axes3D(fig)
zz = np.linspace(0,5)
xx,yy = lin(zz)
ax.scatter(x, y, z)
ax.plot(xx,yy,zz)
plt.savefig('test.png')
plt.show()
If you want to minimize the actual orthogonal distances from the line (orthogonal to the line) to the points in 3-space (which I'm not sure is even referred to as linear regression). Then I would build a function that computes the RSS and use a scipy.optimize minimization function to solve it.

Categories