Matrix Approximation and Predicting Timeseries in Python/R with SVD - python

I have an excel file that is 126 rows and 5 columns full of numbers, I have to use that data and SVD methods to predict 5-10 more rows of data. I have implemented SVD in Python successfully using numpy:
import numpy as np
from numpy import genfromtxt
my_data = genfromtxt('data.csv', delimiter=',')
U, s, V = np.linalg.svd(my_data)
print ("U:")
print (U)
print ("\nSigma:")
print (s)
print ("\nVT:")
print (V)
which outputs:
U:
[[-0.03339497 0.10018171 0.01013636 ..., -0.10076323 -0.09740801
-0.08901366]
[-0.02881809 0.0992715 -0.01239945 ..., -0.02920558 -0.04133748
-0.06100236]
[-0.02501102 0.10637736 -0.0528663 ..., -0.0885227 -0.05408083
-0.01678337]
...,
[-0.02418483 0.10993637 0.05200962 ..., 0.9734676 -0.01866914
-0.00870467]
[-0.02944344 0.10238372 0.02009676 ..., -0.01948701 0.98455034
-0.00975614]
[-0.03109401 0.0973963 -0.0279125 ..., -0.01072974 -0.0109425
0.98929811]]
Sigma:
[ 252943.48015512 74965.29844851 15170.76769244 4357.38062076
3934.63212778]
VT:
[[-0.16143572 -0.22105626 -0.93558846 -0.14545156 -0.16908786]
[ 0.5073101 0.40240734 -0.34460639 0.45443181 0.50541365]
[-0.11561044 0.87141558 -0.07426656 -0.26914744 -0.38641073]
[ 0.63320943 -0.09361249 0.00794671 -0.75788695 0.12580436]
[-0.54977724 0.14516905 -0.01849291 -0.35426346 0.74217676]]
But I am not sure how to use this data to preidct my values. I am using this link http://datascientistinsights.com/2013/02/17/single-value-decomposition-a-golfers-tutotial/ as a reference but that is in R. At the end they use R to predict values but they use this command in R:
approxGolf_1 <- golfSVD$u[,1] %*% t(golfSVD$v[,1]) * golfSVD$d[1]
Here is the IdeOne link to the entire R code: http://ideone.com/Yj3y6j
I'm not really familiar with R so can anyone let me know if there is a similar function in Python to the command above or explain what that command is doing exactly?
Thanks.

I will use the golf course example data you linked, to set the stage:
import numpy as np
A=np.matrix((4,4,3,4,4,3,4,2,5,4,5,3,5,4,5,4,4,5,5,5,2,4,4,4,3,4,5))
A=A.reshape((3,9)).T
This gives you the original 9 rows, 3 columns table with scores of 9 holes for 3 players:
matrix([[4, 4, 5],
[4, 5, 5],
[3, 3, 2],
[4, 5, 4],
[4, 4, 4],
[3, 5, 4],
[4, 4, 3],
[2, 4, 4],
[5, 5, 5]])
Now the singular value decomposition:
U, s, V = np.linalg.svd(A)
The most important thing to investigate is the vector s of singular values:
array([ 21.11673273, 2.0140035 , 1.423864 ])
It shows that the first value is much bigger than the others, indicating that the corresponding Truncated SVD with only one value represents the original matrix A quite well. To calculate this representation, you take column 1 of U multiplied by the first row of V, multiplied by the first singular value. This is what the last cited command in R does. Here is the same in Python:
U[:,0]*s[0]*V[0,:]
And here is the result of this product:
matrix([[ 3.95411864, 4.64939923, 4.34718814],
[ 4.28153222, 5.03438425, 4.70714912],
[ 2.42985854, 2.85711772, 2.67140498],
[ 3.97540054, 4.67442327, 4.37058562],
[ 3.64798696, 4.28943826, 4.01062464],
[ 3.69694905, 4.3470097 , 4.06445393],
[ 3.34185528, 3.92947728, 3.67406114],
[ 3.09108399, 3.63461111, 3.39836128],
[ 4.5599837 , 5.36179782, 5.0132808 ]])
Concerning the vector factors U[:,0] and V[0,:]: Figuratively speaking, U can be seen as a representation of a hole's difficulty, while V encodes a player's strength.

Related

Sort an array of multi D points by distance to a reference point

I have a reference point p_ref stored in a numpy array with a shape of (1024,), something like:
print(p_ref)
>>> array([ p1, p2, p3, ..., p_n])
I also have a numpy array A_points with a shape of (1024,5000) containing 5000 points, each having 1024 dimensions like p_ref. My problem: I would like to sort the points in A_points by their (eucledian) distance to p_ref!
How can I do this? I read about scipy.spatial.distance.cdist and scipy.spatial.KDTree, but they both weren't doing exactly what I wanted and when I tried to combine them I made a mess. Thanks!
For reference and consistency let's assume:
p_ref = np.array([0,1,2,3])
A_points = np.reshape(np.array([10,3,2,13,4,5,16,3,8,19,4,11]), (4,3))
Expected output:
array([[ 3, 2, 10],
[ 4, 5, 13],
[ 3, 8, 16],
[ 4, 11, 19]])
EDIT: Updated on suggestions by the OP.
I hope I understand you correctly, but you can calculate the distance between two vectors by using numpy.linalg.norm. Using this it should be as simple as:
A_sorted = sorted( A_points.T, key = lambda x: np.linalg.norm(x - p_ref ) )
A_sorted = np.reshape(A_sorted, (3,4)).T
You can do something like this -
A_points[:,np.linalg.norm(A_points-p_ref[:,None],axis=0).argsort()]
Another with np.einsum that should be more efficient than np.linalg.norm -
d = A_points-p_ref[:,None]
out = A_points[:,np.einsum('ij,ij->j',d,d).argsort()]
Further optimize to leverage fast matrix-multiplication to replace last step -
A_points[:,((A_points**2).sum(0)+(p_ref**2).sum()-2*p_ref.dot(A_points)).argsort()]

Why is my SVD calculation different than numpy's SVD calculation of this matrix?

I'm trying to manually compute the SVD of the matrix A defined below but I am having some problems. Computing it manually and with the svd method in numpy yields two different results.
Computed manually below:
import numpy as np
A = np.array([[3,2,2], [2,3,-2]])
V = np.linalg.eig(A.T # A)[1]
U = np.linalg.eig(A # A.T)[1]
S = np.c_[np.diag(np.sqrt(np.linalg.eig(A # A.T)[0])), [0,0]]
print(A)
print(U # S # V.T)
And computed via numpy's svd method:
X,Y,Z = np.linalg.svd(A)
Y = np.c_[np.diag(Y), [0,0]]
print(A)
print(X # Y # Z)
When these two codes are run. The manual calculation doesn't equal the svd method. Why is there a discrepancy between these two calculations?
Look at the eigenvalues returned by np.linalg.eig(A.T # A):
In [57]: evals, evecs = np.linalg.eig(A.T # A)
In [58]: evals
Out[58]: array([2.50000000e+01, 3.61082692e-15, 9.00000000e+00])
So (ignoring the normal floating point imprecision), it computed [25, 0, 9]. The eigenvectors associated with those eigenvalues are in the columns of evecs, in the same order. But your construction of S doesn't match that order; here's your S:
In [60]: S
Out[60]:
array([[5., 0., 0.],
[0., 3., 0.]])
When you compute U # S # V.T, the values in S # V.T are not correctly aligned.
As a quick fix, you can rerun your code with S set explicitly as follows:
S = np.array([[5, 0, 0],
[0, 0, 3]])
With that change, your code outputs
[[ 3 2 2]
[ 2 3 -2]]
[[-3. -2. -2.]
[-2. -3. 2.]]
That's better, but why are the signs wrong? Now the problem is that you have independently computed U and V. Eigenvectors are not unique; they are the basis of an eigenspace, and such a basis is not unique. If the eigenvalue is simple, and if the vector is normalized to have length one (which numpy.linalg.eig does), there is still a choice of the sign to be made. That is, if v is an eigenvector, then so is -v. The choices made by eig when computing U and V won't necessarily result in restoring the sign of A when U # S # V.T is computed.
It turns out that you can get the result that you expect by simply reversing all the signs in either U or V. Here is a modified version of your script that generates the output that you expected:
import numpy as np
A = np.array([[3, 2, 2],
[2, 3, -2]])
U = np.linalg.eig(A # A.T)[1]
V = -np.linalg.eig(A.T # A)[1]
#S = np.c_[np.diag(np.sqrt(np.linalg.eig(A # A.T)[0])), [0,0]]
S = np.array([[5, 0, 0],
[0, 0, 3]])
print(A)
print(U # S # V.T)
Output:
[[ 3 2 2]
[ 2 3 -2]]
[[ 3. 2. 2.]
[ 2. 3. -2.]]

*Update* Creating an array for distance between two 2-D arrays

So I have two arrays that have x, y, z coordinates. I'm just trying to apply the 3D distance formula. Problem is, that I can't find a post that constitutes arrays with multiple values in each column and spits out an array.
print MW_FirstsubPos1
[[ 51618.7265625 106197.7578125 69647.6484375 ]
[ 33864.1953125 11757.29882812 11849.90332031]
[ 12750.09863281 58954.91015625 38067.0859375 ]
...,
[ 99002.6640625 96021.0546875 18798.44726562]
[ 27180.83984375 74350.421875 78075.78125 ]
[ 19297.88476562 82161.140625 1204.53503418]]
print MW_SecondsubPos1
[[ 51850.9140625 106004.0078125 69536.5234375 ]
[ 33989.9375 11847.11425781 12255.80859375]
[ 12526.203125 58372.3046875 37641.34765625]
...,
[ 98823.2734375 95837.1796875 18758.7734375 ]
[ 27047.19140625 74242.859375 78166.703125 ]
[ 19353.97851562 82375.8515625 1147.07556152]]
Yes, they are the same shape.
My attempt,
import numpy as np
xs1,ys1,zs1 = zip(*MW_FirstsubPos1)
xs11,ys11,zs11 = zip(*MW_SecondsubPos1)
squared_dist1 = (xs11 - xs1)**2 + (ys11 - ys1)**2 + (zs11 - zs1)**2
dist1 = np.sqrt(squared_dist1)
print dist1
This returns:
TypeError: unsupported operand type(s) for -: 'tuple' and 'tuple'
I'm just wanting to return a 1-D array of the same shape.
* --------------------- Update --------------------- *
Using what Sнаđошƒаӽ said,
Distance1 = []
for Fir1, Sec1 in zip(MW_FirstsubVel1, MW_SecondsubPos1):
dist1 = 0
for i in range(3):
dist1 += (Fir1[i]-Sec1[i])**2
Distance1.append(dist1**0.5)
But when comparing the distance formula for one element in my original post such as,
squared_dist1 = (xs11[0] - xs1[0])**2 + (ys11[0] - ys1[0])**2 + (zs11[0] - zs1[0])**2
dist1 = np.sqrt(squared_dist1)
print dist1
returns 322.178309762
while
result = []
for a, b in zip(MW_FirstsubVel1, MW_SecondsubPos1):
dist = 0
for i in range(3):
dist += (a[i]-b[i])**2
result.append(dist**0.5)
print result[0]
returns 137163.203004
What's wrong here?
Your solutions look good to me.
A better idea is to use the linear algebra module in scipy package, as it scales with multiple dimensional data. Here are my codes.
import scipy.linalg as LA
dist1 = LA.norm(MW_FirstsubPos1 - MW_SecondsubPos1, axis=1)
See if this works, assuming that aaa and bbb are normal python list of lists having the x, y and z coordinates (or that you can convert to such, using tolist or something like that perhaps). result will have the 1-D array you are looking for.
Edit: aaa and bbb are python lists of lists. Only code for printing the output have been added.
aaa = [[51618.7265625, 106197.7578125, 69647.6484375],
[33864.1953125, 11757.29882812, 11849.90332031],
[12750.09863281, 58954.91015625, 38067.0859375],
[99002.6640625, 96021.0546875, 18798.44726562],
[27180.83984375, 74350.421875, 78075.78125],
[19297.88476562, 82161.140625, 1204.53503418]]
bbb = [[51850.9140625, 106004.0078125, 69536.5234375],
[33989.9375, 11847.11425781, 12255.80859375],
[12526.203125, 58372.3046875, 37641.34765625],
[98823.2734375, 95837.1796875, 18758.7734375],
[27047.19140625, 74242.859375, 78166.703125],
[19353.97851562, 82375.8515625, 1147.07556152]]
result = []
for a, b in zip(aaa, bbb):
dist = 0
for i in range(3):
dist += (a[i]-b[i])**2
result.append(dist**0.5)
for elem in result:
print(elem)
Output:
322.178309762234
434.32361222259755
755.5206249710258
259.9327309143388
194.16071591842936
229.23543894772612
Here's a vectorized approach using np.einsum -
diffs = MW_FirstsubPos1 - MW_SecondsubPos1
dists = np.sqrt(np.einsum('ij,ij->i',diffs,diffs))
Sample run -
In [233]: MW_FirstsubPos1
Out[233]:
array([[2, 0, 0],
[8, 6, 1],
[0, 2, 8],
[7, 6, 3],
[3, 1, 7]])
In [234]: MW_SecondsubPos1
Out[234]:
array([[3, 4, 7],
[0, 8, 4],
[4, 7, 4],
[2, 5, 6],
[5, 0, 6]])
In [235]: diffs = MW_FirstsubPos1 - MW_SecondsubPos1
In [236]: np.sqrt(np.einsum('ij,ij->i',diffs,diffs))
Out[236]: array([ 8.1240384 , 8.77496439, 7.54983444, 5.91607978, 2.44948974])

Interpolating a 2D data grid in python

I have a 2D grid with radioactive beta-decay rates. Each vale corresponds to a rate on a specific pair of temperature and density (both on logarithmic scale). What I would like to do, is when I have a temperature and density data pair (after getting their logarithms), to find the matching values in the table. I tried using the scipy interpolate interpn function, but I got a little confused, I would be grateful for the help.
What I have so far:
pointsx = np.array([7+0.2*i for i in range(0,16)]) #temperature range
pointsy = np.array([i for i in range(0,11) ]) #rho_el range
data = numpy.loadtxt(filename) #getting data from file
logT = np.log10(T) #wanted temperature logarithmic
logrho = np.log10(rho) #wanted rho logarithmic
The interpn function has the following arguments: points, values, xi, method='linear', bounds_error=True, fill_value=nan. I figure that the points will be the pointsx and pointsy I have, the data is quite obvious, and xi will be the (T,rho) I'm looking for. But I'm not sure, what dimensions they should have? The points is the same size, as the data? So I have to make an array of the corresponding pairs of T and rho, which will be the points part, and then have a (T, rho) pair as xi?
When you aren't certain about how a function works, it's always a good idea to open up a REPL and test it yourself. In this case, the function works exactly as expected, given your understanding of the documentation.
>>> points = [[1, 2, 3, 4], [1, 2, 3, 4]] # Input values for each grid dimension
>>> values = [[1, 2, 3, 4], [2, 3, 4, 5], [3, 4, 5, 6], [4, 5, 6, 7]] # The grid itself
>>> xi = (1, 1.5)
>>> scipy.interpolate.interpn(points, values, xi)
array([ 1.5])
>>> xi = [[1, 1.5], [2, 1.5], [2, 2.5], [3, 2.5], [3, 3.5], [4, 3.5]]
>>> scipy.interpolate.interpn(points, values, xi)
array([ 1.5, 2.5, 3.5, 4.5, 5.5, 6.5])
The only thing you missed was that points is supposed to be a tuple. But as you can see from the above, it works even if points ins't a tuple.

Sorting an Array Alongside a 2d Array

So I'm using NumPy's linear algebra routines to do some basic computational quantum mechanics. Say I have a matrix, hamiltonian, and I want its eigenvalues and eigenvectors
import numpy as np
from numpy import linalg as la
hamiltonian = np.zeros((N, N)) # N is some constant I have defined
# fill up hamiltonian here
energies, states = la.eig(hamiltonian)
Now, I want to sort the energies in increasing order, and I want to sort the states along with them. For example, if I do:
groundStateEnergy = min(energies)
groundStateIndex = np.where(energies == groundStateEnergy)
groundState = states[groundStateIndex, :]
I correctly plot the ground state (eigenvector with the lowest eigenvalue). However, if I try something like this:
energies, states = zip(*sorted(zip(energies, states)))
or even
energies, states = zip(*sorted(zip(energies, states), key = lambda pair:pair[0])))
plotting in the same way no longer plots the correct state.So how can I sort states alongside energies, but only by row? (i.e, I want to associate each row of states with a value in energies, and I want to rearrange the rows so that the ordering of the rows corresponds to the sorted ordering of the values in energies)
You can use argsort as follows:
>>> x = np.random.random((1,10))
>>> x
array([ 0.69719108, 0.75828237, 0.79944838, 0.68245968, 0.36232211,
0.46565445, 0.76552493, 0.94967472, 0.43531813, 0.22913607])
>>> y = np.random.random((10))
>>> y
array([ 0.64332275, 0.34984653, 0.55240204, 0.31019789, 0.96354724,
0.76723872, 0.25721343, 0.51629662, 0.13096252, 0.86220311])
>>> idx = np.argsort(x)
>>> idx
array([9, 4, 8, 5, 3, 0, 1, 6, 2, 7])
>>> xsorted= x[idx]
>>> xsorted
array([ 0.22913607, 0.36232211, 0.43531813, 0.46565445, 0.68245968,
0.69719108, 0.75828237, 0.76552493, 0.79944838, 0.94967472])
>>> ysordedbyx = y[idx]
>>> ysordedbyx
array([ 0.86220311, 0.96354724, 0.13096252, 0.76723872, 0.31019789,
0.64332275, 0.34984653, 0.25721343, 0.55240204, 0.51629662])
and as suggested by the comments an example where we sort a 2d array by it's first collumn
>>> x=np.random.random((10,2))
>>> x
array([[ 0.72789275, 0.29404982],
[ 0.05149693, 0.24411234],
[ 0.34863983, 0.58950756],
[ 0.81916424, 0.32032827],
[ 0.52958012, 0.00417253],
[ 0.41587698, 0.32733306],
[ 0.79918377, 0.18465189],
[ 0.678948 , 0.55039723],
[ 0.8287709 , 0.54735691],
[ 0.74044999, 0.70688683]])
>>> idx = np.argsort(x[:,0])
>>> idx
array([1, 2, 5, 4, 7, 0, 9, 6, 3, 8])
>>> xsorted = x[idx,:]
>>> xsorted
array([[ 0.05149693, 0.24411234],
[ 0.34863983, 0.58950756],
[ 0.41587698, 0.32733306],
[ 0.52958012, 0.00417253],
[ 0.678948 , 0.55039723],
[ 0.72789275, 0.29404982],
[ 0.74044999, 0.70688683],
[ 0.79918377, 0.18465189],
[ 0.81916424, 0.32032827],
[ 0.8287709 , 0.54735691]])

Categories