Sorting an Array Alongside a 2d Array - python

So I'm using NumPy's linear algebra routines to do some basic computational quantum mechanics. Say I have a matrix, hamiltonian, and I want its eigenvalues and eigenvectors
import numpy as np
from numpy import linalg as la
hamiltonian = np.zeros((N, N)) # N is some constant I have defined
# fill up hamiltonian here
energies, states = la.eig(hamiltonian)
Now, I want to sort the energies in increasing order, and I want to sort the states along with them. For example, if I do:
groundStateEnergy = min(energies)
groundStateIndex = np.where(energies == groundStateEnergy)
groundState = states[groundStateIndex, :]
I correctly plot the ground state (eigenvector with the lowest eigenvalue). However, if I try something like this:
energies, states = zip(*sorted(zip(energies, states)))
or even
energies, states = zip(*sorted(zip(energies, states), key = lambda pair:pair[0])))
plotting in the same way no longer plots the correct state.So how can I sort states alongside energies, but only by row? (i.e, I want to associate each row of states with a value in energies, and I want to rearrange the rows so that the ordering of the rows corresponds to the sorted ordering of the values in energies)

You can use argsort as follows:
>>> x = np.random.random((1,10))
>>> x
array([ 0.69719108, 0.75828237, 0.79944838, 0.68245968, 0.36232211,
0.46565445, 0.76552493, 0.94967472, 0.43531813, 0.22913607])
>>> y = np.random.random((10))
>>> y
array([ 0.64332275, 0.34984653, 0.55240204, 0.31019789, 0.96354724,
0.76723872, 0.25721343, 0.51629662, 0.13096252, 0.86220311])
>>> idx = np.argsort(x)
>>> idx
array([9, 4, 8, 5, 3, 0, 1, 6, 2, 7])
>>> xsorted= x[idx]
>>> xsorted
array([ 0.22913607, 0.36232211, 0.43531813, 0.46565445, 0.68245968,
0.69719108, 0.75828237, 0.76552493, 0.79944838, 0.94967472])
>>> ysordedbyx = y[idx]
>>> ysordedbyx
array([ 0.86220311, 0.96354724, 0.13096252, 0.76723872, 0.31019789,
0.64332275, 0.34984653, 0.25721343, 0.55240204, 0.51629662])
and as suggested by the comments an example where we sort a 2d array by it's first collumn
>>> x=np.random.random((10,2))
>>> x
array([[ 0.72789275, 0.29404982],
[ 0.05149693, 0.24411234],
[ 0.34863983, 0.58950756],
[ 0.81916424, 0.32032827],
[ 0.52958012, 0.00417253],
[ 0.41587698, 0.32733306],
[ 0.79918377, 0.18465189],
[ 0.678948 , 0.55039723],
[ 0.8287709 , 0.54735691],
[ 0.74044999, 0.70688683]])
>>> idx = np.argsort(x[:,0])
>>> idx
array([1, 2, 5, 4, 7, 0, 9, 6, 3, 8])
>>> xsorted = x[idx,:]
>>> xsorted
array([[ 0.05149693, 0.24411234],
[ 0.34863983, 0.58950756],
[ 0.41587698, 0.32733306],
[ 0.52958012, 0.00417253],
[ 0.678948 , 0.55039723],
[ 0.72789275, 0.29404982],
[ 0.74044999, 0.70688683],
[ 0.79918377, 0.18465189],
[ 0.81916424, 0.32032827],
[ 0.8287709 , 0.54735691]])

Related

PIck randomly samples from a 2D matrix and keep the indexes in python

I have a numpy 2D matrix with data in python and I want to perform downsampling by keeping the 25% of the initial samples. In order to do so, I am using the following random.randint functionality:
reduced_train_face = face_train[np.random.randint(face_train.shape[0], size=300), :]
However, I am having a second matrix which contains the labels associated with the faces and I want to reduce with the same way. How, can I keep the indexes from the reduced matrix and apply them to the train_lbls matrix?
You can fix the seed just before applying your extraction:
import numpy as np
# Each labels correspond to the first element of each line of face_train
labels_train = np.array(range(0,15,3))
face_train = np.array(range(15)).reshape(5,3)
np.random.seed(0)
reduced_train_face = face_train[np.random.randint(face_train.shape[0], size=3), :]
np.random.seed(0)
reduced_train_labels = labels_train[np.random.randint(labels_train.shape[0], size=3)]
print(reduced_train_face, reduced_train_labels)
# [[12, 13, 14], [ 0, 1, 2], [ 9, 10, 11]], [12, 0, 9]
With the same seed, it will be reduce the same way.
edit: I advice you to use np.random.choice(n_total_elem, n_reduce_elem) in order to ensure that you only choose each data once and not twice the same data
Why don't you keep the selected index and use them to select data from both matrices?
import numpy as np
# setting up matrices
np.random.seed(1234) # make example repeatable
# the seeding is optional, only for the showing the
# same results as below!
face_train = np.random.rand(8,3)
train_lbls= np.random.rand(8)
print('face_train:\n', face_train)
print('labels:\n', train_lbls)
# Setting the random indexes
random_idxs= np.random.randint(face_train.shape[0], size=4)
print('random_idxs:\n', random_idxs)
# Using the indexes to slice the matrixes
reduced_train_face = face_train[random_idxs, :]
reduced_labels = train_lbls[random_idxs]
print('reduced_train_face:\n', reduced_train_face)
print('reduced_labels:\n', reduced_labels)
Gives as output:
face_train:
[[ 0.19151945 0.62210877 0.43772774]
[ 0.78535858 0.77997581 0.27259261]
[ 0.27646426 0.80187218 0.95813935]
[ 0.87593263 0.35781727 0.50099513]
[ 0.68346294 0.71270203 0.37025075]
[ 0.56119619 0.50308317 0.01376845]
[ 0.77282662 0.88264119 0.36488598]
[ 0.61539618 0.07538124 0.36882401]]
labels:
[ 0.9331401 0.65137814 0.39720258 0.78873014 0.31683612 0.56809865
0.86912739 0.43617342]
random_idxs:
[1 7 5 4]
reduced_train_face:
[[ 0.78535858 0.77997581 0.27259261]
[ 0.61539618 0.07538124 0.36882401]
[ 0.56119619 0.50308317 0.01376845]
[ 0.68346294 0.71270203 0.37025075]]
reduced_labels:
[ 0.65137814 0.43617342 0.56809865 0.31683612]

Scipy Multivariate Normal: How to draw deterministic samples?

I am using Scipy.stats.multivariate_normal to draw samples from a multivariate normal distribution. Like this:
from scipy.stats import multivariate_normal
# Assume we have means and covs
mn = multivariate_normal(mean = means, cov = covs)
# Generate some samples
samples = mn.rvs()
The samples are different at every run. How do I get always the same sample?
I was expecting something like:
mn = multivariate_normal(mean = means, cov = covs, seed = aNumber)
or
samples = mn.rsv(seed = aNumber)
There are two ways:
The rvs() method accepts a random_state argument. Its value can
be an integer seed, or an instance of numpy.random.Generator or numpy.random.RandomState. In
this example, I use an integer seed:
In [46]: mn = multivariate_normal(mean=[0,0,0], cov=[1, 5, 25])
In [47]: mn.rvs(size=5, random_state=12345)
Out[47]:
array([[-0.51943872, 1.07094986, -1.0235383 ],
[ 1.39340583, 4.39561899, -2.77865152],
[ 0.76902257, 0.63000355, 0.46453938],
[-1.29622111, 2.25214387, 6.23217368],
[ 1.35291684, 0.51186476, 1.37495817]])
In [48]: mn.rvs(size=5, random_state=12345)
Out[48]:
array([[-0.51943872, 1.07094986, -1.0235383 ],
[ 1.39340583, 4.39561899, -2.77865152],
[ 0.76902257, 0.63000355, 0.46453938],
[-1.29622111, 2.25214387, 6.23217368],
[ 1.35291684, 0.51186476, 1.37495817]])
This version uses an instance of numpy.random.Generator:
In [34]: rng = np.random.default_rng(438753948759384)
In [35]: mn = multivariate_normal(mean=[0,0,0], cov=[1, 5, 25])
In [36]: mn.rvs(size=5, random_state=rng)
Out[36]:
array([[ 0.30626179, 0.60742839, 2.86919105],
[ 1.61859885, 2.63409111, 1.19018398],
[ 0.35469027, 0.85685011, 6.76892829],
[-0.88659459, -0.59922575, -5.43926698],
[ 0.94777687, -5.80057427, -2.16887719]])
You can set the seed for numpy's global random number generator. This is the generator that multivariate_normal.rvs() uses if random_state is not given:
In [54]: mn = multivariate_normal(mean=[0,0,0], cov=[1, 5, 25])
In [55]: np.random.seed(123)
In [56]: mn.rvs(size=5)
Out[56]:
array([[ 0.2829785 , 2.23013222, -5.42815302],
[ 1.65143654, -1.2937895 , -7.53147357],
[ 1.26593626, -0.95907779, -12.13339622],
[ -0.09470897, -1.51803558, -4.33370201],
[ -0.44398196, -1.4286283 , 7.45694813]])
In [57]: np.random.seed(123)
In [58]: mn.rvs(size=5)
Out[58]:
array([[ 0.2829785 , 2.23013222, -5.42815302],
[ 1.65143654, -1.2937895 , -7.53147357],
[ 1.26593626, -0.95907779, -12.13339622],
[ -0.09470897, -1.51803558, -4.33370201],
[ -0.44398196, -1.4286283 , 7.45694813]])

*Update* Creating an array for distance between two 2-D arrays

So I have two arrays that have x, y, z coordinates. I'm just trying to apply the 3D distance formula. Problem is, that I can't find a post that constitutes arrays with multiple values in each column and spits out an array.
print MW_FirstsubPos1
[[ 51618.7265625 106197.7578125 69647.6484375 ]
[ 33864.1953125 11757.29882812 11849.90332031]
[ 12750.09863281 58954.91015625 38067.0859375 ]
...,
[ 99002.6640625 96021.0546875 18798.44726562]
[ 27180.83984375 74350.421875 78075.78125 ]
[ 19297.88476562 82161.140625 1204.53503418]]
print MW_SecondsubPos1
[[ 51850.9140625 106004.0078125 69536.5234375 ]
[ 33989.9375 11847.11425781 12255.80859375]
[ 12526.203125 58372.3046875 37641.34765625]
...,
[ 98823.2734375 95837.1796875 18758.7734375 ]
[ 27047.19140625 74242.859375 78166.703125 ]
[ 19353.97851562 82375.8515625 1147.07556152]]
Yes, they are the same shape.
My attempt,
import numpy as np
xs1,ys1,zs1 = zip(*MW_FirstsubPos1)
xs11,ys11,zs11 = zip(*MW_SecondsubPos1)
squared_dist1 = (xs11 - xs1)**2 + (ys11 - ys1)**2 + (zs11 - zs1)**2
dist1 = np.sqrt(squared_dist1)
print dist1
This returns:
TypeError: unsupported operand type(s) for -: 'tuple' and 'tuple'
I'm just wanting to return a 1-D array of the same shape.
* --------------------- Update --------------------- *
Using what Sнаđошƒаӽ said,
Distance1 = []
for Fir1, Sec1 in zip(MW_FirstsubVel1, MW_SecondsubPos1):
dist1 = 0
for i in range(3):
dist1 += (Fir1[i]-Sec1[i])**2
Distance1.append(dist1**0.5)
But when comparing the distance formula for one element in my original post such as,
squared_dist1 = (xs11[0] - xs1[0])**2 + (ys11[0] - ys1[0])**2 + (zs11[0] - zs1[0])**2
dist1 = np.sqrt(squared_dist1)
print dist1
returns 322.178309762
while
result = []
for a, b in zip(MW_FirstsubVel1, MW_SecondsubPos1):
dist = 0
for i in range(3):
dist += (a[i]-b[i])**2
result.append(dist**0.5)
print result[0]
returns 137163.203004
What's wrong here?
Your solutions look good to me.
A better idea is to use the linear algebra module in scipy package, as it scales with multiple dimensional data. Here are my codes.
import scipy.linalg as LA
dist1 = LA.norm(MW_FirstsubPos1 - MW_SecondsubPos1, axis=1)
See if this works, assuming that aaa and bbb are normal python list of lists having the x, y and z coordinates (or that you can convert to such, using tolist or something like that perhaps). result will have the 1-D array you are looking for.
Edit: aaa and bbb are python lists of lists. Only code for printing the output have been added.
aaa = [[51618.7265625, 106197.7578125, 69647.6484375],
[33864.1953125, 11757.29882812, 11849.90332031],
[12750.09863281, 58954.91015625, 38067.0859375],
[99002.6640625, 96021.0546875, 18798.44726562],
[27180.83984375, 74350.421875, 78075.78125],
[19297.88476562, 82161.140625, 1204.53503418]]
bbb = [[51850.9140625, 106004.0078125, 69536.5234375],
[33989.9375, 11847.11425781, 12255.80859375],
[12526.203125, 58372.3046875, 37641.34765625],
[98823.2734375, 95837.1796875, 18758.7734375],
[27047.19140625, 74242.859375, 78166.703125],
[19353.97851562, 82375.8515625, 1147.07556152]]
result = []
for a, b in zip(aaa, bbb):
dist = 0
for i in range(3):
dist += (a[i]-b[i])**2
result.append(dist**0.5)
for elem in result:
print(elem)
Output:
322.178309762234
434.32361222259755
755.5206249710258
259.9327309143388
194.16071591842936
229.23543894772612
Here's a vectorized approach using np.einsum -
diffs = MW_FirstsubPos1 - MW_SecondsubPos1
dists = np.sqrt(np.einsum('ij,ij->i',diffs,diffs))
Sample run -
In [233]: MW_FirstsubPos1
Out[233]:
array([[2, 0, 0],
[8, 6, 1],
[0, 2, 8],
[7, 6, 3],
[3, 1, 7]])
In [234]: MW_SecondsubPos1
Out[234]:
array([[3, 4, 7],
[0, 8, 4],
[4, 7, 4],
[2, 5, 6],
[5, 0, 6]])
In [235]: diffs = MW_FirstsubPos1 - MW_SecondsubPos1
In [236]: np.sqrt(np.einsum('ij,ij->i',diffs,diffs))
Out[236]: array([ 8.1240384 , 8.77496439, 7.54983444, 5.91607978, 2.44948974])

python all possible products between columns

I have a numpy matrix X and I would like to add to this matrix as new variables all the possible products between 2 columns.
So if X=(x1,x2,x3) I want X=(x1,x2,x3,x1x2,x2x3,x1x3)
Is there an elegant way to do that?
I think a combination of numpy and itertools should work
EDIT:
Very good answers but are they considering that X is a matrix? So x1,x1,.. x3 can eventually be arrays?
EDIT:
A Real example
a=array([[1,2,3],[4,5,6]])
Itertools should be the answer here.
a = [1, 2, 3]
p = (x * y for x, y in itertools.combinations(a, 2))
print list(itertools.chain(a, p))
Result:
[1, 2, 3, 2, 3, 6] # 1, 2, 3, 2 x 1, 3 x 1, 3 x 2
I think Samy's solution is pretty good. If you need to use numpy, you could transform it a little like this:
from itertools import combinations
from numpy import prod
x = [1, 2, 3]
print x + map(prod, combinations(x, 2))
Gives the same output as Samy's solution:
[1, 2, 3, 2, 3, 6]
If your arrays are small, then Samy's pure-Python solution using itertools.combinations should be fine:
from itertools import combinations, chain
def all_products1(a):
p = (x * y for x, y in combinations(a, 2))
return list(chain(a, p))
But if your arrays are large, then you'll get a substantial speedup by fully vectorizing the computation, using numpy.triu_indices, like this:
import numpy as np
def all_products2(a):
x, y = np.triu_indices(len(a), 1)
return np.r_[a, a[x] * a[y]]
Let's compare these:
>>> data = np.random.uniform(0, 100, (10000,))
>>> timeit(lambda:all_products1(data), number=1)
53.745754408999346
>>> timeit(lambda:all_products2(data), number=1)
12.26144006299728
The solution using numpy.triu_indices also works for multi-dimensional data:
>>> np.random.uniform(0, 100, (3,2))
array([[ 63.75071196, 15.19461254],
[ 94.33972762, 50.76916376],
[ 88.24056878, 90.36136808]])
>>> all_products2(_)
array([[ 63.75071196, 15.19461254],
[ 94.33972762, 50.76916376],
[ 88.24056878, 90.36136808],
[ 6014.22480172, 771.41777239],
[ 5625.39908354, 1373.00597677],
[ 8324.59122432, 4587.57109368]])
If you want to operate on columns rather than rows, use:
def all_products3(a):
x, y = np.triu_indices(a.shape[1], 1)
return np.c_[a, a[:,x] * a[:,y]]
For example:
>>> np.random.uniform(0, 100, (2,3))
array([[ 33.0062385 , 28.17575024, 20.42504351],
[ 40.84235995, 61.12417428, 58.74835028]])
>>> all_products3(_)
array([[ 33.0062385 , 28.17575024, 20.42504351, 929.97553238,
674.15385734, 575.4909246 ],
[ 40.84235995, 61.12417428, 58.74835028, 2496.45552756,
2399.42126888, 3590.94440122]])

Matrix Approximation and Predicting Timeseries in Python/R with SVD

I have an excel file that is 126 rows and 5 columns full of numbers, I have to use that data and SVD methods to predict 5-10 more rows of data. I have implemented SVD in Python successfully using numpy:
import numpy as np
from numpy import genfromtxt
my_data = genfromtxt('data.csv', delimiter=',')
U, s, V = np.linalg.svd(my_data)
print ("U:")
print (U)
print ("\nSigma:")
print (s)
print ("\nVT:")
print (V)
which outputs:
U:
[[-0.03339497 0.10018171 0.01013636 ..., -0.10076323 -0.09740801
-0.08901366]
[-0.02881809 0.0992715 -0.01239945 ..., -0.02920558 -0.04133748
-0.06100236]
[-0.02501102 0.10637736 -0.0528663 ..., -0.0885227 -0.05408083
-0.01678337]
...,
[-0.02418483 0.10993637 0.05200962 ..., 0.9734676 -0.01866914
-0.00870467]
[-0.02944344 0.10238372 0.02009676 ..., -0.01948701 0.98455034
-0.00975614]
[-0.03109401 0.0973963 -0.0279125 ..., -0.01072974 -0.0109425
0.98929811]]
Sigma:
[ 252943.48015512 74965.29844851 15170.76769244 4357.38062076
3934.63212778]
VT:
[[-0.16143572 -0.22105626 -0.93558846 -0.14545156 -0.16908786]
[ 0.5073101 0.40240734 -0.34460639 0.45443181 0.50541365]
[-0.11561044 0.87141558 -0.07426656 -0.26914744 -0.38641073]
[ 0.63320943 -0.09361249 0.00794671 -0.75788695 0.12580436]
[-0.54977724 0.14516905 -0.01849291 -0.35426346 0.74217676]]
But I am not sure how to use this data to preidct my values. I am using this link http://datascientistinsights.com/2013/02/17/single-value-decomposition-a-golfers-tutotial/ as a reference but that is in R. At the end they use R to predict values but they use this command in R:
approxGolf_1 <- golfSVD$u[,1] %*% t(golfSVD$v[,1]) * golfSVD$d[1]
Here is the IdeOne link to the entire R code: http://ideone.com/Yj3y6j
I'm not really familiar with R so can anyone let me know if there is a similar function in Python to the command above or explain what that command is doing exactly?
Thanks.
I will use the golf course example data you linked, to set the stage:
import numpy as np
A=np.matrix((4,4,3,4,4,3,4,2,5,4,5,3,5,4,5,4,4,5,5,5,2,4,4,4,3,4,5))
A=A.reshape((3,9)).T
This gives you the original 9 rows, 3 columns table with scores of 9 holes for 3 players:
matrix([[4, 4, 5],
[4, 5, 5],
[3, 3, 2],
[4, 5, 4],
[4, 4, 4],
[3, 5, 4],
[4, 4, 3],
[2, 4, 4],
[5, 5, 5]])
Now the singular value decomposition:
U, s, V = np.linalg.svd(A)
The most important thing to investigate is the vector s of singular values:
array([ 21.11673273, 2.0140035 , 1.423864 ])
It shows that the first value is much bigger than the others, indicating that the corresponding Truncated SVD with only one value represents the original matrix A quite well. To calculate this representation, you take column 1 of U multiplied by the first row of V, multiplied by the first singular value. This is what the last cited command in R does. Here is the same in Python:
U[:,0]*s[0]*V[0,:]
And here is the result of this product:
matrix([[ 3.95411864, 4.64939923, 4.34718814],
[ 4.28153222, 5.03438425, 4.70714912],
[ 2.42985854, 2.85711772, 2.67140498],
[ 3.97540054, 4.67442327, 4.37058562],
[ 3.64798696, 4.28943826, 4.01062464],
[ 3.69694905, 4.3470097 , 4.06445393],
[ 3.34185528, 3.92947728, 3.67406114],
[ 3.09108399, 3.63461111, 3.39836128],
[ 4.5599837 , 5.36179782, 5.0132808 ]])
Concerning the vector factors U[:,0] and V[0,:]: Figuratively speaking, U can be seen as a representation of a hole's difficulty, while V encodes a player's strength.

Categories