Appending 2x2 co variance matrices in numpy - python

I have a numpy array such as:
gmm.sigma =
[[[ 4.64 -1.93]
[-1.93 6.5 ]]
[[ 3.65 2.89]
[ 2.89 -1.26]]]
and I want to add another 2x2 matrix such as:
gauss.sigma=
[[ -1.24 2.34]
[ 2.34 4.76]]
to get:
gmm.sigma =
[[[ 4.64 -1.93]
[-1.93 6.5 ]]
[[ 3.65 2.89]
[ 2.89 -1.26]]
[[-1.24 2.34]
[ 2.34 4.76]]]
I have tried: gmm.sigma = np.append(gmm.sigma, gauss.sigma, axis = 0),
but get this error:
Traceback (most recent call last):
File "test1.py", line 40, in <module>
gmm.sigma = np.append(gmm.sigma, gauss.sigma, axis = 0)
File "/home/rowan/anaconda2/lib/python2.7/site-packages/numpy/lib/function_base.py", line 4528, in append
return concatenate((arr, values), axis=axis)
ValueError: all the input arrays must have same number of dimensions
Any help is appreciated

Looks like you want to join the 2 arrays on the first axis - except that the second is only 2d. It needs an added dimension:
In [233]: arr = np.arange(8).reshape(2,2,2)
In [234]: arr1 = np.arange(10,14).reshape(2,2)
In [235]: np.concatenate((arr, arr1[None,:,:]), axis=0)
Out[235]:
array([[[ 0, 1],
[ 2, 3]],
[[ 4, 5],
[ 6, 7]],
[[10, 11],
[12, 13]]])
dstack is a variation on concatenate that expands everything to 3d, and joins on the last axis. To use it we have to transpose everything:
In [236]: np.dstack((arr.T,arr1.T)).T
Out[236]:
array([[[ 0, 1],
[ 2, 3]],
[[ 4, 5],
[ 6, 7]],
[[10, 11],
[12, 13]]])
index_tricks adds some classes that play similar tricks with dimensions:
In [241]: np.r_['0,3', arr, arr1]
Out[241]:
array([[[ 0, 1],
[ 2, 3]],
[[ 4, 5],
[ 6, 7]],
[[10, 11],
[12, 13]]])
The docs of np.r_ require some reading if you want get most from it, but it might worth using if you had to adjust the dimensions of several arrays, eg. np.r_['0,3', arr1, arr, arr1]

You can use dstack which stacks the arrays in sequence depth wise (along the third axis) followed by a transpose. To get the desired output, you will have to stack gmm.T and gauss
gmm = np.array([[[4.64, -1.93],
[-1.93, 6.5 ]],
[[3.65, 2.89],
[2.89, -1.26]]])
gauss = np.array([[ -1.24, 2.34],
[2.34, 4.76]])
result = np.dstack((gmm.T, gauss)).T
print (result)
print (result.shape)
# (3, 2, 2)
Output
array([[[ 4.64, -1.93],
[-1.93, 6.5 ]],
[[ 3.65, 2.89],
[ 2.89, -1.26]],
[[-1.24, 2.34],
[ 2.34, 4.76]]])
Alternatively you can also use concatenate by properly reshaping your second array as
gmm = np.array([[[4.64, -1.93],
[-1.93, 6.5 ]],
[[3.65, 2.89],
[2.89, -1.26]]])
gauss = np.array([[ -1.24, 2.34],
[2.34, 4.76]]).reshape(1,2,2)
result = np.concatenate((gmm, gauss), axis=0)

As the error message stated, the dimension of gmm and gauss_sigmaare not the same, you should reshape gauss_sigma before appending.
gmm_sigma = np.array([[[4.64, -1.93], [-1.93, 6.5]], [[3.65, 2.89], [ 2.89, -1.26]]])
gauss_sigma = np.array([[-1.24, 2.34], [2.34, 4.76]])
print(np.append(gmm_sigma, gauss_sigma.reshape(1, 2, 2), axis=0))
# array([[[ 4.64, -1.93],
# [-1.93, 6.5 ]],
#
# [[ 3.65, 2.89],
# [ 2.89, -1.26]],
#
# [[-1.24, 2.34],
# [ 2.34, 4.76]]])

Related

Losing decimal when doing array operation in Python

I tried to make a function and inside it there is a code to divides a column with its column sum and here I come up with.
A = np.array([[1,2,3,4],[1,2,3,4],[1,2,3,4]])
print(A)
A = A.T
Asum = A.sum(axis=1)
print(Asum)
for i in range(len(Asum)):
A[:,i] = A[:,i]/Asum[i]
I'm hoping some decimal matrix but it automatically turn into integer. It gives me a zero matrix. Where do I go wrong?
You must change:
Asum = A.sum(axis=1)
by:
Asum = A.sum(axis=0)
To get the column by column sum.
Also you can get the division easily with numpy.divide:
np.divide(A, Asum)
#array([[0.1, 0.1, 0.1],
# [0.2, 0.2, 0.2],
# [0.3, 0.3, 0.3],
# [0.4, 0.4, 0.4]])
Or simply with:
A/Asum
Your A is integer dtype; assigned floats get truncated. If A started as a float array your iteration would work. But you don't need to iterate to perform this calculation:
In [108]: A = np.array([[1,2,3,4],[1,2,3,4],[1,2,3,4]]).T
In [109]: A
Out[109]:
array([[1, 1, 1],
[2, 2, 2],
[3, 3, 3],
[4, 4, 4]])
In [110]: Asum = A.sum(axis=1)
In [111]: Asum
Out[111]: array([ 3, 6, 9, 12])
A is (4,3), Asum is (4,). If we make it (4,1):
In [114]: Asum[:,None]
Out[114]:
array([[ 3],
[ 6],
[ 9],
[12]])
we can perform the divide without iteration (review broadcasting if necessary):
In [115]: A/Asum[:,None]
Out[115]:
array([[0.33333333, 0.33333333, 0.33333333],
[0.33333333, 0.33333333, 0.33333333],
[0.33333333, 0.33333333, 0.33333333],
[0.33333333, 0.33333333, 0.33333333]])
sum has keepdims parameter that makes this kind of calculation easier:
In [117]: Asum = A.sum(axis=1, keepdims=True)
In [118]: Asum
Out[118]:
array([[ 3],
[ 6],
[ 9],
[12]])

Manual calculation of the covariance matrix is different from that of numpy method

Given there is a matrix of n samples with m features, I am calculating the covariance matrix of features by hand so it should be m by m. That can be done by numpy, for small example,
arr_x = np.array([[ 1, 4],
[ 4, 10],
[ 3, 6],
[ 5, 11],
[ 2, 4]])
print(np.cov(arr_x, rowvar=False))
>>> [[ 2.5 5. ]
[ 5. 11. ]]
I can get the same covariance matrix by doing
mean_x = np.mean(arr_x, 0)
np.dot(arr_x.T-mean_x[:, None],(arr_x.T-mean_x[:, None]).T)/(arr_x.shape[0] - 1)
>>> [[ 2.5, 5. ]
[ 5. , 11. ]]
Yet, the latter derived method does not result in the same result as numpy for some what large matrix
a = np.random.randint(0, 255, (5000, 200))
mean_a = np.mean(a, 0)
by_hand = np.dot(a.T-mean_a[:, None],(a.T-mean_a[:, None]).T)/(a.shape[0] - 1)
numpy_cov = np.cov(a, rowvar=False)
print((by_hand == numpy_cov).all())
What am I doing wrong or is there better way to get a covariance matrix by hand?

huge matrix sorted and then find smallest elements with their indices into a list

I have a matrix M that is rather large. I am trying to find the top 5 closest distances along with their indices.
M = csr_matrix(M)
dst = pairwise_distances(M,Y=None,metric='euclidean')
dst becomes a huge matrix and I am trying to sort it efficiently or use scipy or sklearn to find the closest 5 distances.
Here is an example of what I am trying to do:
X = np.array([[2, 3, 5], [2, 3, 6], [2, 3, 8], [2, 3, 3], [2, 3, 4]])
I then calculate dst as:
[[ 0. 1. 3. 2. 1.]
[ 1. 0. 2. 3. 2.]
[ 3. 2. 0. 5. 4.]
[ 2. 3. 5. 0. 1.]
[ 1. 2. 4. 1. 0.]]
So, row 0 to itself has a distance of 0., row 0 to 1 has a distance of 1.,... row 2 to row 3 has a distance of 5., and so on. I want to find these closest 5 distances and put them in a list with the corresponding rows, maybe like [distance, row, row]. I don't want any diagonal elements or duplicate elements so I take the upper triangular matrix as follows:
[[ inf 1. 3. 2. 1.]
[ nan inf 2. 3. 2.]
[ nan nan inf 5. 4.]
[ nan nan nan inf 1.]
[ nan nan nan nan inf]]
Now, the top 5 distances least to greatest are:
[1, 0, 1], [1, 0, 4], [1, 3, 4], [2, 1, 2], [2, 0, 3], [2, 1, 4]
As you can see there are three elements that have distance 2 and three elements that have distance 1. From these I want to randomly choose one of the elements with distance 2 to keep as I only want the top f elements where f=5 in this case.
This is just a sample as this matrix could be very large. Is there an efficient way to do the above besides using a basic sorted function? I couldn't find any sklearn or scipy to help me with this.
Here's a fully vectorized solution to your problem:
import numpy as np
from scipy.spatial.distance import pdist
def smallest(M, f):
# compute the condensed distance matrix
dst = pdist(M, 'euclidean')
# indices of the upper triangular matrix
rows, cols = np.triu_indices(M.shape[0], k=1)
# indices of the f smallest distances
idx = np.argsort(dst)[:f]
# gather results in the specified format: distance, row, column
return np.vstack((dst[idx], rows[idx], cols[idx])).T
Notice that np.argsort(dst)[:f] yields the indices of the smallest f elements of the condensed distance matrix dst sorted in ascending order.
The following demo reproduces the result of your toy example and shows how the function smallest deals with a fairly large matrix of integers:
In [59]: X = np.array([[2, 3, 5], [2, 3, 6], [2, 3, 8], [2, 3, 3], [2, 3, 4]])
In [60]: smallest(X, 5)
Out[60]:
array([[ 1., 0., 1.],
[ 1., 0., 4.],
[ 1., 3., 4.],
[ 2., 0., 3.],
[ 2., 1., 2.]])
In [61]: large_X = np.random.randint(100, size=(10000, 2000))
In [62]: large_X
Out[62]:
array([[ 8, 78, 97, ..., 23, 93, 90],
[42, 2, 21, ..., 68, 45, 62],
[28, 45, 30, ..., 0, 75, 48],
...,
[26, 88, 78, ..., 0, 88, 43],
[91, 53, 94, ..., 85, 44, 37],
[39, 8, 10, ..., 46, 15, 67]])
In [63]: %time smallest(large_X, 5)
Wall time: 1min 32s
Out[63]:
array([[ 1676.12529365, 4815. , 5863. ],
[ 1692.97253374, 1628. , 2950. ],
[ 1693.558384 , 5742. , 8240. ],
[ 1695.86408654, 2140. , 6969. ],
[ 1696.68853948, 5477. , 6641. ]])

numpy array directional mean without dimension reduction

How would I do the following:
With a 3D numpy array I want to take the mean in one dimension and assign the values back to a 3D array with the same shape, with duplicate values of the means in the direction they were derived...
I'm struggling to work out an example in 3D but in 2D (4x4) it would look a bit like this I guess
array[[1, 1, 2, 2]
[2, 2, 1, 0]
[1, 1, 2, 2]
[4, 8, 3, 0]]
becomes
array[[2, 3, 2, 1]
[2, 3, 2, 1]
[2, 3, 2, 1]
[2, 3, 2, 1]]
I'm struggling with the np.mean and the loss of dimensions when take an average.
You can use the keepdims keyword argument to keep that vanishing dimension, e.g.:
>>> a = np.random.randint(10, size=(4, 4)).astype(np.double)
>>> a
array([[ 7., 9., 9., 7.],
[ 7., 1., 3., 4.],
[ 9., 5., 9., 0.],
[ 6., 9., 1., 5.]])
>>> a[:] = np.mean(a, axis=0, keepdims=True)
>>> a
array([[ 7.25, 6. , 5.5 , 4. ],
[ 7.25, 6. , 5.5 , 4. ],
[ 7.25, 6. , 5.5 , 4. ],
[ 7.25, 6. , 5.5 , 4. ]])
You can resize the array after taking the mean:
In [24]: a = np.array([[1, 1, 2, 2],
[2, 2, 1, 0],
[2, 3, 2, 1],
[4, 8, 3, 0]])
In [25]: np.resize(a.mean(axis=0).astype(int), a.shape)
Out[25]:
array([[2, 3, 2, 0],
[2, 3, 2, 0],
[2, 3, 2, 0],
[2, 3, 2, 0]])
In order to correctly satisfy the condition that duplicate values of the means appear in the direction they were derived, it's necessary to reshape the mean array to a shape which is broadcastable with the original array.
Specifically, the mean array should have the same shape as the original array except that the length of the dimension along which the mean was taken should be 1.
The following function should work for any shape of array and any number of dimensions:
def fill_mean(arr, axis):
mean_arr = np.mean(arr, axis=axis)
mean_shape = list(arr.shape)
mean_shape[axis] = 1
mean_arr = mean_arr.reshape(mean_shape)
return np.zeros_like(arr) + mean_arr
Here's the function applied to your example array which I've called a:
>>> fill_mean(a, 0)
array([[ 2.25, 3.5 , 2. , 0.75],
[ 2.25, 3.5 , 2. , 0.75],
[ 2.25, 3.5 , 2. , 0.75],
[ 2.25, 3.5 , 2. , 0.75]])
>>> fill_mean(a, 1)
array([[ 1.5 , 1.5 , 1.5 , 1.5 ],
[ 1.25, 1.25, 1.25, 1.25],
[ 2. , 2. , 2. , 2. ],
[ 3.75, 3.75, 3.75, 3.75]])
Construct the numpy array
import numpy as np
data = np.array(
[[1, 1, 2, 2],
[2, 2, 1, 0],
[1, 1, 2, 2],
[4, 8, 3, 0]]
)
Use the axis parameter to get means along a particular axis
>>> means = np.mean(data, axis=0)
>>> means
array([ 2., 3., 2., 1.])
Now tile that resulting array into the shape of the original
>>> print np.tile(means, (4,1))
[[ 2. 3. 2. 1.]
[ 2. 3. 2. 1.]
[ 2. 3. 2. 1.]
[ 2. 3. 2. 1.]]
You can replace the 4,1 with parameters from data.shape

How to get euclidean distance on a 3x3x3 array in numpy

say I have a (3,3,3) array like this.
array([[[1, 1, 1],
[1, 1, 1],
[0, 0, 0]],
[[2, 2, 2],
[2, 2, 2],
[2, 2, 2]],
[[3, 3, 3],
[3, 3, 3],
[1, 1, 1]]])
How do I get the 9 values corresponding to euclidean distance between each vector of 3 values and the zeroth values?
Such as doing a numpy.linalg.norm([1,1,1] - [1,1,1]) 2 times, and then doing norm([0,0,0] - [0,0,0]), and then norm([2,2,2] - [1,1,1]) 2 times, norm([2,2,2] - [0,0,0]), then norm([3,3,3] - [1,1,1]) 2 times, and finally norm([1,1,1] - [0,0,0]).
Any good ways to vectorize this? I want to store the distances in a (3,3,1) matrix.
The result would be:
array([[[0. ],
[0. ],
[0. ]],
[[1.73],
[1.73],
[3.46]]
[[3.46],
[3.46],
[1.73]]])
keepdims argument is added in numpy 1.7, you can use it to keep the sum axis:
np.sum((x - [1, 1, 1])**2, axis=-1, keepdims=True)**0.5
the result is:
[[[ 0. ]
[ 0. ]
[ 0. ]]
[[ 1.73205081]
[ 1.73205081]
[ 1.73205081]]
[[ 3.46410162]
[ 3.46410162]
[ 0. ]]]
Edit
np.sum((x - x[0])**2, axis=-1, keepdims=True)**0.5
the result is:
array([[[ 0. ],
[ 0. ],
[ 0. ]],
[[ 1.73205081],
[ 1.73205081],
[ 3.46410162]],
[[ 3.46410162],
[ 3.46410162],
[ 1.73205081]]])
You might want to consider scipy.spatial.distance.cdist(), which efficiently computes distances between pairs of points in two collections of inputs (with a standard euclidean metric, among others). Here's example code:
import numpy as np
import scipy.spatial.distance as dist
i = np.array([[[1, 1, 1],
[1, 1, 1],
[0, 0, 0]],
[[2, 2, 2],
[2, 2, 2],
[2, 2, 2]],
[[3, 3, 3],
[3, 3, 3],
[1, 1, 1]]])
n,m,o = i.shape
# compute euclidean distances of each vector to the origin
# reshape input array to 2-D, as required by cdist
# only keep diagonal, as cdist computes all pairwise distances
# reshape result, adapting it to input array and required output
d = dist.cdist(i.reshape(n*m,o),i[0]).reshape(n,m,o).diagonal(axis1=2).reshape(n,m,1)
d holds:
array([[[ 0. ],
[ 0. ],
[ 0. ]],
[[ 1.73205081],
[ 1.73205081],
[ 3.46410162]],
[[ 3.46410162],
[ 3.46410162],
[ 1.73205081]]])
The big caveat of this approach is that we're calculating n*m*o distances, when we only need n*m (and that it involves an insane amount of reshaping).
I'm doing something similar that is to compute the the sum of squared distances (SSD) for each pair of frames in video volume. I think that it could be helpful for you.
video_volume is a a single 4d numpy array. This array should have dimensions
(time, rows, cols, 3) and dtype np.uint8.
Output is a square 2d numpy array of dtype float. output[i,j] should contain
the SSD between frames i and j.
video_volume = video_volume.astype(float)
size_t = video_volume.shape[0]
output = np.zeros((size_t, size_t), dtype = np.float)
for i in range(size_t):
for j in range(size_t):
output[i, j] = np.square(video_volume[i,:,:,:] - video_volume[j,:,:,:]).sum()

Categories