How would I do the following:
With a 3D numpy array I want to take the mean in one dimension and assign the values back to a 3D array with the same shape, with duplicate values of the means in the direction they were derived...
I'm struggling to work out an example in 3D but in 2D (4x4) it would look a bit like this I guess
array[[1, 1, 2, 2]
[2, 2, 1, 0]
[1, 1, 2, 2]
[4, 8, 3, 0]]
becomes
array[[2, 3, 2, 1]
[2, 3, 2, 1]
[2, 3, 2, 1]
[2, 3, 2, 1]]
I'm struggling with the np.mean and the loss of dimensions when take an average.
You can use the keepdims keyword argument to keep that vanishing dimension, e.g.:
>>> a = np.random.randint(10, size=(4, 4)).astype(np.double)
>>> a
array([[ 7., 9., 9., 7.],
[ 7., 1., 3., 4.],
[ 9., 5., 9., 0.],
[ 6., 9., 1., 5.]])
>>> a[:] = np.mean(a, axis=0, keepdims=True)
>>> a
array([[ 7.25, 6. , 5.5 , 4. ],
[ 7.25, 6. , 5.5 , 4. ],
[ 7.25, 6. , 5.5 , 4. ],
[ 7.25, 6. , 5.5 , 4. ]])
You can resize the array after taking the mean:
In [24]: a = np.array([[1, 1, 2, 2],
[2, 2, 1, 0],
[2, 3, 2, 1],
[4, 8, 3, 0]])
In [25]: np.resize(a.mean(axis=0).astype(int), a.shape)
Out[25]:
array([[2, 3, 2, 0],
[2, 3, 2, 0],
[2, 3, 2, 0],
[2, 3, 2, 0]])
In order to correctly satisfy the condition that duplicate values of the means appear in the direction they were derived, it's necessary to reshape the mean array to a shape which is broadcastable with the original array.
Specifically, the mean array should have the same shape as the original array except that the length of the dimension along which the mean was taken should be 1.
The following function should work for any shape of array and any number of dimensions:
def fill_mean(arr, axis):
mean_arr = np.mean(arr, axis=axis)
mean_shape = list(arr.shape)
mean_shape[axis] = 1
mean_arr = mean_arr.reshape(mean_shape)
return np.zeros_like(arr) + mean_arr
Here's the function applied to your example array which I've called a:
>>> fill_mean(a, 0)
array([[ 2.25, 3.5 , 2. , 0.75],
[ 2.25, 3.5 , 2. , 0.75],
[ 2.25, 3.5 , 2. , 0.75],
[ 2.25, 3.5 , 2. , 0.75]])
>>> fill_mean(a, 1)
array([[ 1.5 , 1.5 , 1.5 , 1.5 ],
[ 1.25, 1.25, 1.25, 1.25],
[ 2. , 2. , 2. , 2. ],
[ 3.75, 3.75, 3.75, 3.75]])
Construct the numpy array
import numpy as np
data = np.array(
[[1, 1, 2, 2],
[2, 2, 1, 0],
[1, 1, 2, 2],
[4, 8, 3, 0]]
)
Use the axis parameter to get means along a particular axis
>>> means = np.mean(data, axis=0)
>>> means
array([ 2., 3., 2., 1.])
Now tile that resulting array into the shape of the original
>>> print np.tile(means, (4,1))
[[ 2. 3. 2. 1.]
[ 2. 3. 2. 1.]
[ 2. 3. 2. 1.]
[ 2. 3. 2. 1.]]
You can replace the 4,1 with parameters from data.shape
Related
I have a matrix M that is rather large. I am trying to find the top 5 closest distances along with their indices.
M = csr_matrix(M)
dst = pairwise_distances(M,Y=None,metric='euclidean')
dst becomes a huge matrix and I am trying to sort it efficiently or use scipy or sklearn to find the closest 5 distances.
Here is an example of what I am trying to do:
X = np.array([[2, 3, 5], [2, 3, 6], [2, 3, 8], [2, 3, 3], [2, 3, 4]])
I then calculate dst as:
[[ 0. 1. 3. 2. 1.]
[ 1. 0. 2. 3. 2.]
[ 3. 2. 0. 5. 4.]
[ 2. 3. 5. 0. 1.]
[ 1. 2. 4. 1. 0.]]
So, row 0 to itself has a distance of 0., row 0 to 1 has a distance of 1.,... row 2 to row 3 has a distance of 5., and so on. I want to find these closest 5 distances and put them in a list with the corresponding rows, maybe like [distance, row, row]. I don't want any diagonal elements or duplicate elements so I take the upper triangular matrix as follows:
[[ inf 1. 3. 2. 1.]
[ nan inf 2. 3. 2.]
[ nan nan inf 5. 4.]
[ nan nan nan inf 1.]
[ nan nan nan nan inf]]
Now, the top 5 distances least to greatest are:
[1, 0, 1], [1, 0, 4], [1, 3, 4], [2, 1, 2], [2, 0, 3], [2, 1, 4]
As you can see there are three elements that have distance 2 and three elements that have distance 1. From these I want to randomly choose one of the elements with distance 2 to keep as I only want the top f elements where f=5 in this case.
This is just a sample as this matrix could be very large. Is there an efficient way to do the above besides using a basic sorted function? I couldn't find any sklearn or scipy to help me with this.
Here's a fully vectorized solution to your problem:
import numpy as np
from scipy.spatial.distance import pdist
def smallest(M, f):
# compute the condensed distance matrix
dst = pdist(M, 'euclidean')
# indices of the upper triangular matrix
rows, cols = np.triu_indices(M.shape[0], k=1)
# indices of the f smallest distances
idx = np.argsort(dst)[:f]
# gather results in the specified format: distance, row, column
return np.vstack((dst[idx], rows[idx], cols[idx])).T
Notice that np.argsort(dst)[:f] yields the indices of the smallest f elements of the condensed distance matrix dst sorted in ascending order.
The following demo reproduces the result of your toy example and shows how the function smallest deals with a fairly large matrix of integers:
In [59]: X = np.array([[2, 3, 5], [2, 3, 6], [2, 3, 8], [2, 3, 3], [2, 3, 4]])
In [60]: smallest(X, 5)
Out[60]:
array([[ 1., 0., 1.],
[ 1., 0., 4.],
[ 1., 3., 4.],
[ 2., 0., 3.],
[ 2., 1., 2.]])
In [61]: large_X = np.random.randint(100, size=(10000, 2000))
In [62]: large_X
Out[62]:
array([[ 8, 78, 97, ..., 23, 93, 90],
[42, 2, 21, ..., 68, 45, 62],
[28, 45, 30, ..., 0, 75, 48],
...,
[26, 88, 78, ..., 0, 88, 43],
[91, 53, 94, ..., 85, 44, 37],
[39, 8, 10, ..., 46, 15, 67]])
In [63]: %time smallest(large_X, 5)
Wall time: 1min 32s
Out[63]:
array([[ 1676.12529365, 4815. , 5863. ],
[ 1692.97253374, 1628. , 2950. ],
[ 1693.558384 , 5742. , 8240. ],
[ 1695.86408654, 2140. , 6969. ],
[ 1696.68853948, 5477. , 6641. ]])
I have a 3x3 numpy array and I want to divide each column of this with a vector 3x1. I know how to divide each row by elements of the vector, but am unable to find a solution to divide each column.
You can transpose your array to divide on each column
(arr_3x3.T/arr_3x1).T
Let's try several things:
In [347]: A=np.arange(9.).reshape(3,3)
In [348]: A
Out[348]:
array([[ 0., 1., 2.],
[ 3., 4., 5.],
[ 6., 7., 8.]])
In [349]: x=10**np.arange(3).reshape(3,1)
In [350]: A/x
Out[350]:
array([[ 0. , 1. , 2. ],
[ 0.3 , 0.4 , 0.5 ],
[ 0.06, 0.07, 0.08]])
So this has divided each row by a different value
In [351]: A/x.T
Out[351]:
array([[ 0. , 0.1 , 0.02],
[ 3. , 0.4 , 0.05],
[ 6. , 0.7 , 0.08]])
And this has divided each column by a different value
(3,3) divided by (3,1) => replicates x across columns.
With the transpose (1,3) array is replicated across rows.
It's important that x be 2d when using .T (transpose). A (3,) array transposes to a (3,) array - that is, no change.
The simplest seems to be
A = np.arange(1,10).reshape(3,3)
b=np.arange(1,4)
A/b
A will be
array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])
and b will be
array([1, 2, 3])
and the division will produce
array([[1. , 1. , 1. ],
[4. , 2.5, 2. ],
[7. , 4. , 3. ]])
The first column is divided by 1, the second column by 2, and the third by 3.
If I've misinterpreted your columns for rows, simply transform with .T - as C_Z_ answered above.
There are many questions already asked in the same grounds.
I also read the official documentation (http://www.scipy.org/scipylib/faq.html#what-is-the-difference-between-matrices-and-arrays) regarding the differences. But I am still struggling to understand the philosophical difference between numpy arrays and matrices.
More preciously I am seeking the reason for the below mention results.
#using array
>>> A = np.array([[ 1, -1, 2],
[ 0, 1, -1],
[ 0, 0, 1]])
>>> b = np.array([5,-1,3])
>>> x = np.linalg.solve(A,b)
>>> x
array([ 1., 2., 3.])
`#using matrix
>>> A=np.mat(A)
>>> b=np.mat(b)
>>> A
matrix([[ 1, -1, 2],
[ 0, 1, -1],
[ 0, 0, 1]])
>>> b
matrix([[ 5, -1, 3]])
>>> x = np.linalg.solve(A,b)
>>> x
matrix([[ 5., -1., 3.],
[ 10., -2., 6.],
[ 5., -1., 3.]])
Why the linear equations represented as array yields correct solution while the matrix representation yields another matrix solution.
Also honestly I don't understand the reason for getting matrix as a solution in the second case.
Sorry if the question is already answered and I failed to notice and also pardon me if my understanding of numpy array and matrix is wrong.
You have a transpose issue...when you go to matrix land, column-vectors and row-vectors are no longer interchangeable:
import numpy as np
A = np.array([[ 1, -1, 2],
[ 0, 1, -1],
[ 0, 0, 1]])
b = np.array([5,-1,3])
x = np.linalg.solve(A, b)
print 'arrays:'
print x
A = np.matrix(A)
b = np.matrix(b)
x = np.linalg.solve(A, b)
print 'matrix, wrong set up:'
print x
b = b.T
x = np.linalg.solve(A, b)
print 'matrix, right set up:'
print x
yields:
arrays:
[ 1. 2. 3.]
matrix, wrong set up:
[[ 5. -1. 3.]
[ 10. -2. 6.]
[ 5. -1. 3.]]
matrix, right set up:
[[ 1.]
[ 2.]
[ 3.]]
I have a very a very large 2D numpy array that contains 2x2 subsets that I need to take the average of. I am looking for a way to vectorize this operation. For example, given x:
# |- col 0 -| |- col 1 -| |- col 2 -|
x = np.array( [[ 0.0, 1.0, 2.0, 3.0, 4.0, 5.0], # row 0
[ 6.0, 7.0, 8.0, 9.0, 10.0, 11.0], # row 0
[12.0, 13.0, 14.0, 15.0, 16.0, 17.0], # row 1
[18.0, 19.0, 20.0, 21.0, 22.0, 23.0]]) # row 1
I need to end up with a 2x3 array which are the averages of each 2x2 sub array, i.e.:
result = np.array( [[ 3.5, 5.5, 7.5],
[15.5, 17.5, 19.5]])
so element [0,0] is calculated as the average of x[0:2,0:2], while element [0,1] would be the average of x[2:4, 0:2]. Does numpy have vectorized/efficient ways of doing aggregates on subsets like this?
If we form the reshaped matrix y = x.reshape(2,2,3,2), then the (i,j) 2x2 submatrix is given by y[i,:,j,:]. E.g.:
In [340]: x
Out[340]:
array([[ 0., 1., 2., 3., 4., 5.],
[ 6., 7., 8., 9., 10., 11.],
[ 12., 13., 14., 15., 16., 17.],
[ 18., 19., 20., 21., 22., 23.]])
In [341]: y = x.reshape(2,2,3,2)
In [342]: y[0,:,0,:]
Out[342]:
array([[ 0., 1.],
[ 6., 7.]])
In [343]: y[1,:,2,:]
Out[343]:
array([[ 16., 17.],
[ 22., 23.]])
To get the mean of the 2x2 submatrices, use the mean method, with axis=(1,3):
In [344]: y.mean(axis=(1,3))
Out[344]:
array([[ 3.5, 5.5, 7.5],
[ 15.5, 17.5, 19.5]])
If you are using an older version of numpy that doesn't support using a tuple for the axis, you could do:
In [345]: y.mean(axis=1).mean(axis=-1)
Out[345]:
array([[ 3.5, 5.5, 7.5],
[ 15.5, 17.5, 19.5]])
See the link given by #dashesy in a comment for more background on the reshaping "trick".
To generalize this to a 2-d array with shape (m, n), where m and n are even, use
y = x.reshape(x.shape[0]/2, 2, x.shape[1], 2)
y can then be interpreted as an array of 2x2 arrays. The first and third index slots of the 4-d array act as the indices that select one of the 2x2 blocks. To get the upper left 2x2 block, use y[0, :, 0, :]; to the block in the second row and third column of blocks, use y[1, :, 2, :]; and in general, to acces block (j, k), use y[j, :, k, :].
To compute the reduced array of averages of these blocks, use the mean method, with axis=(1, 3) (i.e. average over axes 1 and 3):
avg = y.mean(axis=(1, 3))
Here's an example where x has shape (8, 10), so the array of averages of the 2x2 blocks has shape (4, 5):
In [10]: np.random.seed(123)
In [11]: x = np.random.randint(0, 4, size=(8, 10))
In [12]: x
Out[12]:
array([[2, 1, 2, 2, 0, 2, 2, 1, 3, 2],
[3, 1, 2, 1, 0, 1, 2, 3, 1, 0],
[2, 0, 3, 1, 3, 2, 1, 0, 0, 0],
[0, 1, 3, 3, 2, 0, 3, 2, 0, 3],
[0, 1, 0, 3, 1, 3, 0, 0, 0, 2],
[1, 1, 2, 2, 3, 2, 1, 0, 0, 3],
[2, 1, 0, 3, 2, 2, 2, 2, 1, 2],
[0, 3, 3, 3, 1, 0, 2, 0, 2, 1]])
In [13]: y = x.reshape(x.shape[0]/2, 2, x.shape[1]/2, 2)
Take a look at a couple of the 2x2 blocks:
In [14]: y[0, :, 0, :]
Out[14]:
array([[2, 1],
[3, 1]])
In [15]: y[1, :, 2, :]
Out[15]:
array([[3, 2],
[2, 0]])
Compute the averages of the blocks:
In [16]: avg = y.mean(axis=(1, 3))
In [17]: avg
Out[17]:
array([[ 1.75, 1.75, 0.75, 2. , 1.5 ],
[ 0.75, 2.5 , 1.75, 1.5 , 0.75],
[ 0.75, 1.75, 2.25, 0.25, 1.25],
[ 1.5 , 2.25, 1.25, 1.5 , 1.5 ]])
Right now I am doing this by iterating, but there has to be a way to accomplish this task using numpy functions. My goal is to take a 2D array and average J columns at a time, producing a new array with the same number of rows as the original, but with columns/J columns.
So I want to take this:
J = 2 // two columns averaged at a time
[[1 2 3 4]
[4 3 7 1]
[6 2 3 4]
[3 4 4 1]]
and produce this:
[[1.5 3.5]
[3.5 4.0]
[4.0 3.5]
[3.5 2.5]]
Is there a simple way to accomplish this task? I also need a way such that if I never end up with an unaveraged remainder column. So if, for example, I have an input array with 5 columns and J=2, I would average the first two columns, then the last three columns.
Any help you can provide would be great.
data.reshape(-1,j).mean(axis=1).reshape(data.shape[0],-1)
If your j divides data.shape[1], that is.
Example:
In [40]: data
Out[40]:
array([[7, 9, 7, 2],
[7, 6, 1, 5],
[8, 1, 0, 7],
[8, 3, 3, 2]])
In [41]: data.reshape(-1,j).mean(axis=1).reshape(data.shape[0],-1)
Out[41]:
array([[ 8. , 4.5],
[ 6.5, 3. ],
[ 4.5, 3.5],
[ 5.5, 2.5]])
First of all, it looks to me like you're not averaging the columns at all, you're just averaging two data points at a time. Seems to me like your best off reshaping the array, so your that you have a Nx2 data structure that you can feed directly to mean. You may have to pad it first if the number of columns isn't quite compatible. Then at the end, just do a weighted average of the padded remainder column and the one before it. Finally reshape back to the shape you want.
To play off of the example provided by TheodrosZelleke:
In [1]: data = np.concatenate((data, np.array([[5, 6, 7, 8]]).T), 1)
In [2]: data
Out[2]:
array([[7, 9, 7, 2, 5],
[7, 6, 1, 5, 6],
[8, 1, 0, 7, 7],
[8, 3, 3, 2, 8]])
In [3]: cols = data.shape[1]
In [4]: j = 2
In [5]: dataPadded = np.concatenate((data, np.zeros((data.shape[0], j - cols % j))), 1)
In [6]: dataPadded
Out[6]:
array([[ 7., 9., 7., 2., 5., 0.],
[ 7., 6., 1., 5., 6., 0.],
[ 8., 1., 0., 7., 7., 0.],
[ 8., 3., 3., 2., 8., 0.]])
In [7]: dataAvg = dataPadded.reshape((-1,j)).mean(axis=1).reshape((data.shape[0], -1))
In [8]: dataAvg
Out[8]:
array([[ 8. , 4.5, 2.5],
[ 6.5, 3. , 3. ],
[ 4.5, 3.5, 3.5],
[ 5.5, 2.5, 4. ]])
In [9]: if cols % j:
dataAvg[:, -2] = (dataAvg[:, -2] * j + dataAvg[:, -1] * (cols % j)) / (j + cols % j)
dataAvg = dataAvg[:, :-1]
....:
In [10]: dataAvg
Out[10]:
array([[ 8. , 3.83333333],
[ 6.5 , 3. ],
[ 4.5 , 3.5 ],
[ 5.5 , 3. ]])