I got a numpy array as below:
[[3.4, 87]
[5.5, 11]
[22, 3]
[4, 9.8]
[41, 11.22]
[32, 7.6]]
and I want to:
compare elements in column 2, 3 rows a time
delete the row with the biggest value in column 2, 3 rows a time
For example, in the first 3 rows, 3 values in column 2 are 87, 11 and 3, respectively, and I would like to remain 11 and 3.
The output numpy array I expected would be:
[[5.5, 11]
[22, 3]
[4, 9.8]
[32, 7.6]]
I am new to numpy array, and please give me advice to achieve this.
import numpy as np
x = np.array([[3.4, 87],
[5.5, 11],
[22, 3],
[4, 9.8],
[41, 11.22],
[32, 7.6]])
y = x.reshape(-1,3,2)
idx = y[..., 1].argmax(axis=1)
mask = np.arange(3)[None, :] != idx[:, None]
y = y[mask]
print(y)
# This might be helpful for the deleted part of your question
# y = y.reshape(-1,2,2)
# z = y[...,1]/y[...,1].sum(axis=1)
# result = np.dstack([y, z[...,None]])
yields
[[ 5.5 11. ]
[ 22. 3. ]
[ 4. 9.8]
[ 32. 7.6]]
"Grouping by three" with NumPy can be done by reshaping the array to create a new axis of length 3 -- provided the original number of rows is divisible by 3:
In [92]: y = x.reshape(-1,3,2); y
Out[92]:
array([[[ 3.4 , 87. ],
[ 5.5 , 11. ],
[ 22. , 3. ]],
[[ 4. , 9.8 ],
[ 41. , 11.22],
[ 32. , 7.6 ]]])
In [93]: y.shape
Out[93]: (2, 3, 2)
| | |
| | o--- 2 columns in each group
| o------ 3 rows in each group
o--------- 2 groups
For each group, we can select the second column and find the row with the maximum value:
In [94]: idx = y[..., 1].argmax(axis=1); idx
Out[94]: array([0, 1])
array([0, 1]) indicates that in the first group, the 0th indexed row contains the maximum (i.e. 87), and in the second group, the 1st indexed row contains the maximum (i.e. 11.22).
Next, we can generate a 2D boolean selection mask which is True where the rows do not contain the maximum value:
In [95]: mask = np.arange(3)[None, :] != idx[:, None]; mask
Out[95]:
array([[False, True, True],
[ True, False, True]], dtype=bool)
In [96]: mask.shape
Out[96]: (2, 3)
mask has shape (2,3). y has shape (2,3,2). If mask is used to index y as in y[mask], then the mask is aligned with the first two axes of y, and all values where mask is True are returned:
In [98]: y[mask]
Out[98]:
array([[ 5.5, 11. ],
[ 22. , 3. ],
[ 4. , 9.8],
[ 32. , 7.6]])
In [99]: y[mask].shape
Out[99]: (4, 2)
By the way, the same calculation could be done using Pandas like this:
import numpy as np
import pandas as pd
x = np.array([[3.4, 87],
[5.5, 11],
[22, 3],
[4, 9.8],
[41, 11.22],
[32, 7.6]])
df = pd.DataFrame(x)
idx = df.groupby(df.index // 3)[1].idxmax()
# drop the row with the maximum value in each group
df = df.drop(idx.values, axis=0)
which yields the DataFrame:
0 1
1 5.5 11.0
2 22.0 3.0
3 4.0 9.8
5 32.0 7.6
You might find Pandas syntax easier to use, but for the above calculation NumPy
is faster.
Related
I have a 3dimensional numpy ndarray shaped (i,j,k) where each row is an array of multiple similarly sized vectors.
I would like to extract an (i,k) shaped array such that each element of the rows contains the first non-zero element of its group of "j" vectors
So basically, given an array such as:
[
[
[0 , 10 , 12 , 0 , 4],
[0 , 0 , 13 , 1 , 2],
[12, 14 , 1 , 12 , 8]
],
[
[5 , 17 , 12 , 9 , 0],
[0 , 0 , 13 , 1 , 0],
[12, 14 , 1 , 12 , 8]
],
[
[0 , 0 , 19 , 0 , 9],
[2 , 6 , 13 , 0 , 2],
[12, 14 , 1 , 12 , 8]
]
]
Im looking to find something like:
[
[12, 10, 12, 1, 4],
[5 , 17, 12, 9, 8],
[2 , 6, 19, 12, 9]
]
how to find the results efficiently?
import numpy as np
i,j,k = 3, 5, 10 #in real problem, these are pretty large
arr = np.random.randint(0, 10000, (i,j,k))
#????????
Using a simple transpose https://docs.scipy.org/doc/numpy/reference/generated/numpy.transpose.html and advanced indexing https://docs.scipy.org/doc/numpy/reference/arrays.indexing.html#advanced-indexing, the solution looks like this:
import numpy as np
a = np.array([
[
[0 , 10 , 12 , 0 , 4],
[0 , 0 , 13 , 1 , 2],
[12, 14 , 1 , 12 , 8]
],
[
[5 , 17 , 12 , 9 , 0],
[0 , 0 , 13 , 1 , 0],
[12, 14 , 1 , 12 , 8]
],
[
[0 , 0 , 19 , 0 , 9],
[2 , 6 , 13 , 0 , 2],
[12, 14 , 1 , 12 , 8]
]
])
# swapping the 0 and 1 axes, to make the rest of the code easier
a = a.transpose((1, 0, 2))
# initialising result to zeroes, same shape as a single layer of the transposed a
result = np.zeros(a[0].shape, np.int32)
# one layer at a time
for layer in a:
# as soon as result contains no more zeroes, done
if len(result[result == 0]) == 0:
break
else:
# replace all values of result that are zero
# with values from the same location in layer
result[result == 0] = layer[result == 0]
print(result)
Prints:
[[12 10 12 1 4]
[ 5 17 12 9 8]
[ 2 6 19 12 9]]
This is not the most elegant way to do it, as I wrote it rather quickly on the fly. However, hopefully this is sufficient enough to get you started.
def find_non_zero(lst):
for num in lst:
if num != 0:
return num
First, we define a very simple helper function, find_non_zero, which receives a one-dimensional list as input and returns the first non-zero entry. This function is then used when we loop over each columns of the two-dimensional array in arrays, the three-dimensional input array provided.
import numpy as np
arrays = [
[
[0 , 10 , 12 , 0 , 4],
[0 , 0 , 13 , 1 , 2],
[12, 14 , 1 , 12 , 8]
],
[
[5 , 17 , 12 , 9 , 0],
[0 , 0 , 13 , 1 , 0],
[12, 14 , 1 , 12 , 8]
],
[
[0 , 0 , 19 , 0 , 9],
[2 , 6 , 13 , 0 , 2],
[12, 14 , 1 , 12 , 8]
]
]
final_result = []
for array in arrays:
array = np.array(array).T
final_result.append([find_non_zero(col) for col in array])
print(final_result)
This yields the following output.
[[12, 10, 12, 1, 4], [5, 17, 12, 9, 8], [2, 6, 19, 12, 9]]
Numpy's .T tranpose operator comes in handy because it allows us to loop through two-dimensional arrays by columns, instead of the default row. For more information on the transpose operator, read the official documentation.
I'd be happy to answer any additional questions you might have.
Use apply along axis and then apply the function nonzero and get first element in desired axis
np.apply_along_axis(lambda e: e[np.nonzero(e)][0], 1, x)
OR
np.apply_along_axis(lambda e: e[(e!=0).argmax()], 1, x)
Output
array([[12, 10, 12, 1, 4],
[ 5, 17, 12, 9, 8],
[ 2, 6, 19, 12, 9]])
I have a numpy array such as:
gmm.sigma =
[[[ 4.64 -1.93]
[-1.93 6.5 ]]
[[ 3.65 2.89]
[ 2.89 -1.26]]]
and I want to add another 2x2 matrix such as:
gauss.sigma=
[[ -1.24 2.34]
[ 2.34 4.76]]
to get:
gmm.sigma =
[[[ 4.64 -1.93]
[-1.93 6.5 ]]
[[ 3.65 2.89]
[ 2.89 -1.26]]
[[-1.24 2.34]
[ 2.34 4.76]]]
I have tried: gmm.sigma = np.append(gmm.sigma, gauss.sigma, axis = 0),
but get this error:
Traceback (most recent call last):
File "test1.py", line 40, in <module>
gmm.sigma = np.append(gmm.sigma, gauss.sigma, axis = 0)
File "/home/rowan/anaconda2/lib/python2.7/site-packages/numpy/lib/function_base.py", line 4528, in append
return concatenate((arr, values), axis=axis)
ValueError: all the input arrays must have same number of dimensions
Any help is appreciated
Looks like you want to join the 2 arrays on the first axis - except that the second is only 2d. It needs an added dimension:
In [233]: arr = np.arange(8).reshape(2,2,2)
In [234]: arr1 = np.arange(10,14).reshape(2,2)
In [235]: np.concatenate((arr, arr1[None,:,:]), axis=0)
Out[235]:
array([[[ 0, 1],
[ 2, 3]],
[[ 4, 5],
[ 6, 7]],
[[10, 11],
[12, 13]]])
dstack is a variation on concatenate that expands everything to 3d, and joins on the last axis. To use it we have to transpose everything:
In [236]: np.dstack((arr.T,arr1.T)).T
Out[236]:
array([[[ 0, 1],
[ 2, 3]],
[[ 4, 5],
[ 6, 7]],
[[10, 11],
[12, 13]]])
index_tricks adds some classes that play similar tricks with dimensions:
In [241]: np.r_['0,3', arr, arr1]
Out[241]:
array([[[ 0, 1],
[ 2, 3]],
[[ 4, 5],
[ 6, 7]],
[[10, 11],
[12, 13]]])
The docs of np.r_ require some reading if you want get most from it, but it might worth using if you had to adjust the dimensions of several arrays, eg. np.r_['0,3', arr1, arr, arr1]
You can use dstack which stacks the arrays in sequence depth wise (along the third axis) followed by a transpose. To get the desired output, you will have to stack gmm.T and gauss
gmm = np.array([[[4.64, -1.93],
[-1.93, 6.5 ]],
[[3.65, 2.89],
[2.89, -1.26]]])
gauss = np.array([[ -1.24, 2.34],
[2.34, 4.76]])
result = np.dstack((gmm.T, gauss)).T
print (result)
print (result.shape)
# (3, 2, 2)
Output
array([[[ 4.64, -1.93],
[-1.93, 6.5 ]],
[[ 3.65, 2.89],
[ 2.89, -1.26]],
[[-1.24, 2.34],
[ 2.34, 4.76]]])
Alternatively you can also use concatenate by properly reshaping your second array as
gmm = np.array([[[4.64, -1.93],
[-1.93, 6.5 ]],
[[3.65, 2.89],
[2.89, -1.26]]])
gauss = np.array([[ -1.24, 2.34],
[2.34, 4.76]]).reshape(1,2,2)
result = np.concatenate((gmm, gauss), axis=0)
As the error message stated, the dimension of gmm and gauss_sigmaare not the same, you should reshape gauss_sigma before appending.
gmm_sigma = np.array([[[4.64, -1.93], [-1.93, 6.5]], [[3.65, 2.89], [ 2.89, -1.26]]])
gauss_sigma = np.array([[-1.24, 2.34], [2.34, 4.76]])
print(np.append(gmm_sigma, gauss_sigma.reshape(1, 2, 2), axis=0))
# array([[[ 4.64, -1.93],
# [-1.93, 6.5 ]],
#
# [[ 3.65, 2.89],
# [ 2.89, -1.26]],
#
# [[-1.24, 2.34],
# [ 2.34, 4.76]]])
This question already has answers here:
numpy matrix, setting 0 to values by sorting each row
(2 answers)
Closed 5 years ago.
I have a numpy array of data where I need to keep only n highest values, and zero everything else.
My current solution:
import numpy as np
np.random.seed(30)
# keep only the n highest values
n = 3
# Simple 2x5 data field for this example, real life application will be exteremely large
data = np.random.random((2,5))
#[[ 0.64414354 0.38074849 0.66304791 0.16365073 0.96260781]
# [ 0.34666184 0.99175099 0.2350579 0.58569427 0.4066901 ]]
# find indices of the n highest values per row
idx = np.argsort(data)[:,-n:]
#[[0 2 4]
# [4 3 1]]
# put those values back in a blank array
data_ = np.zeros(data.shape) # blank slate
for i in xrange(data.shape[0]):
data_[i,idx[i]] = data[i,idx[i]]
# Each row contains only the 3 highest values per row or the original data
#[[ 0.64414354 0. 0.66304791 0. 0.96260781]
# [ 0. 0.99175099 0. 0.58569427 0.4066901 ]]
In the code above, data_ has the n highest values and everything else is zeroed out. This works out nicely even if data.shape[1] is smaller than n. But the only issue is the for loop, which is slow because my actual use case is on very very large arrays.
Is it possible to get rid of the for loop?
You could act on the result of np.argsort -- np.argsort twice, the first to get the index order and the second to get the ranks -- in a vectorized fashion, and then use either np.where or simply multiplication to zero everything else:
In [116]: np.argsort(data)
Out[116]:
array([[3, 1, 0, 2, 4],
[2, 0, 4, 3, 1]])
In [117]: np.argsort(np.argsort(data)) # these are the ranks
Out[117]:
array([[2, 1, 3, 0, 4],
[1, 4, 0, 3, 2]])
In [118]: np.argsort(np.argsort(data)) >= data.shape[1] - 3
Out[118]:
array([[ True, False, True, False, True],
[False, True, False, True, True]], dtype=bool)
In [119]: data * (np.argsort(np.argsort(data)) >= data.shape[1] - 3)
Out[119]:
array([[ 0.64414354, 0. , 0.66304791, 0. , 0.96260781],
[ 0. , 0.99175099, 0. , 0.58569427, 0.4066901 ]])
In [120]: np.where(np.argsort(np.argsort(data)) >= data.shape[1]-3, data, 0)
Out[120]:
array([[ 0.64414354, 0. , 0.66304791, 0. , 0.96260781],
[ 0. , 0.99175099, 0. , 0.58569427, 0.4066901 ]])
How can I divide two numpy matrices A and B in python when sometimes the two matrices will have 0 on the same cell?
Basically A[i,j]>=B[i,j] for all i, j. I need to calculate C=A/B. But sometimes A[i,j]==B[i,j]==0. And when this happens I need A[i,j]/B[i,j] to be defined as 0.
Is there a simple pythonic way other than going through all the indexes?
You can use the where argument for ufuncs like np.true_divide:
np.true_divide(A, B, where=(A!=0) | (B!=0))
In case you have no negative values (as stated in the comments) and A >= B for each element (as stated in the question) you can simplify this to:
np.true_divide(A, B, where=(A!=0))
because A[i, j] == 0 implies B[i, j] == 0.
For example:
import numpy as np
A = np.random.randint(0, 3, (4, 4))
B = np.random.randint(0, 3, (4, 4))
print(A)
print(B)
print(np.true_divide(A, B, where=(A!=0) | (B!=0)))
[[1 0 2 1]
[1 0 0 0]
[2 1 0 0]
[2 2 0 2]]
[[1 0 1 1]
[2 2 1 2]
[2 1 0 1]
[2 0 1 2]]
[[ 1. 0. 2. 1. ]
[ 0.5 0. 0. 0. ]
[ 1. 1. 0. 0. ]
[ 1. inf 0. 1. ]]
As alternative: Just replace nans after the division:
C = A / B # may print warnings, suppress them with np.seterrstate if you want
C[np.isnan(C)] = 0
You could use a mask with np.where to choose between such a case of A and B being both zeros and otherwise and put out 0 or an elementwise division respectively -
from __future__ import division # For Python 2.x
mask = (A == B) & (A==0)
C = np.where(mask, 0, A/B)
About the mask creation : (A==B) would be the mask of all elements that are equal between A and B and with (A==0) we have a mask of all elements that are zero in A. Thus, with a combined mask of (A == B) & (A==0), we have mask of places where both A and B are zeros. A more simpler version to do the same task and maybe easier to understand would be to check for zeros in both A and B and it would be :
mask = (A==0) & (B==0)
About the use of np.where, its syntax is :
C = np.where(mask, array1, array2)
i.e. we would select elements for assinging into C based on the mask. If the corresponding mask element is True, we pick the corresponding element from array1, else from array2. This is done on elementwise level and thus, we have the output C.
Sample run -
In [48]: A
Out[48]:
array([[4, 1, 4, 0, 3],
[0, 4, 1, 4, 3],
[1, 0, 0, 4, 0]])
In [49]: B
Out[49]:
array([[4, 2, 2, 1, 4],
[2, 1, 2, 4, 2],
[4, 0, 2, 0, 3]])
In [50]: mask = (A == B) & (A==0)
In [51]: np.where(mask, 0, A/B)
Out[51]:
array([[ 1. , 0.5 , 2. , 0. , 0.75],
[ 0. , 4. , 0.5 , 1. , 1.5 ],
[ 0.25, 0. , 0. , inf, 0. ]])
I have a set of data (X,Y). My independent variable values X are not unique, so there are multiple repeated values, I want to output a new array containing : X_unique, which is a list of unique values of X. Y_mean, the mean of all of the Y values corresponding to X_unique. Y_std, the standard deviation of all the Y values corresponding to X_unique.
x = data[:,0]
y = data[:,1]
You can use binned_statistic from scipy.stats that supports various statistic functions to be applied in chunks across a 1D array. To get the chunks, we need to sort and get positions of the shifts (where chunks change), for which np.unique would be useful. Putting all those, here's an implementation -
from scipy.stats import binned_statistic as bstat
# Sort data corresponding to argsort of first column
sdata = data[data[:,0].argsort()]
# Unique col-1 elements and positions of breaks (elements are not identical)
unq_x,breaks = np.unique(sdata[:,0],return_index=True)
breaks = np.append(breaks,data.shape[0])
# Use binned statistic to get grouped average and std deviation values
idx_range = np.arange(data.shape[0])
avg_y,_,_ = bstat(x=idx_range, values=sdata[:,1], statistic='mean', bins=breaks)
std_y,_,_ = bstat(x=idx_range, values=sdata[:,1], statistic='std', bins=breaks)
From the docs of binned_statistic, one can also use a custom statistic function :
function : a user-defined function which takes a 1D array of values,
and outputs a single numerical statistic. This function will be called
on the values in each bin. Empty bins will be represented by
function([]), or NaN if this returns an error.
Sample input, output -
In [121]: data
Out[121]:
array([[2, 5],
[2, 2],
[1, 5],
[3, 8],
[0, 8],
[6, 7],
[8, 1],
[2, 5],
[6, 8],
[1, 8]])
In [122]: np.column_stack((unq_x,avg_y,std_y))
Out[122]:
array([[ 0. , 8. , 0. ],
[ 1. , 6.5 , 1.5 ],
[ 2. , 4. , 1.41421356],
[ 3. , 8. , 0. ],
[ 6. , 7.5 , 0.5 ],
[ 8. , 1. , 0. ]])
x_unique = np.unique(x)
y_means = np.array([np.mean(y[x==u]) for u in x_unique])
y_stds = np.array([np.std(y[x==u]) for u in x_unique])
Pandas is done for such task :
data=np.random.randint(1,5,20).reshape(10,2)
import pandas
pandas.DataFrame(data).groupby(0).mean()
gives
1
0
1 2.666667
2 3.000000
3 2.000000
4 1.500000