Set duplicate elements as zeros - python

How can I convert the duplicate elements in a array 'data' into 0? It has to be done row-wise.
data = np.array([[1,8,3,3,4],
[1,8,9,9,4]])
The answer should be as follows:
ans = array([[1,8,3,0,4],
[1,8,9,0,4]])

Approach #1
One approach with np.unique -
# Find out the unique elements and their starting positions
unq_data, idx = np.unique(data,return_index=True)
# Find out the positions for each unique element, their duplicate positions
dup_idx = np.setdiff1d(np.arange(data.size),idx)
# Set those duplicate positioned elemnents to 0s
data[dup_idx] = 0
Sample run -
In [46]: data
Out[46]: array([1, 8, 3, 3, 4, 1, 3, 3, 9, 4])
In [47]: unq_data, idx = np.unique(data,return_index=True)
...: dup_idx = np.setdiff1d(np.arange(data.size),idx)
...: data[dup_idx] = 0
...:
In [48]: data
Out[48]: array([1, 8, 3, 0, 4, 0, 0, 0, 9, 0])
Approach #2
You can also use sorting and differentiation as a faster approach -
# Get indices for sorted data
sort_idx = np.argsort(data)
# Get duplicate indices and set those in data to 0s
dup_idx = sort_idx[1::][np.diff(np.sort(data))==0]
data[dup_idx] = 0
Runtime tests -
In [110]: data = np.random.randint(0,100,(10000))
...: data1 = data.copy()
...: data2 = data.copy()
...:
In [111]: def func1(data):
...: unq_data, idx = np.unique(data,return_index=True)
...: dup_idx = np.setdiff1d(np.arange(data.size),idx)
...: data[dup_idx] = 0
...:
...: def func2(data):
...: sort_idx = np.argsort(data)
...: dup_idx = sort_idx[1::][np.diff(np.sort(data))==0]
...: data[dup_idx] = 0
...:
In [112]: %timeit func1(data1)
1000 loops, best of 3: 1.36 ms per loop
In [113]: %timeit func2(data2)
1000 loops, best of 3: 467 µs per loop
Extending to a 2D case :
Approach #2 could be extended to work for a 2D array case, avoiding any loop like so -
# Get indices for sorted data
sort_idx = np.argsort(data,axis=1)
# Get sorted linear indices
row_offset = data.shape[1]*np.arange(data.shape[0])[:,None]
sort_lin_idx = sort_idx[:,1::] + row_offset
# Get duplicate linear indices and set those in data as 0s
dup_lin_idx = sort_lin_idx[np.diff(np.sort(data,axis=1),axis=1)==0]
data.ravel()[dup_lin_idx] = 0
Sample run -
In [6]: data
Out[6]:
array([[1, 8, 3, 3, 4, 0, 3, 3],
[1, 8, 9, 9, 4, 8, 7, 9],
[1, 8, 9, 9, 4, 8, 7, 3]])
In [7]: sort_idx = np.argsort(data,axis=1)
...: row_offset = data.shape[1]*np.arange(data.shape[0])[:,None]
...: sort_lin_idx = sort_idx[:,1::] + row_offset
...: dup_lin_idx = sort_lin_idx[np.diff(np.sort(data,axis=1),axis=1)==0]
...: data.ravel()[dup_lin_idx] = 0
...:
In [8]: data
Out[8]:
array([[1, 8, 3, 0, 4, 0, 0, 0],
[1, 8, 9, 0, 4, 0, 7, 0],
[1, 8, 9, 0, 4, 0, 7, 3]])

Here's a simple pure-Python way to do it:
seen = set()
for i, x in enumerate(data):
if x in seen:
data[i] = 0
else:
seen.add(x)

You could use a nested for loop, where you compare each element of the array to every other element to check for duplicate records. Syntax might be a bit off as I am not really familiar with numpy.
for x in range(0, len(data))
for y in range(x+1, len(data))
if(data[x] == data[y])
data[x] = 0

#Divakar has it almost right, but there are a few things that can be further optimized, but don't really fit in a comment. To begin:
rows, cols = data.shape
The first operation is to sort the array to identify the duplicates. Since we will want to undo the sorting, we need to use np.argsort, but if you want to make sure that it is the first occurrence of each repeated value that is kept, you need to use a stable sorting algorithm:
sort_idx = data.argsort(axis=1, kind='mergesort')
Once we have the indices to sort data, to get a sorted copy of the array it is faster to use the indices than to re-sort the array:
sorted_data = data[np.arange(rows)[:, None], sort_idx]
While the principle is similar to that in using np.diff, it is typically faster to use boolean operations. We want an array full of False where the first occurrences of each value happen, and True where the duplicates are:
sorted_mask = np.concatenate((np.zeros((rows, 1), dtype=bool),
sorted_data[:, :-1] == sorted_data[:, 1:]),
axis=1)
We now use that mask to set all the duplicates to zero:
sorted_data[sorted_mask] = 0
And we finally undo the sorting. To revert a permutation you can sort the indices that define it, i.e. you could do:
invert_idx = sort_idx.argsort(axis=1, kind='mergesort')
ans = sorted_data[np.arange(rows)[:, None], invert_idx]
But it is more efficient to use assignment, i.e.:
ans = np.empty_like(data)
ans[np.arange(rows), sort_idx] = sorted_data
Putting it all together:
def zero_dups(data):
rows, cols = data.shape
sort_idx = data.argsort(axis=1, kind='mergesort')
sorted_data = data[np.arange(rows)[:, None], sort_idx]
sorted_mask = np.concatenate((np.zeros((rows, 1), dtype=bool),
sorted_data[:, :-1] == sorted_data[:, 1:]),
axis=1)
sorted_data[sorted_mask] = 0
ans = np.empty_like(data)
ans[np.arange(rows)[:, None], sort_idx] = sorted_data
return ans

Related

How can i add values to an array with numpy?

in this code im trying to add the sum of values and add it to an array named array_values, but it didnt, only prints []
array_values = ([])
value = 0.0
for a in range(0, 8):
for b in range (1, 5):
value = value + float(klines[a][b])
#print(value)
np.append(array_values, value)#FIX array_values.append(value)
print("añadiendo: ",value)
value = 0.0
print(array_values)
Does this solve your problem?
import numpy as np
array_values = ([])
value = 0.0
for a in range(0, 8):
for b in range (1, 5):
value = value + float(klines[a][b])
#print(value)
array_values = np.append(array_values, value)
print("añadiendo: ",value)
value = 0.0
print(array_values)
np.append returns an ndarray.
A copy of arr with values appended to axis. Note that append does not occur in-place: a new array is allocated and filled. If axis is None, out is a flattened array.
Check the return section to understand better
https://numpy.org/doc/stable/reference/generated/numpy.append.html
Assuming klines is a 2d numeric dtype array:
In [231]: klines = np.arange(1,13).reshape(4,3)
In [232]: klines
Out[232]:
array([[ 1, 2, 3],
[ 4, 5, 6],
[ 7, 8, 9],
[10, 11, 12]])
we can simple sum across rows with:
In [233]: klines.sum(axis=1)
Out[233]: array([ 6, 15, 24, 33])
the equivalent using your style of iteration:
In [234]: alist = []
...: value = 0
...: for i in range(4):
...: for j in range(3):
...: value += klines[i,j]
...: alist.append(value)
...: value = 0
...:
In [235]: alist
Out[235]: [6, 15, 24, 33]
Use of np.append is slower and harder to get right.
Even if klines is a list of lists, the sums can be easily done with:
In [236]: [sum(row) for row in klines]
Out[236]: [6, 15, 24, 33]

Fast nonzero indices per row/column for (sparse) 2D numpy array

I am looking for the fastest way to obtain a list of the nonzero indices of a 2D array per row and per column. The following is a working piece of code:
preds = [matrix[:,v].nonzero()[0] for v in range(matrix.shape[1])]
descs = [matrix[v].nonzero()[0] for v in range(matrix.shape[0])]
Example input:
matrix = np.array([[0,0,0,0],[1,0,0,0],[1,1,0,0],[1,1,1,0]])
Example output
preds = [array([1, 2, 3]), array([2, 3]), array([3]), array([], dtype=int64)]
descs = [array([], dtype=int64), array([0]), array([0, 1]), array([0, 1, 2])]
(The lists are called preds and descs because they refer to the predecessors and descendants in a DAG when the matrix is interpreted as an adjacency matrix but this is not essential to the question.)
Timing example:
For timing purposes, the following matrix is a good representative:
test_matrix = np.zeros(shape=(4096,4096),dtype=np.float32)
for k in range(16):
test_matrix[256*(k+1):256*(k+2),256*k:256*(k+1)]=1
Background: In my code, these two lines take 75% of the time for a 4000x4000 matrix whereas the ensuing topological sort and DP algorithm take only the rest of the quarter. Roughly 5% of the values in the matrix are nonzero so a sparse-matrix solution may be applicable.
Thank you.
(On suggestion posted here as well: https://scicomp.stackexchange.com/questions/35242/fast-nonzero-indices-per-row-column-for-sparse-2d-numpy-array
There are also answers there to which I will provide timings in the comments. This link contains an accepted answer that is twice as fast.)
If you have enough motivation, Numba can do amazing things.
Here is a quick implementation of the logic you need.
Briefly, it computes the equivalent of np.nonzero() but it includes along the way the information to later dispatch the indices into the format you require.
The information is inspired by sparse.csr.indptr and sparse.csc.indptr.
import numpy as np
import numba as nb
#nb.jit
def cumsum(arr):
result = np.empty_like(arr)
cumsum = result[0] = arr[0]
for i in range(1, len(arr)):
cumsum += arr[i]
result[i] = cumsum
return result
#nb.jit
def count_nonzero(arr):
arr = arr.ravel()
n = 0
for x in arr:
if x != 0:
n += 1
return n
#nb.jit
def row_col_nonzero_nb(arr):
n, m = arr.shape
max_k = count_nonzero(arr)
indices = np.empty((2, max_k), dtype=np.uint32)
i_offset = np.zeros(n + 1, dtype=np.uint32)
j_offset = np.zeros(m + 1, dtype=np.uint32)
n, m = arr.shape
k = 0
for i in range(n):
for j in range(m):
if arr[i, j] != 0:
indices[:, k] = i, j
i_offset[i + 1] += 1
j_offset[j + 1] += 1
k += 1
return indices, cumsum(i_offset), cumsum(j_offset)
def row_col_idx_nonzero_nb(arr):
(ii, jj), jj_split, ii_split = row_col_nonzero_nb(arr)
ii_ = np.argsort(jj)
ii = ii[ii_]
return np.split(ii, ii_split[1:-1]), np.split(jj, jj_split[1:-1])
Compared to your approach (row_col_idx_sep() below), and a bunch of others, as per #hpaulj answer (row_col_idx_sparse_lil()) and #knl answer from scicomp.stackexchange.com (row_col_idx_sparse_coo()):
def row_col_idx_sep(arr):
return (
[arr[:, j].nonzero()[0] for j in range(arr.shape[1])],
[arr[i, :].nonzero()[0] for i in range(arr.shape[0])],)
def row_col_idx_zip(arr):
n, m = arr.shape
ii = [[] for _ in range(n)]
jj = [[] for _ in range(m)]
x, y = np.nonzero(arr)
for i, j in zip(x, y):
ii[i].append(j)
jj[j].append(i)
return jj, ii
import scipy as sp
import scipy.sparse
def row_col_idx_sparse_coo(arr):
coo_mat = sp.sparse.coo_matrix(arr)
csr_mat = coo_mat.tocsr()
csc_mat = coo_mat.tocsc()
return (
np.split(csc_mat.indices, csc_mat.indptr)[1:-1],
np.split(csr_mat.indices, csr_mat.indptr)[1:-1],)
def row_col_idx_sparse_lil(arr):
lil_mat = sp.sparse.lil_matrix(arr)
return lil_mat.T.rows, lil_mat.rows
For inputs generated using:
def gen_input(n, density=0.1, dtype=np.float32):
arr = np.zeros(shape=(n, n), dtype=dtype)
indices = tuple(np.random.randint(0, n, (2, int(n * n * density))).tolist())
arr[indices] = 1.0
return arr
One would get (your test_matrix had approximately 0.06 non-zero density):
m = gen_input(4096, density=0.06)
%timeit row_col_idx_sep(m)
# 1 loop, best of 3: 767 ms per loop
%timeit row_col_idx_zip(m)
# 1 loop, best of 3: 660 ms per loop
%timeit row_col_idx_sparse_coo(m)
# 1 loop, best of 3: 205 ms per loop
%timeit row_col_idx_sparse_lil(m)
# 1 loop, best of 3: 498 ms per loop
%timeit row_col_idx_nonzero_nb(m)
# 10 loops, best of 3: 130 ms per loop
Indicating this to be close to twice as fast as the fastest scipy.sparse-based approach.
In [182]: arr = np.array([[0,0,0,0],[1,0,0,0],[1,1,0,0],[1,1,1,0]])
The data is present in the whole-array nonzero, just not broken up into per row/column arrays:
In [183]: np.nonzero(arr)
Out[183]: (array([1, 2, 2, 3, 3, 3]), array([0, 0, 1, 0, 1, 2]))
In [184]: np.argwhere(arr)
Out[184]:
array([[1, 0],
[2, 0],
[2, 1],
[3, 0],
[3, 1],
[3, 2]])
It might be possible to break the array([1, 2, 2, 3, 3, 3]) into sublists, [1,2,3],[2,3],[3],[] based on the other array. But it may take some time to work out the logic for that, and there's no guarantee that it will be faster than your row/column iterations.
Logical operations can reduce the boolean array to column or row, giving the rows or columns where nonzero occurs, but again not ragged:
In [185]: arr!=0
Out[185]:
array([[False, False, False, False],
[ True, False, False, False],
[ True, True, False, False],
[ True, True, True, False]])
In [186]: (arr!=0).any(axis=0)
Out[186]: array([ True, True, True, False])
In [187]: np.nonzero((arr!=0).any(axis=0))
Out[187]: (array([0, 1, 2]),)
In [188]: np.nonzero((arr!=0).any(axis=1))
Out[188]: (array([1, 2, 3]),)
In [189]: arr
Out[189]:
array([[0, 0, 0, 0],
[1, 0, 0, 0],
[1, 1, 0, 0],
[1, 1, 1, 0]])
The scipy.sparse lil format does generate the data you want:
In [190]: sparse
Out[190]: <module 'scipy.sparse' from '/usr/local/lib/python3.6/dist-packages/scipy/sparse/__init__.py'>
In [191]: M = sparse.lil_matrix(arr)
In [192]: M
Out[192]:
<4x4 sparse matrix of type '<class 'numpy.longlong'>'
with 6 stored elements in List of Lists format>
In [193]: M.rows
Out[193]: array([list([]), list([0]), list([0, 1]), list([0, 1, 2])], dtype=object)
In [194]: M.T
Out[194]:
<4x4 sparse matrix of type '<class 'numpy.longlong'>'
with 6 stored elements in List of Lists format>
In [195]: M.T.rows
Out[195]: array([list([1, 2, 3]), list([2, 3]), list([3]), list([])], dtype=object)
But timing probably isn't any better than your row or column iteration.

Python - Convert the array in a tuple to just a normal array

I have a signal where I want to find the average height of the values. This is done by finding the zero crossings and calculating the max and min between each zero crossing, then averaging these values.
My problem occurs when I want to use np.where() to find where the signal is crossing zero. When I use np.where() I get the result in a tuple, but I want it in an array where I can count the amount of times zero is crossed.
I am new to Python and coming from Matlab it is a bit confusing with all the different classes. As you can see, I get an error because nu = len(zero_u) gives 1 as a result, because the whole array is written in a tuple as one element.
Any ideas how to go around this?
The code looks like this:
import numpy as np
def averageheight(f):
rms = np.std(f)
f = f + (rms * 10**-6)
# Find zero crossing
fsign = np.sign(f)
fdiff = np.diff(fsign)
zero_u = np.asarray(np.where(fdiff > 0)) + 1
zero_d = np.asarray(np.where(fdiff < 0)) + 1
nu = len(zero_u)
nd = len(zero_d)
value_max = np.zeros((nu, 1))
value_min = np.zeros((nu, 1))
imaxvec = np.zeros((nu, 1))
iminvec = np.zeros((nu, 1))
if (nu > 2) and (nd > 2):
if zero_u[0] > zero_d[0]:
zero_d[0] = []
nu = len(zero_u)
nd = len(zero_d)
ncross = np.fmin(nu, nd)
# Find Maxima:
for ic in range(0, ncross - 1):
up = int(zero_u[ic])
down = int(zero_d[ic])
fvec = f[up:down]
value_max[ic] = np.amax(fvec)
index_max = value_max.argmax()
imaxvec[ic] = up + index_max - 1
# Find Minima:
for ic in range(0, ncross - 2):
down = int(zero_d[ic])
up = int(zero_u[ic+1])
fvec = f[down:up]
value_min[ic] = np.amin(fvec)
index_min = value_min.argmin()
iminvec[ic] = down + index_min - 1
# Remove spurious values, bumps and zero_d
thr = rms/3
maxfind = np.where(value_max < thr)
for i in range(0, len(maxfind)):
imaxfind = np.where(value_max == maxfind[i])
imaxvec[imaxfind] = 0
value_max[imaxfind] = 0
minfind = np.where(value_min > -thr)
for j in range(0, len(minfind)):
iminfind = np.where(value_min == minfind[j])
value_min[iminfind] = 0
iminvec[iminfind] = 0
# Find Average Height
avh = np.mean(value_max) - np.mean(value_min)
else:
avh = 0
return avh
np.where, and np.nonzero even more so, clearly explains that it returns a tuple, with one array for each dimension of the condition array:
In [71]: arr = np.random.randint(-5,5,10)
In [72]: arr
Out[72]: array([ 3, 4, 2, -3, -1, 0, -5, 4, 2, -3])
In [73]: arr.shape
Out[73]: (10,)
In [74]: np.where(arr>=0)
Out[74]: (array([0, 1, 2, 5, 7, 8]),)
In [75]: arr[_]
Out[75]: array([3, 4, 2, 0, 4, 2])
That Out[74] tuple can be used directly as an index.
You can also extract the array from the tuple:
In [76]: np.where(arr>=0)[0]
Out[76]: array([0, 1, 2, 5, 7, 8])
That, I think is a better choice than the np.asarray(np.where(...))
This convention for where becomes clearer when we use it on a 2d array
In [77]: arr2 = arr.reshape(2,5)
In [78]: np.where(arr2>=0)
Out[78]: (array([0, 0, 0, 1, 1, 1]), array([0, 1, 2, 0, 2, 3]))
In [79]: arr2[_]
Out[79]: array([3, 4, 2, 0, 4, 2])
Again we are indexing with a tuple. arr2[1,3] is really arr2[(1,3)]. The values in [] indexing brackets are actually passed to the indexing function as a tuple of values.
np.argwhere applies transpose to the result of where, producing an array:
In [80]: np.transpose(np.where(arr2>=0))
Out[80]:
array([[0, 0],
[0, 1],
[0, 2],
[1, 0],
[1, 2],
[1, 3]])
That's the same indexing arrays, but arranged in a 2d column matrix.
If you need the count of where without the actual values, a slightly faster function is
In [81]: np.count_nonzero(arr>=0)
Out[81]: 6
In fact np.nonzero uses the count to first determine the size of the arrays that it will return.

Removing max and min elements of array from mean calculation

I am hoping to delete the highest number and the lowest number from the array 3*4. Let's say, the data looks like this:
a=np.array([[1,4,5,10],[2,6,5,0],[3,9,9,0]])
so I expected to see the result like this:
deleted_data=[4,5],[2,5],[3]
Could you advise me how to delete the max and min from each array?
to do so, I did like this (UPDATE):
#to find out the max / min values:
b = np.max(a,1) #max
c = np.min(a,1) #min
#creating dataset after deleting max & min
d=(a!=b[:,None]) & (a!=c[:,None])
f=[i[j] for i,j in zip(a, d)]
output: [array([8, 7, 7, 9, 9, 8]), array([8, 7, 8, 6, 8, 8]), array([9, 8, 9, 9, 8]), array([6, 7, 7, 6, 6, 7]), array([7, 7, 7, 7, 6])]
Now I am not sure how to calculate the mean of the list objects?
I would like to calculate the mean of each array, so I have tried this:
mean1=f.mean(axis=0)
but it did not work.
Another method is to use a Masked Array
import numpy.ma as ma
mask = np.logical_or(a == a.max(1, keepdims = 1), a == a.min(1, keepdims = 1))
a_masked = ma.masked_array(a, mask = mask)
from there if you want an average of the unmasked elements you can just do
a_masked.mean()
Or you could even do the mean of the rows
a_masked.mean(1).data
or columns (strange, but seems to be what you're asking for)
a_masked.mean(0).data
A python list has a remove method.
With a utility function we could remove the min and max elements from a row:
def foo(i,j,k):
il = i.tolist()
il.remove(j)
il.remove(k)
return il
In [230]: [foo(i,j,k) for i,j,k in zip(a,b,c)]
Out[230]: [[4, 5], [2, 5], [3, 9]]
This could be turned back into an array with np.array(...). Note that this removed just one of the 9 in the last row. If it had removed both, the last list would have just 1 value, and the result could not be turned back into a 2d array.
I'm sure we could come up with a pure-array method, possibly useing argmax and argmin instead of max and min. But I think the list approach is a better starting point for a Python beginner.
An array masking approach
In [232]: bi = np.argmax(a,1)
In [233]: ci = np.argmin(a,1)
In [234]: bi
Out[234]: array([3, 1, 1], dtype=int32)
In [235]: ci
Out[235]: array([0, 3, 3], dtype=int32)
In [243]: mask = np.ones_like(a, bool)
In [244]: mask[np.arange(3),bi]=False
In [245]: mask[np.arange(3),ci]=False
In [246]: mask
Out[246]:
array([[False, True, True, False],
[ True, False, True, False],
[ True, False, True, False]], dtype=bool)
In [247]: a[mask]
Out[247]: array([4, 5, 2, 5, 3, 9])
In [248]: _.reshape(3,-1)
Out[248]:
array([[4, 5],
[2, 5],
[3, 9]])
Again this is better if we just delete one max and one min from each row.
Another masking approach:
In [257]: (a!=b[:,None]) & (a!=c[:,None])
Out[257]:
array([[False, True, True, False],
[ True, False, True, False],
[ True, False, False, False]], dtype=bool)
In [258]: a[(a!=b[:,None]) & (a!=c[:,None])]
Out[258]: array([4, 5, 2, 5, 3])
This does remove all '9's in the last row. But it does not preserve the row split.
This preserves the row structure, and allows variable lengths:
In [259]: mask=(a!=b[:,None]) & (a!=c[:,None])
In [260]: [i[j] for i,j in zip(a, mask)]
Out[260]: [array([4, 5]), array([2, 5]), array([3])]
As #hpaulj predicted, there is an array-only method. And it's a doozy. As a one-liner:
a[np.arange(a.shape[0])[:, None], np.sort(np.argpartition(a, (0,-1), axis = 1)[:, 1:-1], axis = 1)]
Let's break that down:
y_ = np.argpartition(a, (0,-1), axis = 1)[:, 1:-1]
argpartiton takes the index of the 0th (smallest) and -1th (largest) elements of each row and moves them to the first and last position repsectively. [:,1:-1] indexes everything else. Now argpartition can sometimes reorder the rest of the elements, so
y = np.sort(y_ , axis = 1)
We sort the rest of the indices back to their orginal positions. Now we have a y.shape -> (m, n-2) array of indices with the max and min removed, for your original (m, n) = a.shape array.
Now to use this, we need the row indicies as well.
x = np.arange(a.shape[0])[:, None]
arange just gives the m row indices. To broadcast this x.shape -> (a.shape[0],) -> (m,) array to your index array, you need the [:, None] to make x.shape -> (m, 1). Now the m lines up for broadcasting and you have your two sets of indices.
a[x, y]
array([[4, 5],
[2, 5],
[3, 9]])
You could get to the final destination of average of elements that are not the max or min per row in two steps with masking -
In [140]: a # input array
Out[140]:
array([[ 1, 4, 5, 10],
[ 2, 6, 5, 0],
[ 3, 9, 9, 0]])
In [141]: m = (a!=a.min(1,keepdims=1)) & (a!=a.max(1,keepdims=1))
In [142]: (a*m).sum(1)/m.sum(1).astype(float)
Out[142]: array([ 4.5, 3.5, 3. ])
This avoids the mess of creating the intermediate ragged arrays, which arent the most convenient data formats to operate with NumPy funcs.
Alternatively, for performance boost, use np.einsum to get the equivalent of (a*m).sum(1) with np.einsum('ij,ij->i',a,m).
Runtime test on bigger array -
In [181]: np.random.seed(0)
In [182]: a = np.random.randint(0,10,(5000,5000))
# #Daniel F' soln from https://stackoverflow.com/a/47325431/
In [183]: %%timeit
...: mask = np.logical_or(a == a.max(1, keepdims = 1), a == a.min(1, keepdims = 1))
...: a_masked = ma.masked_array(a, mask = mask)
...: out = a_masked.mean(1).data
1 loop, best of 3: 251 ms per loop
# Posted in here
In [184]: %%timeit
...: m = (a!=a.min(1,keepdims=1)) & (a!=a.max(1,keepdims=1))
...: out = (a*m).sum(1)/m.sum(1).astype(float)
10 loops, best of 3: 165 ms per loop
# Posted in here with additional einsum
In [185]: %%timeit
...: m = (a!=a.min(1,keepdims=1)) & (a!=a.max(1,keepdims=1))
...: out = np.einsum('ij,ij->i',a,m)/m.sum(1).astype(float)
10 loops, best of 3: 124 ms per loop
If the question is to remove min and/or max elements from a numpy array arr then this is the easiest way in my opinion.
np.delete(arr, np.argmax(arr))
example
tmp = np.random.random(3)
print(tmp)
tmp = np.delete(tmp, np.argmax(tmp))
print(tmp)
returns
[0.7366768 0.65492774 0.93632866]
[0.7366768 0.65492774]

Summing and removing repeated elements of Numpy Arrays

I have 4 1D Numpy arrays of equal length.
The first three act as an ID, uniquely identifying the 4th array.
The ID arrays contain repeated combinations, for which I need to sum the 4th array, and remove the repeating element from all 4 arrays.
x = np.array([1, 2, 4, 1])
y = np.array([1, 1, 4, 1])
z = np.array([1, 2, 2, 1])
data = np.array([4, 7, 3, 2])
In this case I need:
x = [1, 2, 4]
y = [1, 1, 4]
z = [1, 2, 2]
data = [6, 7, 3]
The arrays are rather long so loops really won't work. I'm sure there is a fairly simple way to do this, but for the life of me I can't figure it out.
To get started, we can stack the ID vectors into a matrix such that each ID is a row of three values:
XYZ = np.vstack((x,y,z)).T
Now, we just need to find the indices of repeated rows. Unfortunately, np.unique doesn't operate on rows, so we need to do some tricks:
order = np.lexsort(XYZ.T)
diff = np.diff(XYZ[order], axis=0)
uniq_mask = np.append(True, (diff != 0).any(axis=1))
This part is borrowed from the np.unique source code, and finds the unique indices as well as the "inverse index" mapping:
uniq_inds = order[uniq_mask]
inv_idx = np.zeros_like(order)
inv_idx[order] = np.cumsum(uniq_mask) - 1
Finally, sum over the unique indices:
data = np.bincount(inv_idx, weights=data)
x,y,z = XYZ[uniq_inds].T
You can use unique and sum as reptilicus suggested to do the following
from itertools import izip
import numpy as np
x = np.array([1, 2, 4, 1])
y = np.array([1, 1, 4, 1])
z = np.array([1, 2, 2, 1])
data = np.array([4, 7, 3, 2])
# N = len(x)
# ids = x + y*N + z*(N**2)
ids = np.array([hash((a, b, c)) for a, b, c in izip(x, y, z)]) # creates flat ids
_, idx, idx_rep = np.unique(ids, return_index=True, return_inverse=True)
x_out = x[idx]
y_out = y[idx]
z_out = z[idx]
# data_out = np.array([np.sum(data[idx_rep == i]) for i in idx])
data_out = np.bincount(idx_rep, weights=data)
print x_out
print y_out
print z_out
print data_out

Categories