Delete subarray in numpy array if threshold is not met

Delete subarray in numpy array if threshold is not met - python

I have the following np array structure:
[[1, 2, 3 ,4]
[5, 7, 8 ,6]
.
.
[7, 5, 1 ,0]]
What is want to do is to remove a subarray if thresholds are not met.
for example in [5, 7, 8 ,6], i want to delete this array if position 0 is not between 2 and 4. I want to do this action for the whole numpy array and intend on having a threshold on all positions in the sub array.
My thought process is something that is shown below:
for arr in data:
if arr[0] < 2 or arr[0] > 4:
np.delete(data, arr)
However, printing data.shape before and after show no difference. Can someone help me?
Thanks!

Creating example data for testing:
>>> import numpy as np
>>> data = np.array([
... [1,2,3,4],
... [5,7,8,9],
... [7,5,1,0]
... ])
You can slice the array to get the first column:
>>> data[:, 0]
array([1, 5, 7])
Figure out which of these first-column values is in range by broadcasting the comparison operators across them (being careful that we can't chain these operators, and must combine them using a bitwise rather than logical AND, because of syntax limitations):
>>> first = data[:, 0]
>>> (4 <= first) & (first <= 6)
array([False, True, False])
Finally, we can use that to mask the original array:
>>> data[(4 <= first) & (first <= 6)]
array([[5, 7, 8, 9]])

Related

Delete array of values from numpy array

This post is an extension of this question.
I would like to delete multiple elements from a numpy array that have certain values. That is for
import numpy as np
a = np.array([1, 1, 2, 5, 6, 8, 8, 8, 9])
How do I delete one instance of each value of [1,5,8], such that the output is [1,2,6,8,8,9]. All I have found in the documentation for an array removal is the use of np.setdiff1d, but this removes all instances of each number. How can this be updated?

Using outer comparison and argmax to only remove once. For large arrays this will be memory intensive, since the created mask has a.shape * r.shape elements.
r = np.array([1, 5, 8])
m = (a == r[:, None]).argmax(1)
np.delete(a, m)
array([1, 2, 6, 8, 8, 9])
This does assume that each value in r appears in a at least once, otherwise the value at index 0 will get deleted since argmax will not find a match, and will return 0.

delNums = [np.where(a == x)[0][0] for x in [1,5,8]]
a = np.delete(a, delNums)
here, delNums contains the indexes of the values 1,5,8 and np.delete() will delete the values at those specified indexes
OUTPUT:
[1 2 6 8 8 9]

deleting rows based on value found in specififc column

I am attempting to write a code that searches a numpy array for cases where the value in the fifth column does not have 50. If it does not I wish to remove it.
This is what I have so far:
for rows in range(len(b)):
if b[:,4].any() != 50:
b = np.delete(b, b[rows])
However, I keep getting the following error:
too many indices for array

Lets run the calculation with some diagnositic prints. Note where the error occurs. That's important! (We shouldn't just keep trying things without isolating the problem!)
In [2]: b=np.array([[0,1,2],[1,2,3],[2,1,2]])
In [3]: for row in range(len(b)):
...: print(row)
...: if b[:,2].any() !=2:
...: print(b[row])
...: b = np.delete(b, b[row])
...:
0
[0 1 2]
1
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-3-04dc188d9a2b> in <module>()
1 for row in range(len(b)):
2 print(row)
----> 3 if b[:,2].any() !=2:
4 print(b[row])
5 b = np.delete(b, b[row])
IndexError: too many indices for array
So the error occurs on the 2nd iteration (row 1). Something is wrong with the b after the delete. What is the new value of b?
In [4]: b
Out[4]: array([1, 2, 3, 2, 1, 2])
b is a 1d array, not the 2d we started with. That explains the error, right? Something must be wrong with the use of delete. Maybe we need to check its documentation????
Look at the axis parameter:
axis : int, optional
The axis along which to delete the subarray defined by `obj`.
If `axis` is None, `obj` is applied to the flattened array.
We didn't specify an axis, so the delete was applied to the flattened array, and result was flattened - 1d.
But even if I specify an axis I get an error (I won't get into that), which prompts me to look more carefully at the if condition:
In [10]: b[:,2]
Out[10]: array([2, 3, 2])
In [11]: b[:,2].any()
Out[11]: True
In [12]: b[:,2]!=2
Out[12]: array([False, True, False])
Applying any to the column don't make sense - it just checks if any values in the column are not 0. Instead we want to test the column against the target, getting a boolean that matches the column in size.
We can use that boolean directly as row selection mask
In [13]: b[_,:]
Out[13]: array([[1, 2, 3]])
No need to iterate.
Another problem with your iteration. You iterate on the range(3), [0,1,2]. But inside the loop you try to remove a row from b, changing the size of b. That going to give problems when you try to index b[row] by number, right? When iterating, in Python or numpy, be careful about modifying the object that you are iterating over.
Sorry to be long winded about this, but it looks like you need some basic debugging guidance.
Here's a basic list approach:
In [15]: [row for row in b if row[2]!=2]
Out[15]: [array([1, 2, 3])]
I'm iterating on the rows, not their indices, and for each row checking the column value, and keeping that row if the check is True. We could do that with np.delete, but a list comprehension is clearer (and faster).

It would be better to provide b and desired output, but if i understand it correctly, you could use:
import numpy as np
b = np.array([[50, 2, 3, 4, 5, 6],
[4, 50, 6, 7, 8, 9],
[1, 1, 1, 1, 50, 9]])
array([[50, 2, 3, 4, 5, 6],
[ 4, 50, 6, 7, 8, 9],
[ 1, 1, 1, 1, 50, 9]])
Then you can check which rows contain 50 in the 5th column using
b[:, 4] == 50
array([False, False, True])
and feed this Boolean array back to b to select the desired columns:
b[b[:, 4] == 50]
which leaves you with one row in this case
array([[ 1, 1, 1, 1, 50, 9]])

find indeces of grouped-item matches between two arrays

a = np.array([5,8,3,4,2,5,7,8,1,9,1,3,4,7])
b = np.array ([3,4,7,8,1,3])
I have two lists of integers that each is grouped by every 2 consecutive items (ie indices [0, 1], [2, 3] and so on).
The pairs of items cannot be found as duplicates in either list, neither in the same or the reverse order.
One list is significantly larger and inclusive of the other.
I am trying to figure out an efficient way to get the indices
of the larger list's grouped items that are also in the smaller one.
The desired output in the example above should be:
[2,3,6,7,10,11] #indices
Notice that, as an example, the first group ([3,4]) should not get indices 11,12 as a match because in that case 3 is the second element of [1,3] and 4 the first element of [4,7].

Since you are grouping your arrays by pairs, you can reshape them into 2 columns for comparison. You can then compare each of the elements in the shorter array to the longer array, and reduce the boolean arrays. From there it is a simple matter to get the indices using a reshaped np.arange.
import numpy as np
from functools import reduce
a = np.array([5,8,3,4,2,5,7,8,1,9,1,3,4,7])
b = np.array ([3,4,7,8,1,3])
# reshape a and b into columns
a2 = a.reshape((-1,2))
b2 = b.reshape((-1,2))
# create a generator of bools for the row of a2 that holds b2
b_in_a_generator = (np.all(a2==row, axis=1) for row in b2)
# reduce the generator to get an array of boolean that is True for each row
# of a2 that equals one of the rows of b2
ix_bool = reduce(lambda x,y: x+y, b_in_a_generator)
# grab the indices by slicing a reshaped np.arange array
ix = np.arange(len(a)).reshape((-1,2))[ix_bool]
ix
# returns:
array([[ 2, 3],
[ 6, 7],
[10, 11]])
If you want a flat array, simply ravel ix
ix.ravel()
# returns
array([ 2, 3, 6, 7, 10, 11])

Here's one approach making use of NumPy view of group of elements -
# Taken from https://stackoverflow.com/a/45313353/
def view1D(a, b): # a, b are arrays
a = np.ascontiguousarray(a)
void_dt = np.dtype((np.void, a.dtype.itemsize * a.shape[1]))
return a.view(void_dt).ravel(), b.view(void_dt).ravel()
def grouped_indices(a, b):
a0v, b0v = view1D(a.reshape(-1,2), b.reshape(-1,2))
sidx = a0v.argsort()
idx = sidx[np.searchsorted(a0v,b0v, sorter=sidx)]
return ((idx*2)[:,None] + [0,1]).ravel()
If there isn't a membership between any group from b in a, we could filter that out using a mask : a0v[idx] == b0v.
Sample run -
In [345]: a
Out[345]: array([5, 8, 3, 4, 2, 5, 7, 8, 1, 9, 1, 3, 4, 7])
In [346]: b
Out[346]: array([3, 4, 7, 8, 1, 3])
In [347]: grouped_indices(a, b)
Out[347]: array([ 2, 3, 6, 7, 10, 11])
Another one using np.in1d to replace np.searchsorted -
def grouped_indices_v2(a, b):
a0v, b0v = view1D(a.reshape(-1,2), b.reshape(-1,2))
return (np.flatnonzero(np.in1d(a0v, b0v))[:,None]*2 + [0,1]).ravel()

Efficently multiply a matrix with itself after offsetting it by one in numpy

I am trying to write a function that takes a matrix A, then offsets it by one, and does element wise matrix multiplication on the shared area. Perhaps an example will help. Suppose I have the matrix:
A = np.array([[1,2,3],[4,5,6],[7,8,9]])
What i'd like returned is:
(1*2) + (4*5) + (7*8) = 78
The following code does it, but inefficently:
import numpy as np
A = np.array([[1,2,3],[4,5,6],[7,8,9]])
Height = A.shape[0]
Width = A.shape[1]
Sum1 = 0
for y in range(0, Height):
for x in range(0,Width-2):
Sum1 = Sum1 + \
A.item(y,x)*A.item(y,x+1)
print("%d * %d"%( A.item(y,x),A.item(y,x+1)))
print(Sum1)
With output:
1 * 2
4 * 5
7 * 8
78
Here is my attempt to write the code more efficently with numpy:
import numpy as np
A = np.array([[1,2,3],[4,5,6],[7,8,9]])
print(np.sum(np.multiply(A[:,0:-1], A[:,1:])))
Unfortunately, this time I get 186. I am at a loss where did I go wrong. i'd love someone to either correcty me or offer another way to implement this.
Thank you.

In this 3 column case, you are just multiplying the 1st 2 columns, and taking the sum:
A[:,:2].prod(1).sum()
Out[36]: 78
Same as (A[:,0]*A[:,1]).sum()
Now just how does that generalize to more columns?
In your original loop, you can cut out the row iteration by taking the sum of this list:
[A[:,x]*A[:,x+1] for x in range(0,A.shape[1]-2)]
Out[40]: [array([ 2, 20, 56])]
Your description talks about multiplying the shared area; what direction are you doing the offset? From the calculation it looks like the offset is negative.
A[:,:-1]
Out[47]:
array([[1, 2],
[4, 5],
[7, 8]])
If that is the offset logic, than I could rewrite my calculation as
A[:,:-1].prod(1).sum()
which should work for many more columns.
===================
Your 2nd try:
In [3]: [A[:,:-1],A[:,1:]]
Out[3]:
[array([[1, 2],
[4, 5],
[7, 8]]),
array([[2, 3],
[5, 6],
[8, 9]])]
In [6]: A[:,:-1]*A[:,1:]
Out[6]:
array([[ 2, 6],
[20, 30],
[56, 72]])
In [7]: _.sum()
Out[7]: 186
In other words instead of 1*2, you are calculating [1,2]*[2*3]=[2,6]. Nothing wrong with that, if that's you you really intend. The key is being clear about 'offset' and 'overlap'.

Way of easily finding the average of every nth element over a window of size k in a pandas.Series? (not the rolling mean)

The motivation here is to take a time series and get the average activity throughout a sub-period (day, week).
It is possible to reshape an array and take the mean over the y axis to achieve this, similar to this answer (but using axis=2):
Averaging over every n elements of a numpy array
but I'm looking for something which can handle arrays of length N%k != 0 and does not solve the issue by reshaping and padding with ones or zeros (e.g numpy.resize), i.e takes the average over the existing data only.
E.g Start with a sequence [2,2,3,2,2,3,2,2,3,6] of length N=10 which is not divisible by k=3. What I want is to take the average over columns of a reshaped array with mis-matched dimensions:
In: [[2,2,3],
[2,2,3],
[2,2,3],
[6]], k =3
Out: [3,2,3]
Instead of:
In: [[2,2,3],
[2,2,3],
[2,2,3],
[6,0,0]], k =3
Out: [3,1.5,2.25]
Thank you.

You can use a masked array to pad with special values that are ignored when finding the mean, instead of summing.
k = 3
# how long the array needs to be to be divisible by 3
padded_len = (len(in_arr) + (k - 1)) // k * k
# create a np.ma.MaskedArray with padded entries masked
padded = np.ma.empty(padded_len)
padded[:len(in_arr)] = in_arr
padded[len(in_arr):] = np.ma.masked
# now we can treat it an array divisible by k:
mean = padded.reshape((-1, k)).mean(axis=0)
# if you need to remove the masked-ness
assert not np.ma.is_masked(mean), "in_arr was too short to calculate all means"
mean = mean.data

You can easily do it by padding, reshaping and calculating by how many elements to divide each row:
>>> import numpy as np
>>> a = np.array([2,2,3,2,2,3,2,2,3,6])
>>> k = 3
Pad data
>>> b = np.pad(a, (0, k - a.size%k), mode='constant').reshape(-1, k)
>>> b
array([[2, 2, 3],
[2, 2, 3],
[2, 2, 3],
[6, 0, 0]])
Then create a mask:
>>> c = a.size // k # 3
>>> d = (np.arange(k) + c * k) < a.size # [True, False, False]
The first part of d will create an array that contains [9, 10, 11], and compare it to the size of a (10), generating the mentioned boolean mask.
And divide it:
>>> b.sum(0) / (c + 1.0 * d)
array([ 3., 2., 3.])
The above will divide the first column by 4 (c + 1 * True) and the rest by 3. This is vectorized numpy, thus, it scales very well to large arrays.
Everything can be written shorter, I just show all the steps to make it more clear.

Flatten the list In by unpacking and chaining. Create a new list that arranges the flattened list lst by columns, then use the map function to calculate the average of each column:
from itertools import chain
In = [[2, 2, 3], [2, 2, 3], [2, 2, 3], [6]]
lst = chain(*In)
k = 3
In_by_cols = [lst[i::k] for i in range(k)]
# [[2, 2, 2, 6], [2, 2, 2], [3, 3, 3]]
Out = map(lambda x: sum(x)/ float(len(x)), In_by_cols)
# [3.0, 2.0, 3.0]
Using float on the length of each sublist will provide a more accurate result on python 2.x as it won't do integer truncation.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Delete subarray in numpy array if threshold is not met - python

Related

Delete array of values from numpy array

deleting rows based on value found in specififc column

find indeces of grouped-item matches between two arrays

Efficently multiply a matrix with itself after offsetting it by one in numpy

Way of easily finding the average of every nth element over a window of size k in a pandas.Series? (not the rolling mean)

Categories

Resources