What is the fastest method to delete elements from numpy array while retreiving their initial positions. The following code does not return all elements that it should:
list = []
for pos,i in enumerate(ARRAY):
if i < some_condition:
list.append(pos) #This is where the loop fails
for _ in list:
ARRAY = np.delete(ARRAY, _)
It really feels like you're going about this inefficiently. You should probably be using more builtin numpy capabilities -- e.g. np.where, or boolean indexing. Using np.delete in a loop like that is going to kill any performance gains you get from using numpy...
For example (with boolean indexing):
keep = np.ones(ARRAY.shape, dtype=bool)
for pos, val in enumerate(ARRAY):
if val < some_condition:
keep[pos] = False
ARRAY = ARRAY[keep]
Of course, this could possibly be simplified (and generalized) even further:
ARRAY = ARRAY[ARRAY >= some_condition]
EDIT
You've stated in the comments that you need the same mask to operate on other arrays as well -- That's not a problem. You can keep a handle on the mask and use it for other arrays:
mask = ARRAY >= some_condition
ARRAY = ARRAY[mask]
OTHER_ARRAY = OTHER_ARRAY[mask]
...
Additionally (and perhaps this is the reason your original code isn't working), as soon as you delete the first index from the array in your loop, all of the other items shift one index to the left, so you're not actually deleting the same items that you "tagged" on the initial pass.
As an example, lets say that your original array was [a, b, c, d, e] and on the original pass, you tagged elements at indexes [0, 2] for deletion (a, c)... On the first pass through your delete loop, you'd remove the item at index 0 -- Which would make your array:
[b, c, d, e]
now on the second iteration of your delete loop, you're going to delete the item at index 2 in the new array:
[b, c, e]
But look, instead of removing c like we wanted, we actually removed d! Oh snap!
To fix that, you could probably write your loop over reversed(list), but that still won't result in a fast operation.
You don't need to iterate, especially with a simple condition like this. And you don't really need to use delete:
A sample array:
In [693]: x=np.arange(10)
A mask, boolean array were a condition is true (or false):
In [694]: msk = x%2==0
In [695]: msk
Out[695]: array([ True, False, True, False, True, False, True, False, True, False], dtype=bool)
where (or nonzero) converts it to indexes
In [696]: ind=np.where(msk)
In [697]: ind
Out[697]: (array([0, 2, 4, 6, 8], dtype=int32),)
You use the whole ind in one call to delete (no need to iterate):
In [698]: np.delete(x,ind)
Out[698]: array([1, 3, 5, 7, 9])
You can use it ind to retain those values instead:
In [699]: x[ind]
Out[699]: array([0, 2, 4, 6, 8])
Or you can used the boolean msk directly:
In [700]: x[msk]
Out[700]: array([0, 2, 4, 6, 8])
or use its inverse:
In [701]: x[~msk]
Out[701]: array([1, 3, 5, 7, 9])
delete doesn't do much more than this kind of boolean masking. It's all Python code, so you can easily study it.
Related
I've created vector x and I need to create a vector z by removing the 3rd and 6th elements of x. I cannot just create a vector by simply typing in the elements that should be in z. I have to index them or use a separate function.
x = [5,2,0,6,-10,12]
np.array(x)
print x
z = np.delete(x,)
I am not sure if using np.delete is best or if there is a better approach. Help?
You can index and conact pieces of the list excluding the one you want to "delete"
x = [5,2,0,6,-10,12]
print ( x[0:2]+x[3:5] )
[5, 2, 6, -10]
if x is numpy array, first convert to list:
x = list(x)
if not array then:
z = [x.pop(2), x.pop(-1)]
This will remove 3rd and 6th element form x and place it in z. Then convert it to numpy array if needed.
In [69]: x = np.array([5,2,0,6,-10,12])
Using delete is straight forward:
In [70]: np.delete(x,[2,5])
Out[70]: array([ 5, 2, 6, -10])
delete is a general function that takes various approaches based on the delete object, but in a case like this it uses a boolean mask:
In [71]: mask = np.ones(x.shape, bool); mask[[2,5]] = False; mask
Out[71]: array([ True, True, False, True, True, False])
In [72]: x[mask]
Out[72]: array([ 5, 2, 6, -10])
What is the most efficient way to remove items from a list based on a function in python (and any common library)?
For example, if I have the following function:
def func(a):
return a % 2 == 1
And the following list:
arr = [1,4,5,8,20,24]
Then I would want the result:
new_arr = [1,5]
I know I could simply iterate over the list like such:
new_arr = [i for i in arr if func(a)]
Just wondering if this is an efficient approach (for large datasets), or if there might be a better way. I was thinking maybe using np to map and changing the function to return a if True and -1 if false, and then using np remove to remove all 0s?
Edit:
I decided to test it myself with the suggestions you all gave (I probably just should have tested runtime myself rather than being lazy and asking other people).
filter was by far the fastest alone. Though if you needed a list for random access it was comparable to the [i for i in arr if func(i)] method.
The numpy [np.vectorize(func)(arr)] was slightly faster than the other list methods.
This is the use case for the builtin filter function:
filtered_arr = filter(func, arr)
Note that this returns an iterator, not a list. If you want a list, you can create one with list(filtered_arr) or a list comprehension as you noted. But if you just want to iterate over the filtered items and don't need random access / indexing, it's more memory efficient to use the iterator.
This is a good general approach for filtering lists that are not especially large and contain elements with arbitrary (and possibly mixed) types. If you are working with a large amount of numerical data, you should use one of the NumPy solutions mentioned in other answers.
Since the numpy tag is present, let use it. In the example we use a mask for elements that give remainder 1 when divided by 2.
>>> import numpy as np
>>>
>>> arr = np.array([1,4,5,8,20,24])
>>> arr
array([ 1, 4, 5, 8, 20, 24])
>>> arr[arr % 2 == 1]
array([1, 5])
Using numpy you could use numpy.vectorize to map your function across your array elements, then use that to keep the elements that evaluate to True. So starting with your function
def func(a):
return a % 2 == 1
We could test keeping only the odd values from the range of [0, 19]
>>> import numpy as np
>>> arr = np.arange(20)
>>> arr
array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19])
>>> np.vectorize(func)(arr)
array([False, True, False, True, False, True, False, True, False, True, False, True, False, True, False, True, False, True, False, True])
>>> arr[np.vectorize(func)(arr)]
array([ 1, 3, 5, 7, 9, 11, 13, 15, 17, 19])
You could also rewrite func to make it take a list as an argument:
def func(lst):
lst = [x for x in lst if x % 2 == 1]
return lst
and then do:
new_arr = func(arr)
This would save you some lines in comparison to making func take a single number as argument and writing an iteration such as [for i in arr if func(a)] every time you want to use it on (elements of) a list.
I decided to test it myself with the suggestions you all gave (I probably just should have tested runtime myself rather than being lazy and asking other people).
filter was by far the fastest alone. Though if you needed a list for random access it was comparable to the [i for i in arr if func(i)] method.
The numpy [np.vectorize(func)(arr)] was slightly faster than the other methods.
Just wanted to post this in case anyone else runs across this in the future.
How can I determine the indices of elements in an numpy array that start with a certain string (e.g. using startswith)?
Example
Array:
test1234
testworld
hello
mynewcar
test5678
Now I need the indices where the value starts with test. My desired outcome is:
[0,1,4]
You could use np.char.startswith to get the mask of matches and then np.flatnonzero to get the matching indices -
np.flatnonzero(np.char.startswith(a, 'test'))
Sample run -
In [61]: a = np.array(['test1234', 'testworld','hello','mynewcar','test5678'])
In [62]: np.char.startswith(a, 'test')
Out[62]: array([ True, True, False, False, True], dtype=bool)
In [63]: np.flatnonzero(np.char.startswith(a, 'test'))
Out[63]: array([0, 1, 4])
#Divakar's answer is the way to go, but just as an alternative, you can also use a list comprehension:
a = np.array(['test1234', 'testworld', 'hello', 'mynewcar', 'test5678'])
[i for i, si in enumerate(a) if si.startswith('test')]
will give
[0, 1, 4]
This list you could also convert back to a numpy array:
np.array([i for i, si in enumerate(a) if si.startswith('test')])
I want to count the occurrences of a specific value (in my case -1) in a numpy array and delete them at the same time.
I could do that so here is what I've done:
a = np.array([1, 2, 0, -1, 3, -1, -1])
b = a[a==-1]
a = np.delete(a, np.where(a==-1))
print("a -> ", a) # a -> [1 2 0 3]
print("b -> ", b) # b -> 3
Is there any more optimised way to do it ?
Something like this ?
Using numpy like you did is probably more optimized though.
a = [x for x in a if x != -1]
First, a list in-place count and delete operation:
In [100]: al=a.tolist(); cnt=0
In [101]: for i in range(len(a)-1,-1,-1):
...: if al[i]==-1:
...: del al[i]
...: cnt += 1
In [102]: al
Out[102]: [1, 2, 0, 3]
In [103]: cnt
Out[103]: 3
It operates in place, but has to work from the end. The list comprehension alternative makes a new list, but often is easier to write and read.
The cleanest array operation uses a boolean mask.
In [104]: idx = a==-1
In [105]: idx
Out[105]: array([False, False, False, True, False, True, True], dtype=bool)
In [106]: np.sum(idx) # or np.count_nonzero(idx)
Out[106]: 3
In [107]: a[~idx]
Out[107]: array([1, 2, 0, 3])
You have to identify, in one way or other, all elements that match the target. The count is a trivial operation. Masking is also easy.
np.delete has to be told which items to delete; and in one way or other constructs a new array that contains all but the 'deleted' ones. Because of its generality it will almost always be slower than a direct action like this masking.
np.where (aka np.nonzeros) uses count_nonzero to determine how many values it will return.
So I'm proposing the same actions as you are doing, but in a little more direct way.
I am trying to all rows that only contain zeros from a NumPy array. For example, I want to remove [0,0] from
n = np.array([[1,2], [0,0], [5,6]])
and be left with:
np.array([[1,2], [5,6]])
To remove the second row from a numpy table:
import numpy
n = numpy.array([[1,2],[0,0],[5,6]])
new_n = numpy.delete(n, 1, axis=0)
To remove rows containing only 0:
import numpy
n = numpy.array([[1,2],[0,0],[5,6]])
idxs = numpy.any(n != 0, axis=1) # index of rows with at least one non zero value
n_non_zero = n[idxs, :] # selection of the wanted rows
If you want to delete any row that only contains zeros, the fastest way I can think of is:
n = numpy.array([[1,2], [0,0], [5,6]])
keep_row = n.any(axis=1) # Index of rows with at least one non-zero value
n_non_zero = n[keep_row] # Rows to keep, only
This runs much faster than Simon's answer, because n.any() stops checking the values of each row as soon as it encounters any non-zero value (in Simon's answer, all the elements of each row are compared to zero first, which results in unnecessary computations).
Here is a generalization of the answer, if you ever need to remove a rows that have a specific value (instead of removing only rows that only contain zeros):
n = numpy.array([[1,2], [0,0], [5,6]])
to_be_removed = [0, 0] # Can be any row values: [5, 6], etc.
other_rows = (n != to_be_removed).any(axis=1) # Rows that have at least one element that differs
n_other_rows = n[other_rows] # New array with rows equal to to_be_removed removed.
Note that this solution is not fully optimized: even if the first element of to_be_removed does not match, the remaining row elements from n are compared to those of to_be_removed (as in Simon's answer).
I'd be curious to know if there is a simple efficient NumPy solution to the more general problem of deleting rows with a specific value.
Using cython loops might be a fast solution: for each row, element comparison could be stopped as soon as one element from the row differs from the corresponding element in to_be_removed.
You can use numpy.delete to remove specific rows or columns.
For example:
n = [[1,2], [0,0], [5,6]]
np.delete(n, 1, axis=0)
The output will be:
array([[1, 2],
[5, 6]])
To delete according to value,which is an Object.
To do like this:
>>> n
array([[1, 2],
[0, 0],
[5, 6]])
>>> bl=n==[0,0]
>>> bl
array([[False, False],
[ True, True],
[False, False]], dtype=bool)
>>> bl=np.any(bl,axis=1)
>>> bl
array([False, True, False], dtype=bool)
>>> ind=np.nonzero(bl)[0]
>>> ind
array([1])
>>> np.delete(n,ind,axis=0)
array([[1, 2],
[5, 6]])