Removing items from a list based on a function - python

What is the most efficient way to remove items from a list based on a function in python (and any common library)?
For example, if I have the following function:
def func(a):
return a % 2 == 1
And the following list:
arr = [1,4,5,8,20,24]
Then I would want the result:
new_arr = [1,5]
I know I could simply iterate over the list like such:
new_arr = [i for i in arr if func(a)]
Just wondering if this is an efficient approach (for large datasets), or if there might be a better way. I was thinking maybe using np to map and changing the function to return a if True and -1 if false, and then using np remove to remove all 0s?
Edit:
I decided to test it myself with the suggestions you all gave (I probably just should have tested runtime myself rather than being lazy and asking other people).
filter was by far the fastest alone. Though if you needed a list for random access it was comparable to the [i for i in arr if func(i)] method.
The numpy [np.vectorize(func)(arr)] was slightly faster than the other list methods.

This is the use case for the builtin filter function:
filtered_arr = filter(func, arr)
Note that this returns an iterator, not a list. If you want a list, you can create one with list(filtered_arr) or a list comprehension as you noted. But if you just want to iterate over the filtered items and don't need random access / indexing, it's more memory efficient to use the iterator.
This is a good general approach for filtering lists that are not especially large and contain elements with arbitrary (and possibly mixed) types. If you are working with a large amount of numerical data, you should use one of the NumPy solutions mentioned in other answers.

Since the numpy tag is present, let use it. In the example we use a mask for elements that give remainder 1 when divided by 2.
>>> import numpy as np
>>>
>>> arr = np.array([1,4,5,8,20,24])
>>> arr
array([ 1, 4, 5, 8, 20, 24])
>>> arr[arr % 2 == 1]
array([1, 5])

Using numpy you could use numpy.vectorize to map your function across your array elements, then use that to keep the elements that evaluate to True. So starting with your function
def func(a):
return a % 2 == 1
We could test keeping only the odd values from the range of [0, 19]
>>> import numpy as np
>>> arr = np.arange(20)
>>> arr
array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19])
>>> np.vectorize(func)(arr)
array([False, True, False, True, False, True, False, True, False, True, False, True, False, True, False, True, False, True, False, True])
>>> arr[np.vectorize(func)(arr)]
array([ 1, 3, 5, 7, 9, 11, 13, 15, 17, 19])

You could also rewrite func to make it take a list as an argument:
def func(lst):
lst = [x for x in lst if x % 2 == 1]
return lst
and then do:
new_arr = func(arr)
This would save you some lines in comparison to making func take a single number as argument and writing an iteration such as [for i in arr if func(a)] every time you want to use it on (elements of) a list.

I decided to test it myself with the suggestions you all gave (I probably just should have tested runtime myself rather than being lazy and asking other people).
filter was by far the fastest alone. Though if you needed a list for random access it was comparable to the [i for i in arr if func(i)] method.
The numpy [np.vectorize(func)(arr)] was slightly faster than the other methods.
Just wanted to post this in case anyone else runs across this in the future.

Related

Pythonic and fast way to create an array of values `[1, .., n]` that contain none of `[i_1, ..., i_r]`

What is a fast and pythonic way to create a list from [1, ..., n] that contains none of the numbers [i_1, ..., i_r]. For example, running this function on [1, 2, 3, 4] and [2,3] should return [1, 4].
I am currently using a for loop to test "if i is in [1, ..., n], then exclude it from the output array, else include it".
Is there a better and more pythonic way?
As a pythonic way you can go with :
n = 5
l = [k for k in range(1,n) if not k in [2,3] ]
Not sure about speed though.
1. Use a more efficient data type
One simple way to accomplish this task efficiently is to convert the exclusion list into a Python set, which makes lookup quite a lot faster (at least, if the exclusion list is of significant size):
def range_excluding(limit, exclude):
exclude = set(exclude)
return (i for i in range(1, limit) if i not in exclude)
Note that the above returns a generator, not a list; you can turn it into a list by either calling list explicitly (list(range_excluding(n, [1, 7, 12]))) or by unpacking the generator ([*range_excluding(n, [1, 7, 12])]). The advantage of using the generator, though, is that it avoids excessive memory use if the range is very large and the results don't need to be stored.
Another way to write this is to use itertools.filterfalse to create a baseline excluding function:
from itertools import filterfalse
def excluding(iterable, container):
return filterfalse(container.__contains__, iterable)
This version depends on the caller to create the range iterable and to use an efficient datatype for exclusion lookup (which could be a set, a frozenset, or a dictionary, among other possibilities). I think that's better interface design because it gives the caller more flexibility; they can use any range (or other iterable), rather than insisting on a range starting 1, and they don't incur the overhead of converting a lookup table (say, a dictionary) which is already adequate to the purpose. Of course, nothing stops you from defining a convenience function which uses excluding:
>>> print([*excluding(range(1, 20), {1, 7, 12})])
[2, 3, 4, 5, 6, 8, 9, 10, 11, 13, 14, 15, 16, 17, 18, 19]
>>> def list_excluding(limit, exclusions):
... return [*excluding(range(1, limit), frozenset(exclusions))]
...
>>> list_excluding(20, [12, 1, 7])
[2, 3, 4, 5, 6, 8, 9, 10, 11, 13, 14, 15, 16, 17, 18, 19]
2. Alternative: sort the exclusions and generate a sequence of ranges
If you know that the list to be filtered will always be a range, you could piece together a possibly more efficient solution [Note 1] based on sorting the exclusions, resulting in a sequence of subranges and then using itertools' convenient chain.from_iterable to combine the sequence into a single iterable. (I also used a number of other handy itertools functions, including pairwise which was added in 3.10; see the docs) [Note 2]:
from itertools import chain, pairwise, starmap
def range_excluding(start, stop, exclusions=None):
'''Returns a generator over range(start, stop) which excludes
the values in exclusions.
If only two arguments are provided, the first is the end of
the range, and the second is the list of exclusions.
'''
if exclusions is None:
start, stop, exclusions = 0, start, stop
return chain.from_iterable(
starmap(lambda lo, hi: range(lo + 1, hi),
pairwise(chain((start - 1,),
sorted(exclusions),
(stop,)))))
Notes
On the basis of some very rough microbenchmarks using Python v3.11, it seems like this solution is significantly faster than the first solution if the range is fairly large compared to the number of exclusions. For smaller ranges, the first solution wins out.
Although that function uses a lot of Python features, I'm not sure if there would be any consensus about it being "pythonic" :-)
A pythonic and fast way to do it:
l = [k for k in range(1,n+1) if k < i_1 or k > i_r]
Note that this is O(n), while #Adrien Mau's answer is O(n²).
A numpy approach.
Step by step
import numpy as np
arr = np.arange( 1, 5 )
exclude = np.arange( 2,4 )
mask = np.equal.outer( arr, exclude )
# mask is a 2D array. true where arr (as rows) equals exclude ( as columns ).
mask
# array([[False, False],
# [ True, False],
# [False, True],
# [False, False]])
~np.logical_or.reduce( mask, axis = 1) # `not logical_or` across the columns
# array([ True, False, False, True])
arr[ ~np.logical_or.reduce( mask, axis = 1) ]
# array([1, 4])
As a function
def do_exclude( arr, exclude ):
mask = np.equal.outer( arr, exclude )
return arr[ ~np.logical_or.reduce( mask, axis = 1) ]
do_exclude( arr, exclude )
# array([1, 4])
arr = np.arange( 20 )
exclude = np.array( [ 1, 7, 12 ] )
do_exclude( arr, exclude )
# array([ 0, 2, 3, 4, 5, 6, 8, 9, 10, 11, 13, 14, 15, 16, 17, 18, 19])

Fanccy Indexing vs View in Numpy part II

Fancy Indexing vs Views in Numpy
In an answer to this equation: is is explained that different idioms will produce different results.
Using the idiom where fancy indexing is to chose the values and said values are set to a new value in the same line means that the values in the original object will be changed in place.
However the final example below:
https://scipy-cookbook.readthedocs.io/items/ViewsVsCopies.html
"A final exercise"
The example appears to use the same idiom:
a[x, :][:, y] = 100
but it still produces a different result depending on whether x is a slice or a fancy index (see below):
a = np.arange(12).reshape(3,4)
ifancy = [0,2]
islice = slice(0,3,2)
a[islice, :][:, ifancy] = 100
a
#array([[100, 1, 100, 3],
# [ 4, 5, 6, 7],
# [100, 9, 100, 11]])
a = np.arange(12).reshape(3,4)
ifancy = [0,2]
islice = slice(0,3,2)
a[ifancy, :][:, islice] = 100 # note that ifancy and islice are interchanged here
>>> a
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
My intuition is that if the first set of fancy indexes is a slice it treats the object like a view and therefore the values in the orignal object are changed.
Whereas in the second case the first set of fancy indexes is itself a fancy index so it treats the object as a fancy index creating a copy of the original object. This then means that the original object is not changed when the values of the copy object are changed.
Is my intuition correct?
The example hints that one should think of the sqeuence of getitem and setitem can someone explain it to my properly in theis way?
Python evaluates each set of [] separately. a[x, :][:, y] = 100 is 2 operations.
temp = a[x,:] # getitem step
temp[:,y] = 100 # setitem step
Whether the 2nd line ends up modifying a depends on whether temp is a view or copy.
Remember, numpy is an addon to Python. It does not modify basic Python syntax or interpretation.

Find whether an array is subset of another array hashtable way(Python)

I want to Find whether an array is subset of another array or not and one of the method i can think of doing it is using Hashtable but i want to implement it in python. Attached in the thread is the c++ implementation. I'm not looking for built in functions here like set etc..
Python only has concept of dictionary in terms of hashtables but not sure how to proceed from here. Any suggestions would help me solve it.
Below are couple of lists:
arr1[] = [11, 1, 13, 21, 3, 7]
arr2[] = [11, 3, 7, 1]
Method (c++ Use Hashing)
1) Create a Hash Table for all the elements of arr1[].
2) Traverse arr2[] and search for each element of arr2[] in the Hash Table. If element is not found then return 0.
3) If all elements are found then return 1.
Lists can be million of numbers as well so a scalable and efficiet solution is expected.
In Python, you would use set objects for this:
>>> arr1 = [11, 1, 13, 21, 3, 7]
>>> arr2 = [11, 3, 7, 1]
>>> set(arr1).issuperset(arr2)
True
Or more efficiently, use:
>>> set(arr2).issubset(arr1)
True
If you expect arr2 to be much smaller...
Some quick timings, it seems that they are about the same in rumtime, although, creating a set from arr1 will require much more auxiliary memory:
>>> import numpy as np
>>> arr1 = np.random.randint(0, 100, (1000000,)).tolist()
>>> len(arr1)
1000000
>>> from timeit import timeit
>>> arr2 = [11, 3, 7, 1]
>>> timeit('set(arr1).issuperset(arr2)', 'from __main__ import arr1, arr2', number=1000)
14.337173405918293
>>> timeit('set(arr2).issubset(arr1)', 'from __main__ import arr1, arr2', number=1000)
14.459818648989312
I think you want set
e.g.
set(arr2).issubset(arr1)
Try this:
i = 0
allIn = True
while i <= len(arr2) and allIn:
if arr2[i] not in arr1:
allIn = False
i += 1
allIn will say whether the second list is in the first.
Note: The other solution using set() works equally well.
EDIT (In response to comments):
I'm not using a for loop as I don't know how to stop the loop from running once allIn is False (I don't know whether using break would work so I'm staying on the safe side).
I'm not using set() as the OP explicitly stated that they don't want to use in-built functions. I have posted my answer as an alternate solution to those answers already provided (but have also commended those as I believe they are better).

Deleting elements from numpy array with iteration

What is the fastest method to delete elements from numpy array while retreiving their initial positions. The following code does not return all elements that it should:
list = []
for pos,i in enumerate(ARRAY):
if i < some_condition:
list.append(pos) #This is where the loop fails
for _ in list:
ARRAY = np.delete(ARRAY, _)
It really feels like you're going about this inefficiently. You should probably be using more builtin numpy capabilities -- e.g. np.where, or boolean indexing. Using np.delete in a loop like that is going to kill any performance gains you get from using numpy...
For example (with boolean indexing):
keep = np.ones(ARRAY.shape, dtype=bool)
for pos, val in enumerate(ARRAY):
if val < some_condition:
keep[pos] = False
ARRAY = ARRAY[keep]
Of course, this could possibly be simplified (and generalized) even further:
ARRAY = ARRAY[ARRAY >= some_condition]
EDIT
You've stated in the comments that you need the same mask to operate on other arrays as well -- That's not a problem. You can keep a handle on the mask and use it for other arrays:
mask = ARRAY >= some_condition
ARRAY = ARRAY[mask]
OTHER_ARRAY = OTHER_ARRAY[mask]
...
Additionally (and perhaps this is the reason your original code isn't working), as soon as you delete the first index from the array in your loop, all of the other items shift one index to the left, so you're not actually deleting the same items that you "tagged" on the initial pass.
As an example, lets say that your original array was [a, b, c, d, e] and on the original pass, you tagged elements at indexes [0, 2] for deletion (a, c)... On the first pass through your delete loop, you'd remove the item at index 0 -- Which would make your array:
[b, c, d, e]
now on the second iteration of your delete loop, you're going to delete the item at index 2 in the new array:
[b, c, e]
But look, instead of removing c like we wanted, we actually removed d! Oh snap!
To fix that, you could probably write your loop over reversed(list), but that still won't result in a fast operation.
You don't need to iterate, especially with a simple condition like this. And you don't really need to use delete:
A sample array:
In [693]: x=np.arange(10)
A mask, boolean array were a condition is true (or false):
In [694]: msk = x%2==0
In [695]: msk
Out[695]: array([ True, False, True, False, True, False, True, False, True, False], dtype=bool)
where (or nonzero) converts it to indexes
In [696]: ind=np.where(msk)
In [697]: ind
Out[697]: (array([0, 2, 4, 6, 8], dtype=int32),)
You use the whole ind in one call to delete (no need to iterate):
In [698]: np.delete(x,ind)
Out[698]: array([1, 3, 5, 7, 9])
You can use it ind to retain those values instead:
In [699]: x[ind]
Out[699]: array([0, 2, 4, 6, 8])
Or you can used the boolean msk directly:
In [700]: x[msk]
Out[700]: array([0, 2, 4, 6, 8])
or use its inverse:
In [701]: x[~msk]
Out[701]: array([1, 3, 5, 7, 9])
delete doesn't do much more than this kind of boolean masking. It's all Python code, so you can easily study it.

Python: Elegant and efficient ways to mask a list

Example:
from __future__ import division
import numpy as np
n = 8
"""masking lists"""
lst = range(n)
print lst
# the mask (filter)
msk = [(el>3) and (el<=6) for el in lst]
print msk
# use of the mask
print [lst[i] for i in xrange(len(lst)) if msk[i]]
"""masking arrays"""
ary = np.arange(n)
print ary
# the mask (filter)
msk = (ary>3)&(ary<=6)
print msk
# use of the mask
print ary[msk] # very elegant
and the results are:
>>>
[0, 1, 2, 3, 4, 5, 6, 7]
[False, False, False, False, True, True, True, False]
[4, 5, 6]
[0 1 2 3 4 5 6 7]
[False False False False True True True False]
[4 5 6]
As you see the operation of masking on array is more elegant compared to list. If you try to use the array masking scheme on list you'll get an error:
>>> lst[msk]
Traceback (most recent call last):
File "<interactive input>", line 1, in <module>
TypeError: only integer arrays with one element can be converted to an index
The question is to find an elegant masking for lists.
Updates:
The answer by jamylak was accepted for introducing compress however the points mentioned by Joel Cornett made the solution complete to a desired form of my interest.
>>> mlist = MaskableList
>>> mlist(lst)[msk]
>>> [4, 5, 6]
If you are using numpy:
>>> import numpy as np
>>> a = np.arange(8)
>>> mask = np.array([False, False, False, False, True, True, True, False], dtype=np.bool)
>>> a[mask]
array([4, 5, 6])
If you are not using numpy you are looking for itertools.compress
>>> from itertools import compress
>>> a = range(8)
>>> mask = [False, False, False, False, True, True, True, False]
>>> list(compress(a, mask))
[4, 5, 6]
If you are using Numpy, you can do it easily using Numpy array without installing any other library:
>> a = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
>> msk = [ True, False, False, True, True, True, True, False, False, False]
>> a = np.array(a) # convert list to numpy array
>> result = a[msk] # mask a
>> result.tolist()
[0, 3, 4, 5, 6]
Since jamylak already answered the question with a practical answer, here is my example of a list with builtin masking support (totally unnecessary, btw):
from itertools import compress
class MaskableList(list):
def __getitem__(self, index):
try: return super(MaskableList, self).__getitem__(index)
except TypeError: return MaskableList(compress(self, index))
Usage:
>>> myList = MaskableList(range(10))
>>> myList
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
>>> mask = [0, 1, 1, 0]
>>> myList[mask]
[1, 2]
Note that compress stops when either the data or the mask runs out. If you wish to keep the portion of the list that extends past the length of the mask, you could try something like:
from itertools import izip_longest
[i[0] for i in izip_longest(myList, mask[:len(myList)], fillvalue=True) if i[1]]
i don't consider it elegant. It's compact, but tends to be confusing, as the construct is very different than most languages.
As Rossum has said about language design, we spend more time reading it than writing it. The more obscure the construction of a line of code, the more confusing it becomes to others, who may lack familiarity with Python, even though they have full competency in any number of other languages.
Readability trumps short form notations everyday in the real world of servicing code. Just like fixing your car. Big drawings with lots of information make troubleshooting a lot easier.
For me, I would much rather troubleshoot someone's code that uses the long form
print [lst[i] for i in xrange(len(lst)) if msk[i]]
than the numpy short notation mask. I don't need to have any special knowledge of a specific Python package to interpret it.
The following works perfectly well in Python 3:
np.array(lst)[msk]
If you need a list back as the result:
np.array(lst)[msk].tolist()
You could also just use list and zip
define a funcion
def masklist(mylist,mymask):
return [a for a,b in zip(mylist,mymask) if b]
use it!
n = 8
lst = range(n)
msk = [(el>3) and (el<=6) for el in lst]
lst_msk = masklist(lst,msk)
print(lst_msk)

Categories