Python: Elegant and efficient ways to mask a list

Python: Elegant and efficient ways to mask a list - python

Example:
from __future__ import division
import numpy as np
n = 8
"""masking lists"""
lst = range(n)
print lst
# the mask (filter)
msk = [(el>3) and (el<=6) for el in lst]
print msk
# use of the mask
print [lst[i] for i in xrange(len(lst)) if msk[i]]
"""masking arrays"""
ary = np.arange(n)
print ary
# the mask (filter)
msk = (ary>3)&(ary<=6)
print msk
# use of the mask
print ary[msk] # very elegant
and the results are:
>>>
[0, 1, 2, 3, 4, 5, 6, 7]
[False, False, False, False, True, True, True, False]
[4, 5, 6]
[0 1 2 3 4 5 6 7]
[False False False False True True True False]
[4 5 6]
As you see the operation of masking on array is more elegant compared to list. If you try to use the array masking scheme on list you'll get an error:
>>> lst[msk]
Traceback (most recent call last):
File "<interactive input>", line 1, in <module>
TypeError: only integer arrays with one element can be converted to an index
The question is to find an elegant masking for lists.
Updates:
The answer by jamylak was accepted for introducing compress however the points mentioned by Joel Cornett made the solution complete to a desired form of my interest.
>>> mlist = MaskableList
>>> mlist(lst)[msk]
>>> [4, 5, 6]

If you are using numpy:
>>> import numpy as np
>>> a = np.arange(8)
>>> mask = np.array([False, False, False, False, True, True, True, False], dtype=np.bool)
>>> a[mask]
array([4, 5, 6])
If you are not using numpy you are looking for itertools.compress
>>> from itertools import compress
>>> a = range(8)
>>> mask = [False, False, False, False, True, True, True, False]
>>> list(compress(a, mask))
[4, 5, 6]

If you are using Numpy, you can do it easily using Numpy array without installing any other library:
>> a = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
>> msk = [ True, False, False, True, True, True, True, False, False, False]
>> a = np.array(a) # convert list to numpy array
>> result = a[msk] # mask a
>> result.tolist()
[0, 3, 4, 5, 6]

Since jamylak already answered the question with a practical answer, here is my example of a list with builtin masking support (totally unnecessary, btw):
from itertools import compress
class MaskableList(list):
def __getitem__(self, index):
try: return super(MaskableList, self).__getitem__(index)
except TypeError: return MaskableList(compress(self, index))
Usage:
>>> myList = MaskableList(range(10))
>>> myList
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
>>> mask = [0, 1, 1, 0]
>>> myList[mask]
[1, 2]
Note that compress stops when either the data or the mask runs out. If you wish to keep the portion of the list that extends past the length of the mask, you could try something like:
from itertools import izip_longest
[i[0] for i in izip_longest(myList, mask[:len(myList)], fillvalue=True) if i[1]]

i don't consider it elegant. It's compact, but tends to be confusing, as the construct is very different than most languages.
As Rossum has said about language design, we spend more time reading it than writing it. The more obscure the construction of a line of code, the more confusing it becomes to others, who may lack familiarity with Python, even though they have full competency in any number of other languages.
Readability trumps short form notations everyday in the real world of servicing code. Just like fixing your car. Big drawings with lots of information make troubleshooting a lot easier.
For me, I would much rather troubleshoot someone's code that uses the long form
print [lst[i] for i in xrange(len(lst)) if msk[i]]
than the numpy short notation mask. I don't need to have any special knowledge of a specific Python package to interpret it.

The following works perfectly well in Python 3:
np.array(lst)[msk]
If you need a list back as the result:
np.array(lst)[msk].tolist()

You could also just use list and zip
define a funcion
def masklist(mylist,mymask):
return [a for a,b in zip(mylist,mymask) if b]
use it!
n = 8
lst = range(n)
msk = [(el>3) and (el<=6) for el in lst]
lst_msk = masklist(lst,msk)
print(lst_msk)

Related

What's the best way of comparing slices of a list in Python?

I attempted to compare slices of a list in Python but to no avail? Is there a better way to do this?
My Code (Attempt to make slice return True)
a = [1,2,3]
# Slice Assignment
a[0:1] = [0,0]
print(a)
# Slice Comparisons???
print(a[0:2])
print(a[0:2] == True)
print(a[0:2] == [True, True])
My Results
[0, 0, 2, 3]
[0, 0]
False
False

Since slicing returns lists and lists automatically compare element-wise, all you need to do is use ==:
>>> a = [1, 2, 3, 1, 2, 3]
>>> a[:3] == a[3:]
True
To compare to a fixed value, you need a little more effort:
>>> b = [1, 1, 1, 3]
>>> all(e == 1 for e in b[:3])
True
>>> all(e == 1 for e in b[2:])
False
Bonus: if you are doing lots of array calculations, you might benefit from using numpy arrays:
>>> import numpy as np
>>> c = np.array(b)
>>> c[:3] == 1 # this automatically gets applied to all elements
array([ True, True, True])
>>> (c[:3] == 1).all()
True

It is not quite clear what you're trying to do exactly,
As you printed, a[0:2] is [0,0], you're trying to compare the list to a boolean which are different types so they are different
In the second one, you are comparing [0,0] to [True, True], python compares the lists element by element, and 0 evaluvates to false, so [False, False] is clearly not == to [True, True]
Could you edit your question and add what you want the code to do? I would add this in a comment but I dont have enough rep yet :)

Removing items from a list based on a function

What is the most efficient way to remove items from a list based on a function in python (and any common library)?
For example, if I have the following function:
def func(a):
return a % 2 == 1
And the following list:
arr = [1,4,5,8,20,24]
Then I would want the result:
new_arr = [1,5]
I know I could simply iterate over the list like such:
new_arr = [i for i in arr if func(a)]
Just wondering if this is an efficient approach (for large datasets), or if there might be a better way. I was thinking maybe using np to map and changing the function to return a if True and -1 if false, and then using np remove to remove all 0s?
Edit:
I decided to test it myself with the suggestions you all gave (I probably just should have tested runtime myself rather than being lazy and asking other people).
filter was by far the fastest alone. Though if you needed a list for random access it was comparable to the [i for i in arr if func(i)] method.
The numpy [np.vectorize(func)(arr)] was slightly faster than the other list methods.

This is the use case for the builtin filter function:
filtered_arr = filter(func, arr)
Note that this returns an iterator, not a list. If you want a list, you can create one with list(filtered_arr) or a list comprehension as you noted. But if you just want to iterate over the filtered items and don't need random access / indexing, it's more memory efficient to use the iterator.
This is a good general approach for filtering lists that are not especially large and contain elements with arbitrary (and possibly mixed) types. If you are working with a large amount of numerical data, you should use one of the NumPy solutions mentioned in other answers.

Since the numpy tag is present, let use it. In the example we use a mask for elements that give remainder 1 when divided by 2.
>>> import numpy as np
>>>
>>> arr = np.array([1,4,5,8,20,24])
>>> arr
array([ 1, 4, 5, 8, 20, 24])
>>> arr[arr % 2 == 1]
array([1, 5])

Using numpy you could use numpy.vectorize to map your function across your array elements, then use that to keep the elements that evaluate to True. So starting with your function
def func(a):
return a % 2 == 1
We could test keeping only the odd values from the range of [0, 19]
>>> import numpy as np
>>> arr = np.arange(20)
>>> arr
array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19])
>>> np.vectorize(func)(arr)
array([False, True, False, True, False, True, False, True, False, True, False, True, False, True, False, True, False, True, False, True])
>>> arr[np.vectorize(func)(arr)]
array([ 1, 3, 5, 7, 9, 11, 13, 15, 17, 19])

You could also rewrite func to make it take a list as an argument:
def func(lst):
lst = [x for x in lst if x % 2 == 1]
return lst
and then do:
new_arr = func(arr)
This would save you some lines in comparison to making func take a single number as argument and writing an iteration such as [for i in arr if func(a)] every time you want to use it on (elements of) a list.

I decided to test it myself with the suggestions you all gave (I probably just should have tested runtime myself rather than being lazy and asking other people).
filter was by far the fastest alone. Though if you needed a list for random access it was comparable to the [i for i in arr if func(i)] method.
The numpy [np.vectorize(func)(arr)] was slightly faster than the other methods.
Just wanted to post this in case anyone else runs across this in the future.

Find all indices of numpy vector whose value is in a given set

I am getting more and more used to numpy's fancy indexing possibilities, but this time I hit an obstacle I cannot solve without resorting to ugly for loops.
My input is a pair of vectors, one large vector v and a smaller vector of indices e. What I want is to find all the indices i for which v[i] is equal to one of the values v[e[0]], v[e[1]],...v[e[n]]. At the moment, the code that does this for me (and it works) is
import numpy as np
v = np.array([0,0,0,0,1,1,1,2,2,2,2,2,2])
e=np.array([0,4])
#what I want to get is the vector [0,1,2,3,4,5,6].
values = v[e]
r = []
for i in range(n):
if v[i] in values:
r.append(i)
In the case when e is only one number, I am able to do this:
rr = np.arange(n)
r = v[rr] == v[e]
which is both nicer and quicker than a for loop. Is there a way of doing this when e is not a single number?

You could use where and in1d:
>>> v = np.array([0,0,0,0,1,1,1,2,2,2,2,2,2])
>>> e = [0,4]
>>> np.in1d(v, v[e])
array([ True, True, True, True, True, True, True, False, False,
False, False, False, False], dtype=bool)
>>> np.where(np.in1d(v, v[e]))
(array([0, 1, 2, 3, 4, 5, 6]),)
>>> np.where(np.in1d(v, v[e]))[0]
array([0, 1, 2, 3, 4, 5, 6])

Python equivalent of the R operator "%in%"

What is the python equivalent of this in operator? I am trying to filter down a pandas database by having rows only remain if a column in the row has a value found in my list.
I tried using any() and am having immense difficulty with this.

Pandas comparison with R docs are here.
s <- 0:4
s %in% c(2,4)
The isin method is similar to R %in% operator:
In [13]: s = pd.Series(np.arange(5),dtype=np.float32)
In [14]: s.isin([2, 4])
Out[14]:
0 False
1 False
2 True
3 False
4 True
dtype: bool

FWIW: without having to call pandas, here's the answer using a for loop and list compression in pure python
x = [2, 3, 5]
y = [1, 2, 3]
# for loop
for i in x: [].append(i in y)
Out: [True, True, False]
# list comprehension
[i in y for i in x]
Out: [True, True, False]

If you want to use only numpy without panads (like a use case I had) then you can:
import numpy as np
x = np.array([1, 2, 3, 10])
y = np.array([10, 11, 2])
np.isin(y, x)
This is equivalent to:
c(10, 11, 2) %in% c(1, 2, 3, 10)
Note that the last line will work only for numpy >= 1.13.0, for older versions you'll need to use np.in1d.

As others indicate, in operator of base Python works well.
myList = ["a00", "b000", "c0"]
"a00" in myList
# True
"a" in myList
# False

How to invert numpy.where (np.where) function

I frequently use the numpy.where function to gather a tuple of indices of a matrix having some property. For example
import numpy as np
X = np.random.rand(3,3)
>>> X
array([[ 0.51035326, 0.41536004, 0.37821622],
[ 0.32285063, 0.29847402, 0.82969935],
[ 0.74340225, 0.51553363, 0.22528989]])
>>> ix = np.where(X > 0.5)
>>> ix
(array([0, 1, 2, 2]), array([0, 2, 0, 1]))
ix is now a tuple of ndarray objects that contain the row and column indices, whereas the sub-expression X>0.5 contains a single boolean matrix indicating which cells had the >0.5 property. Each representation has its own advantages.
What is the best way to take ix object and convert it back to the boolean form later when it is desired? For example
G = np.zeros(X.shape,dtype=np.bool)
>>> G[ix] = True
Is there a one-liner that accomplishes the same thing?

Something like this maybe?
mask = np.zeros(X.shape, dtype='bool')
mask[ix] = True
but if it's something simple like X > 0, you're probably better off doing mask = X > 0 unless mask is very sparse or you no longer have a reference to X.

mask = X > 0
imask = np.logical_not(mask)
For example
Edit: Sorry for being so concise before. Shouldn't be answering things on the phone :P
As I noted in the example, it's better to just invert the boolean mask. Much more efficient/easier than going back from the result of where.

The bottom of the np.where docstring suggests to use np.in1d for this.
>>> x = np.array([1, 3, 4, 1, 2, 7, 6])
>>> indices = np.where(x % 3 == 1)[0]
>>> indices
array([0, 2, 3, 5])
>>> np.in1d(np.arange(len(x)), indices)
array([ True, False, True, True, False, True, False], dtype=bool)
(While this is a nice one-liner, it is a lot slower than #Bi Rico's solution.)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python: Elegant and efficient ways to mask a list - python

The following works perfectly well in Python 3: np.array(lst)[msk] If you need a list back as the result: np.array(lst)[msk].tolist()

You could also just use list and zip define a funcion def masklist(mylist,mymask): return [a for a,b in zip(mylist,mymask) if b] use it! n = 8 lst = range(n) msk = [(el>3) and (el<=6) for el in lst] lst_msk = masklist(lst,msk) print(lst_msk)

Related

What's the best way of comparing slices of a list in Python?

Removing items from a list based on a function

Find all indices of numpy vector whose value is in a given set

Python equivalent of the R operator "%in%"

How to invert numpy.where (np.where) function

Categories

Resources