Convert a numpy boolean array to int one with distinct values - python

I've read through most of How to convert a boolean array to an int array , but I was still at loss with, how to (most efficiently) convert a numpy bool array to an int array, but with distinct values. For instance, I have:
>>> k=np.array([True, False, False, True, False])
>>> print k
[ True False False True False]
I'd like this to be converted to an array, where True is say, 2, and False is say 5.
Of course, I can always set up a linear equation:
>>> print 5-k*3
[2 5 5 2 5]
... which, while vectorized, needlessly employs both addition (subtraction) and multiplication; furthermore, if I'd want the opposite values (5 for True and 2 for False), I basically have to use (and recalculate) a different equation:
>>> print 2+k*3
[5 2 2 5 2]
... which is a bit of a readability issue for me.
In essence, this is merely a selection/mapping operation - but one which I'd like done in the numpy domain. How could I do that?

Seems like numpy.where is exactly what you want:
>>> import numpy as np
>>> k = np.array([True, False, False, True, False])
>>> np.where(k, 2, 5)
array([2, 5, 5, 2, 5])

Well, it seems I have to make a np.array (not a Python list!) containing the two distinct values:
>>> z=np.array([2,5])
>>> print z
[2 5]
... and then I can simply cast the boolean array (k) to int, and use that as selection indices in the "distinct values array" (z):
>>> print z[k.astype(int)]
[5 2 2 5 2]
It is also trivial to change the [2,5] to [5,2], so that is good.
I'm just not sure if this is the right way to do it (that is, maybe there exists something like (pseudocode) k.asdistinctvalues([2,5]) or something like that?)

Related

NumPy: Make filter array from a (N,2) shaped array

I have the following array:
[[4 9]
[5 4]
...
[2 9]]
I want to filter this arr array, in a way, that I only have the elements of it, where both elements are between 0 and 7, and discard the rest. My solution, so far, has been to create a filter array, to index it with:
filter_array = ((arr >= 0) & (arr <= 7))
My problem is, this returns an array of the same shape as arr:
[[ True False]
[ True True]
...
[ True False]]
Which I can't use to index the original array in the way that I want. I want to discard the entire line, if any of the elements are not between the values I want:
#desired output:
[ False
True
...
False ]
I want to solve this in a "numpy-ish" manner, since the array is quite large, so performance is important. (I don't want to just iterate over it with some for loops)
You can sum in the axis=1 and see if it sums to 2:
filtered_array = (filtered_array.sum(1)==2)
Another way, where you can use and operator:
filtered_array = filter_array[:,0] & filter_array[:,1]
Lastly:
filter_array = filter_array.all(1)
This latter for me is the best way, but you can choose what works for you.

Deleting elements from a list if that value is masked in another list

I have imported data that has the format of a numpy masked array of incrementing integers. The masked elements are irregular and not repeating, e.g printing it yields:
masked = [0,1,--,3,--,5,6,--,--,9,--]
And I have another list of incrementing numbers that doesn't start from zero, and has irregular gaps and is a different size from masked:
data = [1,3,4,6,7,9,10]
I want to remove any element of data if its value is a masked element in masked
So that I get:
result = [1,3,6,9]
As 4, 7 and 10 were masked values in masked.
I think my pseudocode should look something like:
for i in len(masked):
if masked[i] = 'masked' && data[i] == [i]:
del data[i]
But I'm having trouble reconciling the different lengths and mis-matched indices of the two arrays,
Thanks for any help!
Make sure data is an array:
data = np.asarray(data)
Then:
data[~masked.mask[data]]
This will be extremely fast, though it does assume that your masked array contains all numbers from 0 to at least max(data).
You can use set function to get sets of the lists and take their intersection.
Here goes a demo :-
>>> import numpy as np
>>> import numpy.ma as ma
>>> arr = np.array([x for x in range(11)])
>>> masked = ma.masked_array(arr, mask=[0,0,1,0,1,0,0,1,1,0,1])
>>> masked
masked_array(data = [0 1 -- 3 -- 5 6 -- -- 9 --],
mask = [False False True False True False False True True False
True],
fill_value = 999999)
>>> data = np.array([1,3,4,6,7,9,10])
>>> result = list(set(data) & set(masked[~masked.mask]))
>>> result
[1, 3, 6, 9]
>>>

python numpy strange boolean arithmetic behaviour

Why is it, in python/numpy:
from numpy import asarray
bools=asarray([False,True])
print(bools)
[False True]
print(1*bools, 0+bools, 0-bools) # False, True are valued as 0, 1
[0 1] [0 1] [ 0 -1]
print(-2*bools, -bools*2) # !? expected same result! :-/
[0 -2] [2 0]
print(-bools) # this is the reason!
[True False]
I consider it weird that -bools returns logical_not(bools), because in all other cases the behaviour is "arithmetic", not "logical".
One who wants to use an array of booleans as a 0/1 mask (or "characteristic function") is forced to use somehow involute expressions such as (0-bools) or (-1)*bools, and can easily incur into bugs if he forgets about this.
Why is it so, and what would be the best acceptable way to obtain the desired behaviour? (beside commenting of course)
Its all about operator order and data types.
>>> import numpy as np
>>> B = np.array([0, 1], dtype=np.bool)
>>> B
array([False, True], dtype=bool)
With numpy, boolean arrays are treated as that, boolean arrays. Every operation applied to them, will first try to maintain the data type. That is way:
>>> -B
array([ True, False], dtype=bool)
and
>>> ~B
array([ True, False], dtype=bool)
which are equivalent, return the element-wise negation of its elements. Note however that using -B throws a warning, as the function is deprecated.
When you use things like:
>>> B + 1
array([1, 2])
B and 1 are first casted under the hood to the same data type. In data-type promotions, the boolean array is always casted to a numeric array. In the above case, B is casted to int, which is similar as:
>>> B.astype(int) + 1
array([1, 2])
In your example:
>>> -B * 2
array([2, 0])
First the array B is negated by the operator - and then multiplied by 2. The desired behaviour can be adopted either by explicit data conversion, or adding brackets to ensure proper operation order:
>>> -(B * 2)
array([ 0, -2])
or
>>> -B.astype(int) * 2
array([ 0, -2])
Note that B.astype(int) can be replaced without data-copy by B.view(np.int8), as boolean are represented by characters and have thus 8 bits, the data can be viewed as integer with the .view method without needing to convert it.
>>> B.view(np.int8)
array([0, 1], dtype=int8)
So, in short, B.view(np.int8) or B.astype(yourtype) will always ensurs that B is a [0,1] numeric array.
Numpy arrays are homogenous—all elements have the same type for a given array, and the array object stores what type that is. When you create an array with True and False, it is an array of type bool and operators behave on the array as such. It's not surprising, then, that you get logical negation happening in situations that would be logical negation for a normal bool. When you use the arrays for integer math, then they are converted to 1's and 0's. Of all your examples, those are the more anomalous cases, that is, it's behavior that shouldn't be relied upon in good code.
As suggested in the comments, if you want to do math with an array of 0's and 1's, it's better to just make an array of 0's and 1's. However, depending on what you want to do with them, you might be better served looking into functions like numpy.where().

Numpy : The truth value of an array with more than one element is ambiguous

I am really confused on why this error is showing up. Here is my code:
import numpy as np
x = np.array([0, 0])
y = np.array([10, 10])
a = np.array([1, 6])
b = np.array([3, 7])
points = [x, y, a, b]
max_pair = [x, y]
other_pairs = [p for p in points if p not in max_pair]
>>>ValueError: The truth value of an array with more than one element is ambiguous.
Use a.any() or a.all()
(a not in max_paix)
>>>ValueError: The truth ...
What confuses me is that the following works fine:
points = [[1, 2], [3, 4], [5, 7]]
max_pair = [[1, 2], [5, 6]]
other_pairs = [p for p in points if p not in max_pair]
>>>[[3, 4], [5, 7]]
([5, 6] not in max_pair)
>>>False
Why is this happening when using numpy arrays? Is not in/in ambiguous for existance?
What is the correct syntax using any()\all()?
Numpy arrays define a custom equality operator, i.e. they are objects that implement the __eq__ magic function. Accordingly, the == operator and all other functions/operators that rely on such an equality call this custom equality function.
Numpy's equality is based on element-wise comparison of arrays. Thus, in return you get another numpy array with boolean values. For instance:
x = np.array([1,2,3])
y = np.array([1,4,5])
x == y
returns
array([ True, False, False], dtype=bool)
However, the in operator in combination with lists requires equality comparisons that only return a single boolean value. This is the reason why the error asks for all or any. For instance:
any(x==y)
returns True because at least one value of the resulting array is True.
In contrast
all(x==y)
returns False because not all values of the resulting array are True.
So in your case, a way around the problem would be the following:
other_pairs = [p for p in points if all(any(p!=q) for q in max_pair)]
and print other_pairs prints the expected result
[array([1, 6]), array([3, 7])]
Why so? Well, we look for an item p from points where any of its entries are unequal to the entries of all items q from max_pair.
Reason behind is that they're two different objects completely. np.array has it's own operators that work on it.
They are named the same as the global operators any and all but don't work exactly the same, and that distinction is reflected in the fact they are methods of np.array.
>>> x = np.array([0,9])
>>> x.any(axis=0)
True
>>> y = np.array([10, 10])
>>> y.all()
True
>>> y.all(axis=0)
True
meanwhile:
>>> bool([])
False
>>> bool([[]])
True
>>> bool([[]][0])
False
Notice how first result is false (in python2) an empty list is deemed False. However a list with another list in it, even if that one is empty, is not False but True. Evaluating the inside list returns False again because it's empty. Since any and all are defined over the conversion to bool the results you see are different.
>>> help(all)
all(...)
all(iterable) -> bool
Return True if bool(x) is True for all values x in the iterable.
>>> help(any)
any(...)
any(iterable) -> bool
Return True if bool(x) is True for any x in the iterable.
See a better explanations for logical numpy operators here

Efficiently sum a small numpy array, broadcast across a ginormous numpy array?

I want to calculate an indexed weight sum across a large (1,000,000 x
3,000) boolean numpy array. The large boolean array changes
infrequently, but the weights come at query time, and I need answers
very fast, without copying the whole large array, or expanding the
small weight array to the size of the large array.
The result should be an array with 1,000,000 entries, each having the
sum of the weights array entries corresponding to that row's True
values.
I looked into using masked arrays, but they seem to require building a
weights array the size of my large boolean array.
The code below gives the correct results, but I can't afford that copy
during the multiply step. The multiply isn't even necessary, since
the values array is boolean, but at least it handles the broadcasting
properly.
I'm new to numpy, and loving it, but I'm about to give up on it for
this particular problem. I've learned enough numpy to know to stay
away from anything that loops in python.
My next step will be to write this routine in C (which has the added
benefit of letting me save memory by using bits instead of bytes, by
the way.)
Unless one of you numpy gurus can save me from cython?
from numpy import array, multiply, sum
# Construct an example values array, alternating True and False.
# This represents four records of three attributes each:
# array([[False, True, False],
# [ True, False, True],
# [False, True, False],
# [ True, False, True]], dtype=bool)
values = array([(x % 2) for x in range(12)], dtype=bool).reshape((4,3))
# Construct example weights, one for each attribute:
# array([1, 2, 3])
weights = array(range(1, 4))
# Create expensive NEW array with the weights for the True attributes.
# Broadcast the weights array into the values array.
# array([[0, 2, 0],
# [1, 0, 3],
# [0, 2, 0],
# [1, 0, 3]])
weighted = multiply(values, weights)
# Add up the weights:
# array([2, 4, 2, 4])
answers = sum(weighted, axis=1)
print answers
# Rejected masked_array solution is too expensive (and oddly inverts
# the results):
masked = numpy.ma.array([[1,2,3]] * 4, mask=values)
The dot product (or inner product) is what you want. It allows you to take a matrix of size mĂ—n and a vector of length n and multiply them together yielding a vector of length m, where each entry is the weighted sum of a row of the matrix with the entries of the vector of as weights.
Numpy implements this as array1.dot(array2) (or numpy.dot(array1, array2) in older versions). e.g.:
from numpy import array
values = array([(x % 2) for x in range(12)], dtype=bool).reshape((4,3))
weights = array(range(1, 4))
answers = values.dot(weights)
print answers
# output: [ 2 4 2 4 ]
(You should benchmark this though, using the timeit module.)
It seems likely that dbaupp's answer is the correct one. But just for the sake of diversity, here's another solution that saves memory. This will work even for operations that don't have a built-in numpy equivalent.
>>> values = numpy.array([(x % 2) for x in range(12)], dtype=bool).reshape((4,3))
>>> weights = numpy.array(range(1, 4))
>>> weights_stretched = numpy.lib.stride_tricks.as_strided(weights, (4, 3), (0, 8))
numpy.lib.stride_tricks.as_strided is a wonderful little function! It allows you to specify shape and strides values that allow a small array to mimic a much larger array. Observe -- there aren't really four rows here; it just looks that way:
>>> weights_stretched[0][0] = 4
>>> weights_stretched
array([[4, 2, 3],
[4, 2, 3],
[4, 2, 3],
[4, 2, 3]])
So instead of passing a huge array to MaskedArray, you can pass a smaller one. (But as you've already noticed, numpy masking works in the opposite way you might expect; truth masks, rather than revealing, so you'll have to store your values inverted.) As you can see, MaskedArray doesn't copy any data; it just reflects whatever is in weights_stretched:
>>> masked = numpy.ma.MaskedArray(weights_stretched, numpy.logical_not(values))
>>> weights_stretched[0][0] = 1
>>> masked
masked_array(data =
[[-- 2 --]
[1 -- 3]
[-- 2 --]
[1 -- 3]],
mask =
[[ True False True]
[False True False]
[ True False True]
[False True False]],
fill_value=999999)
Now we can just pass it to sum:
>>> sum(masked, axis=1)
masked_array(data = [2 4 2 4],
mask = [False False False False],
fill_value=999999)
I benchmarked numpy.dot and the above against a 1,000,000 x 30 array. This is the result on a relatively modern MacBook Pro (numpy.dot is dot1; mine is dot2):
>>> %timeit dot1(values, weights)
1 loops, best of 3: 194 ms per loop
>>> %timeit dot2(values, weights)
1 loops, best of 3: 459 ms per loop
As you can see, the built-in numpy solution is faster. But stride_tricks is worth knowing about regardless, so I'm leaving this.
Would this work for you?
a = np.array([sum(row * weights) for row in values])
This uses sum() to immediately sum the row * weights values, so you don't need the memory to store all the intermediate values. Then the list comprehension collects all the values.
You said you want to avoid anything that "loops in Python". This at least does the looping with the C guts of Python, rather than an explicit Python loop, but it can't be as fast as a NumPy solution because that uses compiled C or Fortran.
I don't think you need numpy for something like that. And 1000000 by 3000 is a huge array; this will not fit in your RAM, most likely.
I would do it this way:
Let's say that you data is originally in a text file:
False,True,False
True,False,True
False,True,False
True,False,True
My code:
weight = range(1,4)
dicto = {'True':1, 'False':0}
with open ('my_data.txt') as fin:
a = sum(sum(dicto[ele]*w for ele,w in zip(line.strip().split(','),weight)) for line in fin)
Result:
>>> a
12
EDIT:
I think I slightly misread the question first time around, and summed up the everything together. Here is the solution that gives the exact solution that OP is after:
weight = range(1,4)
dicto = {'True':1, 'False':0}
with open ('my_data.txt') as fin:
a = [sum(dicto[ele]*w for ele,w in zip(line.strip().split(','),weight)) for line in fin]
Result:
>>> a
[2, 4, 2, 4]

Categories