Suppose I have a very big 2D boolean array (for the sake of the example, let's take dimensions 4 lines x 3 columns):
toto = np.array([[True, True, False],
[False, True, False],
[True, False, False],
[False, True, False]])
I want to transform totoso that it contains at least one True value per column , by leaving other columns untouched.
EDIT : The rule is just this : If a column is all False, I want to introduce a True in a random line.
So in this example, one of the False in the 3rd column should become True.
How would you do that efficiently?
Thank you in advance
You can do it like this:
col_mask = ~np.any(toto, axis=0)
row_idx = np.random.randint(toto.shape[0], size=np.sum(col_mask))
toto[row_idx, col_mask]=True
col_mask is array([False, False, True]) of changeable columns.
row_idx is array that consists of changeable indexes of rows.
import numpy as np
toto = np.array([[False, True, False], [False, True, False],
[False, False, False], [False, True, False]])
# First we get a boolean array indicating columns that have at least one True value
mask = np.any(toto, axis=0)
# Now we invert the mask to get columns indexes (as boolean array) with no True value
mask = np.logical_not(mask)
# Notice that if we index with this mask on the colum dimension we get elements
# in all rows only in the columns containing no True value. The dimension is is
# "num_rows x num_columns_without_true"
toto[:, mask]
# Now we need random indexes for rows in the columns containing only false. That
# means an array of integers from zero to `num_rows - 1` with
# `num_columns_without_true` elements
row_indexes = np.random.randint(toto.shape[0], size=np.sum(mask))
# Now we can use both masks to select one False element in each column containing only False elements and set them to True
toto[row_indexes, mask] = True
Disclaimer: mathfux was faster with essentially the same solution as the one I was writing (accept his answer then if this is what you were looking for), but since I was writting with more comments I decided to post anyway.
Related
I have two boolean arrays a and b. I want a resulting boolean array c such that each element in a is reversed if condition in b is True and keeps original if condition in b is false.
a = np.array([True, False, True, True, False])
b = np.array([True, False, False, False, True])
c = np.invert(a, where=b)
Expected output:
c = np.array([False, False, True, True, True])
However this is the output I'm getting:
c = np.array([False False False False True])
Why is this so?
You need to include an out to specify the value for the not-where elements. Otherwise they are unpredictable.
In [242]: np.invert(a,where=b, out=a)
Out[242]: array([False, False, True, True, True])
Passing where=b to numpy.invert doesn't mean "keep the original a values for cells not selected by b". It means "don't write anything to the output array for cells not selected by b". Since you didn't pass an initialized out array, the unselected cells are filled with whatever garbage happened to be in that memory when it was allocated.
Since NumPy has some free lists for small array buffers, we can demonstrate that the output is uninitialized garbage by getting NumPy to reuse an allocation filled with whatever we want:
import numpy
a = numpy.zeros(4, dtype=bool)
numpy.array([True, False, True, False])
print(repr(numpy.invert(a, where=a)))
Output:
array([ True, False, True, False])
In this example, we can see that NumPy reused the buffer from the array we created but didn't save. Since where=a selected no cells, numpy.invert didn't write anything to the buffer, and the result is exactly the contents of the discarded array.
As for the operation you wanted to perform, that's just XOR: c = a ^ b
The ultimate goal of my question is that I want to generate a new array 'output' by passing the subarrays of an array into a function, where the return of the function for each subarray generates a new element into 'output'.
My input array was generated as follows:
aggregate_input = np.random.rand(100, 5)
input = np.split(aggregate_predictors, 1, axis=1)[0]
So now input appears as follows:
print(input[0:2])
>>[[ 0.61521025 0.07407679 0.92888063 0.66066605 0.95023826]
>> [ 0.0666379 0.20007622 0.84123138 0.94585421 0.81627862]]
Next, I want to pass each element of input (so the array of 5 floats) through my function 'condition' and I want the return of each function call to fill in a new array 'output'. Basically, I want 'output' to contain 100 values.
def condition(array):
return array[4] < 0.5
How do I pass each element of input into condition without using any nasty loops?
========
Basically, I want to do this, but optimized:
lister = []
for i in range(100):
lister.append(condition(input[i]))
output = np.array(lister)
That initial split and index does nothing. It just wraps the array in list, and then takes out again:
In [76]: x=np.random.rand(100,5)
In [77]: y = np.split(x,1,axis=1)
In [78]: len(y)
Out[78]: 1
In [79]: y[0].shape
Out[79]: (100, 5)
The rest just tests if the 4th element of each row is <.5:
In [81]: def condition(array):
...:
...: return array[4] < 0.5
...:
In [82]: lister = []
...:
...: for i in range(100):
...: lister.append(condition(x[i]))
...:
...: output = np.array(lister)
...:
In [83]: output
Out[83]:
array([ True, False, False, True, False, True, True, False, False,
True, False, True, False, False, True, False, False, True,
False, True, False, True, False, False, False, True, False,
...], dtype=bool)
We can do just as easily with column indexing
In [84]: x[:,4]<.5
Out[84]:
array([ True, False, False, True, False, True, True, False, False,
True, False, True, False, False, True, False, False, True,
False, True, False, True, False, False, False, True, False,
...], dtype=bool)
In other words, operate on the whole 4th column of the array.
You are trying to make a very simple indexing expression very convoluted. If you read the docs for np.split very carefully, you will see that passing a second argument of 1 does absolutely nothing: it splits the array into one chunk. The following line is literally a no-op and should be removed:
input = np.split(aggregate_predictors, 1, axis=1)[0]
You have a 2D numpy array of shape 100, 5 (you can check that with aggregate_predictors.shape). Your function returns whether or not the fifth column contains a value less than 0.5. You can do this with a single vectorized expression:
output = aggregate_predictors[:, 4] < 0.5
If you want to find the last column instead of the fifth, use index -1 instead:
output = aggregate_predictors[:, -1] < 0.5
The important thing to remember here is that all the comparison operators are vectorized element-wise in numpy. Usually, vectorizing an operation like this involves finding the correct index in the array. You should never have to convert anything to a list: numpy arrays are iterable as it is, and there are more complex iterators available.
That being said, your original intent was probably to do something like
input = split(aggregate_predictors, len(aggregate_predictors), axis=0)
OR
input = split(aggregate_predictors, aggregate_predictors.shape[0])
Both expressions are equivalent. They split aggregate_predictors into a list of 100 single-row matrices.
How is this operation called technically and what other functionalities does it allow for:
Z[1:-1,1:-1][birth|survive]=1. Where Z is a 4x4 array and birth and survive are same size Boolean arrays. I understand what this code does, but would like to know how is this operation called and what else can I do with it (talking about this latter part [birth|survive]).
The pipe | is the bitwise or operator. Therefore, birth|survive is the equivalent to np.bitwise_or(birth, survive). Presumably birth and survive are boolean arrays, so the output is a boolean array with the straightforward or behavior:
a = np.array([True, True, False, False])
b = np.array([True, False, False, True])
a|b
# array([ True, True, False, True], dtype=bool)
For integers, each bit is considered and an integer array is returned where for each digit in the binary representation has been or'ed. There is a better explanation on its behavior and some examples at the documentation page.
Once you've created the boolean array from birth|survive, you are using it to do a boolean index into the Z array. Most simply, this can be shown with:
a = np.array([1,2,3])
b = np.array([True, False, True])
a[b] # the elements of a where b is True
# array([1, 3])
Since it's on the left side of the assignment =, python will assign the value 1 to every point in Z where birth or survive is True:
a[b] = 99
a
# array([99, 2, 99])
I have columns corresponding to a given day, month, and year in a numpy array called 'a' and I am comparing all three of these values to the columns of another array called 'b' which also correspond to day,month, and year to find the index of 'a' that is equal to 'b' so far I have tried:
a[:,3:6,1] == b[1,3:6]
array([[False, True, True],
[ True, True, True],
[False, True, True],
...,
[False, False, False],
[False, False, False],
[False, False, False]], dtype=bool)
which works fine but I need the row that corresponds to [True,True,True]
I've also tried:
np.where(a[:,3:6,1] == b[1,3:6], a[:,3:6,1])
ValueError: either both or neither of x and y should be given
and
a[:,:,1].all(a[:,3:6,1] == b[1,3:6])
TypeError: only length-1 arrays can be converted to Python scalars
What is a quick and easy way to do this?
You can use np.all() along the last axis:
rows = np.where((a[:,3:6,1]==b[1,3:6]).all(axis=1))[0]
it will store in rows the indices where all the row contains True values.
I am fairly new to numpy and scientific computing and I struggle with a problem for several days, so I decided to post it here.
I am trying to get a count for a specific occurence of a condition in a numpy array.
In [233]: import numpy as np
In [234]: a= np.random.random([5,5])
In [235]: a >.7
Out[235]: array([[False, True, True, False, False],
[ True, False, False, False, True],
[ True, False, True, True, False],
[False, False, False, False, False],
[False, False, True, False, False]], dtype=bool)
What I would like to count the number of occurence of True in each row and keep the rows when this count reach a certain threshold:
ex :
results=[]
threshold = 2
for i,row in enumerate(a>.7):
if len([value for value in row if value==True]) > threshold:
results.append(i) # keep ids for each row that have more than 'threshold' times True
This is the non-optimized version of the code but I would love to achieve the same thing with numpy (I have a very large matrix to process).
I have been trying all sort of things with np.where but I only can get flatten results. I need the row number
Thanks in advance !
To make results reproducible, use some seed:
>>> np.random.seed(100)
Then for a sample matrix
>>> a = np.random.random([5,5])
Count number of occurences along axis with sum:
>>> (a >.7).sum(axis=1)
array([1, 0, 3, 1, 2])
You can get row numbers with np.where:
>>> np.where((a > .7).sum(axis=1) >= 2)
(array([2, 4]),)
To filter result, just use boolean indexing:
>>> a[(a > .7).sum(axis=1) >= 2]
array([[ 0.89041156, 0.98092086, 0.05994199, 0.89054594, 0.5769015 ],
[ 0.54468488, 0.76911517, 0.25069523, 0.28589569, 0.85239509]])
You can sum over axis with a.sum.
Then you can use where on the resulting vector.
results = np.where(a.sum(axis=0) < threshold))