Python: How to pass subarrays of array into array function - python

The ultimate goal of my question is that I want to generate a new array 'output' by passing the subarrays of an array into a function, where the return of the function for each subarray generates a new element into 'output'.
My input array was generated as follows:
aggregate_input = np.random.rand(100, 5)
input = np.split(aggregate_predictors, 1, axis=1)[0]
So now input appears as follows:
print(input[0:2])
>>[[ 0.61521025 0.07407679 0.92888063 0.66066605 0.95023826]
>> [ 0.0666379 0.20007622 0.84123138 0.94585421 0.81627862]]
Next, I want to pass each element of input (so the array of 5 floats) through my function 'condition' and I want the return of each function call to fill in a new array 'output'. Basically, I want 'output' to contain 100 values.
def condition(array):
return array[4] < 0.5
How do I pass each element of input into condition without using any nasty loops?
========
Basically, I want to do this, but optimized:
lister = []
for i in range(100):
lister.append(condition(input[i]))
output = np.array(lister)

That initial split and index does nothing. It just wraps the array in list, and then takes out again:
In [76]: x=np.random.rand(100,5)
In [77]: y = np.split(x,1,axis=1)
In [78]: len(y)
Out[78]: 1
In [79]: y[0].shape
Out[79]: (100, 5)
The rest just tests if the 4th element of each row is <.5:
In [81]: def condition(array):
...:
...: return array[4] < 0.5
...:
In [82]: lister = []
...:
...: for i in range(100):
...: lister.append(condition(x[i]))
...:
...: output = np.array(lister)
...:
In [83]: output
Out[83]:
array([ True, False, False, True, False, True, True, False, False,
True, False, True, False, False, True, False, False, True,
False, True, False, True, False, False, False, True, False,
...], dtype=bool)
We can do just as easily with column indexing
In [84]: x[:,4]<.5
Out[84]:
array([ True, False, False, True, False, True, True, False, False,
True, False, True, False, False, True, False, False, True,
False, True, False, True, False, False, False, True, False,
...], dtype=bool)
In other words, operate on the whole 4th column of the array.

You are trying to make a very simple indexing expression very convoluted. If you read the docs for np.split very carefully, you will see that passing a second argument of 1 does absolutely nothing: it splits the array into one chunk. The following line is literally a no-op and should be removed:
input = np.split(aggregate_predictors, 1, axis=1)[0]
You have a 2D numpy array of shape 100, 5 (you can check that with aggregate_predictors.shape). Your function returns whether or not the fifth column contains a value less than 0.5. You can do this with a single vectorized expression:
output = aggregate_predictors[:, 4] < 0.5
If you want to find the last column instead of the fifth, use index -1 instead:
output = aggregate_predictors[:, -1] < 0.5
The important thing to remember here is that all the comparison operators are vectorized element-wise in numpy. Usually, vectorizing an operation like this involves finding the correct index in the array. You should never have to convert anything to a list: numpy arrays are iterable as it is, and there are more complex iterators available.
That being said, your original intent was probably to do something like
input = split(aggregate_predictors, len(aggregate_predictors), axis=0)
OR
input = split(aggregate_predictors, aggregate_predictors.shape[0])
Both expressions are equivalent. They split aggregate_predictors into a list of 100 single-row matrices.

Related

Python: comparing numpy array with sub-numpy array without loop

My problem is quite simple but I cannot figure how to solve it without a loop.
I have a first numpy array:
FullArray = np.array([0,1,2,3,4,5,6,7,8,9])
and a sub array (not necessarily ordered in the same way):
Sub array = np.array([8, 3, 5])
I would like to create a bool array that has the same size of the full array and that returns True if a given value of FullArray is present in the SubArray and False either way.
For example here I expect to get:
BoolArray = np.array([False, False, False, True, False, True, False, False, True, False])
Is there a way to do this without using a loop?
You can use np.isin:
np.isin(FullArray, SubArray)
# array([False, False, False, True, False, True, False, False, True, False])

Inverting boolean array using np.invert

I have two boolean arrays a and b. I want a resulting boolean array c such that each element in a is reversed if condition in b is True and keeps original if condition in b is false.
a = np.array([True, False, True, True, False])
b = np.array([True, False, False, False, True])
c = np.invert(a, where=b)
Expected output:
c = np.array([False, False, True, True, True])
However this is the output I'm getting:
c = np.array([False False False False True])
Why is this so?
You need to include an out to specify the value for the not-where elements. Otherwise they are unpredictable.
In [242]: np.invert(a,where=b, out=a)
Out[242]: array([False, False, True, True, True])
Passing where=b to numpy.invert doesn't mean "keep the original a values for cells not selected by b". It means "don't write anything to the output array for cells not selected by b". Since you didn't pass an initialized out array, the unselected cells are filled with whatever garbage happened to be in that memory when it was allocated.
Since NumPy has some free lists for small array buffers, we can demonstrate that the output is uninitialized garbage by getting NumPy to reuse an allocation filled with whatever we want:
import numpy
a = numpy.zeros(4, dtype=bool)
numpy.array([True, False, True, False])
print(repr(numpy.invert(a, where=a)))
Output:
array([ True, False, True, False])
In this example, we can see that NumPy reused the buffer from the array we created but didn't save. Since where=a selected no cells, numpy.invert didn't write anything to the buffer, and the result is exactly the contents of the discarded array.
As for the operation you wanted to perform, that's just XOR: c = a ^ b

At least one True value per column in numpy boolean array

Suppose I have a very big 2D boolean array (for the sake of the example, let's take dimensions 4 lines x 3 columns):
toto = np.array([[True, True, False],
[False, True, False],
[True, False, False],
[False, True, False]])
I want to transform totoso that it contains at least one True value per column , by leaving other columns untouched.
EDIT : The rule is just this : If a column is all False, I want to introduce a True in a random line.
So in this example, one of the False in the 3rd column should become True.
How would you do that efficiently?
Thank you in advance
You can do it like this:
col_mask = ~np.any(toto, axis=0)
row_idx = np.random.randint(toto.shape[0], size=np.sum(col_mask))
toto[row_idx, col_mask]=True
col_mask is array([False, False, True]) of changeable columns.
row_idx is array that consists of changeable indexes of rows.
import numpy as np
toto = np.array([[False, True, False], [False, True, False],
[False, False, False], [False, True, False]])
# First we get a boolean array indicating columns that have at least one True value
mask = np.any(toto, axis=0)
# Now we invert the mask to get columns indexes (as boolean array) with no True value
mask = np.logical_not(mask)
# Notice that if we index with this mask on the colum dimension we get elements
# in all rows only in the columns containing no True value. The dimension is is
# "num_rows x num_columns_without_true"
toto[:, mask]
# Now we need random indexes for rows in the columns containing only false. That
# means an array of integers from zero to `num_rows - 1` with
# `num_columns_without_true` elements
row_indexes = np.random.randint(toto.shape[0], size=np.sum(mask))
# Now we can use both masks to select one False element in each column containing only False elements and set them to True
toto[row_indexes, mask] = True
Disclaimer: mathfux was faster with essentially the same solution as the one I was writing (accept his answer then if this is what you were looking for), but since I was writting with more comments I decided to post anyway.

Best practice to expand a list (efficiency) in python

I'm working with large data sets. I'm trying to use the NumPy library where I can or python features to process the data sets in an efficient way (e.g. LC).
First I find the relevant indexes:
dt_temp_idx = np.where(dt_diff > dt_temp_th)
Then I want to create a mask containing for each index a sequence starting from the index to a stop value, I tried:
mask_dt_temp = [np.arange(idx, idx+dt_temp_step) for idx in dt_temp_idx]
and:
mask_dt_temp = [idxs for idx in dt_temp_idx for idxs in np.arange(idx, idx+dt_temp_step)]
but it gives me the exception:
The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
Example input:
indexes = [0, 100, 1000]
Example output with stop values after 10 integers from each indexes:
list = [0, 1, ..., 10, 100, 101, ..., 110, 1000, 1001, ..., 1010]
1) How can I solve it?
2) Is it the best practice to do it?
Using masks (boolean arrays) are efficient being memory-efficient and performant too. We will make use of SciPy's binary-dilation to extend the thresholded mask.
Here's a step-by-step setup and solution run-
In [42]: # Random data setup
...: np.random.seed(0)
...: dt_diff = np.random.rand(20)
...: dt_temp_th = 0.9
In [43]: # Get mask of threshold crossings
...: mask = dt_diff > dt_temp_th
In [44]: mask
Out[44]:
array([False, False, False, False, False, False, False, False, True,
False, False, False, False, True, False, False, False, False,
False, False])
In [45]: W = 3 # window size for extension (edit it according to your use-case)
In [46]: from scipy.ndimage.morphology import binary_dilation
In [47]: extm = binary_dilation(mask, np.ones(W, dtype=bool), origin=-(W//2))
In [48]: mask
Out[48]:
array([False, False, False, False, False, False, False, False, True,
False, False, False, False, True, False, False, False, False,
False, False])
In [49]: extm
Out[49]:
array([False, False, False, False, False, False, False, False, True,
True, True, False, False, True, True, True, False, False,
False, False])
Compare mask against extm to see how the extension takes place.
As, we can see the thresholded mask is extended by window-size W on the right side, as is the expected output mask extm. This can be use to mask out those in the input array : dt_diff[~extm] to simulate the deleting/dropping of the elements from the input following boolean-indexing or inversely dt_diff[extm] to simulate selecting those.
Alternatives with NumPy based functions
Alternative #1
extm = np.convolve(mask, np.ones(W, dtype=int))[:len(dt_diff)]>0
Alternative #2
idx = np.flatnonzero(mask)
ext_idx = (idx[:,None]+ np.arange(W)).ravel()
ext_mask = np.ones(len(dt_diff), dtype=bool)
ext_mask[ext_idx[ext_idx<len(dt_diff)]] = False
# Get filtered o/p
out = dt_diff[ext_mask]
dt_temp_idx is a numpy array, but still a Python iterable so you can use a good old Python list comprehension:
lst = [ i for j in dt_temp_idx for i in range(j, j+11)]
If you want to cope with sequence overlaps and make it back a np.array, just do:
result = np.array({i for j in dt_temp_idx for i in range(j, j+11)})
But beware the use of a set is robust and guarantee no repetition but it could be more expensive that a simple list.

Find indices in an array that contain one of values from another array

How to get the index of values in an array (a) by a another array (label) with more than one "markers"? For example, given
label = array([1, 2])
a = array([1, 1, 2, 2, 3, 3])
the goal is to find the indices of a with the value of 1 or 2; that is, 0, 1, 2, 3.
I tried several combinations. None of the following seems to work.
label = array([1, 2])
a = array([1, 1, 2, 2, 3, 3])
idx = where(a==label) # gives me only the index of the last value in label
idx = where(a==label[0] or label[1]) # Is confused by all or any?
idx = where(a==label[0] | label[1]) # gives me results as if nor. idx = [4,5]
idx = where(a==label[0] || label[1]) # syntax error
idx = where(a==bolean.or(label,0,1) # I know, this is not the correct form but I don`t remember it correctly but remember the error: also asks for a.all or a.any
idx = where(label[0] or label[1] in a) # gives me only the first appearance. index = 0. Also without where().
idx = where(a==label[0] or a==label[1]).all()) # syntax error
idx = where(a.any(0,label[0] or label[1])) # gives me only the first appearance. index=0. Also without where().
idx = where(a.any(0,label[0] | label[1])) # gives me only the first appearance. index=0. Also without where().
idx=where(a.any(0,label)) # Datatype not understood
Ok, I think you get my problem. Does anyone know how to do it correctly? Best would be a solution with a general label instead of label[x] so that the use of label is more variable for later changes.
You can use numpy.in1d:
>>> a = numpy.array([1, 1, 2, 2, 3, 3])
>>> label = numpy.array([1, 2])
>>> numpy.in1d(a, label)
array([ True, True, True, True, False, False], dtype=bool)
The above returns a mask. If you want indices, you can call numpy.nonzero on the mask array.
Also, if the values in label array are unique, you can pass assume_unique=True to in1d to possibly speed it up.
np.where(a==label) is the same as np.nonzeros(a==label). It tells us the coordinates (indexes) of all non-zero (or True) elements in the array, a==label.
So instead of trying all these different where expressions, focus on the conditional array
Without the where here's what some of your expressions produce:
In [40]: a==label # 2 arrays don't match in size, scalar False
Out[40]: False
In [41]: a==label[0] # result is the size of a
Out[41]: array([ True, True, False, False, False, False], dtype=bool)
In [42]: a==label[0] or label[1] # or is a Python scalar operation
...
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
In [43]: a==label[0] | label[1]
Out[43]: array([False, False, False, False, True, True], dtype=bool)
This last is the same as a==(label[0] | label[1]), the | is evaluated before the ==.
You need to understand how each of those arrays (or scalar or error) are produced before you understand what where gives you.
Correct combination of 2 equality tests (the extra () are important):
In [44]: (a==label[1]) | (a==label[0])
Out[44]: array([ True, True, True, True, False, False], dtype=bool)
Using broadcasting to separately test the 2 elements of label. Result is 2d array:
In [45]: a==label[:,None]
Out[45]:
array([[ True, True, False, False, False, False],
[False, False, True, True, False, False]], dtype=bool)
In [47]: (a==label[:,None]).any(axis=0)
Out[47]: array([ True, True, True, True, False, False], dtype=bool)
As I understand it, you want the indices of 1 and 2 in array "a".
In that case, try
label= [1,2]
a= [1,1,2,2,3,3]
idx_list = list()
for x in label:
for i in range(0,len(a)-1):
if a[i] == x:
idx_list.append(i)
I think what I'm reading as your intent is to get the indices in the second list, 'a', of the values in the first list, 'labels'. I think that a dictionary is a good way to store this information where the labels will be keys and indices will be the values.
Try this:
labels = [a,2]
a = [1,1,2,2,3,3]
results = {}
for label in labels:
results[label] = [i for i,x in enumerate(a) if x == label]
if you want the indices of 1 just call results[1]. The list comprehension is and the enumerate function are the real MVPs here.

Categories