Count of specific case per row in matrix - python

I am fairly new to numpy and scientific computing and I struggle with a problem for several days, so I decided to post it here.
I am trying to get a count for a specific occurence of a condition in a numpy array.
In [233]: import numpy as np
In [234]: a= np.random.random([5,5])
In [235]: a >.7
Out[235]: array([[False, True, True, False, False],
[ True, False, False, False, True],
[ True, False, True, True, False],
[False, False, False, False, False],
[False, False, True, False, False]], dtype=bool)
What I would like to count the number of occurence of True in each row and keep the rows when this count reach a certain threshold:
ex :
results=[]
threshold = 2
for i,row in enumerate(a>.7):
if len([value for value in row if value==True]) > threshold:
results.append(i) # keep ids for each row that have more than 'threshold' times True
This is the non-optimized version of the code but I would love to achieve the same thing with numpy (I have a very large matrix to process).
I have been trying all sort of things with np.where but I only can get flatten results. I need the row number
Thanks in advance !

To make results reproducible, use some seed:
>>> np.random.seed(100)
Then for a sample matrix
>>> a = np.random.random([5,5])
Count number of occurences along axis with sum:
>>> (a >.7).sum(axis=1)
array([1, 0, 3, 1, 2])
You can get row numbers with np.where:
>>> np.where((a > .7).sum(axis=1) >= 2)
(array([2, 4]),)
To filter result, just use boolean indexing:
>>> a[(a > .7).sum(axis=1) >= 2]
array([[ 0.89041156, 0.98092086, 0.05994199, 0.89054594, 0.5769015 ],
[ 0.54468488, 0.76911517, 0.25069523, 0.28589569, 0.85239509]])

You can sum over axis with a.sum.
Then you can use where on the resulting vector.
results = np.where(a.sum(axis=0) < threshold))

Related

Best practice to expand a list (efficiency) in python

I'm working with large data sets. I'm trying to use the NumPy library where I can or python features to process the data sets in an efficient way (e.g. LC).
First I find the relevant indexes:
dt_temp_idx = np.where(dt_diff > dt_temp_th)
Then I want to create a mask containing for each index a sequence starting from the index to a stop value, I tried:
mask_dt_temp = [np.arange(idx, idx+dt_temp_step) for idx in dt_temp_idx]
and:
mask_dt_temp = [idxs for idx in dt_temp_idx for idxs in np.arange(idx, idx+dt_temp_step)]
but it gives me the exception:
The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
Example input:
indexes = [0, 100, 1000]
Example output with stop values after 10 integers from each indexes:
list = [0, 1, ..., 10, 100, 101, ..., 110, 1000, 1001, ..., 1010]
1) How can I solve it?
2) Is it the best practice to do it?
Using masks (boolean arrays) are efficient being memory-efficient and performant too. We will make use of SciPy's binary-dilation to extend the thresholded mask.
Here's a step-by-step setup and solution run-
In [42]: # Random data setup
...: np.random.seed(0)
...: dt_diff = np.random.rand(20)
...: dt_temp_th = 0.9
In [43]: # Get mask of threshold crossings
...: mask = dt_diff > dt_temp_th
In [44]: mask
Out[44]:
array([False, False, False, False, False, False, False, False, True,
False, False, False, False, True, False, False, False, False,
False, False])
In [45]: W = 3 # window size for extension (edit it according to your use-case)
In [46]: from scipy.ndimage.morphology import binary_dilation
In [47]: extm = binary_dilation(mask, np.ones(W, dtype=bool), origin=-(W//2))
In [48]: mask
Out[48]:
array([False, False, False, False, False, False, False, False, True,
False, False, False, False, True, False, False, False, False,
False, False])
In [49]: extm
Out[49]:
array([False, False, False, False, False, False, False, False, True,
True, True, False, False, True, True, True, False, False,
False, False])
Compare mask against extm to see how the extension takes place.
As, we can see the thresholded mask is extended by window-size W on the right side, as is the expected output mask extm. This can be use to mask out those in the input array : dt_diff[~extm] to simulate the deleting/dropping of the elements from the input following boolean-indexing or inversely dt_diff[extm] to simulate selecting those.
Alternatives with NumPy based functions
Alternative #1
extm = np.convolve(mask, np.ones(W, dtype=int))[:len(dt_diff)]>0
Alternative #2
idx = np.flatnonzero(mask)
ext_idx = (idx[:,None]+ np.arange(W)).ravel()
ext_mask = np.ones(len(dt_diff), dtype=bool)
ext_mask[ext_idx[ext_idx<len(dt_diff)]] = False
# Get filtered o/p
out = dt_diff[ext_mask]
dt_temp_idx is a numpy array, but still a Python iterable so you can use a good old Python list comprehension:
lst = [ i for j in dt_temp_idx for i in range(j, j+11)]
If you want to cope with sequence overlaps and make it back a np.array, just do:
result = np.array({i for j in dt_temp_idx for i in range(j, j+11)})
But beware the use of a set is robust and guarantee no repetition but it could be more expensive that a simple list.

Python: How to pass subarrays of array into array function

The ultimate goal of my question is that I want to generate a new array 'output' by passing the subarrays of an array into a function, where the return of the function for each subarray generates a new element into 'output'.
My input array was generated as follows:
aggregate_input = np.random.rand(100, 5)
input = np.split(aggregate_predictors, 1, axis=1)[0]
So now input appears as follows:
print(input[0:2])
>>[[ 0.61521025 0.07407679 0.92888063 0.66066605 0.95023826]
>> [ 0.0666379 0.20007622 0.84123138 0.94585421 0.81627862]]
Next, I want to pass each element of input (so the array of 5 floats) through my function 'condition' and I want the return of each function call to fill in a new array 'output'. Basically, I want 'output' to contain 100 values.
def condition(array):
return array[4] < 0.5
How do I pass each element of input into condition without using any nasty loops?
========
Basically, I want to do this, but optimized:
lister = []
for i in range(100):
lister.append(condition(input[i]))
output = np.array(lister)
That initial split and index does nothing. It just wraps the array in list, and then takes out again:
In [76]: x=np.random.rand(100,5)
In [77]: y = np.split(x,1,axis=1)
In [78]: len(y)
Out[78]: 1
In [79]: y[0].shape
Out[79]: (100, 5)
The rest just tests if the 4th element of each row is <.5:
In [81]: def condition(array):
...:
...: return array[4] < 0.5
...:
In [82]: lister = []
...:
...: for i in range(100):
...: lister.append(condition(x[i]))
...:
...: output = np.array(lister)
...:
In [83]: output
Out[83]:
array([ True, False, False, True, False, True, True, False, False,
True, False, True, False, False, True, False, False, True,
False, True, False, True, False, False, False, True, False,
...], dtype=bool)
We can do just as easily with column indexing
In [84]: x[:,4]<.5
Out[84]:
array([ True, False, False, True, False, True, True, False, False,
True, False, True, False, False, True, False, False, True,
False, True, False, True, False, False, False, True, False,
...], dtype=bool)
In other words, operate on the whole 4th column of the array.
You are trying to make a very simple indexing expression very convoluted. If you read the docs for np.split very carefully, you will see that passing a second argument of 1 does absolutely nothing: it splits the array into one chunk. The following line is literally a no-op and should be removed:
input = np.split(aggregate_predictors, 1, axis=1)[0]
You have a 2D numpy array of shape 100, 5 (you can check that with aggregate_predictors.shape). Your function returns whether or not the fifth column contains a value less than 0.5. You can do this with a single vectorized expression:
output = aggregate_predictors[:, 4] < 0.5
If you want to find the last column instead of the fifth, use index -1 instead:
output = aggregate_predictors[:, -1] < 0.5
The important thing to remember here is that all the comparison operators are vectorized element-wise in numpy. Usually, vectorizing an operation like this involves finding the correct index in the array. You should never have to convert anything to a list: numpy arrays are iterable as it is, and there are more complex iterators available.
That being said, your original intent was probably to do something like
input = split(aggregate_predictors, len(aggregate_predictors), axis=0)
OR
input = split(aggregate_predictors, aggregate_predictors.shape[0])
Both expressions are equivalent. They split aggregate_predictors into a list of 100 single-row matrices.

too many indices for array when using np.where

I have the code:
a=b=np.arange(9).reshape(3,3)
c=np.zeros(3)
for x in range(3):
c[x]=np.average(b[np.where(a<x+3)])
The output of c is
>>>array([ 1. , 1.5, 2. ])
Instead of the for loop, I wanna use array (vectorization), then I did the following code:
a=b=np.arange(9).reshape(3,3)
c=np.zeros(3)
i=np.arange(3)
c[i]=np.average(b[np.where(a<i[:,None,None]+3)])
But it shows IndexError: too many indices for array
As for a<i[:,None,None]+3
it correctly shows
array([[[ True, True, True],
[False, False, False],
[False, False, False]],
[[ True, True, True],
[ True, False, False],
[False, False, False]],
[[ True, True, True],
[ True, True, False],
[False, False, False]]], dtype=bool)
But when I use b[np.where(a<i[:,None,None]+3)], it again shows IndexError: too many indices for array. I cannot get the correct output of c.
I am sensing you are trying to vectorize things here, though not explicitly mentioned. Now, I don't think you can index like that in a vectorized manner. To solve your qustion in a vectorized manner, I would suggest a more efficient way to get the sum-reduction with matrix-multiplication using np.tensordot and with help from broadcasting as you had set out already in your trials.
Thus, one solution would be -
from __future__ import division
i = np.arange(3)
mask = a<i[:,None,None]+3
c = np.tensordot(b,mask,axes=((0,1),(1,2)))/mask.sum((1,2))
Related post to understand tensordot.
Possible improvements on performance
Convert the mask to float dtype before feeding to np.dot as BLAS based matrix-multiplication would be faster with it.
Use np.count_nonzero instead of np.sum for counting booleans. So, use it to replace mask.sum() part.

Find indices in an array that contain one of values from another array

How to get the index of values in an array (a) by a another array (label) with more than one "markers"? For example, given
label = array([1, 2])
a = array([1, 1, 2, 2, 3, 3])
the goal is to find the indices of a with the value of 1 or 2; that is, 0, 1, 2, 3.
I tried several combinations. None of the following seems to work.
label = array([1, 2])
a = array([1, 1, 2, 2, 3, 3])
idx = where(a==label) # gives me only the index of the last value in label
idx = where(a==label[0] or label[1]) # Is confused by all or any?
idx = where(a==label[0] | label[1]) # gives me results as if nor. idx = [4,5]
idx = where(a==label[0] || label[1]) # syntax error
idx = where(a==bolean.or(label,0,1) # I know, this is not the correct form but I don`t remember it correctly but remember the error: also asks for a.all or a.any
idx = where(label[0] or label[1] in a) # gives me only the first appearance. index = 0. Also without where().
idx = where(a==label[0] or a==label[1]).all()) # syntax error
idx = where(a.any(0,label[0] or label[1])) # gives me only the first appearance. index=0. Also without where().
idx = where(a.any(0,label[0] | label[1])) # gives me only the first appearance. index=0. Also without where().
idx=where(a.any(0,label)) # Datatype not understood
Ok, I think you get my problem. Does anyone know how to do it correctly? Best would be a solution with a general label instead of label[x] so that the use of label is more variable for later changes.
You can use numpy.in1d:
>>> a = numpy.array([1, 1, 2, 2, 3, 3])
>>> label = numpy.array([1, 2])
>>> numpy.in1d(a, label)
array([ True, True, True, True, False, False], dtype=bool)
The above returns a mask. If you want indices, you can call numpy.nonzero on the mask array.
Also, if the values in label array are unique, you can pass assume_unique=True to in1d to possibly speed it up.
np.where(a==label) is the same as np.nonzeros(a==label). It tells us the coordinates (indexes) of all non-zero (or True) elements in the array, a==label.
So instead of trying all these different where expressions, focus on the conditional array
Without the where here's what some of your expressions produce:
In [40]: a==label # 2 arrays don't match in size, scalar False
Out[40]: False
In [41]: a==label[0] # result is the size of a
Out[41]: array([ True, True, False, False, False, False], dtype=bool)
In [42]: a==label[0] or label[1] # or is a Python scalar operation
...
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
In [43]: a==label[0] | label[1]
Out[43]: array([False, False, False, False, True, True], dtype=bool)
This last is the same as a==(label[0] | label[1]), the | is evaluated before the ==.
You need to understand how each of those arrays (or scalar or error) are produced before you understand what where gives you.
Correct combination of 2 equality tests (the extra () are important):
In [44]: (a==label[1]) | (a==label[0])
Out[44]: array([ True, True, True, True, False, False], dtype=bool)
Using broadcasting to separately test the 2 elements of label. Result is 2d array:
In [45]: a==label[:,None]
Out[45]:
array([[ True, True, False, False, False, False],
[False, False, True, True, False, False]], dtype=bool)
In [47]: (a==label[:,None]).any(axis=0)
Out[47]: array([ True, True, True, True, False, False], dtype=bool)
As I understand it, you want the indices of 1 and 2 in array "a".
In that case, try
label= [1,2]
a= [1,1,2,2,3,3]
idx_list = list()
for x in label:
for i in range(0,len(a)-1):
if a[i] == x:
idx_list.append(i)
I think what I'm reading as your intent is to get the indices in the second list, 'a', of the values in the first list, 'labels'. I think that a dictionary is a good way to store this information where the labels will be keys and indices will be the values.
Try this:
labels = [a,2]
a = [1,1,2,2,3,3]
results = {}
for label in labels:
results[label] = [i for i,x in enumerate(a) if x == label]
if you want the indices of 1 just call results[1]. The list comprehension is and the enumerate function are the real MVPs here.

How do I search for indices that satisfy condition in numpy?

I have columns corresponding to a given day, month, and year in a numpy array called 'a' and I am comparing all three of these values to the columns of another array called 'b' which also correspond to day,month, and year to find the index of 'a' that is equal to 'b' so far I have tried:
a[:,3:6,1] == b[1,3:6]
array([[False, True, True],
[ True, True, True],
[False, True, True],
...,
[False, False, False],
[False, False, False],
[False, False, False]], dtype=bool)
which works fine but I need the row that corresponds to [True,True,True]
I've also tried:
np.where(a[:,3:6,1] == b[1,3:6], a[:,3:6,1])
ValueError: either both or neither of x and y should be given
and
a[:,:,1].all(a[:,3:6,1] == b[1,3:6])
TypeError: only length-1 arrays can be converted to Python scalars
What is a quick and easy way to do this?
You can use np.all() along the last axis:
rows = np.where((a[:,3:6,1]==b[1,3:6]).all(axis=1))[0]
it will store in rows the indices where all the row contains True values.

Categories