Best practice to expand a list (efficiency) in python - python

I'm working with large data sets. I'm trying to use the NumPy library where I can or python features to process the data sets in an efficient way (e.g. LC).
First I find the relevant indexes:
dt_temp_idx = np.where(dt_diff > dt_temp_th)
Then I want to create a mask containing for each index a sequence starting from the index to a stop value, I tried:
mask_dt_temp = [np.arange(idx, idx+dt_temp_step) for idx in dt_temp_idx]
and:
mask_dt_temp = [idxs for idx in dt_temp_idx for idxs in np.arange(idx, idx+dt_temp_step)]
but it gives me the exception:
The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
Example input:
indexes = [0, 100, 1000]
Example output with stop values after 10 integers from each indexes:
list = [0, 1, ..., 10, 100, 101, ..., 110, 1000, 1001, ..., 1010]
1) How can I solve it?
2) Is it the best practice to do it?

Using masks (boolean arrays) are efficient being memory-efficient and performant too. We will make use of SciPy's binary-dilation to extend the thresholded mask.
Here's a step-by-step setup and solution run-
In [42]: # Random data setup
...: np.random.seed(0)
...: dt_diff = np.random.rand(20)
...: dt_temp_th = 0.9
In [43]: # Get mask of threshold crossings
...: mask = dt_diff > dt_temp_th
In [44]: mask
Out[44]:
array([False, False, False, False, False, False, False, False, True,
False, False, False, False, True, False, False, False, False,
False, False])
In [45]: W = 3 # window size for extension (edit it according to your use-case)
In [46]: from scipy.ndimage.morphology import binary_dilation
In [47]: extm = binary_dilation(mask, np.ones(W, dtype=bool), origin=-(W//2))
In [48]: mask
Out[48]:
array([False, False, False, False, False, False, False, False, True,
False, False, False, False, True, False, False, False, False,
False, False])
In [49]: extm
Out[49]:
array([False, False, False, False, False, False, False, False, True,
True, True, False, False, True, True, True, False, False,
False, False])
Compare mask against extm to see how the extension takes place.
As, we can see the thresholded mask is extended by window-size W on the right side, as is the expected output mask extm. This can be use to mask out those in the input array : dt_diff[~extm] to simulate the deleting/dropping of the elements from the input following boolean-indexing or inversely dt_diff[extm] to simulate selecting those.
Alternatives with NumPy based functions
Alternative #1
extm = np.convolve(mask, np.ones(W, dtype=int))[:len(dt_diff)]>0
Alternative #2
idx = np.flatnonzero(mask)
ext_idx = (idx[:,None]+ np.arange(W)).ravel()
ext_mask = np.ones(len(dt_diff), dtype=bool)
ext_mask[ext_idx[ext_idx<len(dt_diff)]] = False
# Get filtered o/p
out = dt_diff[ext_mask]

dt_temp_idx is a numpy array, but still a Python iterable so you can use a good old Python list comprehension:
lst = [ i for j in dt_temp_idx for i in range(j, j+11)]
If you want to cope with sequence overlaps and make it back a np.array, just do:
result = np.array({i for j in dt_temp_idx for i in range(j, j+11)})
But beware the use of a set is robust and guarantee no repetition but it could be more expensive that a simple list.

Related

Python: comparing numpy array with sub-numpy array without loop

My problem is quite simple but I cannot figure how to solve it without a loop.
I have a first numpy array:
FullArray = np.array([0,1,2,3,4,5,6,7,8,9])
and a sub array (not necessarily ordered in the same way):
Sub array = np.array([8, 3, 5])
I would like to create a bool array that has the same size of the full array and that returns True if a given value of FullArray is present in the SubArray and False either way.
For example here I expect to get:
BoolArray = np.array([False, False, False, True, False, True, False, False, True, False])
Is there a way to do this without using a loop?
You can use np.isin:
np.isin(FullArray, SubArray)
# array([False, False, False, True, False, True, False, False, True, False])

How can I shorten this 2D array code?

I have the following snippet which works fine.
P=im1.copy()
for i in range(P.shape[0]):
for j in range(P.shape[1]):
if (n_dens.data[i][j]==-5000 or T_k.data[i][j]==-5000):
P.data[i][j]=-5000
else :
P.data[i][j]=n_dens.data[i][j]*T_k.data[i][j]
where P is a 2D array.
I was wondering how to trim this down to something along the following lines:
P.data=n_dens.data*T_k.data
P.data=[foo-2.5005*10**7 if n_dens.data==-5000 or T_k.data==-5000 else foo for foo in P.data]
For my trial above I get the following error:
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
How can I correct the error? Or is there another method to trim it down?
The n_dens.data==-5000 produces an array of true/false values, not a single value. So, the if can't handle it. You are close to the idea though. You can use logical indexing in numpy.
Also logical operators cannot be overloaded in python. So, numpy does not handle them as you would wish. So, you have to do something like
index = np.logical_or(n_dens.data ==-5000, T_k.data==-5000)
P.data[index] = -5000
Similarly, P.data[np.logical_not(index)] = n_dens.data * T.data for the second branch of if-else.
You can try this:
P.data[(n_dens.data == -5000) | (T_k.data == -5000)] = -5000
cond = ~(n_dens.data == -5000) & ~(T_k.data == -5000) # 2D array of booleans
P.data[cond] = n_dens.data[cond] * T_k.data[cond]
A complete example:
import numpy as np
from copy import deepcopy
class IMAGE:
def __init__(self, data):
self.data = data
self.shape = self.data.shape
np.random.seed(0)
P, n_dens, T_k = IMAGE(np.zeros((5,5))), IMAGE(np.reshape(np.random.choice([-5000,1,2],25), (5,5))), IMAGE(3*np.ones((5,5)))
P1 = deepcopy(P)
# with loop
for i in range(P.shape[0]):
for j in range(P.shape[1]):
if (n_dens.data[i][j]==-5000 or T_k.data[i][j]==-5000):
P.data[i][j]=-5000
else :
P.data[i][j]=n_dens.data[i][j]*T_k.data[i][j]
# vectorized
P1.data[(n_dens.data == -5000) | (T_k.data == -5000)] = -5000
cond = ~(n_dens.data == -5000) & ~(T_k.data == -5000) # 2D array of booleans
P1.data[cond] = n_dens.data[cond] * T_k.data[cond]
cond
# array([[False, True, False, True, True],
# [ True, False, True, False, False],
# [False, True, True, True, True],
# [False, True, True, True, True],
# [False, True, False, False, True]], dtype=bool)
# with same output for both
P.data == P1.data
# array([[ True, True, True, True, True],
# [ True, True, True, True, True],
# [ True, True, True, True, True],
# [ True, True, True, True, True],
# [ True, True, True, True, True]], dtype=bool)

Python: How to pass subarrays of array into array function

The ultimate goal of my question is that I want to generate a new array 'output' by passing the subarrays of an array into a function, where the return of the function for each subarray generates a new element into 'output'.
My input array was generated as follows:
aggregate_input = np.random.rand(100, 5)
input = np.split(aggregate_predictors, 1, axis=1)[0]
So now input appears as follows:
print(input[0:2])
>>[[ 0.61521025 0.07407679 0.92888063 0.66066605 0.95023826]
>> [ 0.0666379 0.20007622 0.84123138 0.94585421 0.81627862]]
Next, I want to pass each element of input (so the array of 5 floats) through my function 'condition' and I want the return of each function call to fill in a new array 'output'. Basically, I want 'output' to contain 100 values.
def condition(array):
return array[4] < 0.5
How do I pass each element of input into condition without using any nasty loops?
========
Basically, I want to do this, but optimized:
lister = []
for i in range(100):
lister.append(condition(input[i]))
output = np.array(lister)
That initial split and index does nothing. It just wraps the array in list, and then takes out again:
In [76]: x=np.random.rand(100,5)
In [77]: y = np.split(x,1,axis=1)
In [78]: len(y)
Out[78]: 1
In [79]: y[0].shape
Out[79]: (100, 5)
The rest just tests if the 4th element of each row is <.5:
In [81]: def condition(array):
...:
...: return array[4] < 0.5
...:
In [82]: lister = []
...:
...: for i in range(100):
...: lister.append(condition(x[i]))
...:
...: output = np.array(lister)
...:
In [83]: output
Out[83]:
array([ True, False, False, True, False, True, True, False, False,
True, False, True, False, False, True, False, False, True,
False, True, False, True, False, False, False, True, False,
...], dtype=bool)
We can do just as easily with column indexing
In [84]: x[:,4]<.5
Out[84]:
array([ True, False, False, True, False, True, True, False, False,
True, False, True, False, False, True, False, False, True,
False, True, False, True, False, False, False, True, False,
...], dtype=bool)
In other words, operate on the whole 4th column of the array.
You are trying to make a very simple indexing expression very convoluted. If you read the docs for np.split very carefully, you will see that passing a second argument of 1 does absolutely nothing: it splits the array into one chunk. The following line is literally a no-op and should be removed:
input = np.split(aggregate_predictors, 1, axis=1)[0]
You have a 2D numpy array of shape 100, 5 (you can check that with aggregate_predictors.shape). Your function returns whether or not the fifth column contains a value less than 0.5. You can do this with a single vectorized expression:
output = aggregate_predictors[:, 4] < 0.5
If you want to find the last column instead of the fifth, use index -1 instead:
output = aggregate_predictors[:, -1] < 0.5
The important thing to remember here is that all the comparison operators are vectorized element-wise in numpy. Usually, vectorizing an operation like this involves finding the correct index in the array. You should never have to convert anything to a list: numpy arrays are iterable as it is, and there are more complex iterators available.
That being said, your original intent was probably to do something like
input = split(aggregate_predictors, len(aggregate_predictors), axis=0)
OR
input = split(aggregate_predictors, aggregate_predictors.shape[0])
Both expressions are equivalent. They split aggregate_predictors into a list of 100 single-row matrices.

Count of specific case per row in matrix

I am fairly new to numpy and scientific computing and I struggle with a problem for several days, so I decided to post it here.
I am trying to get a count for a specific occurence of a condition in a numpy array.
In [233]: import numpy as np
In [234]: a= np.random.random([5,5])
In [235]: a >.7
Out[235]: array([[False, True, True, False, False],
[ True, False, False, False, True],
[ True, False, True, True, False],
[False, False, False, False, False],
[False, False, True, False, False]], dtype=bool)
What I would like to count the number of occurence of True in each row and keep the rows when this count reach a certain threshold:
ex :
results=[]
threshold = 2
for i,row in enumerate(a>.7):
if len([value for value in row if value==True]) > threshold:
results.append(i) # keep ids for each row that have more than 'threshold' times True
This is the non-optimized version of the code but I would love to achieve the same thing with numpy (I have a very large matrix to process).
I have been trying all sort of things with np.where but I only can get flatten results. I need the row number
Thanks in advance !
To make results reproducible, use some seed:
>>> np.random.seed(100)
Then for a sample matrix
>>> a = np.random.random([5,5])
Count number of occurences along axis with sum:
>>> (a >.7).sum(axis=1)
array([1, 0, 3, 1, 2])
You can get row numbers with np.where:
>>> np.where((a > .7).sum(axis=1) >= 2)
(array([2, 4]),)
To filter result, just use boolean indexing:
>>> a[(a > .7).sum(axis=1) >= 2]
array([[ 0.89041156, 0.98092086, 0.05994199, 0.89054594, 0.5769015 ],
[ 0.54468488, 0.76911517, 0.25069523, 0.28589569, 0.85239509]])
You can sum over axis with a.sum.
Then you can use where on the resulting vector.
results = np.where(a.sum(axis=0) < threshold))

python: convert ascii character to boolean array

I have a character. I want to represent its ascii value as a numpy array of booleans.
This works, but seems contorted. Is there a better way?
bin_str = bin(ord(mychar))
bool_array = array([int(x)>0 for x in list(bin_str[2:])], dtype=bool)
for
mychar = 'd'
the desired resulting value for bool_array is
array([ True, True, False, False, True, False, False], dtype=bool)
You can extract the bits from a uint8 array directly using np.unpackbits:
np.unpackbits(np.array(ord(mychar), dtype=np.uint8))
EDIT: To get only the 7 relevant bits in a boolean array:
np.unpackbits(np.array(ord(mychar), dtype=np.uint8)).astype(bool)[1:]
This is more or less the same thing:
>>> import numpy as np
>>> mychar = 'd'
>>> np.array(list(np.binary_repr(ord(mychar), width=4))).astype('bool')
array([ True, True, False, False, True, False, False], dtype=bool)
Is it less contorted?

Categories