Python: comparing numpy array with sub-numpy array without loop - python

My problem is quite simple but I cannot figure how to solve it without a loop.
I have a first numpy array:
FullArray = np.array([0,1,2,3,4,5,6,7,8,9])
and a sub array (not necessarily ordered in the same way):
Sub array = np.array([8, 3, 5])
I would like to create a bool array that has the same size of the full array and that returns True if a given value of FullArray is present in the SubArray and False either way.
For example here I expect to get:
BoolArray = np.array([False, False, False, True, False, True, False, False, True, False])
Is there a way to do this without using a loop?

You can use np.isin:
np.isin(FullArray, SubArray)
# array([False, False, False, True, False, True, False, False, True, False])

Related

What is the most memory/storage efficient encoding scheme for fixed length boolean arrays?

I've got 3mil boolean numpy ndarrays each of length 773 currently stored in a pandas dataframe. When they're being used, they need to be in the form of a fixed lenght array, but when in memory and storage I can use whatever encoding scheme is smallest.
As of right now I'm just saving off the arrays directly into the dataframe, but I'm unsure if I should pack the booleans into a handful of integers and save them off or if there's a way to write arbitrary binary data into a dataframe and unpack that. In short, what's the smallest/easiest to use format for saving off these arrays?
Let's take a smaller minimal workable example (yeah, they help to attract more help!):
>>> X = np.random.randint(2, size=(3, 25), dtype=bool)
>>> X
array([[False, True, False, False, False, False, True, True, True, True, False, True, True, True, True, False, False, False, True, False, False, True, True, True, False],
[False, True, True, False, False, True, True, True, True, False, True, False, False, True, True, False, True, False, True, True, True, True, False, False, True],
[ True, False, True, True, True, False, True, False, True, True, False, True, True, False, False, False, False, True, False, False, False, True, False, False, True]])
If you want to pack the elements of this array, use numpy.packbits:
>>> Y = np.packbits(X, axis=1)
>>> Y
array([[ 67, 222, 39, 0],
[103, 166, 188, 128],
[186, 216, 68, 128]], dtype=uint8)
It could be observed that elements of boolean type are indeed not memory-efficient:
def inspect_array(arr):
print('Number of elements in the array:', arr.size)
print('Length of one array element in bytes:', arr.itemsize)
print('Total bytes consumed by the elements of the array:', arr.nbytes)
>>> inspect_array(X)
Number of elements in the array: 75
Length of one array element in bytes: 1
Total bytes consumed by the elements of the array: 75
>>> inspect_array(Y)
Number of elements in the array: 12
Length of one array element in bytes: 1
Total bytes consumed by the elements of the array: 12
You could also unpack bits using the following
Z = np.unpackbits(Y, axis=1).astype(bool)[:, :X.shape[1]]
and make sure this is right
>>> np.array_equal(X, Z)
True
It also looks the same problem remains in pandas. So you could make your dataframe into numpy array, pack/unpack bits and then make it back into dataframe.

Best practice to expand a list (efficiency) in python

I'm working with large data sets. I'm trying to use the NumPy library where I can or python features to process the data sets in an efficient way (e.g. LC).
First I find the relevant indexes:
dt_temp_idx = np.where(dt_diff > dt_temp_th)
Then I want to create a mask containing for each index a sequence starting from the index to a stop value, I tried:
mask_dt_temp = [np.arange(idx, idx+dt_temp_step) for idx in dt_temp_idx]
and:
mask_dt_temp = [idxs for idx in dt_temp_idx for idxs in np.arange(idx, idx+dt_temp_step)]
but it gives me the exception:
The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
Example input:
indexes = [0, 100, 1000]
Example output with stop values after 10 integers from each indexes:
list = [0, 1, ..., 10, 100, 101, ..., 110, 1000, 1001, ..., 1010]
1) How can I solve it?
2) Is it the best practice to do it?
Using masks (boolean arrays) are efficient being memory-efficient and performant too. We will make use of SciPy's binary-dilation to extend the thresholded mask.
Here's a step-by-step setup and solution run-
In [42]: # Random data setup
...: np.random.seed(0)
...: dt_diff = np.random.rand(20)
...: dt_temp_th = 0.9
In [43]: # Get mask of threshold crossings
...: mask = dt_diff > dt_temp_th
In [44]: mask
Out[44]:
array([False, False, False, False, False, False, False, False, True,
False, False, False, False, True, False, False, False, False,
False, False])
In [45]: W = 3 # window size for extension (edit it according to your use-case)
In [46]: from scipy.ndimage.morphology import binary_dilation
In [47]: extm = binary_dilation(mask, np.ones(W, dtype=bool), origin=-(W//2))
In [48]: mask
Out[48]:
array([False, False, False, False, False, False, False, False, True,
False, False, False, False, True, False, False, False, False,
False, False])
In [49]: extm
Out[49]:
array([False, False, False, False, False, False, False, False, True,
True, True, False, False, True, True, True, False, False,
False, False])
Compare mask against extm to see how the extension takes place.
As, we can see the thresholded mask is extended by window-size W on the right side, as is the expected output mask extm. This can be use to mask out those in the input array : dt_diff[~extm] to simulate the deleting/dropping of the elements from the input following boolean-indexing or inversely dt_diff[extm] to simulate selecting those.
Alternatives with NumPy based functions
Alternative #1
extm = np.convolve(mask, np.ones(W, dtype=int))[:len(dt_diff)]>0
Alternative #2
idx = np.flatnonzero(mask)
ext_idx = (idx[:,None]+ np.arange(W)).ravel()
ext_mask = np.ones(len(dt_diff), dtype=bool)
ext_mask[ext_idx[ext_idx<len(dt_diff)]] = False
# Get filtered o/p
out = dt_diff[ext_mask]
dt_temp_idx is a numpy array, but still a Python iterable so you can use a good old Python list comprehension:
lst = [ i for j in dt_temp_idx for i in range(j, j+11)]
If you want to cope with sequence overlaps and make it back a np.array, just do:
result = np.array({i for j in dt_temp_idx for i in range(j, j+11)})
But beware the use of a set is robust and guarantee no repetition but it could be more expensive that a simple list.

Python: How to pass subarrays of array into array function

The ultimate goal of my question is that I want to generate a new array 'output' by passing the subarrays of an array into a function, where the return of the function for each subarray generates a new element into 'output'.
My input array was generated as follows:
aggregate_input = np.random.rand(100, 5)
input = np.split(aggregate_predictors, 1, axis=1)[0]
So now input appears as follows:
print(input[0:2])
>>[[ 0.61521025 0.07407679 0.92888063 0.66066605 0.95023826]
>> [ 0.0666379 0.20007622 0.84123138 0.94585421 0.81627862]]
Next, I want to pass each element of input (so the array of 5 floats) through my function 'condition' and I want the return of each function call to fill in a new array 'output'. Basically, I want 'output' to contain 100 values.
def condition(array):
return array[4] < 0.5
How do I pass each element of input into condition without using any nasty loops?
========
Basically, I want to do this, but optimized:
lister = []
for i in range(100):
lister.append(condition(input[i]))
output = np.array(lister)
That initial split and index does nothing. It just wraps the array in list, and then takes out again:
In [76]: x=np.random.rand(100,5)
In [77]: y = np.split(x,1,axis=1)
In [78]: len(y)
Out[78]: 1
In [79]: y[0].shape
Out[79]: (100, 5)
The rest just tests if the 4th element of each row is <.5:
In [81]: def condition(array):
...:
...: return array[4] < 0.5
...:
In [82]: lister = []
...:
...: for i in range(100):
...: lister.append(condition(x[i]))
...:
...: output = np.array(lister)
...:
In [83]: output
Out[83]:
array([ True, False, False, True, False, True, True, False, False,
True, False, True, False, False, True, False, False, True,
False, True, False, True, False, False, False, True, False,
...], dtype=bool)
We can do just as easily with column indexing
In [84]: x[:,4]<.5
Out[84]:
array([ True, False, False, True, False, True, True, False, False,
True, False, True, False, False, True, False, False, True,
False, True, False, True, False, False, False, True, False,
...], dtype=bool)
In other words, operate on the whole 4th column of the array.
You are trying to make a very simple indexing expression very convoluted. If you read the docs for np.split very carefully, you will see that passing a second argument of 1 does absolutely nothing: it splits the array into one chunk. The following line is literally a no-op and should be removed:
input = np.split(aggregate_predictors, 1, axis=1)[0]
You have a 2D numpy array of shape 100, 5 (you can check that with aggregate_predictors.shape). Your function returns whether or not the fifth column contains a value less than 0.5. You can do this with a single vectorized expression:
output = aggregate_predictors[:, 4] < 0.5
If you want to find the last column instead of the fifth, use index -1 instead:
output = aggregate_predictors[:, -1] < 0.5
The important thing to remember here is that all the comparison operators are vectorized element-wise in numpy. Usually, vectorizing an operation like this involves finding the correct index in the array. You should never have to convert anything to a list: numpy arrays are iterable as it is, and there are more complex iterators available.
That being said, your original intent was probably to do something like
input = split(aggregate_predictors, len(aggregate_predictors), axis=0)
OR
input = split(aggregate_predictors, aggregate_predictors.shape[0])
Both expressions are equivalent. They split aggregate_predictors into a list of 100 single-row matrices.

Check for array - is value contained in another array?

I'd like to return a boolean for each value in array A that indicates whether it's in array B. This should be a standard procedure I guess, but I can't find any information on how to do it. My attempt is below:
A = ['User0','User1','User2','User3','User4','User0','User1','User2','User3'
'User4','User0','User1','User2','User3','User4','User0','User1','User2'
'User3','User4','User0','User1','User2','User3','User4','User0','User1'
'User2','User3','User4','User0','User1']
B = ['User3', 'User2', 'User4']
contained = (A in B)
However, I get the error:
ValueError: shape mismatch: objects cannot be broadcast to a single shape
I'm using numpy so any solution using numpy or standard Python would be preferred.
You can use in1d I believe -
np.in1d(A,B)
For testing it without using numpy, try:
contained = [a in B for a in A]
result:
[False, False, True, True, True, False, False, True, False, False,
False, True, True, True, False, False, False, True, False, False,
True, True, True, False, False, True, True, False, False]

python: convert ascii character to boolean array

I have a character. I want to represent its ascii value as a numpy array of booleans.
This works, but seems contorted. Is there a better way?
bin_str = bin(ord(mychar))
bool_array = array([int(x)>0 for x in list(bin_str[2:])], dtype=bool)
for
mychar = 'd'
the desired resulting value for bool_array is
array([ True, True, False, False, True, False, False], dtype=bool)
You can extract the bits from a uint8 array directly using np.unpackbits:
np.unpackbits(np.array(ord(mychar), dtype=np.uint8))
EDIT: To get only the 7 relevant bits in a boolean array:
np.unpackbits(np.array(ord(mychar), dtype=np.uint8)).astype(bool)[1:]
This is more or less the same thing:
>>> import numpy as np
>>> mychar = 'd'
>>> np.array(list(np.binary_repr(ord(mychar), width=4))).astype('bool')
array([ True, True, False, False, True, False, False], dtype=bool)
Is it less contorted?

Categories