python: convert ascii character to boolean array - python

I have a character. I want to represent its ascii value as a numpy array of booleans.
This works, but seems contorted. Is there a better way?
bin_str = bin(ord(mychar))
bool_array = array([int(x)>0 for x in list(bin_str[2:])], dtype=bool)
for
mychar = 'd'
the desired resulting value for bool_array is
array([ True, True, False, False, True, False, False], dtype=bool)

You can extract the bits from a uint8 array directly using np.unpackbits:
np.unpackbits(np.array(ord(mychar), dtype=np.uint8))
EDIT: To get only the 7 relevant bits in a boolean array:
np.unpackbits(np.array(ord(mychar), dtype=np.uint8)).astype(bool)[1:]

This is more or less the same thing:
>>> import numpy as np
>>> mychar = 'd'
>>> np.array(list(np.binary_repr(ord(mychar), width=4))).astype('bool')
array([ True, True, False, False, True, False, False], dtype=bool)
Is it less contorted?

Related

What is the most memory/storage efficient encoding scheme for fixed length boolean arrays?

I've got 3mil boolean numpy ndarrays each of length 773 currently stored in a pandas dataframe. When they're being used, they need to be in the form of a fixed lenght array, but when in memory and storage I can use whatever encoding scheme is smallest.
As of right now I'm just saving off the arrays directly into the dataframe, but I'm unsure if I should pack the booleans into a handful of integers and save them off or if there's a way to write arbitrary binary data into a dataframe and unpack that. In short, what's the smallest/easiest to use format for saving off these arrays?
Let's take a smaller minimal workable example (yeah, they help to attract more help!):
>>> X = np.random.randint(2, size=(3, 25), dtype=bool)
>>> X
array([[False, True, False, False, False, False, True, True, True, True, False, True, True, True, True, False, False, False, True, False, False, True, True, True, False],
[False, True, True, False, False, True, True, True, True, False, True, False, False, True, True, False, True, False, True, True, True, True, False, False, True],
[ True, False, True, True, True, False, True, False, True, True, False, True, True, False, False, False, False, True, False, False, False, True, False, False, True]])
If you want to pack the elements of this array, use numpy.packbits:
>>> Y = np.packbits(X, axis=1)
>>> Y
array([[ 67, 222, 39, 0],
[103, 166, 188, 128],
[186, 216, 68, 128]], dtype=uint8)
It could be observed that elements of boolean type are indeed not memory-efficient:
def inspect_array(arr):
print('Number of elements in the array:', arr.size)
print('Length of one array element in bytes:', arr.itemsize)
print('Total bytes consumed by the elements of the array:', arr.nbytes)
>>> inspect_array(X)
Number of elements in the array: 75
Length of one array element in bytes: 1
Total bytes consumed by the elements of the array: 75
>>> inspect_array(Y)
Number of elements in the array: 12
Length of one array element in bytes: 1
Total bytes consumed by the elements of the array: 12
You could also unpack bits using the following
Z = np.unpackbits(Y, axis=1).astype(bool)[:, :X.shape[1]]
and make sure this is right
>>> np.array_equal(X, Z)
True
It also looks the same problem remains in pandas. So you could make your dataframe into numpy array, pack/unpack bits and then make it back into dataframe.

Python: comparing numpy array with sub-numpy array without loop

My problem is quite simple but I cannot figure how to solve it without a loop.
I have a first numpy array:
FullArray = np.array([0,1,2,3,4,5,6,7,8,9])
and a sub array (not necessarily ordered in the same way):
Sub array = np.array([8, 3, 5])
I would like to create a bool array that has the same size of the full array and that returns True if a given value of FullArray is present in the SubArray and False either way.
For example here I expect to get:
BoolArray = np.array([False, False, False, True, False, True, False, False, True, False])
Is there a way to do this without using a loop?
You can use np.isin:
np.isin(FullArray, SubArray)
# array([False, False, False, True, False, True, False, False, True, False])

Best practice to expand a list (efficiency) in python

I'm working with large data sets. I'm trying to use the NumPy library where I can or python features to process the data sets in an efficient way (e.g. LC).
First I find the relevant indexes:
dt_temp_idx = np.where(dt_diff > dt_temp_th)
Then I want to create a mask containing for each index a sequence starting from the index to a stop value, I tried:
mask_dt_temp = [np.arange(idx, idx+dt_temp_step) for idx in dt_temp_idx]
and:
mask_dt_temp = [idxs for idx in dt_temp_idx for idxs in np.arange(idx, idx+dt_temp_step)]
but it gives me the exception:
The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
Example input:
indexes = [0, 100, 1000]
Example output with stop values after 10 integers from each indexes:
list = [0, 1, ..., 10, 100, 101, ..., 110, 1000, 1001, ..., 1010]
1) How can I solve it?
2) Is it the best practice to do it?
Using masks (boolean arrays) are efficient being memory-efficient and performant too. We will make use of SciPy's binary-dilation to extend the thresholded mask.
Here's a step-by-step setup and solution run-
In [42]: # Random data setup
...: np.random.seed(0)
...: dt_diff = np.random.rand(20)
...: dt_temp_th = 0.9
In [43]: # Get mask of threshold crossings
...: mask = dt_diff > dt_temp_th
In [44]: mask
Out[44]:
array([False, False, False, False, False, False, False, False, True,
False, False, False, False, True, False, False, False, False,
False, False])
In [45]: W = 3 # window size for extension (edit it according to your use-case)
In [46]: from scipy.ndimage.morphology import binary_dilation
In [47]: extm = binary_dilation(mask, np.ones(W, dtype=bool), origin=-(W//2))
In [48]: mask
Out[48]:
array([False, False, False, False, False, False, False, False, True,
False, False, False, False, True, False, False, False, False,
False, False])
In [49]: extm
Out[49]:
array([False, False, False, False, False, False, False, False, True,
True, True, False, False, True, True, True, False, False,
False, False])
Compare mask against extm to see how the extension takes place.
As, we can see the thresholded mask is extended by window-size W on the right side, as is the expected output mask extm. This can be use to mask out those in the input array : dt_diff[~extm] to simulate the deleting/dropping of the elements from the input following boolean-indexing or inversely dt_diff[extm] to simulate selecting those.
Alternatives with NumPy based functions
Alternative #1
extm = np.convolve(mask, np.ones(W, dtype=int))[:len(dt_diff)]>0
Alternative #2
idx = np.flatnonzero(mask)
ext_idx = (idx[:,None]+ np.arange(W)).ravel()
ext_mask = np.ones(len(dt_diff), dtype=bool)
ext_mask[ext_idx[ext_idx<len(dt_diff)]] = False
# Get filtered o/p
out = dt_diff[ext_mask]
dt_temp_idx is a numpy array, but still a Python iterable so you can use a good old Python list comprehension:
lst = [ i for j in dt_temp_idx for i in range(j, j+11)]
If you want to cope with sequence overlaps and make it back a np.array, just do:
result = np.array({i for j in dt_temp_idx for i in range(j, j+11)})
But beware the use of a set is robust and guarantee no repetition but it could be more expensive that a simple list.

Python: numpy array larger and smaller than a value

How to look for numbers that is between a range?
c = array[2,3,4,5,6]
>>> c>3
>>> array([False, False, True, True, True]
However, when I give c in between two numbers, it return error
>>> 2<c<5
>>> ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
The desire output is
array([False, True, True, False, False]
Try this,
(c > 2) & (c < 5)
Result
array([False, True, True, False, False], dtype=bool)
Python evaluates 2<c<5 as (2<c) and (c<5) which would be valid, except the and keyword doesn't work as we would want with numpy arrays. (It attempts to cast each array to a single boolean, and that behavior can't be overridden, as discussed here.) So for a vectorized and operation with numpy arrays you need to do this:
(2<c) & (c<5)
You can do something like this :
import numpy as np
c = np.array([2,3,4,5,6])
output = [(i and j) for i, j in zip(c>2, c<5)]
Output :
[False, True, True, False, False]

Functional masking of numpy string array in Python

I'm trying to extract either the first (or only) floating point or integer from strings like these:
str1 = np.asarray('92834.1alksjdhaklsjh')
str2 = np.asarray'-987___-')
str3 = np.asarray'-234234.alskjhdasd')
where, if parsed correctly, we should get
var1 = 92834.1 #float
var2 = -987 #int
var3 = -234234.0 #float
Using the "masking" property of numpy arrays I come up with something like for any of the str_ variables, e.g.:
>> ma1 = np.asarray([not str.isalpha(c) for c in str1.tostring()],dtype=bool)
array([ True, True, True, True, True, True, True, False, False,
False, False, False, False, False, False, False, False, False,
False, False], dtype=bool)
>> str1[ma1]
IndexError: too many indeces for array
Now I've read just about everything I can find about indexing using boolean arrays; but I can't get it to work.
It's simple enough that I don't think hunkering down to figure out a regex for is worth it, but complex enough that it's been giving me trouble.
You can not create an array with different type like that, If you wan to use different types in a numpy array object you might use a record array and specify the types in your array but here as a more straight way you can convert your numpy object to string and use re.search to get the number :
>>> float(re.search(r'[\d.-]+',str(str1)).group())
92834.1
>>> float(re.search(r'[\d.-]+',str(str2)).group())
-987.0
>>> float(re.search(r'[\d.-]+',str(str3)).group())
-234234.0
But if you want to use a numpy approach you need to first create an array from your string :
>>> st=str(str1)
>>> arr=np.array(list(st))
>>> mask=map(str.isalpha,st)
>>> mask
[False, False, False, False, False, False, False, True, True, True, True, True, True, True, True, True, True, True, True, True]
>>> arr[~mask]
array(['9', '2', '8', '3', '4', '.', '1'],
dtype='|S1')
And then use str.join method with float:
>>> float(''.join(arr[~mask]))
92834.1

Categories