Functional masking of numpy string array in Python - python

I'm trying to extract either the first (or only) floating point or integer from strings like these:
str1 = np.asarray('92834.1alksjdhaklsjh')
str2 = np.asarray'-987___-')
str3 = np.asarray'-234234.alskjhdasd')
where, if parsed correctly, we should get
var1 = 92834.1 #float
var2 = -987 #int
var3 = -234234.0 #float
Using the "masking" property of numpy arrays I come up with something like for any of the str_ variables, e.g.:
>> ma1 = np.asarray([not str.isalpha(c) for c in str1.tostring()],dtype=bool)
array([ True, True, True, True, True, True, True, False, False,
False, False, False, False, False, False, False, False, False,
False, False], dtype=bool)
>> str1[ma1]
IndexError: too many indeces for array
Now I've read just about everything I can find about indexing using boolean arrays; but I can't get it to work.
It's simple enough that I don't think hunkering down to figure out a regex for is worth it, but complex enough that it's been giving me trouble.

You can not create an array with different type like that, If you wan to use different types in a numpy array object you might use a record array and specify the types in your array but here as a more straight way you can convert your numpy object to string and use re.search to get the number :
>>> float(re.search(r'[\d.-]+',str(str1)).group())
92834.1
>>> float(re.search(r'[\d.-]+',str(str2)).group())
-987.0
>>> float(re.search(r'[\d.-]+',str(str3)).group())
-234234.0
But if you want to use a numpy approach you need to first create an array from your string :
>>> st=str(str1)
>>> arr=np.array(list(st))
>>> mask=map(str.isalpha,st)
>>> mask
[False, False, False, False, False, False, False, True, True, True, True, True, True, True, True, True, True, True, True, True]
>>> arr[~mask]
array(['9', '2', '8', '3', '4', '.', '1'],
dtype='|S1')
And then use str.join method with float:
>>> float(''.join(arr[~mask]))
92834.1

Related

What is the most memory/storage efficient encoding scheme for fixed length boolean arrays?

I've got 3mil boolean numpy ndarrays each of length 773 currently stored in a pandas dataframe. When they're being used, they need to be in the form of a fixed lenght array, but when in memory and storage I can use whatever encoding scheme is smallest.
As of right now I'm just saving off the arrays directly into the dataframe, but I'm unsure if I should pack the booleans into a handful of integers and save them off or if there's a way to write arbitrary binary data into a dataframe and unpack that. In short, what's the smallest/easiest to use format for saving off these arrays?
Let's take a smaller minimal workable example (yeah, they help to attract more help!):
>>> X = np.random.randint(2, size=(3, 25), dtype=bool)
>>> X
array([[False, True, False, False, False, False, True, True, True, True, False, True, True, True, True, False, False, False, True, False, False, True, True, True, False],
[False, True, True, False, False, True, True, True, True, False, True, False, False, True, True, False, True, False, True, True, True, True, False, False, True],
[ True, False, True, True, True, False, True, False, True, True, False, True, True, False, False, False, False, True, False, False, False, True, False, False, True]])
If you want to pack the elements of this array, use numpy.packbits:
>>> Y = np.packbits(X, axis=1)
>>> Y
array([[ 67, 222, 39, 0],
[103, 166, 188, 128],
[186, 216, 68, 128]], dtype=uint8)
It could be observed that elements of boolean type are indeed not memory-efficient:
def inspect_array(arr):
print('Number of elements in the array:', arr.size)
print('Length of one array element in bytes:', arr.itemsize)
print('Total bytes consumed by the elements of the array:', arr.nbytes)
>>> inspect_array(X)
Number of elements in the array: 75
Length of one array element in bytes: 1
Total bytes consumed by the elements of the array: 75
>>> inspect_array(Y)
Number of elements in the array: 12
Length of one array element in bytes: 1
Total bytes consumed by the elements of the array: 12
You could also unpack bits using the following
Z = np.unpackbits(Y, axis=1).astype(bool)[:, :X.shape[1]]
and make sure this is right
>>> np.array_equal(X, Z)
True
It also looks the same problem remains in pandas. So you could make your dataframe into numpy array, pack/unpack bits and then make it back into dataframe.

Python: comparing numpy array with sub-numpy array without loop

My problem is quite simple but I cannot figure how to solve it without a loop.
I have a first numpy array:
FullArray = np.array([0,1,2,3,4,5,6,7,8,9])
and a sub array (not necessarily ordered in the same way):
Sub array = np.array([8, 3, 5])
I would like to create a bool array that has the same size of the full array and that returns True if a given value of FullArray is present in the SubArray and False either way.
For example here I expect to get:
BoolArray = np.array([False, False, False, True, False, True, False, False, True, False])
Is there a way to do this without using a loop?
You can use np.isin:
np.isin(FullArray, SubArray)
# array([False, False, False, True, False, True, False, False, True, False])

Python: How to pass subarrays of array into array function

The ultimate goal of my question is that I want to generate a new array 'output' by passing the subarrays of an array into a function, where the return of the function for each subarray generates a new element into 'output'.
My input array was generated as follows:
aggregate_input = np.random.rand(100, 5)
input = np.split(aggregate_predictors, 1, axis=1)[0]
So now input appears as follows:
print(input[0:2])
>>[[ 0.61521025 0.07407679 0.92888063 0.66066605 0.95023826]
>> [ 0.0666379 0.20007622 0.84123138 0.94585421 0.81627862]]
Next, I want to pass each element of input (so the array of 5 floats) through my function 'condition' and I want the return of each function call to fill in a new array 'output'. Basically, I want 'output' to contain 100 values.
def condition(array):
return array[4] < 0.5
How do I pass each element of input into condition without using any nasty loops?
========
Basically, I want to do this, but optimized:
lister = []
for i in range(100):
lister.append(condition(input[i]))
output = np.array(lister)
That initial split and index does nothing. It just wraps the array in list, and then takes out again:
In [76]: x=np.random.rand(100,5)
In [77]: y = np.split(x,1,axis=1)
In [78]: len(y)
Out[78]: 1
In [79]: y[0].shape
Out[79]: (100, 5)
The rest just tests if the 4th element of each row is <.5:
In [81]: def condition(array):
...:
...: return array[4] < 0.5
...:
In [82]: lister = []
...:
...: for i in range(100):
...: lister.append(condition(x[i]))
...:
...: output = np.array(lister)
...:
In [83]: output
Out[83]:
array([ True, False, False, True, False, True, True, False, False,
True, False, True, False, False, True, False, False, True,
False, True, False, True, False, False, False, True, False,
...], dtype=bool)
We can do just as easily with column indexing
In [84]: x[:,4]<.5
Out[84]:
array([ True, False, False, True, False, True, True, False, False,
True, False, True, False, False, True, False, False, True,
False, True, False, True, False, False, False, True, False,
...], dtype=bool)
In other words, operate on the whole 4th column of the array.
You are trying to make a very simple indexing expression very convoluted. If you read the docs for np.split very carefully, you will see that passing a second argument of 1 does absolutely nothing: it splits the array into one chunk. The following line is literally a no-op and should be removed:
input = np.split(aggregate_predictors, 1, axis=1)[0]
You have a 2D numpy array of shape 100, 5 (you can check that with aggregate_predictors.shape). Your function returns whether or not the fifth column contains a value less than 0.5. You can do this with a single vectorized expression:
output = aggregate_predictors[:, 4] < 0.5
If you want to find the last column instead of the fifth, use index -1 instead:
output = aggregate_predictors[:, -1] < 0.5
The important thing to remember here is that all the comparison operators are vectorized element-wise in numpy. Usually, vectorizing an operation like this involves finding the correct index in the array. You should never have to convert anything to a list: numpy arrays are iterable as it is, and there are more complex iterators available.
That being said, your original intent was probably to do something like
input = split(aggregate_predictors, len(aggregate_predictors), axis=0)
OR
input = split(aggregate_predictors, aggregate_predictors.shape[0])
Both expressions are equivalent. They split aggregate_predictors into a list of 100 single-row matrices.

Check for array - is value contained in another array?

I'd like to return a boolean for each value in array A that indicates whether it's in array B. This should be a standard procedure I guess, but I can't find any information on how to do it. My attempt is below:
A = ['User0','User1','User2','User3','User4','User0','User1','User2','User3'
'User4','User0','User1','User2','User3','User4','User0','User1','User2'
'User3','User4','User0','User1','User2','User3','User4','User0','User1'
'User2','User3','User4','User0','User1']
B = ['User3', 'User2', 'User4']
contained = (A in B)
However, I get the error:
ValueError: shape mismatch: objects cannot be broadcast to a single shape
I'm using numpy so any solution using numpy or standard Python would be preferred.
You can use in1d I believe -
np.in1d(A,B)
For testing it without using numpy, try:
contained = [a in B for a in A]
result:
[False, False, True, True, True, False, False, True, False, False,
False, True, True, True, False, False, False, True, False, False,
True, True, True, False, False, True, True, False, False]

python: convert ascii character to boolean array

I have a character. I want to represent its ascii value as a numpy array of booleans.
This works, but seems contorted. Is there a better way?
bin_str = bin(ord(mychar))
bool_array = array([int(x)>0 for x in list(bin_str[2:])], dtype=bool)
for
mychar = 'd'
the desired resulting value for bool_array is
array([ True, True, False, False, True, False, False], dtype=bool)
You can extract the bits from a uint8 array directly using np.unpackbits:
np.unpackbits(np.array(ord(mychar), dtype=np.uint8))
EDIT: To get only the 7 relevant bits in a boolean array:
np.unpackbits(np.array(ord(mychar), dtype=np.uint8)).astype(bool)[1:]
This is more or less the same thing:
>>> import numpy as np
>>> mychar = 'd'
>>> np.array(list(np.binary_repr(ord(mychar), width=4))).astype('bool')
array([ True, True, False, False, True, False, False], dtype=bool)
Is it less contorted?

Categories