Check for array - is value contained in another array? - python

I'd like to return a boolean for each value in array A that indicates whether it's in array B. This should be a standard procedure I guess, but I can't find any information on how to do it. My attempt is below:
A = ['User0','User1','User2','User3','User4','User0','User1','User2','User3'
'User4','User0','User1','User2','User3','User4','User0','User1','User2'
'User3','User4','User0','User1','User2','User3','User4','User0','User1'
'User2','User3','User4','User0','User1']
B = ['User3', 'User2', 'User4']
contained = (A in B)
However, I get the error:
ValueError: shape mismatch: objects cannot be broadcast to a single shape
I'm using numpy so any solution using numpy or standard Python would be preferred.

You can use in1d I believe -
np.in1d(A,B)

For testing it without using numpy, try:
contained = [a in B for a in A]
result:
[False, False, True, True, True, False, False, True, False, False,
False, True, True, True, False, False, False, True, False, False,
True, True, True, False, False, True, True, False, False]

Related

What is the most memory/storage efficient encoding scheme for fixed length boolean arrays?

I've got 3mil boolean numpy ndarrays each of length 773 currently stored in a pandas dataframe. When they're being used, they need to be in the form of a fixed lenght array, but when in memory and storage I can use whatever encoding scheme is smallest.
As of right now I'm just saving off the arrays directly into the dataframe, but I'm unsure if I should pack the booleans into a handful of integers and save them off or if there's a way to write arbitrary binary data into a dataframe and unpack that. In short, what's the smallest/easiest to use format for saving off these arrays?
Let's take a smaller minimal workable example (yeah, they help to attract more help!):
>>> X = np.random.randint(2, size=(3, 25), dtype=bool)
>>> X
array([[False, True, False, False, False, False, True, True, True, True, False, True, True, True, True, False, False, False, True, False, False, True, True, True, False],
[False, True, True, False, False, True, True, True, True, False, True, False, False, True, True, False, True, False, True, True, True, True, False, False, True],
[ True, False, True, True, True, False, True, False, True, True, False, True, True, False, False, False, False, True, False, False, False, True, False, False, True]])
If you want to pack the elements of this array, use numpy.packbits:
>>> Y = np.packbits(X, axis=1)
>>> Y
array([[ 67, 222, 39, 0],
[103, 166, 188, 128],
[186, 216, 68, 128]], dtype=uint8)
It could be observed that elements of boolean type are indeed not memory-efficient:
def inspect_array(arr):
print('Number of elements in the array:', arr.size)
print('Length of one array element in bytes:', arr.itemsize)
print('Total bytes consumed by the elements of the array:', arr.nbytes)
>>> inspect_array(X)
Number of elements in the array: 75
Length of one array element in bytes: 1
Total bytes consumed by the elements of the array: 75
>>> inspect_array(Y)
Number of elements in the array: 12
Length of one array element in bytes: 1
Total bytes consumed by the elements of the array: 12
You could also unpack bits using the following
Z = np.unpackbits(Y, axis=1).astype(bool)[:, :X.shape[1]]
and make sure this is right
>>> np.array_equal(X, Z)
True
It also looks the same problem remains in pandas. So you could make your dataframe into numpy array, pack/unpack bits and then make it back into dataframe.

Python: comparing numpy array with sub-numpy array without loop

My problem is quite simple but I cannot figure how to solve it without a loop.
I have a first numpy array:
FullArray = np.array([0,1,2,3,4,5,6,7,8,9])
and a sub array (not necessarily ordered in the same way):
Sub array = np.array([8, 3, 5])
I would like to create a bool array that has the same size of the full array and that returns True if a given value of FullArray is present in the SubArray and False either way.
For example here I expect to get:
BoolArray = np.array([False, False, False, True, False, True, False, False, True, False])
Is there a way to do this without using a loop?
You can use np.isin:
np.isin(FullArray, SubArray)
# array([False, False, False, True, False, True, False, False, True, False])

How can you find the index of a list within a list of lists

I know for 1d arrays there is a function called np.in1d that allows you to find the indices of an array that are present in another array, for example:
a = [0,0,0,24210,0,0,0,0,0,21220,0,0,0,0,0,24410]
b = [24210,24610,24410]
np.in1d(a,b)
yields [False, False, False, True, False, False, False, False, False,
False, False, False, False, False, False, True]
I was wondering if there was a command like this for finding lists in a list of lists?
c = [[1,0,1],[0,0,1],[0,0,0],[0,0,1],[1,1,1]]
d = [[0,0,1],[1,0,1]]
something like np.in2d(c,d)
would yield [True, True, False, True, False]
Edit: I should add, I tried this with in1d and it flattens the 2d lists so it does not give the correct output.
I did np.in1d(c,d) and the result was [ True, True, True,
True, True, True, True, True, True, True, True, True, True, True,
True]
What about this?
[x in d for x in c]

Changing a certain index of boolean list of lists change others, too [duplicate]

This question already has answers here:
List of lists changes reflected across sublists unexpectedly
(17 answers)
Closed 5 years ago.
so I have a boolean list of lists and I changed a certain index from True to False, and it affects some others element in list of lists too. Why is it happening? is there any alternative?
test = [[True]*9]*9
test[0][1] = False
print(test)
The output:
[[True, False, True, True, True, True, True, True, True],
[True, False, True, True, True, True, True, True, True],
[True, False, True, True, True, True, True, True, True],
[True, False, True, True, True, True, True, True, True],
[True, False, True, True, True, True, True, True, True],
[True, False, True, True, True, True, True, True, True],
[True, False, True, True, True, True, True, True, True],
[True, False, True, True, True, True, True, True, True],
[True, False, True, True, True, True, True, True, True]]
What you want to do is :
test = [[True for i in range(cols)] for j in range(rows)]
#OR
test = [[True]*cols for j in range(rows)]
The problem with doing
test = [[True]*9]*9
is that you are creating a multi-dimensional list that is referencing the same memory address that holds the True value.
It is something like having :
test = [ [True, True, True, True, True, True, True, True, True] repeated 9 times ]
where the rows all point to the same memory location of their respective columns. So when you change value in one column, it changes for the whole set of column value.
So, with the desired way,
test = [[True for i in range(9)] for j in range(9)]
test[0][1] = False
print(test)
will print :
[[True, False, True, True, True, True, True, True, True],
[True, True, True, True, True, True, True, True, True],
[True, True, True, True, True, True, True, True, True],
[True, True, True, True, True, True, True, True, True],
[True, True, True, True, True, True, True, True, True],
[True, True, True, True, True, True, True, True, True],
[True, True, True, True, True, True, True, True, True],
[True, True, True, True, True, True, True, True, True],
[True, True, True, True, True, True, True, True, True]]
This is caused by the fact that the star (*) operator doesn't create n new independent lists. It creates n references to the same list. Creating a list with all independent lists can be done using a list comprehension, as is already suggested:
test = [[True for i in range(cols)] for j in range(rows)]
Yes, that's a classic python gotcha. In the inner multiplication, you make 9 copies of True. That generates a list of 9 instances of the same True object. However, the True object is immutable, so you can't change it, you just replace one of them, without affecting others.
On the contrary, in the outer multiplication, you create 9 copies of the same inner list. However, lists are mutable, so, when you change one of them, it really changes. Since the outer list consists of copies of the same list, the will all change.
You can only avoid it by creating 9 different lists
test = [[Test]*9 for i in range(9)]

Functional masking of numpy string array in Python

I'm trying to extract either the first (or only) floating point or integer from strings like these:
str1 = np.asarray('92834.1alksjdhaklsjh')
str2 = np.asarray'-987___-')
str3 = np.asarray'-234234.alskjhdasd')
where, if parsed correctly, we should get
var1 = 92834.1 #float
var2 = -987 #int
var3 = -234234.0 #float
Using the "masking" property of numpy arrays I come up with something like for any of the str_ variables, e.g.:
>> ma1 = np.asarray([not str.isalpha(c) for c in str1.tostring()],dtype=bool)
array([ True, True, True, True, True, True, True, False, False,
False, False, False, False, False, False, False, False, False,
False, False], dtype=bool)
>> str1[ma1]
IndexError: too many indeces for array
Now I've read just about everything I can find about indexing using boolean arrays; but I can't get it to work.
It's simple enough that I don't think hunkering down to figure out a regex for is worth it, but complex enough that it's been giving me trouble.
You can not create an array with different type like that, If you wan to use different types in a numpy array object you might use a record array and specify the types in your array but here as a more straight way you can convert your numpy object to string and use re.search to get the number :
>>> float(re.search(r'[\d.-]+',str(str1)).group())
92834.1
>>> float(re.search(r'[\d.-]+',str(str2)).group())
-987.0
>>> float(re.search(r'[\d.-]+',str(str3)).group())
-234234.0
But if you want to use a numpy approach you need to first create an array from your string :
>>> st=str(str1)
>>> arr=np.array(list(st))
>>> mask=map(str.isalpha,st)
>>> mask
[False, False, False, False, False, False, False, True, True, True, True, True, True, True, True, True, True, True, True, True]
>>> arr[~mask]
array(['9', '2', '8', '3', '4', '.', '1'],
dtype='|S1')
And then use str.join method with float:
>>> float(''.join(arr[~mask]))
92834.1

Categories