Convert two boolean columns to class ID in Pandas - python

I have to boolean columns:
df = pd.DataFrame([[True, True],
[True, False],
[False, True],
[True, True],
[False, False]],
columns=['col1', 'col2'])
I need to generate a new column that identifies which unique combination they belong to:
result = pd.Series([0, 1, 2, 0, 3])
Seems like there should be a very simple way to do this but it's escaping me. Maybe something using sklearn.preprocessing? Simple Pandas or Numpy solutions are equally preferred.
EDIT: Would be really nice if the solution could scale to more than 2 columns

The simpliest is create tuples with factorize:
print (pd.Series(pd.factorize(df.apply(tuple, axis=1))[0]))
0 0
1 1
2 2
3 0
4 3
dtype: int64
Another solution with cast to string and sum:
print (pd.Series(pd.factorize(df.astype(str).sum(axis=1))[0]))
0 0
1 1
2 2
3 0
4 3
dtype: int64

I've never used pandas before but here is a solution with plain python that I'm sure wouldn't be hard to adapt to pandas:
a = [[True, True],
[True, False],
[False, True],
[True, True],
[False, False]]
ids, result = [], [] # ids, keeps a list of previously seen items. result, keeps the result
for x in a:
if x in ids: # x has been seen before
id = ids.index(x) # find old id
result.append(id)
else: # x hasn't been seen before
id = len(ids) # create new id
result.append(id)
ids.append(x)
print(result) # [0, 1, 2, 0, 3]
This works with any number of columns, to get the result into a series just use:
result = pd.Series(result)

Related

How to use broadcast feature of numpy on a pandas dataframe with list columns of different lengths

I am trying to use broadcast feature of numpy on my large data. I have list columns that can have hundreds of elements in many rows. I need to filter rows based on presence of columns value in the list column. If number in col_a is present in col_b, I need to filter IN that row.
Sample data:
import pandas as pd
import numpy as np
dt = pd.DataFrame({'id' : ['a', 'a', 'a', 'b', 'b'],
'col_a': [[1],[2],[5],[1],[2]],
'col_b': [[2],[2,4],[2,5,7],[4],[3,2]],
})
dt
id col_a col_b
0 a [1] [2]
1 a [2] [2, 4]
2 a [5] [2, 5, 7]
3 b [1] [4]
4 b [2] [3, 2]
I tried below code to add dimension to col_b and check if the value is present in col_a:
(dt['col_a'] == dt['col_b'][:,None]).any(axis = 1)
but I get below error:
ValueError: ('Shapes must match', (5,), (5, 1))
Could someone please let me know what's the correct approach.
import pandas as pd
import numpy as np
from itertools import product
Parse out columns based on the commas:
dt2 = pd.DataFrame([j for i in dt.values for j in product(*i)], columns=dt.columns)
Filter to where col_a equals col_b:
dt2 = dt2[dt2['col_a'] == dt2['col_b']]
Results in:
I think you've been told that numpy "vectorization" is the key to speeding up your code, but you don't have a good grasp of what this. It isn't something magical that you can apply to any pandas task. It's just "shorthand" for making full use of numpy array methods, which means, actually learning numpy.
But let's explore your task:
In [205]: dt = pd.DataFrame({'id' : ['a', 'a', 'a', 'b', 'b'],
...: 'col_a': [[1],[2],[5],[1],[2]],
...: 'col_b': [[2],[2,4],[2,5,7],[4],[3,2]],
...: })
In [206]: dt
Out[206]:
id col_a col_b
0 a [1] [2]
1 a [2] [2, 4]
2 a [5] [2, 5, 7]
3 b [1] [4]
4 b [2] [3, 2]
In [207]: dt.dtypes
Out[207]:
id object
col_a object
col_b object
dtype: object
Because the columns contain lists, their dtype is object; they have references to lists.
Doing things like == on columns, pandas Series is not the same as doing things with the arrays of their values.
But to focus on the numpy aspect, lets get numpy arrays:
In [208]: a = dt['col_a'].to_numpy()
In [209]: b = dt['col_b'].to_numpy()
In [210]: a
Out[210]:
array([list([1]), list([2]), list([5]), list([1]), list([2])],
dtype=object)
In [211]: b
Out[211]:
array([list([2]), list([2, 4]), list([2, 5, 7]), list([4]), list([3, 2])],
dtype=object)
The fast numpy operations use compiled code, and, for the most part, only work with numeric dtypes. Arrays like this, containing references to lists, are basically the same as lists. Math, and other operations like equalty, operate at list comprehension speeds. That may be faster than pandas speeds, but no where like the highly vaunted "vectorized" numpy speeds.
So lets to a list comprehension on the elements of these lists. This is a lot like pandas apply, though I think it's faster (pandas apply is notoriously slow).
In [212]: [i in j for i,j in zip(a,b)]
Out[212]: [False, False, False, False, False]
Oops, not matches - must be because i from a is a list. Let's extract that number:
In [213]: [i[0] in j for i,j in zip(a,b)]
Out[213]: [False, True, True, False, True]
Making col_a contain lists instead of numbers does not help you.
Since a and b are arrays, we can use ==, but that essentially the same operation as [212] (timeit is slightly better):
In [214]: a==b
Out[214]: array([False, False, False, False, False])
We could make b into a (5,1) array, but why?
In [215]: b[:,None]
Out[215]:
array([[list([2])],
[list([2, 4])],
[list([2, 5, 7])],
[list([4])],
[list([3, 2])]], dtype=object)
What I think you were trying to imitate an array comparison like this, broadcasting a (5,) against a (3,1) to produce a (3,5) truth table:
In [216]: x = np.arange(5); y = np.array([3,5,1])
In [217]: x==y[:,None]
Out[217]:
array([[False, False, False, True, False],
[False, False, False, False, False],
[False, True, False, False, False]])
In [218]: (x==y[:,None]).any(axis=1)
Out[218]: array([ True, False, True])
isin can do the same sort of comparision:
In [219]: np.isin(x,y)
Out[219]: array([False, True, False, True, False])
In [220]: np.isin(y,x)
Out[220]: array([ True, False, True])
While this works for numbers, it does not work for the arrays of lists, especially not your case where you want to test the lists in a against the corresponding list in b. You aren't testing all of a against all of b.
Since the lists in a are all the same size, we can join them into one number array:
In [225]: np.hstack(a)
Out[225]: array([1, 2, 5, 1, 2])
We cannot do the same for b because the lists very in size. As a general rule, when you have lists (or arrays) that vary in size, you cannot do the fast numeric numpy math and comparisons.
We could test (5,) a against (5,1) b, producing a (5,5) truth table:
In [227]: a==b[:,None]
Out[227]:
array([[False, True, False, False, True],
[False, False, False, False, False],
[False, False, False, False, False],
[False, False, False, False, False],
[False, False, False, False, False]])
But that is True for a couple of cells in the first row; that's where the list([2]) from b matches the same list in a.

How do I pass a list as changing condition in an array?

Let's say that I have an numpy array a = [1 2 3 4 5 6 7 8] and I want to change everything else but 1,2 and 3 to 0. With a list b = [1,2,3] a tried a[a not in b] = 0, but Python does not accept this. Currently I'm using a for loop like this:
c = a.unique()
for i in c:
if i not in b:
a[a == i] = 0
Which works very slowly (Around 900 different values in a 3D array around the size of 1000x1000x1000) and doesn't fell like the optimal solution for numpy. Is there a more optimal way doing it in numpy?
You can use numpy.isin() to create a boolean mask to use as an index:
np.isin(a, b)
# array([ True, True, True, False, False, False, False, False])
Use ~ to do the opposite:
~np.isin(a, b)
# array([False, False, False, True, True, True, True, True])
Using this to index the original array lets you assign zero to the specific elements:
a = np.array([1,2,3,4,5,6,7,8])
b = np.array([1, 2, 3])
a[~np.isin(a, b)] = 0
print(a)
# [1 2 3 0 0 0 0 0]

Apply numpy 'where' along one of axes

I have an array like that:
array = np.array([
[True, False],
[True, False],
[True, False],
[True, True],
])
I would like to find the last occurance of True for each row of the array.
If it was 1d array I would do it in this way:
np.where(array)[0][-1]
How do I do something similar in 2D? Kind of like:
np.where(array, axis = 1)[0][:,-1]
but there is no axis argument in np.where.
Since True is greater than False, find the position of the largest element in each row. Unfortunately, argmax finds the first largest element, not the last one. So, reverse the array sideways, find the first True from the end, and recalculate the indexes:
(array.shape[1] - 1) - array[:, ::-1].argmax(axis=1)
# array([0, 0, 0, 1])
The method fails if there are no True values in a row. You can check if that's the case by dividing by array.max(axis=1). A row with no Trues will have its last True at the infinity :)
array[0, 0] = False
((array.shape[1] - 1) - array[:, ::-1].argmax(axis=1)) / array.max(axis=1)
#array([inf, 0., 0., 1.])
I found an older answer but didn't like that it returns 0 for both a True in the first position, and for a row of False.
So here's a way to solve that problem, if it's important to you:
import numpy as np
arr = np.array([[False, False, False], # -1
[False, False, True], # 2
[True, False, False], # 0
[True, False, True], # 2
[True, True, False], # 1
[True, True, True], # 2
])
# Make an adustment for no Trues at all.
adj = np.sum(arr, axis=1) == 0
# Get the position and adjust.
x = np.argmax(np.cumsum(arr, axis=1), axis=1) - adj
# Compare to expected result:
assert np.all(x == np.array([-1, 2, 0, 2, 1, 2]))
print(x)
Gives [-1 2 0 2 1 2].

Python/Numpy: Combining boolean masks by row in grouped columns in multidimensional array

I have a 3D boolean array (a 2D numpy array of boolean mask arrays) with r rows and c cols. In the example below the array shape is (3, 6, 2); 3 rows and 6 columns, where each column contains 2 elements.
maskArr = np.array([
[[True, False], [True, True], [True, True], [True, True], [True, True], [True, True]],
[[False, True], [False, True], [True, True], [False, True], [True, True], [True, True]],
[[True, False], [True, True], [True, True], [True, True], [True, True], [True, True]],
])
# If n=2: |<- AND these 2 cols ->|<- AND these 2 cols ->|<- AND these 2 cols ->|
# If n=3: |<----- AND these 3 cols ----->|<----- AND these 3 cols ----->|
I know I can use np.all(maskArr, axis=1) to and together all the mask arrays in each row as in previous answer, but instead I would like to and together the boolean arrays in each row in increments of n columns.
So if we start with 6 columns, as above, and n=2, I would like to apply the equivalent of np.all on every 2 columns for an end result of 3 columns, where:
The first column of the result array equals the rows of the first (2) columns of the original array ANDed together - result[:,0] = np.all(maskArr[:,0:1], axis=1)
The second column of the result array equals the rows of the second (2) columns of the original array ANDed together. - result[:,1] = np.all(maskArr[:,2:3], axis=1)
And the third column of the result array equals the rows of the last (2) columns of the original array ANDed together. - result[:,2] = np.all(maskArr[:,4:5], axis=1)
Is there a way to use np.all (or another vectorized approach) to get this result?
Expected result with n=2:
>>> np.array([
[[True, False], [True, True], [True, True]],
[[False, True], [False, True], [True, True]],
[[True, False], [True, True], [True, True]],
])
Note: The array I'm working with is extremely large so I'm looking for a vectorized approach to minimize performance impact. The actual boolean arrays can be thousands of elements long.
I've tried:
n = 2
c = len(maskArr[0]) ## c = 6 (number of columns)
nResultColumns = int(c / n) ## nResultColumns = 3
combinedMaskArr = [np.all(maskArr[:,i*n:i*n+n], axis=1) for i in range(nResultColumns)]
which gives me:
>>> [
array([[True, False], [False, True], [True, False]]),
array([[True, True], [False, True], [True, True]]),
array([[True, True], [True, True], [True, True]])
]
The output above is not the expected format or values.
Any guidance or suggestions on how to get to the expected result?
Thank you in advance.
The following works, if I understood your problem correctly.
n = 2
cols = mask_arr.shape[1]
chunks = math.ceil(cols / n)
groups = np.array_split(np.swapaxes(mask_arr, 0, 1), chunks)
combined = np.array([np.all(g, axis=0) for g in groups])
result = np.swapaxes(combined, 0, 1)
If cols is divisible by n, I think this works:
n = 2
rows, cols = mask_arr.shape[0:2]
result = np.all(mask_arr.reshape(rows, cols // n, n, -1), axis=2)

Delete rows from a multidimensional array in Python

Im trying to delete specific rows in my numpy array that following certain conditions.
This is an example:
a = np.array ([[1,1,0,0,1],
[0,0,1,1,1],
[0,1,0,1,1],
[1,0,1,0,1],
[0,0,1,0,1],
[1,0,1,0,0]])
I want to able to delete all rows, where specific columns are zero, this array could be a lot bigger.
In this example, if first two element are zero, or if last two elements are zero, the rows will be deleted.
It could be any combination, no only first element or last ones.
This should be the final:
a = np.array ([[1,1,0,0,1],
[0,1,0,1,1],
[1,0,1,0,1]])
For example If I try:
a[:,0:2] == 0
After reading:
Remove lines with empty values from multidimensional-array in php
and this question: How to delete specific rows from a numpy array using a condition?
But they don't seem to apply to my case, or probably I'm not understanding something here as nothing works my case.
This gives me all rows there the first two cases are zero, True, True
array([[False, False],
[ True, True],
[ True, False],
[False, True],
[ True, True],
[False, True]])
and for the last two columns being zero, the last row should be deleted too. So at the end I will only be left with 2 rows.
a[:,3:5] == 0
array([[ True, False],
[False, False],
[False, False],
[ True, False],
[ True, False],
[ True, True]])
Im trying something like this, but I don't understand now how to tell it to only give me the rows that follow the condition, although this only :
(a[a[:,0:2]] == 0).all(axis=1)
array([[ True, True, False, False, False],
[False, False, True, True, False],
[False, False, False, False, False],
[False, False, False, False, False],
[False, False, True, True, False],
[False, False, False, False, False]])
(a[((a[:,0])& (a[:,1])) ] == 0).all(axis=1)
and this shows everything as False
could you please guide me a bit?
thank you
Just adding in the question, that the case it wont always be the first 2 or the last 2. If my matrix has 35 columns, it could be the column 6th to 10th, and then column 20th and 25th. An user will be able to decide which columns they want to get deleted.
Try this
idx0 = (a[:,0:2] == 0).all(axis=1)
idx1 = (a[:,-2:] == 0).all(axis=1)
a[~(idx0 | idx1)]
The first two steps select the indices of the rows that match your filtering criteria. Then do an or (|) operation, and the not (~) operation to obtain the final indices you want.
If I understood correctly you could do something like this:
import numpy as np
a = np.array([[1, 1, 0, 0, 1],
[0, 0, 1, 1, 1],
[0, 1, 0, 1, 1],
[1, 0, 1, 0, 1],
[0, 0, 1, 0, 1],
[1, 0, 1, 0, 0]])
left = np.count_nonzero(a[:, :2], axis=1) != 0
a = a[left]
right = np.count_nonzero(a[:, -2:], axis=1) != 0
a = a[right]
print(a)
Output
[[1 1 0 0 1]
[0 1 0 1 1]
[1 0 1 0 1]]
Or, a shorter version:
left = np.count_nonzero(a[:, :2], axis=1) != 0
right = np.count_nonzero(a[:, -2:], axis=1) != 0
a = a[(left & right)]
Use the following mask:
[np.any(a[:,:2], axis=1) & np.any(a[:,:-2], axis=1)]
if you want to create a filtered view:
a[np.any(a[:,:2], axis=1) & np.any(a[:,:-2], axis=1)]
if you want to create a new array:
np.delete(a,np.where(~(np.any(a[:,:2], axis=1) & np.any(a[:,:-2], axis=1))), axis=0)

Categories