selecting pandas rows: why isin() and not i in mylist? - python

I have a dataframe with a multi-index and need to select only the rows where the first index is not in a list. This works:
df= df.iloc[~(df.index.get_level_values(0).isin(mylist) )
This doesn't:
df= df.iloc[(df.index.get_level_values(0) not in mylist )
I get an error about the truth value of the array.
Why? What does it mean? Is it documented in the official docs?

Say, you have a dataframe df as follows:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.rand(30).reshape((6,5)))
tuples = [(i//2, i%2) for i in range(6)]
df.index = pd.MultiIndex.from_tuples(tuples)
print(df)
0 1 2 3 4
0 0 0.623671 0.335741 0.035219 0.902372 0.349697
1 0.487387 0.325101 0.361753 0.935972 0.425735
1 0 0.147836 0.599608 0.888232 0.712804 0.604688
1 0.156712 0.286682 0.680316 0.104996 0.389507
2 0 0.212923 0.580401 0.02415 0.712987 0.803497
1 0.804538 0.035597 0.611101 0.328159 0.140793
df.index.get_level_values(0) will return an array: Int64Index([0, 0, 1, 1, 2, 2], dtype='int64')
The error says that by using in operator it is not clear whether you want to check all elements in that array are in the list, or any element in that array is in the list. You are comparing the array against the whole list. What you want is the element-wise comparison and in does not do that. Even if it was clear, it would return a single value. If you try df.index.get_level_values(0).isin([0,1]), on the other hand, it will return an array of boolean values: array([ True, True, True, True, False, False], dtype=bool) so it will check first whether 0 is in the list, whether second 0 is in the list, whether 1 is in the list... And then those boolean values will be used to slice the dataframe (i.e. show me only the rows where the array has True value).
In [12]: df.iloc[[ True, True, True, True, False, False]]
Out [12]: 0 1 2 3 4
0 0 0.623671 0.335741 0.035219 0.902372 0.349697
1 0.487387 0.325101 0.361753 0.935972 0.425735
1 0 0.147836 0.599608 0.888232 0.712804 0.604688
1 0.156712 0.286682 0.680316 0.104996 0.389507

Related

Why numpy .isin function gives incorrect output

My requirement is I have a large dataframe with millions of rows. I encoded all strings to numeric values in order to use numpys vectorization to increase processing speed.
So I was looking at a way to quickly check if a number exists in another list column. Previously, I was using list comprehension with string values, but with after converting to np.arrays was looking at similar function.
I stumbled across this link: check if values of a column are in values of another numpy array column in pandas
In order to the numpy.isin, I tried running below code:
dt = pd.DataFrame({'id' : ['a', 'a', 'a', 'b', 'b'],
'col_a': [1,2,5,1,2],
'col_b': [2,2,[2,5,4],4,[1,5,6,3,2]]})
dt
id col_a col_b
0 a 1 2
1 a 2 2
2 a 5 [2, 5, 4]
3 b 1 4
4 b 2 [1, 5, 6, 3, 2]
When I enter:
np.isin(dt['col_a'], dt['col_b'])
The output is:
array([False, True, False, False, True])
Which is incorrect as the 3rd row has 5 in both columns col_a and col_b.
Where as if I change the value to 4 as below:
dt = pd.DataFrame({'id' : ['a', 'a', 'a', 'b', 'b'],
'col_a': [1,2,4,1,2],
'col_b': [2,2,[2,5,4],4,[1,5,6,3,2]]})
dt
id col_a col_b
0 a 1 2
1 a 2 2
2 a 4 [2, 5, 4]
3 b 1 4
4 b 2 [1, 5, 6, 3, 2]
and execute same code:
np.isin(dt['col_a'], dt['col_b'])
I get correct result:
array([False, True, True, False, True])
Can someone please let me know why it's giving different results.
Since col_b not only has lists but also integers, you may need to use apply and treat them differently:
( dt.apply(lambda x: x['col_a'] in x['col_b'] if type(x['col_b']) is list
else x['col_a'] == x['col_b'], axis=1)
Output:
0 False
1 True
2 True
3 False
4 True
dtype: bool
np.isin for each element from dt['col_a'] checks whether it is present in the whole dt['col_b'] column, i.e.:
[
1 in dt['col_b'],
2 in dt['col_b'],
5 in dt['col_b'],
...
]
There's no 5 in dt['col_b'] but there's 4
From the docs
isin is an element-wise function version of the python keyword in. isin(a, b) is roughly equivalent to np.array([item in b for item in a]) if a and b are 1-D sequences.
Also, your issue is that you have an inconsistent dt['col_b'] column (some values are numbers some are lists). I think the easiest approach is to use apply:
def isin(row):
if isinstance(row['col_b'], int):
return row['col_a'] == row['col_b']
else:
return row['col_a'] in row['col_b']
dt.apply(isin, axis=1)
Output:
0 False
1 True
2 True
3 False
4 True
dtype: bool

How can I use pandas agg to sum booleans and always obtain numbers as result?

I have a data frame with a bool type column. I would like to obtain the number of True values per id using pandas' groupby and agg functions. I've done this a bunch of times, but it seems the resulting column's type depends on the data frame. Here is an example:
import pandas as pd
d = {'id': [1, 1, 2, 3], 'bool': [True, False, False, True]}
df = pd.DataFrame(data=d)
print(df.groupby(['id']).agg({'bool': 'sum'}))
The output I get from this code is:
id bool
0 1 True
1 2 False
2 3 True
Which is not what I want. Now, if agg tries to sum two True values:
import pandas as pd
d = {'id': [1, 1, 2, 3], 'bool': [True, True, False, True]}
df = pd.DataFrame(data=d)
print(df.groupby(['id']).agg({'bool': 'sum'}))
Then I get:
id bool
0 1 2.00
1 2 0.00
2 3 1.00
Which is how I want.
I've seen situations in which a few rows are of type bool, whereas others are type float. It seems to be related with the number of rows grouped: if only one row, then it shows the bool value; if more than one, resulting type is float. I would like the resulting aggregated columns to always be of type float.
Pandas version is 1.0.1
You can sum and keep the output as a float in all cases with the following:
import pandas as pd
d = {'id': [1, 1, 2, 3], 'bool': [True, False, False, True]}
df = pd.DataFrame(data=d)
print(df.groupby(['id'])['bool'].sum().astype(float))
Yields the output
id
1 1.0
2 0.0
3 1.0
Name: bool, dtype: float64
You can just use the max function
df.groupby(['id']).agg({'bool': 'max'})
You can use typecast to float. Use reset_index if you want a separate column for your index at the end
df.groupby(['id']).agg({'bool': 'sum'}).astype(float).reset_index()
Example:
>>> import pandas as pd
>>> d = {'id': [1, 1, 2, 3], 'bool': [True, True, False, True]}
>>> df = pd.DataFrame(data=d)
>>>
>>> df.groupby(['id']).agg({'bool': 'sum'}).astype(float).reset_index()
id bool
0 1 2.0
1 2 0.0
2 3 1.0
>>>
There is a dedicated Numpy function to count non-zero cells
(True is counted as 1, False as 0). So you can run:
df.groupby(['id']).bool.agg(lambda gr: np.count_nonzero(gr))
I assume that you want integer number of True values.
Otherwise append .astype(float).

Get a boolean vector from comparing columns using conditions

Given a matrix with column 0 and 1 that needs to be compared:
>>> import pandas as pd
>>> df = pd.DataFrame([[3,4,'a'],[3,5,'b'],[9,2,'a']])
>>> df
0 1 2
0 3 4 a
1 3 5 b
2 9 2 a
The goal is to compare the first and the second columns such that the we it fits a certain conditions, e.g. if we want to know whether values in column df[0] is lower than df[1]. The desired output would look as such:
[True, True, False]
I've tried to use np.where with a conditions but it seems like it's returning the values that fit the condition instead:
>>> import numpy as np
>>> np.where(df[0] < df[1], df[0], df[1])
array([3, 3, 2])
I could do this, but I'm sure there's a simpler way to use numpy or pandas function to get the boolean vector:
[row[0] < row[1] for idx, row in df.iterrows()]
Is that what you want?
import numpy as np
df = np.array([[3,4,'a'],[3,5,'b'],[9,2,'a']])
df[0, :] == df[1, :]
#output array([ True, False, False], dtype=bool)
df[0, :] < df[1, :]
#output array([False, True, True], dtype=bool)

Compare a matrix against a column vector

Arrays 'A' and vector 'B' below are part of pandas dataframe.
I have a large array A of form:
28 39 52
77 80 66
7 18 24
9 97 68
I have a vector B of form:
32
5
42
17
How do I compare pythonically each column of A against B. I am trying to get True/False values for A < B comparison to get the following result:
TRUE FALSE FALSE
FALSE FALSE FALSE
TRUE TRUE TRUE
TRUE FALSE FALSE
I can do list comprehension syntax but is there a better way to pull this off. My array A and B are very large.
Consider the pd.DataFrame and pd.Series, A and B
A = pd.DataFrame([
[28, 39, 52],
[77, 80, 66],
[7, 18, 24],
[9, 97, 68]
])
B = pd.Series([32, 5, 42, 17])
pandas
By default, when you compare a pd.DataFrame with a pd.Series, pandas aligns each index value from the series with the column names of the dataframe. This is what happens when you use A < B. In this case, you have 4 rows in your dataframe and 4 elements in your series, so I'm going to assume you want to align the index values of the series with the index values of the dataframe. In order to specify the axis you want to align with, you need to use the comparison method rather than the operator. That's because when you use the method, you can use the axis parameter and specify that you want axis=0 rather than the default axis=1.
A.lt(B, axis=0)
0 1 2
0 True False False
1 False False False
2 True True True
3 True False False
I often just write this as A.lt(B, 0)
numpy
In numpy, you also have to pay attention to the dimensionality of the arrays and you are assuming that the positions are already lined up. The positions will be taken care of if they come from the same dataframe.
print(A.values)
[[28 39 52]
[77 80 66]
[ 7 18 24]
[ 9 97 68]]
print(B.values)
[32 5 42 17]
Notice that B is a 1 dimensional array while A is a 2 dimensional array. In order to compare B along the rows of A we need to reshape B into a 2 dimensional array. The most obvious way to do this is with reshape
print(A.values < B.values.reshape(4, 1))
[[ True False False]
[False False False]
[ True True True]
[ True False False]]
However, these are ways in which you will commonly see others do the same reshaping
A.values < B.values.reshape(-1, 1)
Or
A.values < B.values[:, None]
timed back test
To get a handle of how fast these comparisons are, I've constructed the following back test.
def pd_cmp(df, s):
return df.lt(s, 0)
def np_cmp_a2a(df, s):
"""To get an apples to apples comparison
I return the same thing in both functions"""
return pd.DataFrame(
df.values < s.values[:, None],
df.index, df.columns
)
def np_cmp_a2o(df, s):
"""To get an apples to oranges comparison
I return a numpy array"""
return df.values < s.values[:, None]
results = pd.DataFrame(
index=pd.Index([10, 1000, 100000], name='group size'),
columns=pd.Index(['pd_cmp', 'np_cmp_a2a', 'np_cmp_a2o'], name='method'),
)
from timeit import timeit
for i in results.index:
df = pd.concat([A] * i, ignore_index=True)
s = pd.concat([B] * i, ignore_index=True)
for j in results.columns:
results.set_value(
i, j,
timeit(
'{}(df, s)'.format(j),
'from __main__ import {}, df, s'.format(j),
number=100
)
)
results.plot()
I can conclude that the numpy based solutions are faster but not all that much. They all scale the same.
You can do this using lt and calling squeeze on B so it flattens the df to a 1-D Series:
In [107]:
A.lt(B.squeeze(),axis=0)
Out[107]:
0 1 2
0 True False False
1 False False False
2 True True True
3 True False False
The problem is that without squeeze then it will try to align on the column labels which we don't want. We want to broadcast the comparison along the column-axis
The more efficient is to go down numpy level (A,B are DataFrames here):
A.values<B.values
Yet another option using numpy is with numpy.newaxis
In [99]: B = B[:, np.newaxis]
In [100]: B
Out[100]:
array([[32],
[ 5],
[42],
[17]])
In [101]: A < B
Out[101]:
array([[ True, False, False],
[False, False, False],
[ True, True, True],
[ True, False, False]], dtype=bool)
Essentially, we're converting the vector B into a 2D array so that numpy can broadcast when comparing two arrays of different shapes.

Select DataFrame data base on series value

I have a pandas' DataFrame and when I perform an operation on the dataframe, I get back a series. How can I use that series to select out only records where I find a match?
Right now I'm appending the column onto the DataFrame and doing a query against the dataframe then dropping the column. I really do not like this solution though, so I'm hoping I can get a better solution.
data = [[1,2,3], [1,3,4], [3,4,5]]
columns = ['a', 'b', 'c']
df = pd.DataFrame(data, columns=columns)
series = df.myoperation()
df['myoperation'] = series
res = df[df['myoperation'] == True]
del res['myoperation']
The series object will produce a 1-1 match, so index item 1 will match item 1 in the dataframe object.
Above is my hacky code to get it done, but I'm afraid when the dataframe have many column or a lot more data than just this simple example, it will be slow.
Thank you
I think you can use if series is boolean Series with same index as df and same length as df - it is called boolean indexing:
series = pd.Series([True, False, True], index=df.index)
res = df[series]
print (res)
a b c
0 1 2 3
2 3 4 5
It always works with boolean list and numpy array, only lenght has to be same as df:
L = [True, False, True]
res = df[L]
print (res)
a b c
0 1 2 3
2 3 4 5
arr = np.array([True, False, True])
res = df[arr]
print (res)
a b c
0 1 2 3
2 3 4 5

Categories