Unexpected behaviour from applying np.isin() on a pandas dataframe

Unexpected behaviour from applying np.isin() on a pandas dataframe - python

While working on an answer to another question, I stumbled upon an unexpected behaviour:
Consider the following DataFrame:
df = pd.DataFrame({
'A':list('AAcdef'),
'B':[4,5,4,5,5,4],
'E':[5,3,6,9,2,4],
'F':list('BaaBbA')
})
print(df)
A B E F
0 A 4 5 B #<— row contains 'A' and 5
1 A 5 3 a #<— row contains 'A' and 5
2 c 4 6 a
3 d 5 9 B
4 e 5 2 b
5 f 4 4 A
If we try to find all columns that contain ['A', 5], we can use jezrael's answer:
cond = [['A'],[5]]
print( np.logical_and.reduce([df.isin(x).any(1) for x in cond]) )
which (correctly) yields: [ True True False False False False]
If we however use:
cond = [['A'],[5]]
print( df.apply(lambda x: np.isin([cond],[x]).all(),axis=1) )
this yields:
0 False
1 False
2 False
3 False
4 False
5 False
dtype: bool
Closer inspection of the second attempt reveals that:
np.isin(['A',5],df.loc[0]) "wrongly" yields array([ True, False]), likely due to numpy infering a dtype <U1, and consequently 5!='5'
np.isin(['A',5],['A',4,5,'B']) "correctly" yields array([ True, True]), which means we can (and should) use df.loc[0].values.tolist() in the .apply() method above
The question, simplified:
Why do I need to specify x.values.tolist() in one case, and can directly use x in the other?
print( np.logical_and.reduce([df.isin(x).any(1) for x in cond]) )
print( df.apply(lambda x: np.isin([cond],x.values.tolist()).all(),axis=1 ) )
Edit:
Even worse is what happens if we search for [4,5]:
cond = [[4],[5]]
## this returns False for row 0
print( df.apply(lambda x: np.isin([cond],x.values.tolist() ).all() ,axis=1) )
## this returns True for row 0
print( df.apply(lambda x: np.isin([cond],x.values ).all() ,axis=1) )

I think in DataFrame are mixed numeric with integer solumns, so if loop by rows get Series with mixing types, so numpy coerce the to strings.
Possible solution is convert to array and then to string values in cond:
cond = [[4],[5]]
print(df.apply(lambda x: np.isin(np.array(cond).astype(str), x.values.tolist()).all(),axis=1))
0 True
1 False
2 False
3 False
4 False
5 False
dtype: bool
Unfortunately for general solution (if possible only numeric columns) need convert both - cond and Series:
f = lambda x: np.isin(np.array(cond).astype(str), x.astype(str).tolist()).all()
print (df.apply(f, axis=1))
Or all data:
f = lambda x: np.isin(np.array(cond).astype(str), x.tolist()).all()
print (df.astype(str).apply(f, axis=1))
If use sets in pure python, it working nice:
print(df.apply(lambda x: set([4,5]).issubset(x),axis=1) )
0 True
1 False
2 False
3 False
4 False
5 False
dtype: bool
print(df.apply(lambda x: set(['A',5]).issubset(x),axis=1) )
0 True
1 True
2 False
3 False
4 False
5 False
dtype: bool

Because
df.isin applies to pd.Series and np.isin does not.
pd.loc returns a pd.Series.
To transform pd.Series to array-like, your x.values.tolist() should work.

Related

Change 0 to False and 1 to True in Python [duplicate]

I have a column in python pandas DataFrame that has boolean True/False values, but for further calculations I need 1/0 representation. Is there a quick pandas/numpy way to do that?

A succinct way to convert a single column of boolean values to a column of integers 1 or 0:
df["somecolumn"] = df["somecolumn"].astype(int)

Just multiply your Dataframe by 1 (int)
[1]: data = pd.DataFrame([[True, False, True], [False, False, True]])
[2]: print data
0 1 2
0 True False True
1 False False True
[3]: print data*1
0 1 2
0 1 0 1
1 0 0 1

True is 1 in Python, and likewise False is 0*:
>>> True == 1
True
>>> False == 0
True
You should be able to perform any operations you want on them by just treating them as though they were numbers, as they are numbers:
>>> issubclass(bool, int)
True
>>> True * 5
5
So to answer your question, no work necessary - you already have what you are looking for.
* Note I use is as an English word, not the Python keyword is - True will not be the same object as any random 1.

This question specifically mentions a single column, so the currently accepted answer works. However, it doesn't generalize to multiple columns. For those interested in a general solution, use the following:
df.replace({False: 0, True: 1}, inplace=True)
This works for a DataFrame that contains columns of many different types, regardless of how many are boolean.

You also can do this directly on Frames
In [104]: df = DataFrame(dict(A = True, B = False),index=range(3))
In [105]: df
Out[105]:
A B
0 True False
1 True False
2 True False
In [106]: df.dtypes
Out[106]:
A bool
B bool
dtype: object
In [107]: df.astype(int)
Out[107]:
A B
0 1 0
1 1 0
2 1 0
In [108]: df.astype(int).dtypes
Out[108]:
A int64
B int64
dtype: object

Use Series.view for convert boolean to integers:
df["somecolumn"] = df["somecolumn"].view('i1')

You can use a transformation for your data frame:
df = pd.DataFrame(my_data condition)
transforming True/False in 1/0
df = df*1

I had to map FAKE/REAL to 0/1 but couldn't find proper answer.
Please find below how to map column name 'type' which has values FAKE/REAL to 0/1 (Note: similar can be applied to any column name and values)
df.loc[df['type'] == 'FAKE', 'type'] = 0
df.loc[df['type'] == 'REAL', 'type'] = 1

This is a reproducible example based on some of the existing answers:
import pandas as pd
def bool_to_int(s: pd.Series) -> pd.Series:
"""Convert the boolean to binary representation, maintain NaN values."""
return s.replace({True: 1, False: 0})
# generate a random dataframe
df = pd.DataFrame({"a": range(10), "b": range(10, 0, -1)}).assign(
a_bool=lambda df: df["a"] > 5,
b_bool=lambda df: df["b"] % 2 == 0,
)
# select all bool columns (or specify which cols to use)
bool_cols = [c for c, d in df.dtypes.items() if d == "bool"]
# apply the new coding to a new dataframe (or can replace the existing one)
df_new = df.assign(**{c: lambda df: df[c].pipe(bool_to_int) for c in bool_cols})

Tries and tested:
df[col] = df[col].map({'True': 1,'False' :0 })
If there are more than one columns with True/False, use the following.
for col in bool_cols:
df[col] = df[col].map({'True': 1,'False' :0 })
#AMC wrote this in a comment

If the column is of the type object
df["somecolumn"] = df["somecolumn"].astype(bool).astype(int)

Find the index of a string value in a pandas DataFrame

How can I identify which column(s) in my DataFrame contain a specific string 'foo'?
Sample DataFrame:
>>> import pandas as pd
>>> df = pd.DataFrame({'A':[10,20,42], 'B':['foo','bar','blah'],'C':[3,4,5], 'D':['some','foo','thing']})
I want to find B and D here.
I can search for numbers:
If I'm looking for a number (e.g. 42) instead of a string, I can generate a boolean mask like this:
>>> ~(df.where(df==42)).isnull().all()
A True
B False
C False
D False
dtype: bool
but not strings:
>>> ~(df.where(df=='foo')).isnull().all()
TypeError: Could not compare ['foo'] with block values
I don't want to iterate over each column and row if possible (my actual data is much larger than this example). It feels like there should be a simple and efficient way.
How can I do this?

One way with underlying array data -
df.columns[(df.values=='foo').any(0)].tolist()
Sample run -
In [209]: df
Out[209]:
A B C D
0 10 foo 3 some
1 20 bar 4 foo
2 42 blah 5 thing
In [210]: df.columns[(df.values=='foo').any(0)].tolist()
Out[210]: ['B', 'D']
If you are looking for just the column-mask -
In [205]: (df.values=='foo').any(0)
Out[205]: array([False, True, False, True], dtype=bool)

Option 1 df.values
~(df.where(df.values=='foo')).isnull().all()
Out[91]:
A False
B True
C False
D True
dtype: bool
Option 2 isin
~(df.where(df.isin(['foo']))).isnull().all()
Out[94]:
A False
B True
C False
D True
dtype: bool

Unfortunately, it won't index a str through the syntax you gave. It has to be run as a series of type string to compare it with string, unless I am missing something.
try this
~df101.where(df101.isin(['foo'])).isnull().all()
A False
B True
C False
D True
dtype: bool

Element wise comparison between two dataframes (one of them containing lists)

I have two two Dataframes and one is made up of list.
in [00]: table01
out[00]:
a b
0 1 2
1 2 3
in [01]: table02
out[01]:
a b
0 [2] [3]
1 [1,2] [1,2]
And now I want to compare two tables. If the element in table01 also in the same position list of table02, return True otherwise return False. So the table I want to have is:
a b
0 False False
1 True False
I have tried table01 in table02 but get a error message: 'DataFrame' objects are mutable, thus they cannot be hashed.
Please share the correct solution of this problem with me. Thanks a lot!

Using sets and df.applymap:
df3 = df1.applymap(lambda x: {x})
df4 = df2.applymap(set)
df3 & df4
a b
0 {} {}
1 {2} {}
(df3 & df4).astype(bool)
a b
0 False False
1 True False
user3847943's solution is a good alternative, but could be improved using a set membership test.
def find_in_array(a, b):
return a in b
for c in df2.columns:
df2[c] = df2[c].map(set)
vfunc = np.vectorize(find_in_array)
df = pd.DataFrame(vfunc(df1, df2), index=df1.index, columns=df1.columns)
df
a b
0 False False
1 True False

You can easily do this by using numpy.vectorize. Sample code as below.
import numpy as np
import pandas as pd
t1 = pd.DataFrame([[1, 2],[2,3]])
t2 = pd.DataFrame([[[2],[3]],[[1,2],[1,2]]])
def find_in_array(a, b):
return a in b
vfunc = np.vectorize(find_in_array)
print(vfunc(t1, t2))

Try this
df=pd.melt(df1.reset_index(),'index')
df['v2']=pd.melt(df2.reset_index(),'index').value
pd.melt(df2.reset_index(),'index')
df['BOOL']=df.apply(lambda x: True if x.value in x.v2 else False, axis = 1)
df.pivot('index','variable','BOOL')
Out[491]:
variable a b
index
0 False False
1 True False
Finally :
df1.apply(lambda x: [(x==df2.loc[y,x.name])[y] for y in x.index])
Out[668]:
a b
0 False False
1 True False

Converting a column value when filtering in pandas

In a csv file which I read using pandas, there's a column of type bool but in the string format which is 'F' or 'T'. How can I convert it to the real Bool when filtering? No need to change in the source file, only when filtering:
# how it is now
if something:
df1 = df1[df1['str_bool_column'] == 'F']
# how I want
if something:
df1 = df1[df1['str_bool_column'] == False]
I probably should use apply, but how exactly?

You can convert that column to a column of True/False values after reading it from the csv. One method to do that would be to use Series.map , to map 'F' to False and 'T' to True. Example -
df1['str_bool_column'] = df1['str_bool_column'].map({'F':False,'T':True})
Demo -
In [9]: df = pd.DataFrame([[1,'F'],[2,'T'],[3,'F'],[4,'T']],columns=['A','B'])
In [10]: df
Out[10]:
A B
0 1 F
1 2 T
2 3 F
3 4 T
In [11]: df['B'] = df['B'].map({'F':False,'T':True})
In [12]: df
Out[12]:
A B
0 1 False
1 2 True
2 3 False
3 4 True

How can I map True/False to 1/0 in a Pandas DataFrame?

I have a column in python pandas DataFrame that has boolean True/False values, but for further calculations I need 1/0 representation. Is there a quick pandas/numpy way to do that?

A succinct way to convert a single column of boolean values to a column of integers 1 or 0:
df["somecolumn"] = df["somecolumn"].astype(int)

Just multiply your Dataframe by 1 (int)
[1]: data = pd.DataFrame([[True, False, True], [False, False, True]])
[2]: print data
0 1 2
0 True False True
1 False False True
[3]: print data*1
0 1 2
0 1 0 1
1 0 0 1

True is 1 in Python, and likewise False is 0*:
>>> True == 1
True
>>> False == 0
True
You should be able to perform any operations you want on them by just treating them as though they were numbers, as they are numbers:
>>> issubclass(bool, int)
True
>>> True * 5
5
So to answer your question, no work necessary - you already have what you are looking for.
* Note I use is as an English word, not the Python keyword is - True will not be the same object as any random 1.

This question specifically mentions a single column, so the currently accepted answer works. However, it doesn't generalize to multiple columns. For those interested in a general solution, use the following:
df.replace({False: 0, True: 1}, inplace=True)
This works for a DataFrame that contains columns of many different types, regardless of how many are boolean.

You also can do this directly on Frames
In [104]: df = DataFrame(dict(A = True, B = False),index=range(3))
In [105]: df
Out[105]:
A B
0 True False
1 True False
2 True False
In [106]: df.dtypes
Out[106]:
A bool
B bool
dtype: object
In [107]: df.astype(int)
Out[107]:
A B
0 1 0
1 1 0
2 1 0
In [108]: df.astype(int).dtypes
Out[108]:
A int64
B int64
dtype: object

Use Series.view for convert boolean to integers:
df["somecolumn"] = df["somecolumn"].view('i1')

You can use a transformation for your data frame:
df = pd.DataFrame(my_data condition)
transforming True/False in 1/0
df = df*1

I had to map FAKE/REAL to 0/1 but couldn't find proper answer.
Please find below how to map column name 'type' which has values FAKE/REAL to 0/1 (Note: similar can be applied to any column name and values)
df.loc[df['type'] == 'FAKE', 'type'] = 0
df.loc[df['type'] == 'REAL', 'type'] = 1

This is a reproducible example based on some of the existing answers:
import pandas as pd
def bool_to_int(s: pd.Series) -> pd.Series:
"""Convert the boolean to binary representation, maintain NaN values."""
return s.replace({True: 1, False: 0})
# generate a random dataframe
df = pd.DataFrame({"a": range(10), "b": range(10, 0, -1)}).assign(
a_bool=lambda df: df["a"] > 5,
b_bool=lambda df: df["b"] % 2 == 0,
)
# select all bool columns (or specify which cols to use)
bool_cols = [c for c, d in df.dtypes.items() if d == "bool"]
# apply the new coding to a new dataframe (or can replace the existing one)
df_new = df.assign(**{c: lambda df: df[c].pipe(bool_to_int) for c in bool_cols})

Tries and tested:
df[col] = df[col].map({'True': 1,'False' :0 })
If there are more than one columns with True/False, use the following.
for col in bool_cols:
df[col] = df[col].map({'True': 1,'False' :0 })
#AMC wrote this in a comment

If the column is of the type object
df["somecolumn"] = df["somecolumn"].astype(bool).astype(int)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Unexpected behaviour from applying np.isin() on a pandas dataframe - python

Because df.isin applies to pd.Series and np.isin does not. pd.loc returns a pd.Series. To transform pd.Series to array-like, your x.values.tolist() should work.

Related

Change 0 to False and 1 to True in Python [duplicate]

Find the index of a string value in a pandas DataFrame

Element wise comparison between two dataframes (one of them containing lists)

Converting a column value when filtering in pandas

How can I map True/False to 1/0 in a Pandas DataFrame?

Categories

Resources