Find the index of a string value in a pandas DataFrame - python

How can I identify which column(s) in my DataFrame contain a specific string 'foo'?
Sample DataFrame:
>>> import pandas as pd
>>> df = pd.DataFrame({'A':[10,20,42], 'B':['foo','bar','blah'],'C':[3,4,5], 'D':['some','foo','thing']})
I want to find B and D here.
I can search for numbers:
If I'm looking for a number (e.g. 42) instead of a string, I can generate a boolean mask like this:
>>> ~(df.where(df==42)).isnull().all()
A True
B False
C False
D False
dtype: bool
but not strings:
>>> ~(df.where(df=='foo')).isnull().all()
TypeError: Could not compare ['foo'] with block values
I don't want to iterate over each column and row if possible (my actual data is much larger than this example). It feels like there should be a simple and efficient way.
How can I do this?

One way with underlying array data -
df.columns[(df.values=='foo').any(0)].tolist()
Sample run -
In [209]: df
Out[209]:
A B C D
0 10 foo 3 some
1 20 bar 4 foo
2 42 blah 5 thing
In [210]: df.columns[(df.values=='foo').any(0)].tolist()
Out[210]: ['B', 'D']
If you are looking for just the column-mask -
In [205]: (df.values=='foo').any(0)
Out[205]: array([False, True, False, True], dtype=bool)

Option 1 df.values
~(df.where(df.values=='foo')).isnull().all()
Out[91]:
A False
B True
C False
D True
dtype: bool
Option 2 isin
~(df.where(df.isin(['foo']))).isnull().all()
Out[94]:
A False
B True
C False
D True
dtype: bool

Unfortunately, it won't index a str through the syntax you gave. It has to be run as a series of type string to compare it with string, unless I am missing something.
try this
~df101.where(df101.isin(['foo'])).isnull().all()
A False
B True
C False
D True
dtype: bool

Related

Change 0 to False and 1 to True in Python [duplicate]

I have a column in python pandas DataFrame that has boolean True/False values, but for further calculations I need 1/0 representation. Is there a quick pandas/numpy way to do that?
A succinct way to convert a single column of boolean values to a column of integers 1 or 0:
df["somecolumn"] = df["somecolumn"].astype(int)
Just multiply your Dataframe by 1 (int)
[1]: data = pd.DataFrame([[True, False, True], [False, False, True]])
[2]: print data
0 1 2
0 True False True
1 False False True
[3]: print data*1
0 1 2
0 1 0 1
1 0 0 1
True is 1 in Python, and likewise False is 0*:
>>> True == 1
True
>>> False == 0
True
You should be able to perform any operations you want on them by just treating them as though they were numbers, as they are numbers:
>>> issubclass(bool, int)
True
>>> True * 5
5
So to answer your question, no work necessary - you already have what you are looking for.
* Note I use is as an English word, not the Python keyword is - True will not be the same object as any random 1.
This question specifically mentions a single column, so the currently accepted answer works. However, it doesn't generalize to multiple columns. For those interested in a general solution, use the following:
df.replace({False: 0, True: 1}, inplace=True)
This works for a DataFrame that contains columns of many different types, regardless of how many are boolean.
You also can do this directly on Frames
In [104]: df = DataFrame(dict(A = True, B = False),index=range(3))
In [105]: df
Out[105]:
A B
0 True False
1 True False
2 True False
In [106]: df.dtypes
Out[106]:
A bool
B bool
dtype: object
In [107]: df.astype(int)
Out[107]:
A B
0 1 0
1 1 0
2 1 0
In [108]: df.astype(int).dtypes
Out[108]:
A int64
B int64
dtype: object
Use Series.view for convert boolean to integers:
df["somecolumn"] = df["somecolumn"].view('i1')
You can use a transformation for your data frame:
df = pd.DataFrame(my_data condition)
transforming True/False in 1/0
df = df*1
I had to map FAKE/REAL to 0/1 but couldn't find proper answer.
Please find below how to map column name 'type' which has values FAKE/REAL to 0/1 (Note: similar can be applied to any column name and values)
df.loc[df['type'] == 'FAKE', 'type'] = 0
df.loc[df['type'] == 'REAL', 'type'] = 1
This is a reproducible example based on some of the existing answers:
import pandas as pd
def bool_to_int(s: pd.Series) -> pd.Series:
"""Convert the boolean to binary representation, maintain NaN values."""
return s.replace({True: 1, False: 0})
# generate a random dataframe
df = pd.DataFrame({"a": range(10), "b": range(10, 0, -1)}).assign(
a_bool=lambda df: df["a"] > 5,
b_bool=lambda df: df["b"] % 2 == 0,
)
# select all bool columns (or specify which cols to use)
bool_cols = [c for c, d in df.dtypes.items() if d == "bool"]
# apply the new coding to a new dataframe (or can replace the existing one)
df_new = df.assign(**{c: lambda df: df[c].pipe(bool_to_int) for c in bool_cols})
Tries and tested:
df[col] = df[col].map({'True': 1,'False' :0 })
If there are more than one columns with True/False, use the following.
for col in bool_cols:
df[col] = df[col].map({'True': 1,'False' :0 })
#AMC wrote this in a comment
If the column is of the type object
df["somecolumn"] = df["somecolumn"].astype(bool).astype(int)

Strings in pandas series in list

I have a series where I would like to check if any string exists in a list. For example:
Series A:
A, B
A, B, C
E, D, F
List = ['A', 'B']
I would like to return is any element of List is in Series A, something like:
True
True
False
Thanks!
Assuming your series consists of strings, you can use set.intersection (&):
L = ['A', 'B']
s = pd.Series(['A, B', 'A, B, C', 'E, D, F'])
res = s.str.split(', ').map(set) & set(L)
print(res)
0 True
1 True
2 False
dtype: bool
Can use np.isin
s.agg(lambda k: np.isin(k, List).any())
0 True
1 True
2 False
dtype: bool
This filter should give you a T/F vector: A.apply(lambda _: _ in List)
Also, you mention Series but that looks like you're referring to a DataFrame.
Using issubset
s.str.split(', ').apply(lambda x : {'A','B'}.issubset(tuple(x)))
Out[615]:
0 True
1 True
2 False
Name: c, dtype: bool
you can use str.contains and join the list with | to look for any of the element of the list such as:
print (s.str.contains('|'.join(L))) #with #jpp definition of variable s and L
0 True
1 True
2 False
dtype: bool

Converting a column value when filtering in pandas

In a csv file which I read using pandas, there's a column of type bool but in the string format which is 'F' or 'T'. How can I convert it to the real Bool when filtering? No need to change in the source file, only when filtering:
# how it is now
if something:
df1 = df1[df1['str_bool_column'] == 'F']
# how I want
if something:
df1 = df1[df1['str_bool_column'] == False]
I probably should use apply, but how exactly?
You can convert that column to a column of True/False values after reading it from the csv. One method to do that would be to use Series.map , to map 'F' to False and 'T' to True. Example -
df1['str_bool_column'] = df1['str_bool_column'].map({'F':False,'T':True})
Demo -
In [9]: df = pd.DataFrame([[1,'F'],[2,'T'],[3,'F'],[4,'T']],columns=['A','B'])
In [10]: df
Out[10]:
A B
0 1 F
1 2 T
2 3 F
3 4 T
In [11]: df['B'] = df['B'].map({'F':False,'T':True})
In [12]: df
Out[12]:
A B
0 1 False
1 2 True
2 3 False
3 4 True

Comparing pandas Series for equality when they contain nan?

My application needs to compare Series instances that sometimes contain nans. That causes ordinary comparison using == to fail, since nan != nan:
import numpy as np
from pandas import Series
s1 = Series([1,np.nan])
s2 = Series([1,np.nan])
>>> (Series([1, nan]) == Series([1, nan])).all()
False
What's the proper way to compare such Series?
How about this. First check the NaNs are in the same place (using isnull):
In [11]: s1.isnull()
Out[11]:
0 False
1 True
dtype: bool
In [12]: s1.isnull() == s2.isnull()
Out[12]:
0 True
1 True
dtype: bool
Then check the values which aren't NaN are equal (using notnull):
In [13]: s1[s1.notnull()]
Out[13]:
0 1
dtype: float64
In [14]: s1[s1.notnull()] == s2[s2.notnull()]
Out[14]:
0 True
dtype: bool
In order to be equal we need both to be True:
In [15]: (s1.isnull() == s2.isnull()).all() and (s1[s1.notnull()] == s2[s2.notnull()]).all()
Out[15]: True
You could also check name etc. if this wasn't sufficient.
If you want to raise if they are different, use assert_series_equal from pandas.util.testing:
In [21]: from pandas.util.testing import assert_series_equal
In [22]: assert_series_equal(s1, s2)
Currently one should just use series1.equals(series2) see docs. This also checks if nans are in the same positions.
I came looking here for a similar answer, and think #Sam's answer is the neatest if you just want 1 value back. But I wanted a truth-array back with an element-wise comparison (but null safe).
So finally I ended up with:
import pandas as pd
s1 = pd.Series([1,np.nan, 2, np.nan])
s2 = pd.Series([1,np.nan, np.nan, 2])
(s1 == s2) | ~(s1.isnull() ^ s2.isnull())
The result:
0 True
1 True
2 False
3 False
dtype: bool
Comparing this to s1 == s2:
0 True
1 False
2 False
3 False
dtype: bool
In [16]: s1 = Series([1,np.nan])
In [17]: s2 = Series([1,np.nan])
In [18]: (s1.dropna()==s2.dropna()).all()
Out[18]: True

How can I map True/False to 1/0 in a Pandas DataFrame?

I have a column in python pandas DataFrame that has boolean True/False values, but for further calculations I need 1/0 representation. Is there a quick pandas/numpy way to do that?
A succinct way to convert a single column of boolean values to a column of integers 1 or 0:
df["somecolumn"] = df["somecolumn"].astype(int)
Just multiply your Dataframe by 1 (int)
[1]: data = pd.DataFrame([[True, False, True], [False, False, True]])
[2]: print data
0 1 2
0 True False True
1 False False True
[3]: print data*1
0 1 2
0 1 0 1
1 0 0 1
True is 1 in Python, and likewise False is 0*:
>>> True == 1
True
>>> False == 0
True
You should be able to perform any operations you want on them by just treating them as though they were numbers, as they are numbers:
>>> issubclass(bool, int)
True
>>> True * 5
5
So to answer your question, no work necessary - you already have what you are looking for.
* Note I use is as an English word, not the Python keyword is - True will not be the same object as any random 1.
This question specifically mentions a single column, so the currently accepted answer works. However, it doesn't generalize to multiple columns. For those interested in a general solution, use the following:
df.replace({False: 0, True: 1}, inplace=True)
This works for a DataFrame that contains columns of many different types, regardless of how many are boolean.
You also can do this directly on Frames
In [104]: df = DataFrame(dict(A = True, B = False),index=range(3))
In [105]: df
Out[105]:
A B
0 True False
1 True False
2 True False
In [106]: df.dtypes
Out[106]:
A bool
B bool
dtype: object
In [107]: df.astype(int)
Out[107]:
A B
0 1 0
1 1 0
2 1 0
In [108]: df.astype(int).dtypes
Out[108]:
A int64
B int64
dtype: object
Use Series.view for convert boolean to integers:
df["somecolumn"] = df["somecolumn"].view('i1')
You can use a transformation for your data frame:
df = pd.DataFrame(my_data condition)
transforming True/False in 1/0
df = df*1
I had to map FAKE/REAL to 0/1 but couldn't find proper answer.
Please find below how to map column name 'type' which has values FAKE/REAL to 0/1 (Note: similar can be applied to any column name and values)
df.loc[df['type'] == 'FAKE', 'type'] = 0
df.loc[df['type'] == 'REAL', 'type'] = 1
This is a reproducible example based on some of the existing answers:
import pandas as pd
def bool_to_int(s: pd.Series) -> pd.Series:
"""Convert the boolean to binary representation, maintain NaN values."""
return s.replace({True: 1, False: 0})
# generate a random dataframe
df = pd.DataFrame({"a": range(10), "b": range(10, 0, -1)}).assign(
a_bool=lambda df: df["a"] > 5,
b_bool=lambda df: df["b"] % 2 == 0,
)
# select all bool columns (or specify which cols to use)
bool_cols = [c for c, d in df.dtypes.items() if d == "bool"]
# apply the new coding to a new dataframe (or can replace the existing one)
df_new = df.assign(**{c: lambda df: df[c].pipe(bool_to_int) for c in bool_cols})
Tries and tested:
df[col] = df[col].map({'True': 1,'False' :0 })
If there are more than one columns with True/False, use the following.
for col in bool_cols:
df[col] = df[col].map({'True': 1,'False' :0 })
#AMC wrote this in a comment
If the column is of the type object
df["somecolumn"] = df["somecolumn"].astype(bool).astype(int)

Categories