I am trying to merge two columns into a third column based on the NaN values
df['code2'] = np.where(df['code']==np.nan, df['code'], df['code1'])
I am getting only the values if code1 column in the code2. The result is coming as shown in the image
Output image
Please tell me what is wrong in the code i am writing. Thanks
I think you need isnull for comparing NaN:
df['code2'] = np.where(df['code'].isnull(), df['code'], df['code1'])
Docs:
Warning
One has to be mindful that in python (and numpy), the nan's don’t compare equal, but None's do. Note that Pandas/numpy uses the fact that np.nan != np.nan, and treats None like np.nan.
In [11]: None == None
Out[11]: True
In [12]: np.nan == np.nan
Out[12]: False
So as compared to above, a scalar equality comparison versus a None/np.nan doesn’t provide useful information.
In [13]: df2['one'] == np.nan
Out[13]:
a False
b False
c False
d False
e False
f False
g False
h False
Name: one, dtype: bool
The correct way to check if a value is nan is to use np.isnan(val):
np.nan == np.nan
False
np.isnan(np.nan)
True
df = pd.DataFrame({'a': [np.nan, 1, 2]})
>>> np.isnan(df.a)
0 True
1 False
2 False
Name: a, dtype: bool
Related
I have a dataframe that might contain NaN values.
array = np.empty((4,5))
array[:] = 10
df = pd.DataFrame(array)
df.iloc[1,3] = np.NaN
df.isna().apply(lambda x: any(x), axis = 0)
Output:
0 False
1 False
2 False
3 True
4 False
dtype: bool
When I run:
any(df.isna())
It returns:
True
If there are no NaNs:
array = np.empty((4,5))
array[:] = 10
df = pd.DataFrame(array)
#df.iloc[1,3] = np.NaN
df.isna().apply(lambda x: any(x), axis = 0)
0 False
1 False
2 False
3 False
4 False
dtype: bool
However when I run:
any(df.isna())
It returns:
True
Why this is the case? Do I have any misunderstanding of the function any()?
Why this is the case? Do I have any misunderstanding of the function any()?
When you loop over a DataFrame you are actually iterating over its column labels, not its rows or values as you might think. More precisely, the for loop calls Dataframe.__iter__ which returns an iterator over the column labels of the DataFrame.
For instance, in the following
df = pd.DataFrame(columns=['a', 'b', 'c'])
for x in df:
print(x)
# Output:
#
# a
# b
# c
x holds the name of each df column. You can also see what is the output of list(df).
This means that when you do any(df.isna()), under the hood any is actually iterating over the column labels of df and checking their truthiness. If at least one is truthy it returns True.
In both of your examples the column labels are numbers list(df.isna()) = list(df.columns) = [0, 1, 2, 3], from which only 0 is a Falsy value. Therefore, in both cases any(df.isna()) = True.
Solution
The solution is to use DataFrame.any with axis=None instead of using the built-in any function.
df.isna().any(axis=None)
I have a column in python pandas DataFrame that has boolean True/False values, but for further calculations I need 1/0 representation. Is there a quick pandas/numpy way to do that?
A succinct way to convert a single column of boolean values to a column of integers 1 or 0:
df["somecolumn"] = df["somecolumn"].astype(int)
Just multiply your Dataframe by 1 (int)
[1]: data = pd.DataFrame([[True, False, True], [False, False, True]])
[2]: print data
0 1 2
0 True False True
1 False False True
[3]: print data*1
0 1 2
0 1 0 1
1 0 0 1
True is 1 in Python, and likewise False is 0*:
>>> True == 1
True
>>> False == 0
True
You should be able to perform any operations you want on them by just treating them as though they were numbers, as they are numbers:
>>> issubclass(bool, int)
True
>>> True * 5
5
So to answer your question, no work necessary - you already have what you are looking for.
* Note I use is as an English word, not the Python keyword is - True will not be the same object as any random 1.
This question specifically mentions a single column, so the currently accepted answer works. However, it doesn't generalize to multiple columns. For those interested in a general solution, use the following:
df.replace({False: 0, True: 1}, inplace=True)
This works for a DataFrame that contains columns of many different types, regardless of how many are boolean.
You also can do this directly on Frames
In [104]: df = DataFrame(dict(A = True, B = False),index=range(3))
In [105]: df
Out[105]:
A B
0 True False
1 True False
2 True False
In [106]: df.dtypes
Out[106]:
A bool
B bool
dtype: object
In [107]: df.astype(int)
Out[107]:
A B
0 1 0
1 1 0
2 1 0
In [108]: df.astype(int).dtypes
Out[108]:
A int64
B int64
dtype: object
Use Series.view for convert boolean to integers:
df["somecolumn"] = df["somecolumn"].view('i1')
You can use a transformation for your data frame:
df = pd.DataFrame(my_data condition)
transforming True/False in 1/0
df = df*1
I had to map FAKE/REAL to 0/1 but couldn't find proper answer.
Please find below how to map column name 'type' which has values FAKE/REAL to 0/1 (Note: similar can be applied to any column name and values)
df.loc[df['type'] == 'FAKE', 'type'] = 0
df.loc[df['type'] == 'REAL', 'type'] = 1
This is a reproducible example based on some of the existing answers:
import pandas as pd
def bool_to_int(s: pd.Series) -> pd.Series:
"""Convert the boolean to binary representation, maintain NaN values."""
return s.replace({True: 1, False: 0})
# generate a random dataframe
df = pd.DataFrame({"a": range(10), "b": range(10, 0, -1)}).assign(
a_bool=lambda df: df["a"] > 5,
b_bool=lambda df: df["b"] % 2 == 0,
)
# select all bool columns (or specify which cols to use)
bool_cols = [c for c, d in df.dtypes.items() if d == "bool"]
# apply the new coding to a new dataframe (or can replace the existing one)
df_new = df.assign(**{c: lambda df: df[c].pipe(bool_to_int) for c in bool_cols})
Tries and tested:
df[col] = df[col].map({'True': 1,'False' :0 })
If there are more than one columns with True/False, use the following.
for col in bool_cols:
df[col] = df[col].map({'True': 1,'False' :0 })
#AMC wrote this in a comment
If the column is of the type object
df["somecolumn"] = df["somecolumn"].astype(bool).astype(int)
How can I identify which column(s) in my DataFrame contain a specific string 'foo'?
Sample DataFrame:
>>> import pandas as pd
>>> df = pd.DataFrame({'A':[10,20,42], 'B':['foo','bar','blah'],'C':[3,4,5], 'D':['some','foo','thing']})
I want to find B and D here.
I can search for numbers:
If I'm looking for a number (e.g. 42) instead of a string, I can generate a boolean mask like this:
>>> ~(df.where(df==42)).isnull().all()
A True
B False
C False
D False
dtype: bool
but not strings:
>>> ~(df.where(df=='foo')).isnull().all()
TypeError: Could not compare ['foo'] with block values
I don't want to iterate over each column and row if possible (my actual data is much larger than this example). It feels like there should be a simple and efficient way.
How can I do this?
One way with underlying array data -
df.columns[(df.values=='foo').any(0)].tolist()
Sample run -
In [209]: df
Out[209]:
A B C D
0 10 foo 3 some
1 20 bar 4 foo
2 42 blah 5 thing
In [210]: df.columns[(df.values=='foo').any(0)].tolist()
Out[210]: ['B', 'D']
If you are looking for just the column-mask -
In [205]: (df.values=='foo').any(0)
Out[205]: array([False, True, False, True], dtype=bool)
Option 1 df.values
~(df.where(df.values=='foo')).isnull().all()
Out[91]:
A False
B True
C False
D True
dtype: bool
Option 2 isin
~(df.where(df.isin(['foo']))).isnull().all()
Out[94]:
A False
B True
C False
D True
dtype: bool
Unfortunately, it won't index a str through the syntax you gave. It has to be run as a series of type string to compare it with string, unless I am missing something.
try this
~df101.where(df101.isin(['foo'])).isnull().all()
A False
B True
C False
D True
dtype: bool
My application needs to compare Series instances that sometimes contain nans. That causes ordinary comparison using == to fail, since nan != nan:
import numpy as np
from pandas import Series
s1 = Series([1,np.nan])
s2 = Series([1,np.nan])
>>> (Series([1, nan]) == Series([1, nan])).all()
False
What's the proper way to compare such Series?
How about this. First check the NaNs are in the same place (using isnull):
In [11]: s1.isnull()
Out[11]:
0 False
1 True
dtype: bool
In [12]: s1.isnull() == s2.isnull()
Out[12]:
0 True
1 True
dtype: bool
Then check the values which aren't NaN are equal (using notnull):
In [13]: s1[s1.notnull()]
Out[13]:
0 1
dtype: float64
In [14]: s1[s1.notnull()] == s2[s2.notnull()]
Out[14]:
0 True
dtype: bool
In order to be equal we need both to be True:
In [15]: (s1.isnull() == s2.isnull()).all() and (s1[s1.notnull()] == s2[s2.notnull()]).all()
Out[15]: True
You could also check name etc. if this wasn't sufficient.
If you want to raise if they are different, use assert_series_equal from pandas.util.testing:
In [21]: from pandas.util.testing import assert_series_equal
In [22]: assert_series_equal(s1, s2)
Currently one should just use series1.equals(series2) see docs. This also checks if nans are in the same positions.
I came looking here for a similar answer, and think #Sam's answer is the neatest if you just want 1 value back. But I wanted a truth-array back with an element-wise comparison (but null safe).
So finally I ended up with:
import pandas as pd
s1 = pd.Series([1,np.nan, 2, np.nan])
s2 = pd.Series([1,np.nan, np.nan, 2])
(s1 == s2) | ~(s1.isnull() ^ s2.isnull())
The result:
0 True
1 True
2 False
3 False
dtype: bool
Comparing this to s1 == s2:
0 True
1 False
2 False
3 False
dtype: bool
In [16]: s1 = Series([1,np.nan])
In [17]: s2 = Series([1,np.nan])
In [18]: (s1.dropna()==s2.dropna()).all()
Out[18]: True
I have a column in python pandas DataFrame that has boolean True/False values, but for further calculations I need 1/0 representation. Is there a quick pandas/numpy way to do that?
A succinct way to convert a single column of boolean values to a column of integers 1 or 0:
df["somecolumn"] = df["somecolumn"].astype(int)
Just multiply your Dataframe by 1 (int)
[1]: data = pd.DataFrame([[True, False, True], [False, False, True]])
[2]: print data
0 1 2
0 True False True
1 False False True
[3]: print data*1
0 1 2
0 1 0 1
1 0 0 1
True is 1 in Python, and likewise False is 0*:
>>> True == 1
True
>>> False == 0
True
You should be able to perform any operations you want on them by just treating them as though they were numbers, as they are numbers:
>>> issubclass(bool, int)
True
>>> True * 5
5
So to answer your question, no work necessary - you already have what you are looking for.
* Note I use is as an English word, not the Python keyword is - True will not be the same object as any random 1.
This question specifically mentions a single column, so the currently accepted answer works. However, it doesn't generalize to multiple columns. For those interested in a general solution, use the following:
df.replace({False: 0, True: 1}, inplace=True)
This works for a DataFrame that contains columns of many different types, regardless of how many are boolean.
You also can do this directly on Frames
In [104]: df = DataFrame(dict(A = True, B = False),index=range(3))
In [105]: df
Out[105]:
A B
0 True False
1 True False
2 True False
In [106]: df.dtypes
Out[106]:
A bool
B bool
dtype: object
In [107]: df.astype(int)
Out[107]:
A B
0 1 0
1 1 0
2 1 0
In [108]: df.astype(int).dtypes
Out[108]:
A int64
B int64
dtype: object
Use Series.view for convert boolean to integers:
df["somecolumn"] = df["somecolumn"].view('i1')
You can use a transformation for your data frame:
df = pd.DataFrame(my_data condition)
transforming True/False in 1/0
df = df*1
I had to map FAKE/REAL to 0/1 but couldn't find proper answer.
Please find below how to map column name 'type' which has values FAKE/REAL to 0/1 (Note: similar can be applied to any column name and values)
df.loc[df['type'] == 'FAKE', 'type'] = 0
df.loc[df['type'] == 'REAL', 'type'] = 1
This is a reproducible example based on some of the existing answers:
import pandas as pd
def bool_to_int(s: pd.Series) -> pd.Series:
"""Convert the boolean to binary representation, maintain NaN values."""
return s.replace({True: 1, False: 0})
# generate a random dataframe
df = pd.DataFrame({"a": range(10), "b": range(10, 0, -1)}).assign(
a_bool=lambda df: df["a"] > 5,
b_bool=lambda df: df["b"] % 2 == 0,
)
# select all bool columns (or specify which cols to use)
bool_cols = [c for c, d in df.dtypes.items() if d == "bool"]
# apply the new coding to a new dataframe (or can replace the existing one)
df_new = df.assign(**{c: lambda df: df[c].pipe(bool_to_int) for c in bool_cols})
Tries and tested:
df[col] = df[col].map({'True': 1,'False' :0 })
If there are more than one columns with True/False, use the following.
for col in bool_cols:
df[col] = df[col].map({'True': 1,'False' :0 })
#AMC wrote this in a comment
If the column is of the type object
df["somecolumn"] = df["somecolumn"].astype(bool).astype(int)