Selecting None-s from dataframe - python

I'm trying to select a column of values where the corresponding column has None in it.
My dataframe looks as follows:
tdf = pandas.DataFrame([
{'a':'val', 'b':'abc'},
{'a':None, 'b':'def'}])
Since the following works for values:
In [112]: tdf[tdf['a']=='val']
Out[112]:
a b
0 val abc
I was expecting the same to work for None, but it doesn't:
In [111]: tdf[tdf['a']==None]
Out[111]:
Empty DataFrame
Columns: [a, b]
Index: []
In the end I'd like to use something like tdf[tdf['a']==None]['b'], but how do I handle those None values properly?

Use isnull to test for NaN:
In [71]:
tdf[tdf.isnull()]
Out[71]:
a b
0 NaN NaN
1 None NaN
NaN has the property that it is not equal to itself which is why it failed for you:
In [72]:
np.NaN == np.NaN
Out[72]:
False
In [73]:
np.NaN != np.NaN
Out[73]:
True
it is also available as a method on a series:
In [74]:
tdf[tdf['a'].isnull()]
Out[74]:
a b
1 None def
So to do what you specifically want, you can pass the boolean mask from isnull to loc and select column 'b':
In [75]:
tdf.loc[tdf['a'].isnull(), 'b']
Out[75]:
1 def
Name: b, dtype: object

Related

Boolean Indexing along the row axis of a DataFrame in pandas

a = [ [1,2,3,4,5], [6,np.nan,8,np.nan,10]]
df = pd.DataFrame(a, columns=['a', 'b', 'c', 'd', 'e'], index=['foo', 'bar'])
In [5]: df
Out[5]:
a b c d e
foo 1 2.0 3 4.0 5
bar 6 NaN 8 NaN 10
I understand how normal boolean indexing works, for example if I want to select the rows that have c > 3 I would write df[df.c > 3]. However, what if I want to do that along the row axis. Say I want only the columns that have 'bar' == np.nan.
I would have assumed that the following should do it due to the similarly of df['a'] and df.loc['bar']:
df.loc[df.loc['bar'].isnull()]
But it doesn't, and obviously neither does results[results.loc['hl'].isnull()] giving the same error *** pandas.core.indexing.IndexingError: Unalignable boolean Series key provided
So how would I do it?
IIUC you want to use the boolean mask to mask the columns:
In [135]:
df[df.columns[df.loc['bar'].isnull()]]
Out[135]:
b d
foo 2.0 4.0
bar NaN NaN
Or you can use ix and decay the series to np array:
In [138]:
df.ix[:,df.loc['bar'].isnull().values]
Out[138]:
b d
foo 2.0 4.0
bar NaN NaN
The problem here is that the boolean series returned is a mask on the columns:
In [136]:
df.loc['bar'].isnull()
Out[136]:
a False
b True
c False
d True
e False
Name: bar, dtype: bool
but your index contains none of these column values as the labels hence the error so you need to use the mask against the columns or you can pass a np array to mask the columns in ix

pandas.DataFrame.apply() using index as an arg

I'm trying to apply a function to every row in a pandas dataframe. The number of columns is variable but I'm using the index in the function as well
def pretend(np_array, index):
sum(np_array)*index
df = pd.DataFrame(np.arange(16).reshape(8,2))
answer = df.apply(pretend, axis=1, args=(df.index))
I shaped it to 8x2 but I'd like it to work on any shape I pass it.
the index values can be accessed via the .name attribute:
In [3]:
df = pd.DataFrame(data = np.random.randn(5,3), columns=list('abc'))
df
Out[3]:
a b c
0 -1.662047 0.794483 0.672300
1 -0.812412 -0.325160 -0.026990
2 -0.334991 0.412977 -2.016004
3 -1.337757 -1.328030 -1.005114
4 0.699106 -1.527408 -1.288385
In [8]:
def pretend(np_array):
return (np_array.sum())*np_array.name
df.apply(lambda x: pretend(x), axis=1)
Out[8]:
0 -0.000000
1 -1.164561
2 -3.876037
3 -11.012701
4 -8.466748
dtype: float64
You can see that the first row becomes 0 as the index value is 0

Is there a better more readable way to coalese columns in pandas

I often need a new column that is the best I can achieve from other columns and I have a specific list of preference priorities. I am willing to take the first non null value.
def coalesce(values):
not_none = (el for el in values if el is not None)
return next(not_none, None)
df = pd.DataFrame([{'third':'B','first':'A','second':'C'},
{'third':'B','first':None,'second':'C'},
{'third':'B','first':None,'second':None},
{'third':None,'first':None,'second':None},
{'third':'B','first':'A','second':None}])
df['combo1'] = df.apply(coalesce, axis=1)
df['combo2'] = df[['second','third','first']].apply(coalesce, axis=1)
print df
Results
first second third combo1 combo2
0 A C B A C
1 None C B C C
2 None None B B B
3 None None None None None
4 A None B A B
this code works (and the result are what I want) but it is not very fast.
I get to pick my priorities if I need to [['second','third','first']]
Coalesce somewhat like the function of the same name from tsql.
I suspect that I may have overlooked an easy way to achieve it with good performance on large DataFrames (+400,000 rows)
I know there are lots of ways to fill in missing data which I often use on axis=0
this is what makes me think I may have missed an easy option for axis=1
Can you suggest something nicer/faster... or confirm that this is as good as it gets.
The Pandas equivalent to COALESCE is the method fillna():
result = column_a.fillna(column_b)
The result is a column where each value is taken from column_a if that column provides a non-null value, otherwise the value is taken from column_b. So your combo1 can be produced with:
df['first'].fillna(df['second']).fillna(df['third'])
giving:
0 A
1 C
2 B
3 None
4 A
And your combo2 can be produced with:
(df['second']).fillna(df['third']).fillna(df['first'])
which returns the new column:
0 C
1 C
2 B
3 None
4 B
If you wanted an efficient operation called coalesce, it could simply combine columns with fillna() from left to right and then return the result:
def coalesce(df, column_names):
i = iter(column_names)
column_name = next(i)
answer = df[column_name]
for column_name in i:
answer = answer.fillna(df[column_name])
return answer
print coalesce(df, ['first', 'second', 'third'])
print coalesce(df, ['second', 'third', 'first'])
which gives:
0 A
1 C
2 B
3 None
4 A
0 C
1 C
2 B
3 None
4 B
You could use pd.isnull to find the null -- in this case None -- values:
In [169]: pd.isnull(df)
Out[169]:
first second third
0 False False False
1 True False False
2 True True False
3 True True True
4 False True False
and then use np.argmin to find the index of the first non-null value. If all the values are null, np.argmin returns 0:
In [186]: np.argmin(pd.isnull(df).values, axis=1)
Out[186]: array([0, 1, 2, 0, 0])
Then you could select the desired values from df using NumPy integer-indexing:
In [193]: df.values[np.arange(len(df)), np.argmin(pd.isnull(df).values, axis=1)]
Out[193]: array(['A', 'C', 'B', None, 'A'], dtype=object)
For example,
import pandas as pd
df = pd.DataFrame([{'third':'B','first':'A','second':'C'},
{'third':'B','first':None,'second':'C'},
{'third':'B','first':None,'second':None},
{'third':None,'first':None,'second':None},
{'third':'B','first':'A','second':None}])
mask = pd.isnull(df).values
df['combo1'] = df.values[np.arange(len(df)), np.argmin(mask, axis=1)]
order = np.array([1,2,0])
mask = mask[:, order]
df['combo2'] = df.values[np.arange(len(df)), order[np.argmin(mask, axis=1)]]
yields
first second third combo1 combo2
0 A C B A C
1 None C B C C
2 None None B B B
3 None None None None None
4 A None B A B
Using argmin instead of df3.apply(coalesce, ...) is significantly quicker if the DataFrame has a lot of rows:
df2 = pd.concat([df]*1000)
In [230]: %timeit mask = pd.isnull(df2).values; df2.values[np.arange(len(df2)), np.argmin(mask, axis=1)]
1000 loops, best of 3: 617 µs per loop
In [231]: %timeit df2.apply(coalesce, axis=1)
10 loops, best of 3: 84.1 ms per loop
df1 = pd.DataFrame([{'third':'B','first':'A','second':'C'},
{'third':'B','first':None,'second':'C'},
{'third':'B','first':None,'second':None},
{'third':None,'first':None,'second':None},
{'third':'B','first':'A','second':None}])
df1['combo'] = df1[['second','third','first']].bfill(axis ='columns')["second"]
print(df1)
Results
third first second combo
0 B A C C
1 B None C C
2 B None None B
3 None None None None
4 B A None B

Check if Pandas column contains value from another column

if df['col']='a','b','c' and df2['col']='a123','b456','d789' how do I create df2['is_contained']='a','b','no_match' where if values from df['col'] are found within values from df2['col'] the df['col'] value is returned and if no match is found, 'no_match' is returned? Also I don't expect there to be multiple matches, but in the unlikely case there are, I'd want to return a string like 'Multiple Matches'.
With this toy data set, we want to add a new column to df2 which will contain no_match for the first three rows, and the last row will contain the value 'd' due to the fact that that row's col value (the letter 'a') appears in df1.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
df1 = pd.DataFrame({'col': ['a', 'b', 'c', 'd']})
df2 = pd.DataFrame({'col': ['a123','b456','d789', 'a']})
In other words, values from df1 should be used to populate this new column in df2 only when a row's df2['col'] value appears somewhere in df1['col'].
In [2]: df1
Out[2]:
col
0 a
1 b
2 c
3 d
In [3]: df2
Out[3]:
col
0 a123
1 b456
2 d789
3 a
If this is the right way to understand your question, then you can do this with pandas isin:
In [4]: df2.col.isin(df1.col)
Out[4]:
0 False
1 False
2 False
3 True
Name: col, dtype: bool
This evaluates to True only when a value in df2.col is also in df1.col.
Then you can use np.where which is more or less the same as ifelse in R if you are familiar with R at all.
In [5]: np.where(df2.col.isin(df1.col), df1.col, 'NO_MATCH')
Out[5]:
0 NO_MATCH
1 NO_MATCH
2 NO_MATCH
3 d
Name: col, dtype: object
For rows where a df2.col value appears in df1.col, the value from df1.col will be returned for the given row index. In cases where the df2.col value is not a member of df1.col, the default 'NO_MATCH' value will be used.
You must first guarantee that the indexes match. To simplify, I'll show as if the columns where in the same dataframe. The trick is to use the apply method in the columns axis:
df = pd.DataFrame({'col1': ['a', 'b', 'c', 'd'],
'col2': ['a123','b456','d789', 'a']})
df['contained'] = df.apply(lambda x: x.col1 in x.col2, axis=1)
df
col1 col2 contained
0 a a123 True
1 b b456 True
2 c d789 False
3 d a False
In 0.13, you can use str.extract:
In [11]: df1 = pd.DataFrame({'col': ['a', 'b', 'c']})
In [12]: df2 = pd.DataFrame({'col': ['d23','b456','a789']})
In [13]: df2.col.str.extract('(%s)' % '|'.join(df1.col))
Out[13]:
0 NaN
1 b
2 a
Name: col, dtype: object

How can I map True/False to 1/0 in a Pandas DataFrame?

I have a column in python pandas DataFrame that has boolean True/False values, but for further calculations I need 1/0 representation. Is there a quick pandas/numpy way to do that?
A succinct way to convert a single column of boolean values to a column of integers 1 or 0:
df["somecolumn"] = df["somecolumn"].astype(int)
Just multiply your Dataframe by 1 (int)
[1]: data = pd.DataFrame([[True, False, True], [False, False, True]])
[2]: print data
0 1 2
0 True False True
1 False False True
[3]: print data*1
0 1 2
0 1 0 1
1 0 0 1
True is 1 in Python, and likewise False is 0*:
>>> True == 1
True
>>> False == 0
True
You should be able to perform any operations you want on them by just treating them as though they were numbers, as they are numbers:
>>> issubclass(bool, int)
True
>>> True * 5
5
So to answer your question, no work necessary - you already have what you are looking for.
* Note I use is as an English word, not the Python keyword is - True will not be the same object as any random 1.
This question specifically mentions a single column, so the currently accepted answer works. However, it doesn't generalize to multiple columns. For those interested in a general solution, use the following:
df.replace({False: 0, True: 1}, inplace=True)
This works for a DataFrame that contains columns of many different types, regardless of how many are boolean.
You also can do this directly on Frames
In [104]: df = DataFrame(dict(A = True, B = False),index=range(3))
In [105]: df
Out[105]:
A B
0 True False
1 True False
2 True False
In [106]: df.dtypes
Out[106]:
A bool
B bool
dtype: object
In [107]: df.astype(int)
Out[107]:
A B
0 1 0
1 1 0
2 1 0
In [108]: df.astype(int).dtypes
Out[108]:
A int64
B int64
dtype: object
Use Series.view for convert boolean to integers:
df["somecolumn"] = df["somecolumn"].view('i1')
You can use a transformation for your data frame:
df = pd.DataFrame(my_data condition)
transforming True/False in 1/0
df = df*1
I had to map FAKE/REAL to 0/1 but couldn't find proper answer.
Please find below how to map column name 'type' which has values FAKE/REAL to 0/1 (Note: similar can be applied to any column name and values)
df.loc[df['type'] == 'FAKE', 'type'] = 0
df.loc[df['type'] == 'REAL', 'type'] = 1
This is a reproducible example based on some of the existing answers:
import pandas as pd
def bool_to_int(s: pd.Series) -> pd.Series:
"""Convert the boolean to binary representation, maintain NaN values."""
return s.replace({True: 1, False: 0})
# generate a random dataframe
df = pd.DataFrame({"a": range(10), "b": range(10, 0, -1)}).assign(
a_bool=lambda df: df["a"] > 5,
b_bool=lambda df: df["b"] % 2 == 0,
)
# select all bool columns (or specify which cols to use)
bool_cols = [c for c, d in df.dtypes.items() if d == "bool"]
# apply the new coding to a new dataframe (or can replace the existing one)
df_new = df.assign(**{c: lambda df: df[c].pipe(bool_to_int) for c in bool_cols})
Tries and tested:
df[col] = df[col].map({'True': 1,'False' :0 })
If there are more than one columns with True/False, use the following.
for col in bool_cols:
df[col] = df[col].map({'True': 1,'False' :0 })
#AMC wrote this in a comment
If the column is of the type object
df["somecolumn"] = df["somecolumn"].astype(bool).astype(int)

Categories