Boolean Indexing along the row axis of a DataFrame in pandas - python

a = [ [1,2,3,4,5], [6,np.nan,8,np.nan,10]]
df = pd.DataFrame(a, columns=['a', 'b', 'c', 'd', 'e'], index=['foo', 'bar'])
In [5]: df
Out[5]:
a b c d e
foo 1 2.0 3 4.0 5
bar 6 NaN 8 NaN 10
I understand how normal boolean indexing works, for example if I want to select the rows that have c > 3 I would write df[df.c > 3]. However, what if I want to do that along the row axis. Say I want only the columns that have 'bar' == np.nan.
I would have assumed that the following should do it due to the similarly of df['a'] and df.loc['bar']:
df.loc[df.loc['bar'].isnull()]
But it doesn't, and obviously neither does results[results.loc['hl'].isnull()] giving the same error *** pandas.core.indexing.IndexingError: Unalignable boolean Series key provided
So how would I do it?

IIUC you want to use the boolean mask to mask the columns:
In [135]:
df[df.columns[df.loc['bar'].isnull()]]
Out[135]:
b d
foo 2.0 4.0
bar NaN NaN
Or you can use ix and decay the series to np array:
In [138]:
df.ix[:,df.loc['bar'].isnull().values]
Out[138]:
b d
foo 2.0 4.0
bar NaN NaN
The problem here is that the boolean series returned is a mask on the columns:
In [136]:
df.loc['bar'].isnull()
Out[136]:
a False
b True
c False
d True
e False
Name: bar, dtype: bool
but your index contains none of these column values as the labels hence the error so you need to use the mask against the columns or you can pass a np array to mask the columns in ix

Related

delete columns where values are not increasing in pandas

I have a df with values and some columns have values which are increasing and some columns have values which are either decreasing or not changing. I want to delete those columns. I tried to use is_monotonic but that returns a boolean = TRUE if the values are increasing but does not include if the values remain the same
data = [{'a': 1, 'b': 2, 'c':33}, {'a':10, 'b': 2, 'c': 30}]
df = pd.DataFrame(data)
In the above example i want to keep only column 'a' as the other two columns are the same or decreasing values. can anyone help me please?
Get difference of all columns, remove first only NaNs row and compare if all values are greater like 0:
df = df.loc[:, df.diff().iloc[1:].gt(0).all()]
print (df)
a
0 1
1 10
Details:
print (df.diff())
a b c
0 NaN NaN NaN
1 9.0 0.0 -3.0
print (df.diff().iloc[1:])
a b c
1 9.0 0.0 -3.0
print (df.diff().iloc[1:].gt(0))
a b c
1 True False False
print (df.diff().iloc[1:].gt(0).all())
a True
b False
c False
dtype: bool
Or like mentioned in comments change logic - get any columns if les or equal 0 and change mask by ~:
df = df.loc[:, ~df.diff().le(0).any()]

How to drop rows not containing string type in a column in Pandas?

I have a csv file with four columns. I read it like this:
df = pd.read_csv('my.csv', error_bad_lines=False, sep='\t', header=None, names=['A', 'B', 'C', 'D'])
Now, field C contains string values. But in some rows there are non-string type (floats or numbers) values. How to drop those rows? I'm using version 0.18.1 of Pandas.
Setup
df = pd.DataFrame([['a', 'b', 'c', 'd'], ['e', 'f', 1.2, 'g']], columns=list('ABCD'))
print df
A B C D
0 a b c d
1 e f 1.2 g
Notice you can see what the individual cell types are.
print type(df.loc[0, 'C']), type(df.loc[1, 'C'])
<type 'str'> <type 'float'>
mask and slice
print df.loc[df.C.apply(type) != float]
A B C D
0 a b c d
more general
print df.loc[df.C.apply(lambda x: not isinstance(x, (float, int)))]
A B C D
0 a b c d
you could also use float as an attempt to determine if it can be a float.
def try_float(x):
try:
float(x)
return True
except:
return False
print df.loc[~df.C.apply(try_float)]
A B C D
0 a b c d
The problem with this approach is that you'll exclude strings that can be interpreted as floats.
Comparing times for the few options I've provided and also jezrael's solution with small dataframes.
For a dataframe with 500,000 rows:
Checking if its type is float seems to be most performant with is numeric right behind it. If you need to check int and float, I'd go with jezrael's answer. If you can get away with checking for float, use that one.
You can use boolean indexing with mask created by to_numeric with parameter errors='coerce' - you get NaN where are string values. Then check isnull:
df = pd.DataFrame({'A':[1,2,3],
'B':[4,5,6],
'C':['a',8,9],
'D':[1,3,5]})
print (df)
A B C D
0 1 4 a 1
1 2 5 8 3
2 3 6 9 5
print (pd.to_numeric(df.C, errors='coerce'))
0 NaN
1 8.0
2 9.0
Name: C, dtype: float64
print (pd.to_numeric(df.C, errors='coerce').isnull())
0 True
1 False
2 False
Name: C, dtype: bool
print (df[pd.to_numeric(df.C, errors='coerce').isnull()])
A B C D
0 1 4 a 1
Use pandas.DataFrame.select_dtypes method.
Ex.
df.select_dtypes(exclude='object')
or
df.select_dtypes(include=['int64','float','int'])

Selecting None-s from dataframe

I'm trying to select a column of values where the corresponding column has None in it.
My dataframe looks as follows:
tdf = pandas.DataFrame([
{'a':'val', 'b':'abc'},
{'a':None, 'b':'def'}])
Since the following works for values:
In [112]: tdf[tdf['a']=='val']
Out[112]:
a b
0 val abc
I was expecting the same to work for None, but it doesn't:
In [111]: tdf[tdf['a']==None]
Out[111]:
Empty DataFrame
Columns: [a, b]
Index: []
In the end I'd like to use something like tdf[tdf['a']==None]['b'], but how do I handle those None values properly?
Use isnull to test for NaN:
In [71]:
tdf[tdf.isnull()]
Out[71]:
a b
0 NaN NaN
1 None NaN
NaN has the property that it is not equal to itself which is why it failed for you:
In [72]:
np.NaN == np.NaN
Out[72]:
False
In [73]:
np.NaN != np.NaN
Out[73]:
True
it is also available as a method on a series:
In [74]:
tdf[tdf['a'].isnull()]
Out[74]:
a b
1 None def
So to do what you specifically want, you can pass the boolean mask from isnull to loc and select column 'b':
In [75]:
tdf.loc[tdf['a'].isnull(), 'b']
Out[75]:
1 def
Name: b, dtype: object

Replace values in a dataframe column based on condition

I have a seemingly easy task. Dataframe with 2 columns: A and B. If values in B are larger than values in A - replace those values with values of A. I used to do this by doing df.B[df.B > df.A] = df.A, however recent upgrade of pandas started giving a SettingWithCopyWarning when encountering this chained assignment. Official documentation recommends using .loc.
Okay, I said, and did it through df.loc[df.B > df.A, 'B'] = df.A and it all works fine, unless column B has all values of NaN. Then something weird happens:
In [1]: df = pd.DataFrame({'A': [1, 2, 3],'B': [np.NaN, np.NaN, np.NaN]})
In [2]: df
Out[2]:
A B
0 1 NaN
1 2 NaN
2 3 NaN
In [3]: df.loc[df.B > df.A, 'B'] = df.A
In [4]: df
Out[4]:
A B
0 1 -9223372036854775808
1 2 -9223372036854775808
2 3 -9223372036854775808
Now, if even one of B's elements satisfies the condition (larger than A), then it all works fine:
In [1]: df = pd.DataFrame({'A': [1, 2, 3],'B': [np.NaN, 4, np.NaN]})
In [2]: df
Out[2]:
A B
0 1 NaN
1 2 4
2 3 NaN
In [3]: df.loc[df.B > df.A, 'B'] = df.A
In [4]: df
Out[4]:
A B
0 1 NaN
1 2 2
2 3 NaN
But if none of Bs elements satisfy, then all NaNs get replaces with -9223372036854775808:
In [1]: df = pd.DataFrame({'A':[1,2,3],'B':[np.NaN,1,np.NaN]})
In [2]: df
Out[2]:
A B
0 1 NaN
1 2 1
2 3 NaN
In [3]: df.loc[df.B > df.A, 'B'] = df.A
In [4]: df
Out[4]:
A B
0 1 -9223372036854775808
1 2 1
2 3 -9223372036854775808
Is this a bug or a feature? How should I have done this replacement?
Thank you!
This is a buggie, fixed here.
Since pandas allows basically anything to be set on the right-hand-side of an expression in loc, there are probably 10+ cases that need to be disambiguated. To give you an idea:
df.loc[lhs, column] = rhs
where rhs could be: list,array,scalar, and lhs could be: slice,tuple,scalar,array
and a small subset of cases where the resulting dtype of the column needs to be inferred / set according to the rhs. (This is a bit complicated). For example say you don't set all of the elements on the lhs and it was integer, then you need to coerce to float. But if you did set all of the elements AND the rhs was an integer then it needs to be coerced BACK to integer.
In this this particular case, the lhs is an array, so we would normally try to coerce the lhs to the type of the rhs, but this case degenerates if we have an unsafe conversion (int -> float)
Suffice to say this was a missing edge case.

Check if Pandas column contains value from another column

if df['col']='a','b','c' and df2['col']='a123','b456','d789' how do I create df2['is_contained']='a','b','no_match' where if values from df['col'] are found within values from df2['col'] the df['col'] value is returned and if no match is found, 'no_match' is returned? Also I don't expect there to be multiple matches, but in the unlikely case there are, I'd want to return a string like 'Multiple Matches'.
With this toy data set, we want to add a new column to df2 which will contain no_match for the first three rows, and the last row will contain the value 'd' due to the fact that that row's col value (the letter 'a') appears in df1.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
df1 = pd.DataFrame({'col': ['a', 'b', 'c', 'd']})
df2 = pd.DataFrame({'col': ['a123','b456','d789', 'a']})
In other words, values from df1 should be used to populate this new column in df2 only when a row's df2['col'] value appears somewhere in df1['col'].
In [2]: df1
Out[2]:
col
0 a
1 b
2 c
3 d
In [3]: df2
Out[3]:
col
0 a123
1 b456
2 d789
3 a
If this is the right way to understand your question, then you can do this with pandas isin:
In [4]: df2.col.isin(df1.col)
Out[4]:
0 False
1 False
2 False
3 True
Name: col, dtype: bool
This evaluates to True only when a value in df2.col is also in df1.col.
Then you can use np.where which is more or less the same as ifelse in R if you are familiar with R at all.
In [5]: np.where(df2.col.isin(df1.col), df1.col, 'NO_MATCH')
Out[5]:
0 NO_MATCH
1 NO_MATCH
2 NO_MATCH
3 d
Name: col, dtype: object
For rows where a df2.col value appears in df1.col, the value from df1.col will be returned for the given row index. In cases where the df2.col value is not a member of df1.col, the default 'NO_MATCH' value will be used.
You must first guarantee that the indexes match. To simplify, I'll show as if the columns where in the same dataframe. The trick is to use the apply method in the columns axis:
df = pd.DataFrame({'col1': ['a', 'b', 'c', 'd'],
'col2': ['a123','b456','d789', 'a']})
df['contained'] = df.apply(lambda x: x.col1 in x.col2, axis=1)
df
col1 col2 contained
0 a a123 True
1 b b456 True
2 c d789 False
3 d a False
In 0.13, you can use str.extract:
In [11]: df1 = pd.DataFrame({'col': ['a', 'b', 'c']})
In [12]: df2 = pd.DataFrame({'col': ['d23','b456','a789']})
In [13]: df2.col.str.extract('(%s)' % '|'.join(df1.col))
Out[13]:
0 NaN
1 b
2 a
Name: col, dtype: object

Categories