pandas element-wise comparison and create selection - python

In a dataframe I would like to compare the elements of a column with a value and sort the elements which pass the comparison into a new column.
df = pandas.DataFrame([{'A':3,'B':10},
{'A':2, 'B':30},
{'A':1,'B':20},
{'A':2,'B':15},
{'A':2,'B':100}])
df['C'] = [x for x in df['B'] if x > 18]
I can't find out what's wrongs and why I get:
ValueError: Length of values does not match length of index

I think you can use loc with boolean indexing:
print (df)
A B
0 3 10
1 2 30
2 1 20
3 2 15
4 2 100
print (df['B'] > 18)
0 False
1 True
2 True
3 False
4 True
Name: B, dtype: bool
df.loc[df['B'] > 18, 'C'] = df['B']
print (df)
A B C
0 3 10 NaN
1 2 30 30.0
2 1 20 20.0
3 2 15 NaN
4 2 100 100.0
If you need select by condition use boolean indexing:
print (df[df['B'] > 18])
A B
1 2 30
2 1 20
4 2 100
If need something more faster, use where:
df['C'] = df.B.where(df['B'] > 18)
Timings (len(df)=50k):
In [1367]: %timeit (a(df))
The slowest run took 8.34 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 1.14 ms per loop
In [1368]: %timeit (b(df1))
100 loops, best of 3: 15.5 ms per loop
In [1369]: %timeit (c(df2))
100 loops, best of 3: 2.93 ms per loop
Code for timings:
import pandas as pd
df = pd.DataFrame([{'A':3,'B':10},
{'A':2, 'B':30},
{'A':1,'B':20},
{'A':2,'B':15},
{'A':2,'B':100}])
print (df)
df = pd.concat([df]*10000).reset_index(drop=True)
df1 = df.copy()
df2 = df.copy()
def a(df):
df['C'] = df.B.where(df['B'] > 18)
return df
def b(df1):
df['C'] = ([x if x > 18 else None for x in df['B']])
return df
def c(df2):
df.loc[df['B'] > 18, 'C'] = df['B']
return df
print (a(df))
print (b(df1))
print (c(df2))

As Darren mentioned, all columns in a DataFrame should have same length.
When you try print [x for x in df['B'] if x > 18], you get only [30, 20, 100] values. But you have got five index/rows. That's the reason you get Length of values does not match length of index error.
You can change your code as follows:
df['C'] = [x if x > 18 else None for x in df['B']]
print df
You will get:
A B C
0 3 10 NaN
1 2 30 30.0
2 1 20 20.0
3 2 15 NaN
4 2 100 100.0

All columns in a DataFrame have to be the same length. Because you are filtering away some values, you are trying to insert fewer values into column C than are in columns A and B.
So, your two options are to start a new DataFrame for C:
dfC = [x for x in df['B'] if x > 18]
or but some dummy value in the column for when x is not 18+. E.g.:
df['C'] = np.where(df['B'] > 18, True, False)
Or even:
df['C'] = np.where(df['B'] > 18, 'Yay', 'Nay')
P.S. Also take a look at: Pandas conditional creation of a series/dataframe column for other ways to do this.

Related

Pandas one liner to filter rows by nunique count on a specific column

In pandas, I regularly use the following to filter a dataframe by number of occurrences
df = df.groupby('A').filter(lambda x: len(x) >= THRESHOLD)
Assume df has another column 'B' and I want to filter the dataframe this time by the count of unique values on that column, I would expect something like
df = df.groupby('A').filter(lambda x: len(np.unique(x['B'])) >= THRESHOLD2)
But that doesn't seem to work, what would be the right approach?
It should working nice with nunique:
df = pd.DataFrame({'B':list('abccee'),
'E':[5,3,6,9,2,4],
'A':list('aabbcc')})
print (df)
A B E
0 a a 5
1 a b 3
2 b c 6
3 b c 9
4 c e 2
5 c e 4
THRESHOLD2 = 2
df1 = df.groupby('A').filter(lambda x: x['B'].nunique() >= THRESHOLD2)
print (df1)
A B E
0 a a 5
1 a b 3
But if need faster solution use transform and filter by boolean indexing:
df2 = df[df.groupby('A')['B'].transform('nunique') >= THRESHOLD2]
print (df2)
A B E
0 a a 5
1 a b 3
Timings:
np.random.seed(123)
N = 1000000
L = list('abcde')
df = pd.DataFrame({'B': np.random.choice(L, N, p=(0.75,0.0001,0.0005,0.0005,0.2489)),
'A':np.random.randint(10000,size=N)})
df = df.sort_values(['A','B']).reset_index(drop=True)
print (df)
THRESHOLD2 = 3
In [403]: %timeit df.groupby('A').filter(lambda x: x['B'].nunique() >= THRESHOLD2)
1 loop, best of 3: 3.05 s per loop
In [404]: %timeit df[df.groupby('A')['B'].transform('nunique')>= THRESHOLD2]
1 loop, best of 3: 558 ms per loop
Caveat
The results do not address performance given the number of groups, which will affect timings a lot for some of these solutions.

Filter dataframe matching all values of a vector in Python

I am trying to solve this question with Python.
ID = np.concatenate((np.repeat("A",5),
np.repeat("B",4),
np.repeat("C",2)))
Hour = np.array([0,2,5,6,9,0,2,5,6,0,2])
testVector = [0,2,5]
df = pd.DataFrame({'ID' : ID, 'Hour': Hour})
We group the rows by ID, then we want to remove all rows from df where not all values in testVector are found in the column Hour of that group. We could achieve that as follows:
def all_in(x,y):
return all([z in list(x) for z in y])
to_keep = df.groupby(by='ID')['Hour'].aggregate(lambda x: all_in(x,testVector))
to_keep = list(to_keep[to_keep].index)
df = df[df['ID'].isin(to_keep)]
I want to make this code as short and efficient as possible. Any suggestions for improvements or alternative solution approaches?
In [99]: test_set = set(testVector)
In [100]: df.loc[df.groupby('ID').Hour.transform(lambda x: set(x) & test_set == test_set)]
Out[100]:
Hour ID
0 0 A
1 2 A
2 5 A
3 6 A
4 9 A
5 0 B
6 2 B
7 5 B
8 6 B
Explanation:
in the lambda x: set(x) & test_set == test_set) function we create a set of Hour values for each group:
In [104]: df.groupby('ID').Hour.apply(lambda x: set(x))
Out[104]:
ID
A {0, 2, 5, 6, 9}
B {0, 2, 5, 6}
C {0, 2}
Name: Hour, dtype: object
Then we do set intersection with the test_set:
In [105]: df.groupby('ID').Hour.apply(lambda x: set(x) & test_set)
Out[105]:
ID
A {0, 2, 5}
B {0, 2, 5}
C {0, 2}
Name: Hour, dtype: object
and compare it with the test_set again:
In [106]: df.groupby('ID').Hour.apply(lambda x: set(x) & test_set == test_set)
Out[106]:
ID
A True
B True
C False
Name: Hour, dtype: bool
PS I used .apply() instead of .transform just for showing how it works.
But we need to use transform in order to use boolean indexing later on:
In [107]: df.groupby('ID').Hour.transform(lambda x: set(x) & test_set == test_set)
Out[107]:
0 True
1 True
2 True
3 True
4 True
5 True
6 True
7 True
8 True
9 False
10 False
Name: Hour, dtype: bool
Similar to MaxU's solution but I used a Series instead of a set:
testVector = pd.Series(testVector)
df[df.groupby('ID')['Hour'].transform(lambda x: testVector.isin(x).all())]
Out:
Hour ID
0 0 A
1 2 A
2 5 A
3 6 A
4 9 A
5 0 B
6 2 B
7 5 B
8 6 B
Filter might be more idiomatic here though:
df.groupby('ID').filter(lambda x: testVector.isin(x['Hour']).all())
Out:
Hour ID
0 0 A
1 2 A
2 5 A
3 6 A
4 9 A
5 0 B
6 2 B
7 5 B
8 6 B
Create sets for each ID from Hour column first. Then map for new Series which is compared with vector:
df = df[df['ID'].map(df.groupby(by='ID')['Hour'].apply(set)) >= set(testVector)]
print (df)
Hour ID
0 0 A
1 2 A
2 5 A
3 6 A
4 9 A
5 0 B
6 2 B
7 5 B
8 6 B
Timings:
np.random.seed(123)
N = 1000000
df = pd.DataFrame({'ID': np.random.randint(200, size=N),
'Hour': np.random.choice(range(10000),N)})
print (df)
testVector = [0,2,5]
test_set = set(testVector)
s = pd.Series(testVector)
#maxu sol
In [259]: %timeit (df.loc[df.groupby('ID').Hour.transform(lambda x: set(x) & test_set == test_set)])
1 loop, best of 3: 356 ms per loop
#jez sol
In [260]: %timeit (df[df['ID'].map(df.groupby(by='ID')['Hour'].apply(set)) >= set(testVector)])
1 loop, best of 3: 462 ms per loop
#ayhan sol1
In [261]: %timeit (df[df.groupby('ID')['Hour'].transform(lambda x: s.isin(x).all())])
1 loop, best of 3: 300 ms per loop
#ayhan sol2
In [263]: %timeit (df.groupby('ID').filter(lambda x: s.isin(x['Hour']).all()))
1 loop, best of 3: 211 ms per loop

How to delete columns with at least 20% missing values

Is there an efficient way to delete columns that have at least 20% missing values?
Suppose my dataframe is like:
A B C D
0 sg hh 1 7
1 gf 9
2 hh 10
3 dd 8
4 6
5 y 8`
After removing the columns, the dataframe becomes like this:
A D
0 sg 7
1 gf 9
2 hh 10
3 dd 8
4 6
5 y 8`
You can use boolean indexing on the columns where the count of notnull values is larger then 80%:
df.loc[:, pd.notnull(df).sum()>len(df)*.8]
This is useful for many cases, e.g., dropping the columns where the number of values larger than 1 would be:
df.loc[:, (df > 1).sum() > len(df) *. 8]
Alternatively, for the .dropna() case, you can also specify the thresh keyword of .dropna() as illustrated by #EdChum:
df.dropna(thresh=0.8*len(df), axis=1)
The latter will be slightly faster:
df = pd.DataFrame(np.random.random((100, 5)), columns=list('ABCDE'))
for col in df:
df.loc[np.random.choice(list(range(100)), np.random.randint(10, 30)), col] = np.nan
%timeit df.loc[:, pd.notnull(df).sum()>len(df)*.8]
1000 loops, best of 3: 716 µs per loop
%timeit df.dropna(thresh=0.8*len(df), axis=1)
1000 loops, best of 3: 537 µs per loop
You can call dropna and pass a thresh value to drop the columns that don't meet your threshold criteria:
In [10]:
frac = len(df) * 0.8
df.dropna(thresh=frac, axis=1)
Out[10]:
A D
0 sg 7
1 gf 9
2 hh 10
3 dd 8
4 NaN 6
5 y 8

Pandas - count if multiple conditions

Having a dataframe in python:
CASE TYPE
1 A
1 A
1 A
2 A
2 B
3 B
3 B
3 B
how can I create a result dataframe which would yield all cases and either an "A" if the case had only "A's" assigned, "B" if it was only "B's" or "MIXED" if the case had both A and B?
Result would be then:
Case Type
1 A
2 MIXED
3 B
Here is an option, where we firstly collect the TYPE as list by group of CASE and then check the length of unique TYPE, if it is larger than 1, return MIXED otherwise the TYPE by itself:
import pandas as pd
import numpy as np
groups = df.groupby('CASE').agg(lambda g: [g.TYPE.unique()]).
apply(lambda row: np.where(len(row.TYPE) > 1, 'MIXED', row.TYPE[0]), axis = 1)
groups
# CASE
# 1 A
# 2 MIXED
# 3 B
# dtype: object
df['NTYPES'] = df.groupby('CASE').transform(lambda x: x.nunique())
df.loc[df.NTYPES > 1, 'TYPE'] = 'MIXED'
df.groupby('TYPE', as_index=False).first().drop('NTYPES', 1)
TYPE CASE
0 A 1
1 B 3
2 MIXED 2
Here is a (admittedly over-engineered) solution that avoids looping over groups and DataFrame.apply (these are slow, so avoiding them may become important if your dataset gets sufficiently large).
import pandas as pd
df = pd.DataFrame({'CASE': [1]*3 + [2]*2 + [3]*3,
'TYPE': ['A']*4 + ['B']*4})
We group by CASE and compute the relative frequencies of TYPE being A or B:
grouped = df.groupby('CASE')
vc = (grouped['TYPE'].value_counts(normalize=True)
.unstack(level=0)
.fillna(0))
Here's what vc looks like
CASE 1 2 3
TYPE
A 1.0 0.5 0.0
B 0.0 0.5 0.0
Notice that all the information is contained in the first row. Cutting said row into bins with pd.cut gives the desired result:
tolerance = 1e-10
bins = [-tolerance, tolerance, 1-tolerance, 1+tolerance]
types = pd.cut(vc.loc['A'], bins=bins, labels=['B', 'MIXED', 'A'])
We get:
CASE
1 A
2 MIXED
3 B
Name: A, dtype: category
Categories (3, object): [B < MIXED < A]
For good measure, we can rename the types series:
types.name = 'TYPE'
here is one bit ugly, but not that slow solution:
In [154]: df
Out[154]:
CASE TYPE
0 1 A
1 1 A
2 1 A
3 2 A
4 2 B
5 3 B
6 3 B
7 3 B
8 4 C
9 4 C
10 4 B
In [155]: %paste
(df.groupby('CASE')['TYPE']
.apply(lambda x: x.head(1) if x.nunique() == 1 else pd.Series(['MIX']))
.reset_index()
.drop('level_1', 1)
)
## -- End pasted text --
Out[155]:
CASE TYPE
0 1 A
1 2 MIX
2 3 B
3 4 MIX
Timing: against 800K rows DF:
In [191]: df = pd.concat([df] * 10**5, ignore_index=True)
In [192]: df.shape
Out[192]: (800000, 3)
In [193]: %timeit Psidom(df)
1 loop, best of 3: 235 ms per loop
In [194]: %timeit capitalistpug(df)
1 loop, best of 3: 419 ms per loop
In [195]: %timeit Alberto_Garcia_Raboso(df)
10 loops, best of 3: 112 ms per loop
In [196]: %timeit MaxU(df)
10 loops, best of 3: 80.4 ms per loop

getting the index of a row in a pandas apply function

I am trying to access the index of a row in a function applied across an entire DataFrame in Pandas. I have something like this:
df = pandas.DataFrame([[1,2,3],[4,5,6]], columns=['a','b','c'])
>>> df
a b c
0 1 2 3
1 4 5 6
and I'll define a function that access elements with a given row
def rowFunc(row):
return row['a'] + row['b'] * row['c']
I can apply it like so:
df['d'] = df.apply(rowFunc, axis=1)
>>> df
a b c d
0 1 2 3 7
1 4 5 6 34
Awesome! Now what if I want to incorporate the index into my function?
The index of any given row in this DataFrame before adding d would be Index([u'a', u'b', u'c', u'd'], dtype='object'), but I want the 0 and 1. So I can't just access row.index.
I know I could create a temporary column in the table where I store the index, but I'm wondering if it is stored in the row object somewhere.
To access the index in this case you access the name attribute:
In [182]:
df = pd.DataFrame([[1,2,3],[4,5,6]], columns=['a','b','c'])
def rowFunc(row):
return row['a'] + row['b'] * row['c']
def rowIndex(row):
return row.name
df['d'] = df.apply(rowFunc, axis=1)
df['rowIndex'] = df.apply(rowIndex, axis=1)
df
Out[182]:
a b c d rowIndex
0 1 2 3 7 0
1 4 5 6 34 1
Note that if this is really what you are trying to do that the following works and is much faster:
In [198]:
df['d'] = df['a'] + df['b'] * df['c']
df
Out[198]:
a b c d
0 1 2 3 7
1 4 5 6 34
In [199]:
%timeit df['a'] + df['b'] * df['c']
%timeit df.apply(rowIndex, axis=1)
10000 loops, best of 3: 163 µs per loop
1000 loops, best of 3: 286 µs per loop
EDIT
Looking at this question 3+ years later, you could just do:
In[15]:
df['d'],df['rowIndex'] = df['a'] + df['b'] * df['c'], df.index
df
Out[15]:
a b c d rowIndex
0 1 2 3 7 0
1 4 5 6 34 1
but assuming it isn't as trivial as this, whatever your rowFunc is really doing, you should look to use the vectorised functions, and then use them against the df index:
In[16]:
df['newCol'] = df['a'] + df['b'] + df['c'] + df.index
df
Out[16]:
a b c d rowIndex newCol
0 1 2 3 7 0 6
1 4 5 6 34 1 16
Either:
1. with row.name inside the apply(..., axis=1) call:
df = pandas.DataFrame([[1,2,3],[4,5,6]], columns=['a','b','c'], index=['x','y'])
a b c
x 1 2 3
y 4 5 6
df.apply(lambda row: row.name, axis=1)
x x
y y
2. with iterrows() (slower)
DataFrame.iterrows() allows you to iterate over rows, and access their index:
for idx, row in df.iterrows():
...
To answer the original question: yes, you can access the index value of a row in apply(). It is available under the key name and requires that you specify axis=1 (because the lambda processes the columns of a row and not the rows of a column).
Working example (pandas 0.23.4):
>>> import pandas as pd
>>> df = pd.DataFrame([[1,2,3],[4,5,6]], columns=['a','b','c'])
>>> df.set_index('a', inplace=True)
>>> df
b c
a
1 2 3
4 5 6
>>> df['index_x10'] = df.apply(lambda row: 10*row.name, axis=1)
>>> df
b c index_x10
a
1 2 3 10
4 5 6 40

Categories