Conditionally select dataframes from a list of dfs - python

I'm trying to subset a list of dataframes with a function. This function would need to return only the df's which for example have a Z-column-total of > 14 and X-column-values (rows 0-4) which are 30% below or above the average of those 5 values. So, in the example below df1 would be returned and df2 not.
Can this be done, evaluating every dataframe with these kinds of conditions? Could anyone point me in the right direction?
N = 5
np.random.seed(0)
df1 = pd.DataFrame(
{'X':np.random.uniform(0,5,N),
'Y':np.random.uniform(0,5,N),
'Z':np.random.uniform(0,5,N),
})
df2 = pd.DataFrame(
{'X':np.random.uniform(0,5,N),
'Y':np.random.uniform(0,5,N),
'Z':np.random.uniform(0,5,N),
})
df1.loc['total'] = df1.sum()
df2.loc['total'] = df2.sum()
df_list = (df1, df2)
X Y Z
0 2.744068 3.229471 3.958625
1 3.575947 2.187936 2.644475
2 3.013817 4.458865 2.840223
3 2.724416 4.818314 4.627983
4 2.118274 1.917208 0.355180
total 14.176521 16.611793 14.426486
--------------------------------------
X Y Z
0 0.435646 4.893092 3.199605
1 0.101092 3.995793 0.716766
2 4.163099 2.307397 4.723345
3 3.890784 3.902646 2.609242
4 4.350061 0.591372 2.073310
total 12.940682 15.690299 13.322267

List comprehension can be used, with the 2 stated conditions.
The Z condition is pretty straightforward and easy to implement. Regarding the X condition, I created a little function that returns True if the dataframe matches the condition, else False.
In [156]: def check_X(df):
...: avg = df.drop('total')['X'].mean()
...: for val in df.drop('total')['X']:
...: if val/avg < 0.7 or val/avg > 1.3: #30% more or less
...: return False
...: return True
...:
Therefore, we can get the expected result by doing:
In [157]: [df for df in df_list if df.drop('total')['Z'].sum() > 14 and check_X(df)]
Out[157]:
[ X Y Z
0 2.744068 3.229471 3.958625
1 3.575947 2.187936 2.644475
2 3.013817 4.458865 2.840223
3 2.724416 4.818314 4.627983
4 2.118274 1.917208 0.355180
total 14.176522 16.611794 14.426486]
Edit: a better, one-liner solution that doesn't use any user-defined function:
In [205]: [df for df in df_list if df['Z'].sum() > 14 and ((df['X'] > df['X'].mean()*0.7) & (df['X'] < df['X'].mean()*1.3)).all()]
Out[205]:
[ X Y Z
0 2.744068 3.229471 3.958625
1 3.575947 2.187936 2.644475
2 3.013817 4.458865 2.840223
3 2.724416 4.818314 4.627983
4 2.118274 1.917208 0.355180]
For simplicity, I dropped the 'total' row from both df before processing:
In [204]: df_list = [df.drop('total') for df in df_list]

If you have a list of dataframes then conditionally select the dataframe using list comprehension and you can use slicing (iloc[0:-1] for excluding last row).
new_list= [x for x in df_list if (x.loc['total','Z']>14) and
((x.iloc[0:-1]['X'] > x.iloc[0:-1]['X'].mean()*0.7) & (x.iloc[0:-1]['X'] < x.iloc[0:-1]['X'].mean()*1.3)).all()]
Output:
[ X Y Z
0 2.744068 3.229471 3.958625
1 3.575947 2.187936 2.644475
2 3.013817 4.458865 2.840223
3 2.724416 4.818314 4.627983
4 2.118274 1.917208 0.355180
total 14.176521 16.611793 14.426486]

Related

Python. Change numeric data into categorical [duplicate]

I have a DataFrame df:
A B
a 2 2
b 3 1
c 1 3
I want to create a new column based on the following criteria:
if row A == B: 0
if rowA > B: 1
if row A < B: -1
so given the above table, it should be:
A B C
a 2 2 0
b 3 1 1
c 1 3 -1
For typical if else cases I do np.where(df.A > df.B, 1, -1), does pandas provide a special syntax for solving my problem with one step (without the necessity of creating 3 new columns and then combining the result)?
To formalize some of the approaches laid out above:
Create a function that operates on the rows of your dataframe like so:
def f(row):
if row['A'] == row['B']:
val = 0
elif row['A'] > row['B']:
val = 1
else:
val = -1
return val
Then apply it to your dataframe passing in the axis=1 option:
In [1]: df['C'] = df.apply(f, axis=1)
In [2]: df
Out[2]:
A B C
a 2 2 0
b 3 1 1
c 1 3 -1
Of course, this is not vectorized so performance may not be as good when scaled to a large number of records. Still, I think it is much more readable. Especially coming from a SAS background.
Edit
Here is the vectorized version
df['C'] = np.where(
df['A'] == df['B'], 0, np.where(
df['A'] > df['B'], 1, -1))
df.loc[df['A'] == df['B'], 'C'] = 0
df.loc[df['A'] > df['B'], 'C'] = 1
df.loc[df['A'] < df['B'], 'C'] = -1
Easy to solve using indexing. The first line of code reads like so, if column A is equal to column B then create and set column C equal to 0.
For this particular relationship, you could use np.sign:
>>> df["C"] = np.sign(df.A - df.B)
>>> df
A B C
a 2 2 0
b 3 1 1
c 1 3 -1
When you have multiple if
conditions, numpy.select is the way to go:
In [4102]: import numpy as np
In [4098]: conditions = [df.A.eq(df.B), df.A.gt(df.B), df.A.lt(df.B)]
In [4096]: choices = [0, 1, -1]
In [4100]: df['C'] = np.select(conditions, choices)
In [4101]: df
Out[4101]:
A B C
a 2 2 0
b 3 1 1
c 1 3 -1
Lets say above one is your original dataframe and you want to add a new column 'old'
If age greater than 50 then we consider as older=yes otherwise False
step 1: Get the indexes of rows whose age greater than 50
row_indexes=df[df['age']>=50].index
step 2:
Using .loc we can assign a new value to column
df.loc[row_indexes,'elderly']="yes"
same for age below less than 50
row_indexes=df[df['age']<50].index
df[row_indexes,'elderly']="no"
You can use the method mask:
df['C'] = np.nan
df['C'] = df['C'].mask(df.A == df.B, 0).mask(df.A > df.B, 1).mask(df.A < df.B, -1)

trim last rows of a pandas dataframe based on a condition

let's assume a dataframe like this:
idx x y
0 a 3
1 b 2
2 c 0
3 d 2
4 e 5
how can I trim the bottom rows, based on a condition, so that any row after the last one matching the condition would be removed?
for example:
with the following condition: y == 0
the output would be
idx x y
0 a 3
1 b 2
2 c 0
the condition can happen many times, but the last one is the one that triggers the cut.
Method 1:
Usng index.max & iloc:
index.max to get the last row with condition y==0
iloc to slice of the dataframe on the index found with df['y'].eq(0)
idx = df.query('y.eq(0)').index.max()+1
# idx = df.query('y==0').index.max()+1 -- if pandas < 0.25
df.iloc[:idx]
Output
x y
0 a 3
1 b 2
2 c 0
Method 2:
Using np.where
idx = np.where(df['y'].eq(0), df.index, 0).max()+1
df.iloc[:idx]
Output
x y
0 a 3
1 b 2
2 c 0
you could do, here np.wherereturns a tuple, so we access the value of the indexes as the first element of the tuple using np.where(df.y == 0), the first occurence is then returned as the last element of this vector, finaly we add 1 to the index so we can include this index of the last occurence while slicing
df_cond = df.iloc[:np.where(df.y == 0)[0][-1]+1, :]
or you could do :
df_cond = df[ :df.y.eq(0).cumsum().idxmax()+1 ]
Set up your dataframe:
data = [
[ 'a', 3],
[ 'b' , 2],
[ 'c' , 0],
[ 'd', 2],
[ 'e' , 5]
]
df = pd.DataFrame(data, columns=['x', 'y']).reset_index().rename(columns={'index':'idx'}).sort_values('idx')
Then find your cutoff (assuming the idx column is already sorted):
cutoff = df[df['y'] == 0].idx.min()
The df['y'] == 0 is your condition. Then get the min idx that meets that condition and save it as our cutoff.
Finally, create a new dataframe using your cutoff:
df_new = df[df.idx <= cutoff].copy()
Output:
df_new
idx x y
0 0 a 3
1 1 b 2
2 2 c 0
I would do something like this:
df.iloc[:df['y'].eq(0).idxmax()+1]
Just look for the largest index where your condition is true.
EDIT
So the above code wont work because idxmax() still only takes the first index where the value is true. So we we can do the following to trick it:
df.iloc[:df['y'].eq(0).sort_index(ascending = False).idxmax()+1]
Flip the index, so the last index is the first index that idxmax picks up.

Show complete rows highliting difference between dataframes df1 , df2 , but only when difference in row's cell exists

i have two dataframes df1 and df2.
Same index and same column names.
how to construct a dataframe which shows difference, but only rows which have at least one different cell?
if row has different cells, but some are same, keep same cells intact.
example:
df1=pd.DataFrame({1:['a','a'],2:['c','c']})
df2=pd.DataFrame({1:['a','a'],2:['d','c']})
output needed:
pd.DataFrame({1:['a'],2:['c->d']},index=[0])
output in this example should be one row dataframe, not dataframe including same rows
NB: output should only contain full rows which has at least one difference in cell
i'd like an efficient solution without iterating by rows , and without creating special-strings in DataFrame
You can use this brilliant solution:
def report_diff(x):
return x[0] if x[0] == x[1] else '{}->{}'.format(*x)
In [70]: pd.Panel(dict(df1=df1,df2=df2)).apply(report_diff, axis=0)
Out[70]:
1 2
0 a c->d
1 a c
For bit more complex DataFrames:
In [73]: df1
Out[73]:
A B C
0 a c 1
1 a c 2
2 1 2 3
In [74]: df2
Out[74]:
A B C
0 a d 1
1 a c 2
2 1 2 4
In [75]: pd.Panel(dict(df1=df1,df2=df2)).apply(report_diff, axis=0)
Out[75]:
A B C
0 a c->d 1
1 a c 2
2 1 2 3->4
UPDATE: showing only changed/different rows:
In [54]: mask = df1.ne(df2).any(1)
In [55]: mask
Out[55]:
0 True
1 False
2 True
dtype: bool
In [56]: pd.Panel(dict(df1=df1[mask],df2=df2[mask])).apply(report_diff, axis=0)
Out[56]:
A B C
0 a c->d 1
2 1 2 3->4
How about a good ole list comprehension on the flattened contents...
import pandas as pd
import numpy as np
df1=pd.DataFrame({1:['a','a'],2:['c','c']})
df2=pd.DataFrame({1:['a','a'],2:['d','c']})
rows_different_mask = (df1 != df2).any(axis=1)
pairs = zip(df1.values.reshape(1, -1)[0], df2.values.reshape(1, -1)[0])
new_elems = ["%s->%s" %(old, new) if (old != new) else new for old, new in pairs]
df3 = pd.DataFrame(np.reshape(new_elems, df1.values.shape))
print df3
0 1
0 a c->d
1 a c

Update row values where certain condition is met in pandas

Say I have the following dataframe:
What is the most efficient way to update the values of the columns feat and another_feat where the stream is number 2?
Is this it?
for index, row in df.iterrows():
if df1.loc[index,'stream'] == 2:
# do something
How do I do it if there are more than 100 columns? I don't want to explicitly name the columns that I want to update. I want to divide the value of each column by 2 (except for the stream column).
So to be clear, my goal is:
Dividing all values by 2 of all rows that have stream 2, but not changing the stream column.
I think you can use loc if you need update two columns to same value:
df1.loc[df1['stream'] == 2, ['feat','another_feat']] = 'aaaa'
print df1
stream feat another_feat
a 1 some_value some_value
b 2 aaaa aaaa
c 2 aaaa aaaa
d 3 some_value some_value
If you need update separate, one option is use:
df1.loc[df1['stream'] == 2, 'feat'] = 10
print df1
stream feat another_feat
a 1 some_value some_value
b 2 10 some_value
c 2 10 some_value
d 3 some_value some_value
Another common option is use numpy.where:
df1['feat'] = np.where(df1['stream'] == 2, 10,20)
print df1
stream feat another_feat
a 1 20 some_value
b 2 10 some_value
c 2 10 some_value
d 3 20 some_value
EDIT: If you need divide all columns without stream where condition is True, use:
print df1
stream feat another_feat
a 1 4 5
b 2 4 5
c 2 2 9
d 3 1 7
#filter columns all without stream
cols = [col for col in df1.columns if col != 'stream']
print cols
['feat', 'another_feat']
df1.loc[df1['stream'] == 2, cols ] = df1 / 2
print df1
stream feat another_feat
a 1 4.0 5.0
b 2 2.0 2.5
c 2 1.0 4.5
d 3 1.0 7.0
If working with multiple conditions is possible use multiple numpy.where
or numpy.select:
df0 = pd.DataFrame({'Col':[5,0,-6]})
df0['New Col1'] = np.where((df0['Col'] > 0), 'Increasing',
np.where((df0['Col'] < 0), 'Decreasing', 'No Change'))
df0['New Col2'] = np.select([df0['Col'] > 0, df0['Col'] < 0],
['Increasing', 'Decreasing'],
default='No Change')
print (df0)
Col New Col1 New Col2
0 5 Increasing Increasing
1 0 No Change No Change
2 -6 Decreasing Decreasing
You can do the same with .ix, like this:
In [1]: df = pd.DataFrame(np.random.randn(5,4), columns=list('abcd'))
In [2]: df
Out[2]:
a b c d
0 -0.323772 0.839542 0.173414 -1.341793
1 -1.001287 0.676910 0.465536 0.229544
2 0.963484 -0.905302 -0.435821 1.934512
3 0.266113 -0.034305 -0.110272 -0.720599
4 -0.522134 -0.913792 1.862832 0.314315
In [3]: df.ix[df.a>0, ['b','c']] = 0
In [4]: df
Out[4]:
a b c d
0 -0.323772 0.839542 0.173414 -1.341793
1 -1.001287 0.676910 0.465536 0.229544
2 0.963484 0.000000 0.000000 1.934512
3 0.266113 0.000000 0.000000 -0.720599
4 -0.522134 -0.913792 1.862832 0.314315
EDIT
After the extra information, the following will return all columns - where some condition is met - with halved values:
>> condition = df.a > 0
>> df[condition][[i for i in df.columns.values if i not in ['a']]].apply(lambda x: x/2)
Another vectorized solution is to use the mask() method to halve the rows corresponding to stream=2 and join() these columns to a dataframe that consists only of the stream column:
cols = ['feat', 'another_feat']
df[['stream']].join(df[cols].mask(df['stream'] == 2, lambda x: x/2))
or you can also update() the original dataframe:
df.update(df[cols].mask(df['stream'] == 2, lambda x: x/2))
Both of the above codes do the following:
mask() is even simpler to use if the value to replace is a constant (not derived using a function); e.g. the following code replaces all feat values corresponding to stream equal to 1 or 3 by 100.1
df[['stream']].join(df.filter(like='feat').mask(df['stream'].isin([1,3]), 100))
1: feat columns can be selected using filter() method as well.

Creating a new column based on if-elif-else condition

I have a DataFrame df:
A B
a 2 2
b 3 1
c 1 3
I want to create a new column based on the following criteria:
if row A == B: 0
if rowA > B: 1
if row A < B: -1
so given the above table, it should be:
A B C
a 2 2 0
b 3 1 1
c 1 3 -1
For typical if else cases I do np.where(df.A > df.B, 1, -1), does pandas provide a special syntax for solving my problem with one step (without the necessity of creating 3 new columns and then combining the result)?
To formalize some of the approaches laid out above:
Create a function that operates on the rows of your dataframe like so:
def f(row):
if row['A'] == row['B']:
val = 0
elif row['A'] > row['B']:
val = 1
else:
val = -1
return val
Then apply it to your dataframe passing in the axis=1 option:
In [1]: df['C'] = df.apply(f, axis=1)
In [2]: df
Out[2]:
A B C
a 2 2 0
b 3 1 1
c 1 3 -1
Of course, this is not vectorized so performance may not be as good when scaled to a large number of records. Still, I think it is much more readable. Especially coming from a SAS background.
Edit
Here is the vectorized version
df['C'] = np.where(
df['A'] == df['B'], 0, np.where(
df['A'] > df['B'], 1, -1))
df.loc[df['A'] == df['B'], 'C'] = 0
df.loc[df['A'] > df['B'], 'C'] = 1
df.loc[df['A'] < df['B'], 'C'] = -1
Easy to solve using indexing. The first line of code reads like so, if column A is equal to column B then create and set column C equal to 0.
For this particular relationship, you could use np.sign:
>>> df["C"] = np.sign(df.A - df.B)
>>> df
A B C
a 2 2 0
b 3 1 1
c 1 3 -1
When you have multiple if
conditions, numpy.select is the way to go:
In [4102]: import numpy as np
In [4098]: conditions = [df.A.eq(df.B), df.A.gt(df.B), df.A.lt(df.B)]
In [4096]: choices = [0, 1, -1]
In [4100]: df['C'] = np.select(conditions, choices)
In [4101]: df
Out[4101]:
A B C
a 2 2 0
b 3 1 1
c 1 3 -1
Lets say above one is your original dataframe and you want to add a new column 'old'
If age greater than 50 then we consider as older=yes otherwise False
step 1: Get the indexes of rows whose age greater than 50
row_indexes=df[df['age']>=50].index
step 2:
Using .loc we can assign a new value to column
df.loc[row_indexes,'elderly']="yes"
same for age below less than 50
row_indexes=df[df['age']<50].index
df[row_indexes,'elderly']="no"
You can use the method mask:
df['C'] = np.nan
df['C'] = df['C'].mask(df.A == df.B, 0).mask(df.A > df.B, 1).mask(df.A < df.B, -1)

Categories