Update row values where certain condition is met in pandas - python

Say I have the following dataframe:
What is the most efficient way to update the values of the columns feat and another_feat where the stream is number 2?
Is this it?
for index, row in df.iterrows():
if df1.loc[index,'stream'] == 2:
# do something
How do I do it if there are more than 100 columns? I don't want to explicitly name the columns that I want to update. I want to divide the value of each column by 2 (except for the stream column).
So to be clear, my goal is:
Dividing all values by 2 of all rows that have stream 2, but not changing the stream column.

I think you can use loc if you need update two columns to same value:
df1.loc[df1['stream'] == 2, ['feat','another_feat']] = 'aaaa'
print df1
stream feat another_feat
a 1 some_value some_value
b 2 aaaa aaaa
c 2 aaaa aaaa
d 3 some_value some_value
If you need update separate, one option is use:
df1.loc[df1['stream'] == 2, 'feat'] = 10
print df1
stream feat another_feat
a 1 some_value some_value
b 2 10 some_value
c 2 10 some_value
d 3 some_value some_value
Another common option is use numpy.where:
df1['feat'] = np.where(df1['stream'] == 2, 10,20)
print df1
stream feat another_feat
a 1 20 some_value
b 2 10 some_value
c 2 10 some_value
d 3 20 some_value
EDIT: If you need divide all columns without stream where condition is True, use:
print df1
stream feat another_feat
a 1 4 5
b 2 4 5
c 2 2 9
d 3 1 7
#filter columns all without stream
cols = [col for col in df1.columns if col != 'stream']
print cols
['feat', 'another_feat']
df1.loc[df1['stream'] == 2, cols ] = df1 / 2
print df1
stream feat another_feat
a 1 4.0 5.0
b 2 2.0 2.5
c 2 1.0 4.5
d 3 1.0 7.0
If working with multiple conditions is possible use multiple numpy.where
or numpy.select:
df0 = pd.DataFrame({'Col':[5,0,-6]})
df0['New Col1'] = np.where((df0['Col'] > 0), 'Increasing',
np.where((df0['Col'] < 0), 'Decreasing', 'No Change'))
df0['New Col2'] = np.select([df0['Col'] > 0, df0['Col'] < 0],
['Increasing', 'Decreasing'],
default='No Change')
print (df0)
Col New Col1 New Col2
0 5 Increasing Increasing
1 0 No Change No Change
2 -6 Decreasing Decreasing

You can do the same with .ix, like this:
In [1]: df = pd.DataFrame(np.random.randn(5,4), columns=list('abcd'))
In [2]: df
Out[2]:
a b c d
0 -0.323772 0.839542 0.173414 -1.341793
1 -1.001287 0.676910 0.465536 0.229544
2 0.963484 -0.905302 -0.435821 1.934512
3 0.266113 -0.034305 -0.110272 -0.720599
4 -0.522134 -0.913792 1.862832 0.314315
In [3]: df.ix[df.a>0, ['b','c']] = 0
In [4]: df
Out[4]:
a b c d
0 -0.323772 0.839542 0.173414 -1.341793
1 -1.001287 0.676910 0.465536 0.229544
2 0.963484 0.000000 0.000000 1.934512
3 0.266113 0.000000 0.000000 -0.720599
4 -0.522134 -0.913792 1.862832 0.314315
EDIT
After the extra information, the following will return all columns - where some condition is met - with halved values:
>> condition = df.a > 0
>> df[condition][[i for i in df.columns.values if i not in ['a']]].apply(lambda x: x/2)

Another vectorized solution is to use the mask() method to halve the rows corresponding to stream=2 and join() these columns to a dataframe that consists only of the stream column:
cols = ['feat', 'another_feat']
df[['stream']].join(df[cols].mask(df['stream'] == 2, lambda x: x/2))
or you can also update() the original dataframe:
df.update(df[cols].mask(df['stream'] == 2, lambda x: x/2))
Both of the above codes do the following:
mask() is even simpler to use if the value to replace is a constant (not derived using a function); e.g. the following code replaces all feat values corresponding to stream equal to 1 or 3 by 100.1
df[['stream']].join(df.filter(like='feat').mask(df['stream'].isin([1,3]), 100))
1: feat columns can be selected using filter() method as well.

Related

Element-wise Comparison of Two Pandas Dataframes

I am trying to compare two columns in pandas. I know I can do:
# either using Pandas' equals()
df1[col].equals(df2[col])
# or this
df1[col] == df2[col]
However, what I am looking for is to compare these columns elment-wise and when they are not matching print out both values. I have tried:
if df1[col] != df2[col]:
print(df1[col])
print(df2[col])
where I get the error for 'The truth value of a Series is ambiguous'
I believe this is because the column is treated as a series of boolean values for the comparison which causes the ambiguity. I also tried various forms of for loops which did not resolve the issue.
Can anyone point me to how I should go about doing what I described?
This might work for you:
import pandas as pd
df1 = pd.DataFrame({'col1': [1, 2, 3, 4, 5]})
df2 = pd.DataFrame({'col1': [1, 2, 9, 4, 7]})
if not df2[df2['col1'] != df1['col1']].empty:
print(df1[df1['col1'] != df2['col1']])
print(df2[df2['col1'] != df1['col1']])
Output:
col1
2 3
4 5
col1
2 9
4 7
You need to get hold of the index where the column values are not matching. Once you have that index then you can query the individual DFs to get the values.
Please try the fallowing and is if this helps:
for ind in (df1.loc[df1['col1'] != df2['col1']].index):
x = df1.loc[df1.index == ind, 'col1'].values[0]
y = df2.loc[df2.index == ind, 'col1'].values[0]
print(x, y )
Solution
Try this. You could use any of the following one-line solutions.
# Option-1
df.loc[df.apply(lambda row: row[col1] != row[col2], axis=1), [col1, col2]]
# Option-2
df.loc[df[col1]!=df[col2], [col1, col2]]
Logic:
Option-1: We use pandas.DataFrame.apply() to evaluate the target columns row by row and pass the returned indices to df.loc[indices, [col1, col2]] and that returns the required set of rows where col1 != col2.
Option-2: We get the indices with df[col1] != df[col2] and the rest of the logic is the same as Option-1.
Dummy Data
I made the dummy data such that for indices: 2,6,8 we will find column 'a' and 'c' to be different. Thus, we want only those rows returned by the solution.
import numpy as np
import pandas as pd
a = np.arange(10)
c = a.copy()
c[[2,6,8]] = [0,20,40]
df = pd.DataFrame({'a': a, 'b': a**2, 'c': c})
print(df)
Output:
a b c
0 0 0 0
1 1 1 1
2 2 4 0
3 3 9 3
4 4 16 4
5 5 25 5
6 6 36 20
7 7 49 7
8 8 64 40
9 9 81 9
Applying the solution to the dummy data
We see that the solution proposed returns the result as expected.
col1, col2 = 'a', 'c'
result = df.loc[df.apply(lambda row: row[col1] != row[col2], axis=1), [col1, col2]]
print(result)
Output:
a c
2 2 0
6 6 20
8 8 40

Python. Change numeric data into categorical [duplicate]

I have a DataFrame df:
A B
a 2 2
b 3 1
c 1 3
I want to create a new column based on the following criteria:
if row A == B: 0
if rowA > B: 1
if row A < B: -1
so given the above table, it should be:
A B C
a 2 2 0
b 3 1 1
c 1 3 -1
For typical if else cases I do np.where(df.A > df.B, 1, -1), does pandas provide a special syntax for solving my problem with one step (without the necessity of creating 3 new columns and then combining the result)?
To formalize some of the approaches laid out above:
Create a function that operates on the rows of your dataframe like so:
def f(row):
if row['A'] == row['B']:
val = 0
elif row['A'] > row['B']:
val = 1
else:
val = -1
return val
Then apply it to your dataframe passing in the axis=1 option:
In [1]: df['C'] = df.apply(f, axis=1)
In [2]: df
Out[2]:
A B C
a 2 2 0
b 3 1 1
c 1 3 -1
Of course, this is not vectorized so performance may not be as good when scaled to a large number of records. Still, I think it is much more readable. Especially coming from a SAS background.
Edit
Here is the vectorized version
df['C'] = np.where(
df['A'] == df['B'], 0, np.where(
df['A'] > df['B'], 1, -1))
df.loc[df['A'] == df['B'], 'C'] = 0
df.loc[df['A'] > df['B'], 'C'] = 1
df.loc[df['A'] < df['B'], 'C'] = -1
Easy to solve using indexing. The first line of code reads like so, if column A is equal to column B then create and set column C equal to 0.
For this particular relationship, you could use np.sign:
>>> df["C"] = np.sign(df.A - df.B)
>>> df
A B C
a 2 2 0
b 3 1 1
c 1 3 -1
When you have multiple if
conditions, numpy.select is the way to go:
In [4102]: import numpy as np
In [4098]: conditions = [df.A.eq(df.B), df.A.gt(df.B), df.A.lt(df.B)]
In [4096]: choices = [0, 1, -1]
In [4100]: df['C'] = np.select(conditions, choices)
In [4101]: df
Out[4101]:
A B C
a 2 2 0
b 3 1 1
c 1 3 -1
Lets say above one is your original dataframe and you want to add a new column 'old'
If age greater than 50 then we consider as older=yes otherwise False
step 1: Get the indexes of rows whose age greater than 50
row_indexes=df[df['age']>=50].index
step 2:
Using .loc we can assign a new value to column
df.loc[row_indexes,'elderly']="yes"
same for age below less than 50
row_indexes=df[df['age']<50].index
df[row_indexes,'elderly']="no"
You can use the method mask:
df['C'] = np.nan
df['C'] = df['C'].mask(df.A == df.B, 0).mask(df.A > df.B, 1).mask(df.A < df.B, -1)

Is there any way to replace values with different values in the same row if they meet a certain condition using a for loop?

I need to replace the values of a certain cell with values from another cell if a certain condition is met.
for r in df:
if df['col1'] > 1 :
df['col2']
else:
I am hoping for every value in column 1 to be replaced with their respective value in column 2 if the condition if the value of the row in column 1 is greater than 1.
No need to loop through the entire dataframe.
idx=df['col1']>1
df.loc[idx,'col1']=df.loc[idx,'col2']
Using a for loop
for _,row in df.iterrows():
if row['col1']>1:
row['col1']=row['col2']
elif condition:
#put assignment here
else other_condition:
#put assignment here
Here is an example
import pandas as pd
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 4, 6]})
print(df)
print('----------')
# the condition here is A^2 == B
df.loc[df['A'] * df['A'] == df['B'], 'A'] = df['B']
print(df)
output
A B
0 1 4
1 2 4
2 3 6
----------
A B
0 1 4
1 4 4
2 3 6

Show complete rows highliting difference between dataframes df1 , df2 , but only when difference in row's cell exists

i have two dataframes df1 and df2.
Same index and same column names.
how to construct a dataframe which shows difference, but only rows which have at least one different cell?
if row has different cells, but some are same, keep same cells intact.
example:
df1=pd.DataFrame({1:['a','a'],2:['c','c']})
df2=pd.DataFrame({1:['a','a'],2:['d','c']})
output needed:
pd.DataFrame({1:['a'],2:['c->d']},index=[0])
output in this example should be one row dataframe, not dataframe including same rows
NB: output should only contain full rows which has at least one difference in cell
i'd like an efficient solution without iterating by rows , and without creating special-strings in DataFrame
You can use this brilliant solution:
def report_diff(x):
return x[0] if x[0] == x[1] else '{}->{}'.format(*x)
In [70]: pd.Panel(dict(df1=df1,df2=df2)).apply(report_diff, axis=0)
Out[70]:
1 2
0 a c->d
1 a c
For bit more complex DataFrames:
In [73]: df1
Out[73]:
A B C
0 a c 1
1 a c 2
2 1 2 3
In [74]: df2
Out[74]:
A B C
0 a d 1
1 a c 2
2 1 2 4
In [75]: pd.Panel(dict(df1=df1,df2=df2)).apply(report_diff, axis=0)
Out[75]:
A B C
0 a c->d 1
1 a c 2
2 1 2 3->4
UPDATE: showing only changed/different rows:
In [54]: mask = df1.ne(df2).any(1)
In [55]: mask
Out[55]:
0 True
1 False
2 True
dtype: bool
In [56]: pd.Panel(dict(df1=df1[mask],df2=df2[mask])).apply(report_diff, axis=0)
Out[56]:
A B C
0 a c->d 1
2 1 2 3->4
How about a good ole list comprehension on the flattened contents...
import pandas as pd
import numpy as np
df1=pd.DataFrame({1:['a','a'],2:['c','c']})
df2=pd.DataFrame({1:['a','a'],2:['d','c']})
rows_different_mask = (df1 != df2).any(axis=1)
pairs = zip(df1.values.reshape(1, -1)[0], df2.values.reshape(1, -1)[0])
new_elems = ["%s->%s" %(old, new) if (old != new) else new for old, new in pairs]
df3 = pd.DataFrame(np.reshape(new_elems, df1.values.shape))
print df3
0 1
0 a c->d
1 a c

pandas dataframe drop columns by number of nan

I have a dataframe with some columns containing nan. I'd like to drop those columns with certain number of nan. For example, in the following code, I'd like to drop any column with 2 or more nan. In this case, column 'C' will be dropped and only 'A' and 'B' will be kept. How can I implement it?
import pandas as pd
import numpy as np
dff = pd.DataFrame(np.random.randn(10,3), columns=list('ABC'))
dff.iloc[3,0] = np.nan
dff.iloc[6,1] = np.nan
dff.iloc[5:8,2] = np.nan
print dff
There is a thresh param for dropna, you just need to pass the length of your df - the number of NaN values you want as your threshold:
In [13]:
dff.dropna(thresh=len(dff) - 2, axis=1)
Out[13]:
A B
0 0.517199 -0.806304
1 -0.643074 0.229602
2 0.656728 0.535155
3 NaN -0.162345
4 -0.309663 -0.783539
5 1.244725 -0.274514
6 -0.254232 NaN
7 -1.242430 0.228660
8 -0.311874 -0.448886
9 -0.984453 -0.755416
So the above will drop any column that does not meet the criteria of the length of the df (number of rows) - 2 as the number of non-Na values.
You can use a conditional list comprehension:
>>> dff[[c for c in dff if dff[c].isnull().sum() < 2]]
A B
0 -0.819004 0.919190
1 0.922164 0.088111
2 0.188150 0.847099
3 NaN -0.053563
4 1.327250 -0.376076
5 3.724980 0.292757
6 -0.319342 NaN
7 -1.051529 0.389843
8 -0.805542 -0.018347
9 -0.816261 -1.627026
Here is a possible solution:
s = dff.isnull().apply(sum, axis=0) # count the number of nan in each column
print s
A 1
B 1
C 3
dtype: int64
for col in dff:
if s[col] >= 2:
del dff[col]
Or
for c in dff:
if sum(dff[c].isnull()) >= 2:
dff.drop(c, axis=1, inplace=True)
I recommend the drop-method. This is an alternative solution:
dff.drop(dff.loc[:,len(dff) - dff.isnull().sum() <2], axis=1)
Say you have to drop columns having more than 70% null values.
data.drop(data.loc[:,list((100*(data.isnull().sum()/len(data.index))>70))].columns, 1)
You can do this through another approach as well like below for dropping columns having certain number of na values:
df = df.drop( columns= [x for x in df if df[x].isna().sum() > 5 ])
For dropping columns having certain percentage of na values :
df = df.drop(columns= [x for x in df if round((df[x].isna().sum()/len(df)*100),2) > 20 ])

Categories