Creating a new column based on if-elif-else condition - python

I have a DataFrame df:
A B
a 2 2
b 3 1
c 1 3
I want to create a new column based on the following criteria:
if row A == B: 0
if rowA > B: 1
if row A < B: -1
so given the above table, it should be:
A B C
a 2 2 0
b 3 1 1
c 1 3 -1
For typical if else cases I do np.where(df.A > df.B, 1, -1), does pandas provide a special syntax for solving my problem with one step (without the necessity of creating 3 new columns and then combining the result)?

To formalize some of the approaches laid out above:
Create a function that operates on the rows of your dataframe like so:
def f(row):
if row['A'] == row['B']:
val = 0
elif row['A'] > row['B']:
val = 1
else:
val = -1
return val
Then apply it to your dataframe passing in the axis=1 option:
In [1]: df['C'] = df.apply(f, axis=1)
In [2]: df
Out[2]:
A B C
a 2 2 0
b 3 1 1
c 1 3 -1
Of course, this is not vectorized so performance may not be as good when scaled to a large number of records. Still, I think it is much more readable. Especially coming from a SAS background.
Edit
Here is the vectorized version
df['C'] = np.where(
df['A'] == df['B'], 0, np.where(
df['A'] > df['B'], 1, -1))

df.loc[df['A'] == df['B'], 'C'] = 0
df.loc[df['A'] > df['B'], 'C'] = 1
df.loc[df['A'] < df['B'], 'C'] = -1
Easy to solve using indexing. The first line of code reads like so, if column A is equal to column B then create and set column C equal to 0.

For this particular relationship, you could use np.sign:
>>> df["C"] = np.sign(df.A - df.B)
>>> df
A B C
a 2 2 0
b 3 1 1
c 1 3 -1

When you have multiple if
conditions, numpy.select is the way to go:
In [4102]: import numpy as np
In [4098]: conditions = [df.A.eq(df.B), df.A.gt(df.B), df.A.lt(df.B)]
In [4096]: choices = [0, 1, -1]
In [4100]: df['C'] = np.select(conditions, choices)
In [4101]: df
Out[4101]:
A B C
a 2 2 0
b 3 1 1
c 1 3 -1

Lets say above one is your original dataframe and you want to add a new column 'old'
If age greater than 50 then we consider as older=yes otherwise False
step 1: Get the indexes of rows whose age greater than 50
row_indexes=df[df['age']>=50].index
step 2:
Using .loc we can assign a new value to column
df.loc[row_indexes,'elderly']="yes"
same for age below less than 50
row_indexes=df[df['age']<50].index
df[row_indexes,'elderly']="no"

You can use the method mask:
df['C'] = np.nan
df['C'] = df['C'].mask(df.A == df.B, 0).mask(df.A > df.B, 1).mask(df.A < df.B, -1)

Related

Python. Change numeric data into categorical [duplicate]

I have a DataFrame df:
A B
a 2 2
b 3 1
c 1 3
I want to create a new column based on the following criteria:
if row A == B: 0
if rowA > B: 1
if row A < B: -1
so given the above table, it should be:
A B C
a 2 2 0
b 3 1 1
c 1 3 -1
For typical if else cases I do np.where(df.A > df.B, 1, -1), does pandas provide a special syntax for solving my problem with one step (without the necessity of creating 3 new columns and then combining the result)?
To formalize some of the approaches laid out above:
Create a function that operates on the rows of your dataframe like so:
def f(row):
if row['A'] == row['B']:
val = 0
elif row['A'] > row['B']:
val = 1
else:
val = -1
return val
Then apply it to your dataframe passing in the axis=1 option:
In [1]: df['C'] = df.apply(f, axis=1)
In [2]: df
Out[2]:
A B C
a 2 2 0
b 3 1 1
c 1 3 -1
Of course, this is not vectorized so performance may not be as good when scaled to a large number of records. Still, I think it is much more readable. Especially coming from a SAS background.
Edit
Here is the vectorized version
df['C'] = np.where(
df['A'] == df['B'], 0, np.where(
df['A'] > df['B'], 1, -1))
df.loc[df['A'] == df['B'], 'C'] = 0
df.loc[df['A'] > df['B'], 'C'] = 1
df.loc[df['A'] < df['B'], 'C'] = -1
Easy to solve using indexing. The first line of code reads like so, if column A is equal to column B then create and set column C equal to 0.
For this particular relationship, you could use np.sign:
>>> df["C"] = np.sign(df.A - df.B)
>>> df
A B C
a 2 2 0
b 3 1 1
c 1 3 -1
When you have multiple if
conditions, numpy.select is the way to go:
In [4102]: import numpy as np
In [4098]: conditions = [df.A.eq(df.B), df.A.gt(df.B), df.A.lt(df.B)]
In [4096]: choices = [0, 1, -1]
In [4100]: df['C'] = np.select(conditions, choices)
In [4101]: df
Out[4101]:
A B C
a 2 2 0
b 3 1 1
c 1 3 -1
Lets say above one is your original dataframe and you want to add a new column 'old'
If age greater than 50 then we consider as older=yes otherwise False
step 1: Get the indexes of rows whose age greater than 50
row_indexes=df[df['age']>=50].index
step 2:
Using .loc we can assign a new value to column
df.loc[row_indexes,'elderly']="yes"
same for age below less than 50
row_indexes=df[df['age']<50].index
df[row_indexes,'elderly']="no"
You can use the method mask:
df['C'] = np.nan
df['C'] = df['C'].mask(df.A == df.B, 0).mask(df.A > df.B, 1).mask(df.A < df.B, -1)

Is there any way to replace values with different values in the same row if they meet a certain condition using a for loop?

I need to replace the values of a certain cell with values from another cell if a certain condition is met.
for r in df:
if df['col1'] > 1 :
df['col2']
else:
I am hoping for every value in column 1 to be replaced with their respective value in column 2 if the condition if the value of the row in column 1 is greater than 1.
No need to loop through the entire dataframe.
idx=df['col1']>1
df.loc[idx,'col1']=df.loc[idx,'col2']
Using a for loop
for _,row in df.iterrows():
if row['col1']>1:
row['col1']=row['col2']
elif condition:
#put assignment here
else other_condition:
#put assignment here
Here is an example
import pandas as pd
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 4, 6]})
print(df)
print('----------')
# the condition here is A^2 == B
df.loc[df['A'] * df['A'] == df['B'], 'A'] = df['B']
print(df)
output
A B
0 1 4
1 2 4
2 3 6
----------
A B
0 1 4
1 4 4
2 3 6

Swapping column values based on column conditions (Pandas DataFrame)

The DataFrame has two columns A and B of integers.
a b
1 3
4 2
2 0
6 1
...
I need to swap in the following way:
if df.a > df.b:
temp = df.b
df.b = df.a
df.a = temp
expected output:
a b
1 3
2 4 <----
0 2 <----
1 6 <----
Basically always having in column A the smaller value of the twos.
I feel I should use loc but I couldn't find the right way yet.
In [443]: df['a'], df['b'] = df.min(axis=1), df.max(axis=1)
In [444]: df
Out[444]:
a b
0 1 3
1 2 4
2 0 2
3 1 6
or
pd.DataFrame(np.sort(d.values, axis=1), d.index, d.columns)
Using np.where you can do
In [21]: df.a, df.b = np.where(df.a > df.b, [df.b, df.a], [df.a, df.b])
In [23]: df
Out[23]:
a b
0 1 3
1 2 4
2 0 2
3 1 6
Or, using .loc
In [35]: cond = df.a > df.b
In [36]: df.loc[cond, ['a', 'b']] = df.loc[cond, ['b', 'a']].values
In [37]: df
Out[37]:
a b
0 1 3
1 2 4
2 0 2
3 1 6
Or, .apply(np.sort, axis=1) if you need smaller a values and larger b
In [54]: df.apply(np.sort, axis=1)
Out[54]:
a b
0 1 3
1 2 4
2 0 2
3 1 6
Seeing the methods proposed by #JohnGait and #MaxU, I did a small speed comparison.
arr = np.random.randint(low = 100, size = (10000000, 2))
# using np.where
df = pd.DataFrame(arr, columns = ['a', 'b'])
t_0 = time.time()
df.a, df.b = np.where(df.a > df.b, [df.b, df.a], [df.a, df.b])
t_1 = time.time()
# using df.loc
df = pd.DataFrame(arr, columns = ['a', 'b'])
t_2 = time.time()
cond = df.a > df.b
df.loc[cond, ['a', 'b']] = df.loc[cond, ['b', 'a']].values
t_3 = time.time()
# using df.min
df = pd.DataFrame(arr, columns = ['a', 'b'])
t_4 = time.time()
df['a'], df['b'] = df.min(axis=1), df.max(axis=1)
t_5 = time.time()
# using np.sort
t_6 = time.time()
df_ = pd.DataFrame(np.sort(arr, axis=1), df.index, df.columns)
t_7 = time.time()
t_1 - t_0 # using np.where: 5.759037971496582
t_3 - t_2 # using .loc: 0.12156987190246582
t_5 - t_4 # using df.min: 1.0503261089324951
t_7 - t_6 # 0.20351791381835938
Although second approach is the fastest approach, the practical gain is insignificant. I am adding it here for pedantic reasons. I didn't include the sort method as I am convinced that's going be a lot slower.
EDIT
I had wrongly reported the computation time of np.where due to a mistake I made. Corrected that (turns out its the slowest of the lot!) and added another method (following #MaxU's comment)
Solution
It's as simple as
df.values.sort(1)
df
a b
0 1 3
1 2 4
2 0 2
3 1 6
What Happened
I can sort a numpy.array in place with its sort method. I pass the parameter axis=1 to indicate that I want to sort along the first axis (row wise). The values attribute of a dataframe accesses the underlying numpy array. So df.values.sort(1) sorts the underlying values in place row wise... done.
We can be a bit more explicit with
df.values[:] = np.sort(df.values, 1)
This allows us a lot of flexibility to perform this over subsets of columns or reverse sort
df.values[:, ::-1] = np.sort(df.values, 1)

Update row values where certain condition is met in pandas

Say I have the following dataframe:
What is the most efficient way to update the values of the columns feat and another_feat where the stream is number 2?
Is this it?
for index, row in df.iterrows():
if df1.loc[index,'stream'] == 2:
# do something
How do I do it if there are more than 100 columns? I don't want to explicitly name the columns that I want to update. I want to divide the value of each column by 2 (except for the stream column).
So to be clear, my goal is:
Dividing all values by 2 of all rows that have stream 2, but not changing the stream column.
I think you can use loc if you need update two columns to same value:
df1.loc[df1['stream'] == 2, ['feat','another_feat']] = 'aaaa'
print df1
stream feat another_feat
a 1 some_value some_value
b 2 aaaa aaaa
c 2 aaaa aaaa
d 3 some_value some_value
If you need update separate, one option is use:
df1.loc[df1['stream'] == 2, 'feat'] = 10
print df1
stream feat another_feat
a 1 some_value some_value
b 2 10 some_value
c 2 10 some_value
d 3 some_value some_value
Another common option is use numpy.where:
df1['feat'] = np.where(df1['stream'] == 2, 10,20)
print df1
stream feat another_feat
a 1 20 some_value
b 2 10 some_value
c 2 10 some_value
d 3 20 some_value
EDIT: If you need divide all columns without stream where condition is True, use:
print df1
stream feat another_feat
a 1 4 5
b 2 4 5
c 2 2 9
d 3 1 7
#filter columns all without stream
cols = [col for col in df1.columns if col != 'stream']
print cols
['feat', 'another_feat']
df1.loc[df1['stream'] == 2, cols ] = df1 / 2
print df1
stream feat another_feat
a 1 4.0 5.0
b 2 2.0 2.5
c 2 1.0 4.5
d 3 1.0 7.0
If working with multiple conditions is possible use multiple numpy.where
or numpy.select:
df0 = pd.DataFrame({'Col':[5,0,-6]})
df0['New Col1'] = np.where((df0['Col'] > 0), 'Increasing',
np.where((df0['Col'] < 0), 'Decreasing', 'No Change'))
df0['New Col2'] = np.select([df0['Col'] > 0, df0['Col'] < 0],
['Increasing', 'Decreasing'],
default='No Change')
print (df0)
Col New Col1 New Col2
0 5 Increasing Increasing
1 0 No Change No Change
2 -6 Decreasing Decreasing
You can do the same with .ix, like this:
In [1]: df = pd.DataFrame(np.random.randn(5,4), columns=list('abcd'))
In [2]: df
Out[2]:
a b c d
0 -0.323772 0.839542 0.173414 -1.341793
1 -1.001287 0.676910 0.465536 0.229544
2 0.963484 -0.905302 -0.435821 1.934512
3 0.266113 -0.034305 -0.110272 -0.720599
4 -0.522134 -0.913792 1.862832 0.314315
In [3]: df.ix[df.a>0, ['b','c']] = 0
In [4]: df
Out[4]:
a b c d
0 -0.323772 0.839542 0.173414 -1.341793
1 -1.001287 0.676910 0.465536 0.229544
2 0.963484 0.000000 0.000000 1.934512
3 0.266113 0.000000 0.000000 -0.720599
4 -0.522134 -0.913792 1.862832 0.314315
EDIT
After the extra information, the following will return all columns - where some condition is met - with halved values:
>> condition = df.a > 0
>> df[condition][[i for i in df.columns.values if i not in ['a']]].apply(lambda x: x/2)
Another vectorized solution is to use the mask() method to halve the rows corresponding to stream=2 and join() these columns to a dataframe that consists only of the stream column:
cols = ['feat', 'another_feat']
df[['stream']].join(df[cols].mask(df['stream'] == 2, lambda x: x/2))
or you can also update() the original dataframe:
df.update(df[cols].mask(df['stream'] == 2, lambda x: x/2))
Both of the above codes do the following:
mask() is even simpler to use if the value to replace is a constant (not derived using a function); e.g. the following code replaces all feat values corresponding to stream equal to 1 or 3 by 100.1
df[['stream']].join(df.filter(like='feat').mask(df['stream'].isin([1,3]), 100))
1: feat columns can be selected using filter() method as well.

retain function in python

Recently, I am converting from SAS to Python pandas. One question I have is that does pandas have a retain like function in SAS,so that I can dynamically referencing the last record. In the following code, I have to manually loop through each line and reference the last record. It seems pretty slow compared to the similar SAS program. Is there anyway that makes it more efficient in pandas? Thank you.
import pandas as pd
import numpy as np
df = pd.DataFrame({'A': [1, 1, 1, 1], 'B': [0, 0, 1, 0]})
df['C'] = np.nan
df['lag_C'] = np.nan
for row in df.index:
if row == df.head(1).index:
df.loc[row, 'C'] = (df.loc[row, 'A'] == 0) + 0
else:
if (df.loc[row, 'B'] == 1):
df.loc[row, 'C'] = 1
elif (df.loc[row, 'lag_C'] == 0):
df.loc[row, 'C'] = 0
elif (df.loc[row, 'lag_C'] != 0):
df.loc[row, 'C'] = df.loc[row, 'lag_C'] + 1
if row != df.tail(1).index:
df.loc[row +1, 'lag_C'] = df.loc[row, 'C']
Very complicated algorithm, but I try vectorized approach.
If I understand it, there can be use cumulative sum as using in this question. Last column lag_C is shifted column C.
But my algorithm can't be use in first rows of df, because only these rows are counted from first value of column A and sometimes column B. So I created column D, where are distinguished rows and latter are copy to output column C, if conditions are True.
I changed input data and test first problematic rows. I try test all three possibilities of first 3 rows of column B with first row of column A.
My input condition are:
Column A and B are only 1 or O. Column C and lag_C are helper columns with only NaN.
import pandas as pd
import numpy as np
df = pd.DataFrame({'A': [1,1,1,1,1,0,0,1,1,0,0], 'B': [0,0,1,1,0,0,0,1,0,1,0]})
df1 = pd.DataFrame({'A': [1,1,1,1,1,0,0,1,1,0,0], 'B': [0,0,1,1,0,0,0,1,0,1,0]})
#cumulative sum of column B
df1['C'] = df1['B'].cumsum()
df1['lag_C'] = 1
#first 'group' with min value is problematic, copy to column D for latter use
df1.loc[df1['C'] == df1['C'].min() ,'D'] = df1['B']
#cumulative sums of groups to column C
df1['C']= df1.groupby(['C'])['lag_C'].cumsum()
#correct problematic states in column C, use value from D
if (df1['A'].loc[0] == 1):
df1.loc[df1['D'].notnull() ,'C'] = df1['D']
if ((df1['A'].loc[0] == 1) & (df1['B'].loc[0] == 1)):
df1.loc[df1['D'].notnull() ,'C'] = 0
del df1['D']
#shifted column lag_C from column C
df1['lag_C'] = df1['C'].shift(1)
print df1
# A B C lag_C
#0 1 0 0 NaN
#1 1 0 0 0
#2 1 1 1 0
#3 1 1 1 1
#4 1 0 2 1
#5 0 0 3 2
#6 0 0 4 3
#7 1 1 1 4
#8 1 0 2 1
#9 0 1 1 2
#10 0 0 2 1

Categories