Pandas modify column values based on another DataFrame - python

I am trying to add values to a column based on a couple of conditions. Here is the code example:
Import pandas as pd
df1 = pd.DataFrame({'Type': ['A', 'A', 'A', 'A', 'B', 'B', 'C', 'C'], 'Val': [20, -10, 20, -10, 30, -20, 40, -30]})
df2 = pd.DataFrame({'Type': ['A', 'A', 'B', 'B', 'C', 'C'], 'Cat':['p', 'n', 'p', 'n','p', 'n'], 'Val': [30, -40, 20, -30, 10, -20]})
for index, _ in df1.iterrows():
if df1.loc[index,'Val'] >=0:
df1.loc[index,'Val'] = df1.loc[index,'Val'] + float(df2.loc[(df2['Type'] == df1.loc[index,'Type']) & (df2['Cat'] == 'p'), 'Val'])
else:
df1.loc[index,'Val'] = df1.loc[index,'Val'] + float(df2.loc[(df2['Type'] == df1.loc[index,'Type']) & (df2['Cat'] == 'n'), 'Val'])
For each value in the 'Val' column of df1, I want to add values from df2, based on the type and whether the original value was positive or negative.
The expected output for this example would be alternate 50 and -50 in df1. The above code does the job, but is too slow to be usable for a large data set. Is there a better way to do this?

Try adding a Cat column to df1 merge then sum val columns across axis 1 then drop the extra columns:
df1['Cat'] = np.where(df1['Val'].lt(0), 'n', 'p')
df1 = df1.merge(df2, on=['Type', 'Cat'], how='left')
df1['Val'] = df1[['Val_x', 'Val_y']].sum(axis=1)
df1 = df1.drop(['Cat', 'Val_x', 'Val_y'], 1)
Type Val
0 A 50
1 A 50
2 A -50
3 A -50
4 B 50
5 B -50
6 C 50
7 C -50
Add new column with np.where
df1['Cat'] = np.where(df1['Val'].lt(0), 'n', 'p')
Type Val Cat
0 A 20 p
1 A -10 n
2 A 20 p
3 A -10 n
4 B 30 p
5 B -20 n
6 C 40 p
7 C -30 n
merge on Type and Cat
df1 = df1.merge(df2, on=['Type', 'Cat'], how='left')
Type Val_x Cat Val_y
0 A 20 p 30
1 A -10 n -40
2 A 20 p 30
3 A -10 n -40
4 B 30 p 20
5 B -20 n -30
6 C 40 p 10
7 C -30 n -20
sum Val columns:
df1['Val'] = df1[['Val_x', 'Val_y']].sum(axis=1)
Type Val_x Cat Val_y Val
0 A 20 p 30 50
1 A -10 n -40 -50
2 A 20 p 30 50
3 A -10 n -40 -50
4 B 30 p 20 50
5 B -20 n -30 -50
6 C 40 p 10 50
7 C -30 n -20 -50
drop extra columns:
df1 = df1.drop(['Cat', 'Val_x', 'Val_y'], 1)
Type Val
0 A 50
1 A -50
2 A 50
3 A -50
4 B 50
5 B -50
6 C 50
7 C -50

import numpy as np
df1['sign'] = np.sign(df1.Val)
df2['sign'] = np.sign(df2.Val)
df = pd.merge(df1, df2, on=['Type', 'sign'], suffixes=('_df1', '_df2'))
df['Val'] = df.Val_df1 + df.Val_df2
df = df.drop(columns=['Val_df1', 'sign', 'Val_df2'])
df

Related

How to efficiently reorder rows based on condition?

My dataframe:
df = pd.DataFrame({'col_1': [10, 20, 10, 20, 10, 10, 20, 20],
'col_2': ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h']})
col_1 col_2
0 10 a
1 20 b
2 10 c
3 20 d
4 10 e
5 10 f
6 20 g
7 20 h
I don't want consecutive rows with col_1 = 10, instead a row below a repeating 10 should jump up by one (in this case, index 6 should become index 5 and vice versa), so the order is always 10, 20, 10, 20...
My current solution:
for idx, row in df.iterrows():
if row['col_1'] == 10 and df.iloc[idx + 1]['col_1'] != 20:
df = df.rename({idx + 1:idx + 2, idx + 2: idx + 1})
df = df.sort_index()
df
gives me:
col_1 col_2
0 10 a
1 20 b
2 10 c
3 20 d
4 10 e
5 20 g
6 10 f
7 20 h
which is what I want but it is very slow (2.34s for a dataframe with just over 8000 rows).
Is there a way to avoid loop here?
Thanks
You can use a custom key in sort_values with groupby.cumcount:
df.sort_values(by='col_1', kind='stable', key=lambda s: df.groupby(s).cumcount())
Output:
col_1 col_2
0 10 a
1 20 b
2 10 c
3 20 d
4 10 e
6 20 g
5 10 f
7 20 h

Pandas get nearest index with equivalent positive number

I have data similar to this
data = {'A': [10,20,30,10,-10, 20,-20, 10], 'B': [100,200,300,100,-100, 30,-30,100], 'C':[1000,2000,3000,1000, -1000, 40,-40, 1000]}
df = pd.DataFrame(data)
df
Index
A
B
C
0
10
100
1000
1
20
200
2000
2
30
300
3000
3
10
100
1000
4
-10
-100
-1000
5
20
30
40
6
-20
-30
-40
7
10
100
1000
Here sum values of all the columns for index 0,3,7 equal to 1110 index 4 equals -1110 and sum value of index 5 and 6 equals 90, and -90 these are exact opposite, so for such scenarios I want a fourth column D to be populated with a value 'Exact opposite' for index value 3,4. and 5,6(Nearest index)
Similar to
Index
A
B
C
D
0
10
100
1000
1
20
200
2000
2
30
300
3000
3
10
100
1000
Exact opposite
4
-10
-100
-1000
Exact opposite
5
20
30
40
Exact opposite
6
-20
-30
-40
Exact opposite
7
10
100
1000
One approach I can think of is by adding a column which adds values of all the columns
column_names=['A','B','C']
df['Sum Val'] = df[column_names].sum(axis=1)
Index
A
B
C
Sum val
0
10
100
1000
1110
1
20
200
2000
2200
2
30
300
3000
3300
3
10
100
1000
1110
4
-10
-100
-1000
-1110
5
20
30
40
90
6
-20
-30
-40
-90
7
10
100
1000
1110
and then check if there are any negative values and try to find out the corresponding equal positive value but could not proceed from there
This should do the trick:
data = {'A': [10,20,30,10,-10, 30, 10], 'B': [100,200,300,100,-100, 300, 100], 'C':[1000,2000,3000,1000, -1000, 3000, 1000]}
df = pd.DataFrame(data)
print (df)
for i in df.index[:-2]:
for col in df.columns[0:2]:
if not df[col][i] == -df[col][i+1]:
break
else:
df.at[i, 'D'] = 'Exact opposite'
df.at[i+1, 'D'] = 'Exact opposite'
continue
print(df)
This solution only considers 2 adjacent lines.
The following code compares all lines so it also detects lines 0 and 6:
data = {'A': [10,20,30,10,-10, 30, 10], 'B': [100,200,300,100,-100, 300, 100], 'C':[1000,2000,3000,1000, -1000, 3000, 1000]}
df = pd.DataFrame(data)
print (df)
for i in df.index:
for j in df.index[i:]:
for col in df.columns[0:2]:
if not df[col][i] == -df[col][j]:
break
else:
df.at[i, 'D'] = 'Exact opposite'
df.at[j, 'D'] = 'Exact opposite'
continue
print(df)

How can pd.get_dummies() be used to dummy-code a list of categories?

I understand that pd.get_dummies() works very well for creating a dummy set to represent a categorical variable (in my case for a decision tree algorithm). My question is, how can this be adapted to handle entries that are a list of categories?
MWE:
import pandas as pd
a = pd.DataFrame({
'id': ['i', 'j', 'k', 'l'],
'category': [['a', 'b'], 'b', 'c', ['b', 'c']],
'x': ['p', 'q', 'r', 's'],
'y': [10, 20, 30, 40]
})
...
a_dummied
id a b c x y
0 i 1 1 0 p 10
1 j 0 1 0 q 20
2 k 0 0 1 r 30
3 l 0 1 1 s 40
You can explode the category column and then call pd.get_dummies:
print( pd.get_dummies(a.explode('category').set_index('id'), prefix='', prefix_sep='').groupby(level=0).sum() )
Prints:
a b c
id
i 1 1 0
j 0 1 0
k 0 0 1
l 0 1 1
EDIT: To work with more columns, first make a pd.get_dummies() on category column and then .join with original dataframe:
c = pd.get_dummies( a[['id', 'category']].explode('category').set_index('id'), prefix='', prefix_sep='').groupby(level=0).sum()
print( a.set_index('id').drop(columns='category').join(c) )
Prints:
x y a b c
id
i p 10 1 1 0
j q 20 0 1 0
k r 30 0 0 1
l s 40 0 1 1

pandas: groupby sum conditional on other column

i have a dataframe which looks like this
pd.DataFrame({'a':['A', 'B', 'B', 'C', 'C', 'D', 'D', 'E'],
'b':['Y', 'Y', 'N', 'Y', 'Y', 'N', 'N', 'N'],
'c':[20, 5, 12, 8, 15, 10, 25, 13]})
a b c
0 A Y 20
1 B Y 5
2 B N 12
3 C Y 8
4 C Y 15
5 D N 10
6 D N 25
7 E N 13
i would like to groupby column 'a', check if any of column 'b' is 'Y' or True and keep that value and then just sum on 'c'
the resulting dataframe should look like this
a b c
0 A Y 20
1 B Y 17
2 C Y 23
3 D N 35
4 E N 13
i tried the below but get an error
df.groupby('a')['b'].max()['c'].sum()
You can use agg with max and sum. Max on column 'b' indeed works because 'Y' > 'N' == True
print(df.groupby('a', as_index=False).agg({'b': 'max', 'c': 'sum'}))
a b c
0 A Y 20
1 B Y 17
2 C Y 23
3 D N 35
4 E N 13

Python - Pandas - Edit duplicate items keeping last

Lets say my df is:
import pandas as pd
df = pd.DataFrame({'col1':['a', 'a', 'a', 'b', 'b', 'c', 'd', 'd', 'd'],
'col2':[10,20, 30, 10, 20, 10, 10, 20, 30]})
How can I make all numbers zero keeping the last one only? In this case the result should be:
col1 col2
a 0
a 0
a 30
b 0
b 20
c 10
d 0
d 0
d 30
Thanks!
Use loc and duplicated with the argument keep='last':
df.loc[df.duplicated(subset='col1',keep='last'), 'col2'] = 0
>>> df
col1 col2
0 a 0
1 a 0
2 a 30
3 b 0
4 b 20
5 c 10
6 d 0
7 d 0
8 d 30

Categories