How to perform conditional updation of column values in Pandas DataFrame? - python

I have a below dataframe is there any way to perform conditional addition of column values in pandas.
emp_id emp_name City months_worked default_sal total_sal jan feb mar apr may jun
111 aaa pune 2 90 NaN 4 5 5 54 3 2
222 bbb pune 1 70 NaN 5 4 4 8 3 4
333 ccc mumbai 2 NaN NaN 9 3 4 8 4 3
444 ddd hyd 4 NaN NaN 3 8 6 4 2 7
What I want to achive
if city = pune default_sal should be updated in total_sal for ex for
emp_id 111 total_salary should be 90
if city!=pune then depending on months_worked value total salary
should be updated.For ex for emp id 333 months_worked =2 So addition
of jan and feb value should be updated as total_sal which is 9+3=12
Desired O/P
emp_id emp_name City months_worked default_sal total_sal jan feb mar apr may jun
111 aaa pune 2 90 90 4 5 5 54 3 2
222 bbb pune 1 70 70 5 4 4 8 3 4
333 ccc mumbai 2 NaN 12 9 3 4 8 4 3
444 ddd hyd 4 NaN 21 3 8 6 4 2 7

Using np.where after create the help series
s1=pd.Series([df.iloc[x,6:y+6].sum() for x,y in enumerate(df.months_worked)],index=df.index)
np.where(df.City=='pune',df.default_sal,s1 )
Out[429]: array([90., 70., 12., 21.])
#df['total']=np.where(df.City=='pune',df.default_sal,s1 )

Related

Transpose/Pivot DataFrame but not all columns in the same row

I have a DataFrame and I want to transpose it.
import pandas as pd
df = pd.DataFrame({'ID':[111,111,222,222,333,333],'Month':['Jan','Feb','Jan','Feb','Jan','Feb'],
'Employees':[2,3,1,5,7,1],'Subsidy':[20,30,10,15,40,5]})
print(df)
ID Month Employees Subsidy
0 111 Jan 2 20
1 111 Feb 3 30
2 222 Jan 1 10
3 222 Feb 5 15
4 333 Jan 7 40
5 333 Feb 1 5
Desired output:
ID Var Jan Feb
0 111 Employees 2 3
1 111 Subsidy 20 30
0 222 Employees 1 5
1 222 Subsidy 10 15
0 333 Employees 7 1
1 333 Subsidy 40 5
My attempt: I tried using pivot_table(), but both Employees & Subsidy naturally appear in same rows, where as I want them on separate rows.
df.pivot_table(index=['ID'],columns='Month',values=['Employees','Subsidy'])
Employees Subsidy
Month Feb Jan Feb Jan
ID
111 3 2 30 20
222 5 1 15 10
333 1 7 5 40
I tried using transpose(), but it transposes entire DataFrame, it seems there is no possibility to transpose by first fixing a column. Any suggestions?
You can add DataFrame.rename_axis for set new column name for first level after pivoting and also None for avoid Month column name in final DataFrame, which is reshaped by DataFrame.stack by first level, last MultiIndex in converted to coumns by DataFrame.reset_index:
df2 = (df.pivot_table(index='ID',
columns='Month',
values=['Employees','Subsidy'])
.rename_axis(['Var',None], axis=1)
.stack(level=0)
.reset_index()
)
print (df2)
ID Var Feb Jan
0 111 Employees 3 2
1 111 Subsidy 30 20
2 222 Employees 5 1
3 222 Subsidy 15 10
4 333 Employees 1 7
5 333 Subsidy 5 40
You were on point with your pivot_table approach. Only thing is you missed stack and reset_index :
df.pivot_table(index=['ID'],columns='Month',values=['Employees','Subsidy']).stack(0).reset_index()
Out[42]:
Month ID level_1 Feb Jan
0 111 Employees 3 2
1 111 Subsidy 30 20
2 222 Employees 5 1
3 222 Subsidy 15 10
4 333 Employees 1 7
5 333 Subsidy 5 40
You can change the column name to var later if it's needed.

Problem while merging a specific multi level pivot table back to the orignal (single level) dataframe

My question is related to pivot table and merging.
I have a main dataframe that I use to create a pivot table. Later, I perform some calculations to that pivot and add a new column. Finally I want to merge this new column back to the main dataframe but not getting result as desired.
I try to explain the steps that i performed as follows:
Step 1.
df:
items cat section weight factor1
0 1 7 abc 3 80
1 1 7 abc 3 80
2 2 7 xyz 5 60
3 2 7 xyz 5 60
4 2 7 xyz 5 60
5 2 7 xyz 5 60
6 3 7 abc 3 80
7 3 7 abc 3 80
8 3 7 abc 3 80
9 1 8 abc 2 80
10 1 8 abc 2 60
11 2 8 xyz 6 60
12 2 8 xyz 6 60
12 2 8 xyz 6 60
13 2 8 xyz 6 60
14 3 8 abc 2 80
15 1 9 abc 4 80
16 2 9 xyz 9 60
17 2 9 xyz 9 60
18 3 9 abc 4 80
Main dataframe (df) having number of items. Each item has given a number.
whereas each item belongs to a dedicated section. Each item has given a weight that varies based on a category (cat) and section. In addition, there is another column named 'factor' whose value is constant for a given section.
Step 2.
I need to create a pivot as follows from the above df.
pivot = df.pivot_table(db, index=['section'],values=['weight','factor', 'items'],columns=['cat'],aggfunc={'weight':np.max,'factor':np.max, 'items':np.sum})
pivot:
weight factor items
cat 7 8 9 7 8 9 7 8 9
section
abc 3 2 4 80 80 80 5 3 2
xyz 5 6 9 60 60 60 4 4 2
Step 3:
Now I want to perform some calculations on that pivot then add the
result in a new column as follows:
pivot['w_n',7] = pivot['factor', 7]/pivot['items', 7]
pivot['w_n',8] = pivot['factor', 8]/pivot['items', 8]
pivot['w_n',9] = pivot['factor', 9]/pivot['items', 9]
pivot:
weight factor items w_n
cat 7 8 9 7 8 9 7 8 9 7 8 9
section
abc 3 2 4 80 80 80 5 3 2 16 27 40
xyz 5 6 9 60 60 60 4 4 2 15 15 30
Step 4:
Finally I want to merge that new column back to the main df .
with a desired result of single column 'w_n' but instead I am getting 3 columns one for each cat.
Current result:
df:
items cat section weight factor1 w_n_7 w_n,8 w_n,9
0 1 7 abc 3 80 16 27 40
1 1 7 abc 3 80 16 27 40
2 2 7 xyz 5 60 15 15 30
3 2 7 xyz 5 60 15 15 30
4 2 7 xyz 5 60 15 15 30
5 2 7 xyz 5 60 15 15 30
6 3 7 abc 3 80 16 27 40
7 3 7 abc 3 80 16 27 40
8 3 7 abc 3 80 16 27 40
9 1 8 abc 2 80 16 27 40
10 1 8 abc 2 60 16 27 40
11 2 8 xyz 6 60 15 15 30
12 2 8 xyz 6 60 15 15 30
12 2 8 xyz 6 60 15 15 30
13 2 8 xyz 6 60 15 15 30
14 3 8 abc 2 80 16 27 40
15 1 9 abc 4 80 16 27 40
16 2 9 xyz 9 60 15 15 30
17 2 9 xyz 9 60 15 15 30
18 3 9 abc 4 80 16 27 40
Desired result:
------------------
df:
items cat section weight factor1 w_n
0 1 7 abc 3 80 16
1 1 7 abc 3 80 16
2 2 7 xyz 5 60 15
3 2 7 xyz 5 60 15
4 2 7 xyz 5 60 15
5 2 7 xyz 5 60 15
6 3 7 abc 3 80 16
7 3 7 abc 3 80 16
8 3 7 abc 3 80 16
9 1 8 abc 2 80 27
10 1 8 abc 2 60 27
11 2 8 xyz 6 60 15
12 2 8 xyz 6 60 15
12 2 8 xyz 6 60 15
13 2 8 xyz 6 60 15
14 3 8 abc 2 80 27
15 1 9 abc 4 80 40
16 2 9 xyz 9 60 30
17 2 9 xyz 9 60 30
18 3 9 abc 4 80 40
Use DataFrame.join with MultiIndex Series with Series.unstack:
df = df.join(pivot['w_n'].unstack().rename('W_n'), on=['cat','section'])
print (df)
items cat section weight factor W_n
0 1 7 abc 3 80 7.272727
1 1 7 abc 3 80 7.272727
2 2 7 xyz 5 60 7.500000
3 2 7 xyz 5 60 7.500000
4 2 7 xyz 5 60 7.500000
5 2 7 xyz 5 60 7.500000
6 3 7 abc 3 80 7.272727
7 3 7 abc 3 80 7.272727
8 3 7 abc 3 80 7.272727
9 1 8 abc 2 80 16.000000
10 1 8 abc 2 60 16.000000
11 2 8 xyz 6 60 7.500000
12 2 8 xyz 6 60 7.500000
12 2 8 xyz 6 60 7.500000
13 2 8 xyz 6 60 7.500000
14 3 8 abc 2 80 16.000000
15 1 9 abc 4 80 20.000000
16 2 9 xyz 9 60 15.000000
17 2 9 xyz 9 60 15.000000
18 3 9 abc 4 80 20.000000

Comparing Two Data Frames in python

I have two data frames. I have to compare the two data frames and get the position of the unmatched data using python.
Note:
The First column will always not be unique.
Data Frame 1:
0 1 2 3 4
0 1 Dhoni 24 Kota 60000.0
1 2 Raina 90 Delhi 41500.0
2 3 Kholi 67 Ahmedabad 20000.0
3 4 Ashwin 45 Bhopal 8500.0
4 5 Watson 64 Mumbai 6500.0
5 6 KL Rahul 19 Indore 4500.0
6 7 Hardik 24 Bengaluru 1000.0
Data Frame 2
0 1 2 3 4
0 3 Kholi 67 Ahmedabad 20000.0
1 7 Hardik 24 Bengaluru 1000.0
2 4 Ashwin 45 Bhopal 8500.0
3 2 Raina 90 Delhi 41500.0
4 6 KL Rahul 19 Chennai 4500.0
5 1 Dhoni 24 Kota 60000.0
6 5 Watson 64 Mumbai 6500.0
I expect the output of (3,5)-(Indore - Chennai).
df1=pd.DataFrame({'A':['Dhoni','Raina','KL Rahul'],'B':[24,90,67],'C':['Kota','Delhi','Indore'],'D':[6000.0,41500.0,4500.0]})
df2=pd.DataFrame({'A':['Dhoni','Raina','KL Rahul'],'B':[24,90,67],'C':['Kota','Delhi','Chennai'],'D':[6000.0,41500.0,4500.0]})
df1['df']='df1'
df2['df']='df2'
df=pd.concat([df1,df2],sort=False).drop_duplicates(subset=['A','B','C','D'],keep=False)
print(df)
A B C D df
2 KL Rahul 67 Indore 4500.0 df1
2 KL Rahul 67 Chennai 4500.0 df2
I have added df column to show, from which df difference comes from

How to use groupby and grouper properly for accumulating column 'A' and averaging column 'B', month by month

I have a pandas data with 3 columns:
date: from 1/1/2018 up until 8/23/2019, column A and column B.
import pandas as pd
df = pd.DataFrame(np.random.randint(0,10,size=(600, 2)), columns=list('AB'))
df['date'] = pd.DataFrame(pd.date_range(start='1/1/2018', end='8/23/2019'))
df.set_index('date')
df is as follows:
date A B
2018-01-01 7 4
2018-01-02 5 4
2018-01-03 3 1
2018-01-04 9 3
2018-01-05 7 8
2018-01-06 0 0
2018-01-07 6 8
2018-01-08 3 7
...
...
...
2019-08-18 1 0
2019-08-19 8 1
2019-08-20 5 9
2019-08-21 0 7
2019-08-22 3 6
2019-08-23 8 6
I want monthly accumulated values of column A and monthly averaged values of column B. The final output will become a df with 20 rows ( 12 months of year 2018 and 8 months of year 2019 ) and 4 columns, representing monthly accumulated values of column A, monthly averaged values of column B, month number and year number just like below:
month year monthly_accumulated_of_A monthly_averaged_of_B
0 1 2018 176 1.747947
1 2 2018 110 2.399476
2 3 2018 131 3.976747
3 4 2018 227 2.314923
4 5 2018 234 0.464097
5 6 2018 249 1.662753
6 7 2018 121 1.588865
7 8 2018 165 2.318268
8 9 2018 219 1.060595
9 10 2018 131 0.577268
10 11 2018 179 3.948414
11 12 2018 115 1.750346
12 1 2019 190 3.364003
13 2 2019 215 0.864792
14 3 2019 231 3.219739
15 4 2019 186 2.904413
16 5 2019 232 0.324695
17 6 2019 163 1.334139
18 7 2019 238 1.670644
19 8 2019 112 1.316442
​
How can I achieve this in pandas?
Use DataFrameGroupBy.agg with DatetimeIndex.month and DatetimeIndex.year, for ordering add sort_index and last use reset_index for columns from MultiIndex:
import pandas as pd
import numpy as np
np.random.seed(2018)
#changed 300 to 600
df = pd.DataFrame(np.random.randint(0,10,size=(600, 2)), columns=list('AB'))
df['date'] = pd.DataFrame(pd.date_range(start='1/1/2018', end='8/23/2019'))
df = df.set_index('date')
df1 = (df.groupby([df.index.month.rename('month'),
df.index.year.rename('year')])
.agg({'A':'sum', 'B':'mean'})
.sort_index(level=['year', 'month'])
.reset_index())
print (df1)
month year A B
0 1 2018 147 4.838710
1 2 2018 120 3.678571
2 3 2018 114 4.387097
3 4 2018 143 3.800000
4 5 2018 124 3.870968
5 6 2018 129 4.700000
6 7 2018 143 3.935484
7 8 2018 118 5.483871
8 9 2018 150 5.500000
9 10 2018 139 4.225806
10 11 2018 136 4.933333
11 12 2018 141 4.548387
12 1 2019 137 4.709677
13 2 2019 120 4.964286
14 3 2019 167 4.935484
15 4 2019 121 4.200000
16 5 2019 133 4.129032
17 6 2019 140 5.066667
18 7 2019 189 4.677419
19 8 2019 100 3.695652

Pandas: how to merge df with condition

I have df
number A B C
123 10 10 1
123 10 11 1
123 18 27 1
456 10 18 2
456 42 34 2
789 13 71 3
789 19 108 3
789 234 560 4
and second df
number A B
123 18 27
456 32 19
789 234 560
I need, if number, A, Bis equal to this column in second df, add that to new df and also add string with C is equal string, that we add earlier.
Desire output
number A B C
123 10 10 1
123 10 11 1
123 18 27 1
789 234 560 4
How can I write this condition?
One way is to give df2 a dummy column:
In [11]: df2["in_df2"] = True
then you can do the merge:
In [12]: df1.merge(df2, how="left")
Out[12]:
number A B C in_df2
0 123 10 10 1 NaN
1 123 10 11 1 NaN
2 123 18 27 1 True
3 456 10 18 2 NaN
4 456 42 34 2 NaN
5 789 13 71 3 NaN
6 789 19 108 3 NaN
7 789 234 560 4 True
Now, we only want those groups which contains a True:
In [13]: df1.merge(df2, how="left").groupby(["number", "C"]).filter(lambda x: x["in_df2"].any())
Out[13]:
number A B C in_df2
0 123 10 10 1 NaN
1 123 10 11 1 NaN
2 123 18 27 1 True
7 789 234 560 4 True

Categories