Add weighted average column with multiple column inputs - python

I have this dataframe:
final_df = [['Name', 'Gender', 'Amount', 'amount2', 'Amount3', 'Percent total of gender'], ['Mike', 'Male', 50, nan, 0, 0.20833333333333334], ['Nancy', 'Female', 30, nan, 0, 0.42857142857142855], ['Bob', 'Male', 100, nan, 0, 0.4166666666666667], ['Terrance', 'Male', 30, nan, 0, 0.125], ['Sara', 'Female', 40, nan, 0, 0.5714285714285714], ['Myo', 'Male', 60, nan, 0, 0.25]]
Name Gender Rate Hours Amount3
Mike Male 20 30 3,000.00
Nancy Female 10 50 1,500.00
Bob Male 30 40 6,000.00
Terrance Male 40 60 12,000.00
Sara Female 35 32 3,360.00
Myo Male 15 80 6,000.00
I have this code for the simple average:
final_df['Weighted Average'] = final_df.groupby('Gender')['Amount3'].transform(lambda x: x/x.sum() if x.sum() > 0 else 0 )
I'm trying to add a weighted average column that will take (Rate * Hours) * (Amount3/groupby.sum())
My desired output would be:
Any ideas?

Given your numbers, it looks like the expected computation is:
df['Weighted Average'] = (
df['Amount3']/(df['Rate']*df['Hours'])
*df.groupby('Gender')['Amount3'].transform(lambda x: x/x.sum() if x.sum() > 0 else 0 )
)
which, somehow is quite weird as this is equivalent to something proportional to the square of Amount3:
df['Weighted Average'] = (
df['Amount3']**2/(df['Rate']*df['Hours'])
/df.groupby('Gender')['Amount3'].transform('sum').fillna(0)
)
output:
Name Gender Rate Hours Amount3 Weighted Average
0 Mike Male 20 30 3000.0 0.555556
1 Nancy Female 10 50 1500.0 0.925926
2 Bob Male 30 40 6000.0 1.111111
3 Terrance Male 40 60 12000.0 2.222222
4 Sara Female 35 32 3360.0 2.074074
5 Myo Male 15 80 6000.0 1.111111

If I understand correctly you need something like:
df['Weighted Average'] = (df.Rate * df.Hours) * (df.Amount3 / df.Gender.map(df.groupby('Gender').Amount3.sum()))

Related

Replace certain values in df2 based on common values in df1 [duplicate]

I have two dataframes where I need to update the first one based on the value of the second one if exists. Sample story provided below is to replace the student_id with updatedId if exists in 'old_id' column and replace it with 'new_id'.
import pandas as pd
import numpy as np
student = {
'Name': ['John', 'Jay', 'sachin', 'Geetha', 'Amutha', 'ganesh'],
'gender': ['male', 'male', 'male', 'female', 'female', 'male'],
'math score': [50, 100, 70, 80, 75, 40],
'student_Id': ['1234', '6788', 'xyz', 'abcd', 'ok83', '234v'],
}
updatedId = {
'old_id' : ['ok83', '234v'],
'new_id' : ['83ko', 'v432'],
}
df_student = pd.DataFrame(student)
df_updated_id = pd.DataFrame(updatedId)
print(df_student)
print(df_updated_id)
# Method with np.where
for index, row in df_updated_id.iterrows():
df_student['student_Id'] = np.where(df_student['student_Id'] == row['old_id'], row['new_id'], df_student['student_Id'])
# print(df_student)
# Method with dataframe.mask
for index, row in df_updated_id.iterrows():
df_student['student_Id'].mask(df_student['student_Id'] == row['old_id'], row['new_id'], inplace=True)
print(df_student)
The results for both methods above work and yield the correct result
Name gender math score student_Id
0 John male 50 1234
1 Jay male 100 6788
2 sachin male 70 xyz
3 Geetha female 80 abcd
4 Amutha female 75 ok83
5 ganesh male 40 234v
old_id new_id
0 ok83 83ko
1 234v v432
Name gender math score student_Id
0 John male 50 1234
1 Jay male 100 6788
2 sachin male 70 xyz
3 Geetha female 80 abcd
4 Amutha female 75 83ko
5 ganesh male 40 v432
Nonetheless, the actual data of students has about 500,000 rows and updated_id has 6000 rows.
Thus I run into performance issues as loop is very slow:
A simple timer are placed to observe the number of records processed for df_updated_id
100 rows - numpy Time=3.9020769596099854; mask Time=3.9169061183929443
500 rows - numpy Time=20.42293930053711; mask Time=19.768696784973145
1000 rows - numpy Time=40.06309795379639; mask Time=37.26559829711914
My question is whether I can optimize it using a merge (join table), or ditch the iterrows? I tried something like the below but failed to get it to work.
Replace dataframe column values based on matching id in another dataframe, and How to iterate over rows in a DataFrame in Pandas
Please advice..
You can also try with map:
df_student['student_Id'] = (
df_student['student_Id'].map(df_updated_id.set_index('old_id')['new_id'])
.fillna(df_student['student_Id'])
)
print(df_student)
# Output
Name gender math score student_Id
0 John male 50 1234
1 Jay male 100 6788
2 sachin male 70 xyz
3 Geetha female 80 abcd
4 Amutha female 75 83ko
5 ganesh male 40 v432
Update
I believe the updated_id isn't unique, so I need to further pre-process the data.
In this case, maybe you could drop duplicates before considering the last value (keep='last') is the most recent for a same old_id:
sr = df_updated_id.drop_duplicates('old_id', keep='last') \
.set_index('old_id')['new_id']
df_student['student_Id'] = df_student['student_Id'].map(sr) \
.fillna(df_student['student_Id']
)
Note: this is exactly what the #BENY's answer does. As he creates a dict, only the last occurrence of an old_id is kept. However, if you want to keep the first value appears, his code doesn't work. With drop_duplicates, you can adjust the keep parameter.
We can just replace
df_student.replace({'student_Id':df_updated_id.set_index('old_id')['new_id']},inplace=True)
df_student
Out[337]:
Name gender math score student_Id
0 John male 50 1234
1 Jay male 100 6788
2 sachin male 70 xyz
3 Geetha female 80 abcd
4 Amutha female 75 83ko
5 ganesh male 40 v432

Replace column value of Dataframe based on a condition on another Dataframe

I have two dataframes where I need to update the first one based on the value of the second one if exists. Sample story provided below is to replace the student_id with updatedId if exists in 'old_id' column and replace it with 'new_id'.
import pandas as pd
import numpy as np
student = {
'Name': ['John', 'Jay', 'sachin', 'Geetha', 'Amutha', 'ganesh'],
'gender': ['male', 'male', 'male', 'female', 'female', 'male'],
'math score': [50, 100, 70, 80, 75, 40],
'student_Id': ['1234', '6788', 'xyz', 'abcd', 'ok83', '234v'],
}
updatedId = {
'old_id' : ['ok83', '234v'],
'new_id' : ['83ko', 'v432'],
}
df_student = pd.DataFrame(student)
df_updated_id = pd.DataFrame(updatedId)
print(df_student)
print(df_updated_id)
# Method with np.where
for index, row in df_updated_id.iterrows():
df_student['student_Id'] = np.where(df_student['student_Id'] == row['old_id'], row['new_id'], df_student['student_Id'])
# print(df_student)
# Method with dataframe.mask
for index, row in df_updated_id.iterrows():
df_student['student_Id'].mask(df_student['student_Id'] == row['old_id'], row['new_id'], inplace=True)
print(df_student)
The results for both methods above work and yield the correct result
Name gender math score student_Id
0 John male 50 1234
1 Jay male 100 6788
2 sachin male 70 xyz
3 Geetha female 80 abcd
4 Amutha female 75 ok83
5 ganesh male 40 234v
old_id new_id
0 ok83 83ko
1 234v v432
Name gender math score student_Id
0 John male 50 1234
1 Jay male 100 6788
2 sachin male 70 xyz
3 Geetha female 80 abcd
4 Amutha female 75 83ko
5 ganesh male 40 v432
Nonetheless, the actual data of students has about 500,000 rows and updated_id has 6000 rows.
Thus I run into performance issues as loop is very slow:
A simple timer are placed to observe the number of records processed for df_updated_id
100 rows - numpy Time=3.9020769596099854; mask Time=3.9169061183929443
500 rows - numpy Time=20.42293930053711; mask Time=19.768696784973145
1000 rows - numpy Time=40.06309795379639; mask Time=37.26559829711914
My question is whether I can optimize it using a merge (join table), or ditch the iterrows? I tried something like the below but failed to get it to work.
Replace dataframe column values based on matching id in another dataframe, and How to iterate over rows in a DataFrame in Pandas
Please advice..
You can also try with map:
df_student['student_Id'] = (
df_student['student_Id'].map(df_updated_id.set_index('old_id')['new_id'])
.fillna(df_student['student_Id'])
)
print(df_student)
# Output
Name gender math score student_Id
0 John male 50 1234
1 Jay male 100 6788
2 sachin male 70 xyz
3 Geetha female 80 abcd
4 Amutha female 75 83ko
5 ganesh male 40 v432
Update
I believe the updated_id isn't unique, so I need to further pre-process the data.
In this case, maybe you could drop duplicates before considering the last value (keep='last') is the most recent for a same old_id:
sr = df_updated_id.drop_duplicates('old_id', keep='last') \
.set_index('old_id')['new_id']
df_student['student_Id'] = df_student['student_Id'].map(sr) \
.fillna(df_student['student_Id']
)
Note: this is exactly what the #BENY's answer does. As he creates a dict, only the last occurrence of an old_id is kept. However, if you want to keep the first value appears, his code doesn't work. With drop_duplicates, you can adjust the keep parameter.
We can just replace
df_student.replace({'student_Id':df_updated_id.set_index('old_id')['new_id']},inplace=True)
df_student
Out[337]:
Name gender math score student_Id
0 John male 50 1234
1 Jay male 100 6788
2 sachin male 70 xyz
3 Geetha female 80 abcd
4 Amutha female 75 83ko
5 ganesh male 40 v432

Matching lists to dataframes

I have a dataframe of people with Age as a column. I would like to match this age to a group, i.e. Baby=0-2 years old, Child=3-12 years old, Young=13-18 years old, Young Adult=19-30 years old, Adult=31-50 years old, Senior Adult=51-65 years old.
I created the lists that define these year groups, e.g. Adult=list(range(31,51)) etc.
How do I match the name of the list 'Adult' to the dataframe by creating a new column?
Small input: the dataframe is made up of three columns: df['Name'], df['Country'], df['Age'].
Name Country Age
Anthony France 15
Albert Belgium 54
.
.
.
Zahra Tunisia 14
So I need to match the age column with lists that I already have. The output should look like:
Name Country Age Group
Anthony France 15 Young
Albert Belgium 54 Adult
.
.
.
Zahra Tunisia 14 Young
Thanks!
IIUC I would go with np.select:
import pandas as pd
import numpy as np
df = pd.DataFrame({'Age': [3, 20, 40]})
condlist = [df.Age.between(0,2),
df.Age.between(3,12),
df.Age.between(13,18),
df.Age.between(19,30),
df.Age.between(31,50),
df.Age.between(51,65)]
choicelist = ['Baby', 'Child', 'Young',
'Young Adult', 'Adult', 'Senior Adult']
df['Adult'] = np.select(condlist, choicelist)
Output:
Age Adult
0 3 Child
1 20 Young Adult
2 40 Adult
Here's a way to do that using pd.cut:
df = pd.DataFrame({"person_id": range(25), "age": np.random.randint(0, 100, 25)})
print(df.head(10))
==>
person_id age
0 0 30
1 1 42
2 2 78
3 3 2
4 4 44
5 5 43
6 6 92
7 7 3
8 8 13
9 9 76
df["group"] = pd.cut(df.age, [0, 18, 50, 100], labels=["child", "adult", "senior"])
print(df.head(10))
==>
person_id age group
0 0 30 adult
1 1 42 adult
2 2 78 senior
3 3 2 child
4 4 44 adult
5 5 43 adult
6 6 92 senior
7 7 3 child
8 8 13 child
9 9 76 senior
Per your question, if you have a few lists (like the ones below), and would like to convert use them for 'binning', you can do:
# for example, these are the lists
Adult = list(range(18,50))
Child = list(range(0, 18))
Senior = list(range(50, 100))
# Creating bins out of the lists.
bins = [min(l) for l in [Child, Adult, Senior]]
bins.append(max([max(l) for l in [Child, Adult, Senior]]))
labels = ["Child", "Adult", "Senior"]
# using the bins:
df["group"] = pd.cut(df.age, bins, labels=labels)
To make things more clear for beginners, you can define a function that will return the age group of each person accordingly, then use pandas.apply() to apply that function to our 'Group' column:
import pandas as pd
def age(row):
a = row['Age']
if 0 < a <= 2:
return 'Baby'
elif 2 < a <= 12:
return 'Child'
elif 12 < a <= 18:
return 'Young'
elif 18 < a <= 30:
return 'Young Adult'
elif 30 < a <= 50:
return 'Adult'
elif 50 < a <= 65:
return 'Senior Adult'
df = pd.DataFrame({'Name':['Anthony','Albert','Zahra'],
'Country':['France','Belgium','Tunisia'],
'Age':[15,54,14]})
df['Group'] = df.apply(age, axis=1)
print(df)
Output:
Name Country Age Group
0 Anthony France 15 Young
1 Albert Belgium 54 Senior Adult
2 Zahra Tunisia 14 Young

case_when function from R to Python

How I can implement the case_when function of R in a python code?
Here is the case_when function of R:
https://www.rdocumentation.org/packages/dplyr/versions/0.7.8/topics/case_when
as a minimum working example suppose we have the following dataframe (python code follows):
import pandas as pd
import numpy as np
data = {'name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'],
'age': [42, 52, 36, 24, 73],
'preTestScore': [4, 24, 31, 2, 3],
'postTestScore': [25, 94, 57, 62, 70]}
df = pd.DataFrame(data, columns = ['name', 'age', 'preTestScore', 'postTestScore'])
df
Suppose than we want to create an new column called 'elderly' that looks at the 'age' column and does the following:
if age < 10 then baby
if age >= 10 and age < 20 then kid
if age >=20 and age < 30 then young
if age >= 30 and age < 50 then mature
if age >= 50 then grandpa
Can someone help on this ?
You want to use np.select:
conditions = [
(df["age"].lt(10)),
(df["age"].ge(10) & df["age"].lt(20)),
(df["age"].ge(20) & df["age"].lt(30)),
(df["age"].ge(30) & df["age"].lt(50)),
(df["age"].ge(50)),
]
choices = ["baby", "kid", "young", "mature", "grandpa"]
df["elderly"] = np.select(conditions, choices)
# Results in:
# name age preTestScore postTestScore elderly
# 0 Jason 42 4 25 mature
# 1 Molly 52 24 94 grandpa
# 2 Tina 36 31 57 mature
# 3 Jake 24 2 62 young
# 4 Amy 73 3 70 grandpa
The conditions and choices lists must be the same length.
There is also a default parameter that is used when all conditions evaluate to False.
np.select is great because it's a general way to assign values to elements in choicelist depending on conditions.
However, for the particular problem OP tries to solve, there is a succinct way to achieve the same with the pandas' cut method.
bin_cond = [-np.inf, 10, 20, 30, 50, np.inf] # think of them as bin edges
bin_lab = ["baby", "kid", "young", "mature", "grandpa"] # the length needs to be len(bin_cond) - 1
df["elderly2"] = pd.cut(df["age"], bins=bin_cond, labels=bin_lab)
# name age preTestScore postTestScore elderly elderly2
# 0 Jason 42 4 25 mature mature
# 1 Molly 52 24 94 grandpa grandpa
# 2 Tina 36 31 57 mature mature
# 3 Jake 24 2 62 young young
# 4 Amy 73 3 70 grandpa grandpa
pyjanitor has a case_when implementation in dev that could be helpful in this case, the implementation idea is inspired by if_else in pydatatable and fcase in R's data.table; under the hood, it uses pd.Series.mask:
# pip install git+https://github.com/pyjanitor-devs/pyjanitor.git
import pandas as pd
import janitor as jn
df.case_when(
df.age.lt(10), 'baby', # 1st condition, result
df.age.between(10, 20, 'left'), 'kid', # 2nd condition, result
df.age.between(20, 30, 'left'), 'young', # 3rd condition, result
df.age.between(30, 50, 'left'), 'mature', # 4th condition, result
'grandpa', # default if none of the conditions match
column_name = 'elderly') # column name to assign to
name age preTestScore postTestScore elderly
0 Jason 42 4 25 mature
1 Molly 52 24 94 grandpa
2 Tina 36 31 57 mature
3 Jake 24 2 62 young
4 Amy 73 3 70 grandpa
Alby's solution is more efficient for this use case than an if/else construct.
Steady of numpy you can create a function and use map or apply with lambda:
def elderly_function(age):
if age < 10:
return 'baby'
if age < 20:
return 'kid'
if age < 30
return 'young'
if age < 50:
return 'mature'
if age >= 50:
return 'grandpa'
df["elderly"] = df["age"].map(lambda x: elderly_function(x))
# Works with apply as well:
df["elderly"] = df["age"].apply(lambda x: elderly_function(x))
The solution with numpy is probably fast and might be preferable if your df is considerably large.
Just for Future reference, Nowadays you could use pandas cut or map with moderate to good speeed. If you need something faster It might not suit your needs, but is good enough for daily use and batches.
import pandas as pd
If you wanna choose map or apply mount your ranges and return something if in range
def calc_grade(age):
if 50 < age < 200:
return 'Grandpa'
elif 30 <= age <=50:
return 'Mature'
elif 20 <= age < 30:
return 'Young'
elif 10 <= age < 20:
return 'Kid'
elif age < 10:
return 'Baby'
%timeit df['elderly'] = df['age'].map(calc_grade)
name
age
preTestScore
postTestScore
elderly
0
Jason
42
4
25
Mature
1
Molly
52
24
94
Grandpa
2
Tina
36
31
57
Mature
3
Jake
24
2
62
Young
4
Amy
73
3
70
Grandpa
393 µs ± 8.43 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
If you wanna choose cut there should be many options.
One approach - We includes to left, exclude to the right . To each bin, one label.
bins = [0, 10, 20, 30, 50, 200] #200 year Vampires are people I guess...you could change to a date you belieave plausible.
labels = ['Baby','Kid','Young', 'Mature','Grandpa']
%timeit df['elderly'] = pd.cut(x=df.age, bins=bins, labels=labels , include_lowest=True, right=False, ordered=False)
name
age
preTestScore
postTestScore
elderly
0
Jason
42
4
25
Mature
1
Molly
52
24
94
Grandpa
2
Tina
36
31
57
Mature
3
Jake
24
2
62
Young
4
Amy
73
3
70
Grandpa

Mean of grouped data

I have data in a dataframe regarding salaries of employees. Each employee also has data stored about their sex, discipline, years since earning phd, and years working at the current employer. An example of the data is as follows.
rank dsc phd srv sex salary
1 Prof B 19 18 Male 139750
2 Prof B 20 16 Male 173200
3 Asst B 4 3 Male 79750
4 Prof B 45 39 Male 115000
5 Prof B 40 41 Male 141500
6 Assoc B 6 6 Male 97000
7 Prof B 30 23 Male 175000
8 Prof B 45 45 Male 147765
9 Prof B 21 20 Male 119250
10 Prof B 18 18 Female 129000
What I want to access is the mean salary of all employees grouped by both sex and a range of ten years of service. For example; Males that have 0-10 years of service, females with 0-10 years of service, Males that have 11-20 years of service, etc. I can get the mean of a range of workers with ranges of years working without separating by the sexes by doing:
serviceSalary = data.groupby(pd.cut(data['yrs.service'], np.arange(0, 70,
10)))['salary'].mean()
What further can I do to add a third grouping to this variable?
You can groupby multiple columns with a list as the first argument, so instead of just one:
In [11]: df.groupby(pd.cut(df['srv'], np.arange(0, 70, 10)))['salary'].mean()
Out[11]:
srv
(0, 10] 88375.0
(10, 20] 140300.0
(20, 30] 175000.0
(30, 40] 115000.0
(40, 50] 144632.5
(50, 60] NaN
Name: salary, dtype: float64
can pass 'sex' too:
In [12]: df.groupby([pd.cut(df['srv'], np.arange(0, 70, 10)), 'sex'])['salary'].mean()
Out[12]:
srv sex
(0, 10] Male 88375.000000
(10, 20] Female 129000.000000
Male 144066.666667
(20, 30] Male 175000.000000
(30, 40] Male 115000.000000
(40, 50] Male 144632.500000
Name: salary, dtype: float64

Categories