I have data in a dataframe regarding salaries of employees. Each employee also has data stored about their sex, discipline, years since earning phd, and years working at the current employer. An example of the data is as follows.
rank dsc phd srv sex salary
1 Prof B 19 18 Male 139750
2 Prof B 20 16 Male 173200
3 Asst B 4 3 Male 79750
4 Prof B 45 39 Male 115000
5 Prof B 40 41 Male 141500
6 Assoc B 6 6 Male 97000
7 Prof B 30 23 Male 175000
8 Prof B 45 45 Male 147765
9 Prof B 21 20 Male 119250
10 Prof B 18 18 Female 129000
What I want to access is the mean salary of all employees grouped by both sex and a range of ten years of service. For example; Males that have 0-10 years of service, females with 0-10 years of service, Males that have 11-20 years of service, etc. I can get the mean of a range of workers with ranges of years working without separating by the sexes by doing:
serviceSalary = data.groupby(pd.cut(data['yrs.service'], np.arange(0, 70,
10)))['salary'].mean()
What further can I do to add a third grouping to this variable?
You can groupby multiple columns with a list as the first argument, so instead of just one:
In [11]: df.groupby(pd.cut(df['srv'], np.arange(0, 70, 10)))['salary'].mean()
Out[11]:
srv
(0, 10] 88375.0
(10, 20] 140300.0
(20, 30] 175000.0
(30, 40] 115000.0
(40, 50] 144632.5
(50, 60] NaN
Name: salary, dtype: float64
can pass 'sex' too:
In [12]: df.groupby([pd.cut(df['srv'], np.arange(0, 70, 10)), 'sex'])['salary'].mean()
Out[12]:
srv sex
(0, 10] Male 88375.000000
(10, 20] Female 129000.000000
Male 144066.666667
(20, 30] Male 175000.000000
(30, 40] Male 115000.000000
(40, 50] Male 144632.500000
Name: salary, dtype: float64
Related
I have two dataframes where I need to update the first one based on the value of the second one if exists. Sample story provided below is to replace the student_id with updatedId if exists in 'old_id' column and replace it with 'new_id'.
import pandas as pd
import numpy as np
student = {
'Name': ['John', 'Jay', 'sachin', 'Geetha', 'Amutha', 'ganesh'],
'gender': ['male', 'male', 'male', 'female', 'female', 'male'],
'math score': [50, 100, 70, 80, 75, 40],
'student_Id': ['1234', '6788', 'xyz', 'abcd', 'ok83', '234v'],
}
updatedId = {
'old_id' : ['ok83', '234v'],
'new_id' : ['83ko', 'v432'],
}
df_student = pd.DataFrame(student)
df_updated_id = pd.DataFrame(updatedId)
print(df_student)
print(df_updated_id)
# Method with np.where
for index, row in df_updated_id.iterrows():
df_student['student_Id'] = np.where(df_student['student_Id'] == row['old_id'], row['new_id'], df_student['student_Id'])
# print(df_student)
# Method with dataframe.mask
for index, row in df_updated_id.iterrows():
df_student['student_Id'].mask(df_student['student_Id'] == row['old_id'], row['new_id'], inplace=True)
print(df_student)
The results for both methods above work and yield the correct result
Name gender math score student_Id
0 John male 50 1234
1 Jay male 100 6788
2 sachin male 70 xyz
3 Geetha female 80 abcd
4 Amutha female 75 ok83
5 ganesh male 40 234v
old_id new_id
0 ok83 83ko
1 234v v432
Name gender math score student_Id
0 John male 50 1234
1 Jay male 100 6788
2 sachin male 70 xyz
3 Geetha female 80 abcd
4 Amutha female 75 83ko
5 ganesh male 40 v432
Nonetheless, the actual data of students has about 500,000 rows and updated_id has 6000 rows.
Thus I run into performance issues as loop is very slow:
A simple timer are placed to observe the number of records processed for df_updated_id
100 rows - numpy Time=3.9020769596099854; mask Time=3.9169061183929443
500 rows - numpy Time=20.42293930053711; mask Time=19.768696784973145
1000 rows - numpy Time=40.06309795379639; mask Time=37.26559829711914
My question is whether I can optimize it using a merge (join table), or ditch the iterrows? I tried something like the below but failed to get it to work.
Replace dataframe column values based on matching id in another dataframe, and How to iterate over rows in a DataFrame in Pandas
Please advice..
You can also try with map:
df_student['student_Id'] = (
df_student['student_Id'].map(df_updated_id.set_index('old_id')['new_id'])
.fillna(df_student['student_Id'])
)
print(df_student)
# Output
Name gender math score student_Id
0 John male 50 1234
1 Jay male 100 6788
2 sachin male 70 xyz
3 Geetha female 80 abcd
4 Amutha female 75 83ko
5 ganesh male 40 v432
Update
I believe the updated_id isn't unique, so I need to further pre-process the data.
In this case, maybe you could drop duplicates before considering the last value (keep='last') is the most recent for a same old_id:
sr = df_updated_id.drop_duplicates('old_id', keep='last') \
.set_index('old_id')['new_id']
df_student['student_Id'] = df_student['student_Id'].map(sr) \
.fillna(df_student['student_Id']
)
Note: this is exactly what the #BENY's answer does. As he creates a dict, only the last occurrence of an old_id is kept. However, if you want to keep the first value appears, his code doesn't work. With drop_duplicates, you can adjust the keep parameter.
We can just replace
df_student.replace({'student_Id':df_updated_id.set_index('old_id')['new_id']},inplace=True)
df_student
Out[337]:
Name gender math score student_Id
0 John male 50 1234
1 Jay male 100 6788
2 sachin male 70 xyz
3 Geetha female 80 abcd
4 Amutha female 75 83ko
5 ganesh male 40 v432
I have this dataframe:
final_df = [['Name', 'Gender', 'Amount', 'amount2', 'Amount3', 'Percent total of gender'], ['Mike', 'Male', 50, nan, 0, 0.20833333333333334], ['Nancy', 'Female', 30, nan, 0, 0.42857142857142855], ['Bob', 'Male', 100, nan, 0, 0.4166666666666667], ['Terrance', 'Male', 30, nan, 0, 0.125], ['Sara', 'Female', 40, nan, 0, 0.5714285714285714], ['Myo', 'Male', 60, nan, 0, 0.25]]
Name Gender Rate Hours Amount3
Mike Male 20 30 3,000.00
Nancy Female 10 50 1,500.00
Bob Male 30 40 6,000.00
Terrance Male 40 60 12,000.00
Sara Female 35 32 3,360.00
Myo Male 15 80 6,000.00
I have this code for the simple average:
final_df['Weighted Average'] = final_df.groupby('Gender')['Amount3'].transform(lambda x: x/x.sum() if x.sum() > 0 else 0 )
I'm trying to add a weighted average column that will take (Rate * Hours) * (Amount3/groupby.sum())
My desired output would be:
Any ideas?
Given your numbers, it looks like the expected computation is:
df['Weighted Average'] = (
df['Amount3']/(df['Rate']*df['Hours'])
*df.groupby('Gender')['Amount3'].transform(lambda x: x/x.sum() if x.sum() > 0 else 0 )
)
which, somehow is quite weird as this is equivalent to something proportional to the square of Amount3:
df['Weighted Average'] = (
df['Amount3']**2/(df['Rate']*df['Hours'])
/df.groupby('Gender')['Amount3'].transform('sum').fillna(0)
)
output:
Name Gender Rate Hours Amount3 Weighted Average
0 Mike Male 20 30 3000.0 0.555556
1 Nancy Female 10 50 1500.0 0.925926
2 Bob Male 30 40 6000.0 1.111111
3 Terrance Male 40 60 12000.0 2.222222
4 Sara Female 35 32 3360.0 2.074074
5 Myo Male 15 80 6000.0 1.111111
If I understand correctly you need something like:
df['Weighted Average'] = (df.Rate * df.Hours) * (df.Amount3 / df.Gender.map(df.groupby('Gender').Amount3.sum()))
I have two dataframes where I need to update the first one based on the value of the second one if exists. Sample story provided below is to replace the student_id with updatedId if exists in 'old_id' column and replace it with 'new_id'.
import pandas as pd
import numpy as np
student = {
'Name': ['John', 'Jay', 'sachin', 'Geetha', 'Amutha', 'ganesh'],
'gender': ['male', 'male', 'male', 'female', 'female', 'male'],
'math score': [50, 100, 70, 80, 75, 40],
'student_Id': ['1234', '6788', 'xyz', 'abcd', 'ok83', '234v'],
}
updatedId = {
'old_id' : ['ok83', '234v'],
'new_id' : ['83ko', 'v432'],
}
df_student = pd.DataFrame(student)
df_updated_id = pd.DataFrame(updatedId)
print(df_student)
print(df_updated_id)
# Method with np.where
for index, row in df_updated_id.iterrows():
df_student['student_Id'] = np.where(df_student['student_Id'] == row['old_id'], row['new_id'], df_student['student_Id'])
# print(df_student)
# Method with dataframe.mask
for index, row in df_updated_id.iterrows():
df_student['student_Id'].mask(df_student['student_Id'] == row['old_id'], row['new_id'], inplace=True)
print(df_student)
The results for both methods above work and yield the correct result
Name gender math score student_Id
0 John male 50 1234
1 Jay male 100 6788
2 sachin male 70 xyz
3 Geetha female 80 abcd
4 Amutha female 75 ok83
5 ganesh male 40 234v
old_id new_id
0 ok83 83ko
1 234v v432
Name gender math score student_Id
0 John male 50 1234
1 Jay male 100 6788
2 sachin male 70 xyz
3 Geetha female 80 abcd
4 Amutha female 75 83ko
5 ganesh male 40 v432
Nonetheless, the actual data of students has about 500,000 rows and updated_id has 6000 rows.
Thus I run into performance issues as loop is very slow:
A simple timer are placed to observe the number of records processed for df_updated_id
100 rows - numpy Time=3.9020769596099854; mask Time=3.9169061183929443
500 rows - numpy Time=20.42293930053711; mask Time=19.768696784973145
1000 rows - numpy Time=40.06309795379639; mask Time=37.26559829711914
My question is whether I can optimize it using a merge (join table), or ditch the iterrows? I tried something like the below but failed to get it to work.
Replace dataframe column values based on matching id in another dataframe, and How to iterate over rows in a DataFrame in Pandas
Please advice..
You can also try with map:
df_student['student_Id'] = (
df_student['student_Id'].map(df_updated_id.set_index('old_id')['new_id'])
.fillna(df_student['student_Id'])
)
print(df_student)
# Output
Name gender math score student_Id
0 John male 50 1234
1 Jay male 100 6788
2 sachin male 70 xyz
3 Geetha female 80 abcd
4 Amutha female 75 83ko
5 ganesh male 40 v432
Update
I believe the updated_id isn't unique, so I need to further pre-process the data.
In this case, maybe you could drop duplicates before considering the last value (keep='last') is the most recent for a same old_id:
sr = df_updated_id.drop_duplicates('old_id', keep='last') \
.set_index('old_id')['new_id']
df_student['student_Id'] = df_student['student_Id'].map(sr) \
.fillna(df_student['student_Id']
)
Note: this is exactly what the #BENY's answer does. As he creates a dict, only the last occurrence of an old_id is kept. However, if you want to keep the first value appears, his code doesn't work. With drop_duplicates, you can adjust the keep parameter.
We can just replace
df_student.replace({'student_Id':df_updated_id.set_index('old_id')['new_id']},inplace=True)
df_student
Out[337]:
Name gender math score student_Id
0 John male 50 1234
1 Jay male 100 6788
2 sachin male 70 xyz
3 Geetha female 80 abcd
4 Amutha female 75 83ko
5 ganesh male 40 v432
I have a dataframe of people with Age as a column. I would like to match this age to a group, i.e. Baby=0-2 years old, Child=3-12 years old, Young=13-18 years old, Young Adult=19-30 years old, Adult=31-50 years old, Senior Adult=51-65 years old.
I created the lists that define these year groups, e.g. Adult=list(range(31,51)) etc.
How do I match the name of the list 'Adult' to the dataframe by creating a new column?
Small input: the dataframe is made up of three columns: df['Name'], df['Country'], df['Age'].
Name Country Age
Anthony France 15
Albert Belgium 54
.
.
.
Zahra Tunisia 14
So I need to match the age column with lists that I already have. The output should look like:
Name Country Age Group
Anthony France 15 Young
Albert Belgium 54 Adult
.
.
.
Zahra Tunisia 14 Young
Thanks!
IIUC I would go with np.select:
import pandas as pd
import numpy as np
df = pd.DataFrame({'Age': [3, 20, 40]})
condlist = [df.Age.between(0,2),
df.Age.between(3,12),
df.Age.between(13,18),
df.Age.between(19,30),
df.Age.between(31,50),
df.Age.between(51,65)]
choicelist = ['Baby', 'Child', 'Young',
'Young Adult', 'Adult', 'Senior Adult']
df['Adult'] = np.select(condlist, choicelist)
Output:
Age Adult
0 3 Child
1 20 Young Adult
2 40 Adult
Here's a way to do that using pd.cut:
df = pd.DataFrame({"person_id": range(25), "age": np.random.randint(0, 100, 25)})
print(df.head(10))
==>
person_id age
0 0 30
1 1 42
2 2 78
3 3 2
4 4 44
5 5 43
6 6 92
7 7 3
8 8 13
9 9 76
df["group"] = pd.cut(df.age, [0, 18, 50, 100], labels=["child", "adult", "senior"])
print(df.head(10))
==>
person_id age group
0 0 30 adult
1 1 42 adult
2 2 78 senior
3 3 2 child
4 4 44 adult
5 5 43 adult
6 6 92 senior
7 7 3 child
8 8 13 child
9 9 76 senior
Per your question, if you have a few lists (like the ones below), and would like to convert use them for 'binning', you can do:
# for example, these are the lists
Adult = list(range(18,50))
Child = list(range(0, 18))
Senior = list(range(50, 100))
# Creating bins out of the lists.
bins = [min(l) for l in [Child, Adult, Senior]]
bins.append(max([max(l) for l in [Child, Adult, Senior]]))
labels = ["Child", "Adult", "Senior"]
# using the bins:
df["group"] = pd.cut(df.age, bins, labels=labels)
To make things more clear for beginners, you can define a function that will return the age group of each person accordingly, then use pandas.apply() to apply that function to our 'Group' column:
import pandas as pd
def age(row):
a = row['Age']
if 0 < a <= 2:
return 'Baby'
elif 2 < a <= 12:
return 'Child'
elif 12 < a <= 18:
return 'Young'
elif 18 < a <= 30:
return 'Young Adult'
elif 30 < a <= 50:
return 'Adult'
elif 50 < a <= 65:
return 'Senior Adult'
df = pd.DataFrame({'Name':['Anthony','Albert','Zahra'],
'Country':['France','Belgium','Tunisia'],
'Age':[15,54,14]})
df['Group'] = df.apply(age, axis=1)
print(df)
Output:
Name Country Age Group
0 Anthony France 15 Young
1 Albert Belgium 54 Senior Adult
2 Zahra Tunisia 14 Young
I'm trying to find the wage gap between genders given a set of majors.
Here is a text version of my table:
gender field group logwage
0 male BUSINESS 7.229572
10 female BUSINESS 7.072464
1 male COMM/JOURN 7.108538
11 female COMM/JOURN 7.015018
2 male COMPSCI/STAT 7.340410
12 female COMPSCI/STAT 7.169401
3 male EDUCATION 6.888829
13 female EDUCATION 6.770255
4 male ENGINEERING 7.397082
14 female ENGINEERING 7.323996
5 male HUMANITIES 7.053048
15 female HUMANITIES 6.920830
6 male MEDICINE 7.319011
16 female MEDICINE 7.193518
17 female NATSCI 6.993337
7 male NATSCI 7.089232
18 female OTHER 6.881126
8 male OTHER 7.091698
9 male SOCSCI/PSYCH 7.197572
19 female SOCSCI/PSYCH 6.968322
diff hasn't worked for me, as it will take the difference between every consecutive major.
and here is the code as it is now:
for row in sorted_mfield:
if sorted_mfield['field group']==sorted_mfield['field group'].shift(1):
diff= lambda x: x[0]-x[1]
My next strategy would be to go back to the unsorted dataframe where male and female were their own columns and make a difference from there, but since I've spent an hour trying to do this, and am pretty new to pandas, I thought I would ask and find out how this works. Thanks.
Solution using Pandas.DataFrame.shift() in a sorted version of the data:
df.sort_values(by=['field group', 'gender'], inplace=True)
df['gap'] = df.logwage - df.logwage.shift(1)
df[df.gender =='male'][['field group', 'gap']]
Producing the following output with the sample data:
field group gap
0 BUSINESS 0.157108
2 COMM/JOURN 0.093520
4 COMPSCI/STAT 0.171009
6 EDUCATION 0.118574
8 ENGINEERING 0.073086
10 HUMANITIES 0.132218
12 MEDICINE 0.125493
15 NATSCI 0.095895
17 OTHER 0.210572
18 SOCSCI/PSYCH 0.229250
Note: it considers that you will always have a pair of values for each field group. If you want to validate it or eliminate field groups without this pair, the code below does the filtering:
df_grouped = df.groupby('field group')
df_filtered = df_grouped.filter(lambda x: len(x) == 2)
I'd consider reshaping your DataFrame with pivot, making it easier to then compute.
Code:
df.pivot(index='field group', columns='gender', values='logwage').rename_axis([None], axis=1)
# female male
#field group
#BUSINESS 7.072464 7.229572
#COMM/JOURN 7.015018 7.108538
#COMPSCI/STAT 7.169401 7.340410
#EDUCATION 6.770255 6.888829
#ENGINEERING 7.323996 7.397082
#HUMANITIES 6.920830 7.053048
#MEDICINE 7.193518 7.319011
#NATSCI 6.993337 7.089232
#OTHER 6.881126 7.091698
#SOCSCI/PSYCH 6.968322 7.197572
df.male - df.female
#field group
#BUSINESS 0.157108
#COMM/JOURN 0.093520
#COMPSCI/STAT 0.171009
#EDUCATION 0.118574
#ENGINEERING 0.073086
#HUMANITIES 0.132218
#MEDICINE 0.125493
#NATSCI 0.095895
#OTHER 0.210572
#SOCSCI/PSYCH 0.229250
#dtype: float64