I have two dataframes where I need to update the first one based on the value of the second one if exists. Sample story provided below is to replace the student_id with updatedId if exists in 'old_id' column and replace it with 'new_id'.
import pandas as pd
import numpy as np
student = {
'Name': ['John', 'Jay', 'sachin', 'Geetha', 'Amutha', 'ganesh'],
'gender': ['male', 'male', 'male', 'female', 'female', 'male'],
'math score': [50, 100, 70, 80, 75, 40],
'student_Id': ['1234', '6788', 'xyz', 'abcd', 'ok83', '234v'],
}
updatedId = {
'old_id' : ['ok83', '234v'],
'new_id' : ['83ko', 'v432'],
}
df_student = pd.DataFrame(student)
df_updated_id = pd.DataFrame(updatedId)
print(df_student)
print(df_updated_id)
# Method with np.where
for index, row in df_updated_id.iterrows():
df_student['student_Id'] = np.where(df_student['student_Id'] == row['old_id'], row['new_id'], df_student['student_Id'])
# print(df_student)
# Method with dataframe.mask
for index, row in df_updated_id.iterrows():
df_student['student_Id'].mask(df_student['student_Id'] == row['old_id'], row['new_id'], inplace=True)
print(df_student)
The results for both methods above work and yield the correct result
Name gender math score student_Id
0 John male 50 1234
1 Jay male 100 6788
2 sachin male 70 xyz
3 Geetha female 80 abcd
4 Amutha female 75 ok83
5 ganesh male 40 234v
old_id new_id
0 ok83 83ko
1 234v v432
Name gender math score student_Id
0 John male 50 1234
1 Jay male 100 6788
2 sachin male 70 xyz
3 Geetha female 80 abcd
4 Amutha female 75 83ko
5 ganesh male 40 v432
Nonetheless, the actual data of students has about 500,000 rows and updated_id has 6000 rows.
Thus I run into performance issues as loop is very slow:
A simple timer are placed to observe the number of records processed for df_updated_id
100 rows - numpy Time=3.9020769596099854; mask Time=3.9169061183929443
500 rows - numpy Time=20.42293930053711; mask Time=19.768696784973145
1000 rows - numpy Time=40.06309795379639; mask Time=37.26559829711914
My question is whether I can optimize it using a merge (join table), or ditch the iterrows? I tried something like the below but failed to get it to work.
Replace dataframe column values based on matching id in another dataframe, and How to iterate over rows in a DataFrame in Pandas
Please advice..
You can also try with map:
df_student['student_Id'] = (
df_student['student_Id'].map(df_updated_id.set_index('old_id')['new_id'])
.fillna(df_student['student_Id'])
)
print(df_student)
# Output
Name gender math score student_Id
0 John male 50 1234
1 Jay male 100 6788
2 sachin male 70 xyz
3 Geetha female 80 abcd
4 Amutha female 75 83ko
5 ganesh male 40 v432
Update
I believe the updated_id isn't unique, so I need to further pre-process the data.
In this case, maybe you could drop duplicates before considering the last value (keep='last') is the most recent for a same old_id:
sr = df_updated_id.drop_duplicates('old_id', keep='last') \
.set_index('old_id')['new_id']
df_student['student_Id'] = df_student['student_Id'].map(sr) \
.fillna(df_student['student_Id']
)
Note: this is exactly what the #BENY's answer does. As he creates a dict, only the last occurrence of an old_id is kept. However, if you want to keep the first value appears, his code doesn't work. With drop_duplicates, you can adjust the keep parameter.
We can just replace
df_student.replace({'student_Id':df_updated_id.set_index('old_id')['new_id']},inplace=True)
df_student
Out[337]:
Name gender math score student_Id
0 John male 50 1234
1 Jay male 100 6788
2 sachin male 70 xyz
3 Geetha female 80 abcd
4 Amutha female 75 83ko
5 ganesh male 40 v432
Related
I have two dataframes where I need to update the first one based on the value of the second one if exists. Sample story provided below is to replace the student_id with updatedId if exists in 'old_id' column and replace it with 'new_id'.
import pandas as pd
import numpy as np
student = {
'Name': ['John', 'Jay', 'sachin', 'Geetha', 'Amutha', 'ganesh'],
'gender': ['male', 'male', 'male', 'female', 'female', 'male'],
'math score': [50, 100, 70, 80, 75, 40],
'student_Id': ['1234', '6788', 'xyz', 'abcd', 'ok83', '234v'],
}
updatedId = {
'old_id' : ['ok83', '234v'],
'new_id' : ['83ko', 'v432'],
}
df_student = pd.DataFrame(student)
df_updated_id = pd.DataFrame(updatedId)
print(df_student)
print(df_updated_id)
# Method with np.where
for index, row in df_updated_id.iterrows():
df_student['student_Id'] = np.where(df_student['student_Id'] == row['old_id'], row['new_id'], df_student['student_Id'])
# print(df_student)
# Method with dataframe.mask
for index, row in df_updated_id.iterrows():
df_student['student_Id'].mask(df_student['student_Id'] == row['old_id'], row['new_id'], inplace=True)
print(df_student)
The results for both methods above work and yield the correct result
Name gender math score student_Id
0 John male 50 1234
1 Jay male 100 6788
2 sachin male 70 xyz
3 Geetha female 80 abcd
4 Amutha female 75 ok83
5 ganesh male 40 234v
old_id new_id
0 ok83 83ko
1 234v v432
Name gender math score student_Id
0 John male 50 1234
1 Jay male 100 6788
2 sachin male 70 xyz
3 Geetha female 80 abcd
4 Amutha female 75 83ko
5 ganesh male 40 v432
Nonetheless, the actual data of students has about 500,000 rows and updated_id has 6000 rows.
Thus I run into performance issues as loop is very slow:
A simple timer are placed to observe the number of records processed for df_updated_id
100 rows - numpy Time=3.9020769596099854; mask Time=3.9169061183929443
500 rows - numpy Time=20.42293930053711; mask Time=19.768696784973145
1000 rows - numpy Time=40.06309795379639; mask Time=37.26559829711914
My question is whether I can optimize it using a merge (join table), or ditch the iterrows? I tried something like the below but failed to get it to work.
Replace dataframe column values based on matching id in another dataframe, and How to iterate over rows in a DataFrame in Pandas
Please advice..
You can also try with map:
df_student['student_Id'] = (
df_student['student_Id'].map(df_updated_id.set_index('old_id')['new_id'])
.fillna(df_student['student_Id'])
)
print(df_student)
# Output
Name gender math score student_Id
0 John male 50 1234
1 Jay male 100 6788
2 sachin male 70 xyz
3 Geetha female 80 abcd
4 Amutha female 75 83ko
5 ganesh male 40 v432
Update
I believe the updated_id isn't unique, so I need to further pre-process the data.
In this case, maybe you could drop duplicates before considering the last value (keep='last') is the most recent for a same old_id:
sr = df_updated_id.drop_duplicates('old_id', keep='last') \
.set_index('old_id')['new_id']
df_student['student_Id'] = df_student['student_Id'].map(sr) \
.fillna(df_student['student_Id']
)
Note: this is exactly what the #BENY's answer does. As he creates a dict, only the last occurrence of an old_id is kept. However, if you want to keep the first value appears, his code doesn't work. With drop_duplicates, you can adjust the keep parameter.
We can just replace
df_student.replace({'student_Id':df_updated_id.set_index('old_id')['new_id']},inplace=True)
df_student
Out[337]:
Name gender math score student_Id
0 John male 50 1234
1 Jay male 100 6788
2 sachin male 70 xyz
3 Geetha female 80 abcd
4 Amutha female 75 83ko
5 ganesh male 40 v432
I have the following data frame:
import pandas as pd
pandas_df = pd.DataFrame([
["SEX", "Male"],
["SEX", "Female"],
["EXACT_AGE", None],
["Country", "Afghanistan"],
["Country", "Albania"]],
columns=['FullName', 'ResponseLabel'
])
Now what I need to do is to add sort order to this dataframe. Each new "FullName" would increment it by 100 and each consecutive "ResponseLabel" for a given "FullName" would increment it by 1 (for this specific "FullName"). So I basically create two different sort orders that I sum later on.
pandas_full_name_increment = pandas_df[['FullName']].drop_duplicates()
pandas_full_name_increment = pandas_full_name_increment.reset_index()
pandas_full_name_increment.index += 1
pandas_full_name_increment['SortOrderFullName'] = pandas_full_name_increment.index * 100
pandas_df['SortOrderResponseLabel'] = pandas_df.groupby(['FullName']).cumcount() + 1
pandas_df = pd.merge(pandas_df, pandas_full_name_increment, on = ['FullName'], how = 'left')
Result:
FullName ResponseLabel SortOrderResponseLabel index SortOrderFullName SortOrder
0 SEX Male 1 0 100 101
1 SEX Female 2 0 100 102
2 EXACT_AGE NULL 1 2 200 201
3 Country Afghanistan 1 3 300 301
4 Country Albania 2 3 300 302
The result that I get on my "SortOrder" column is correct but I wonder if there is some better approach pandas-wise?
Thank you!
The best way to do this would be to use ngroup and cumcount
name_group = pandas_df.groupby('FullName')
pandas_df['sort_order'] = (
name_group.ngroup(ascending=False).add(1).mul(100) +
name_group.cumcount().add(1)
)
Output
FullName ResponseLabel sort_order
0 SEX Male 101
1 SEX Female 102
2 EXACT_AGE None 201
3 Country Afghanistan 301
4 Country Albania 302
I have a txt file that I read in through python that comes like this:
Text File:
18|Male|66|180|Brown
23|Female|67|120|Brown
16|71|192|Brown
22|Male|68|185|Brown
24|Female|62|100|Blue
One of the rows has missing data and the problem is that when I read it into a dataframe it appears like this:
Age Gender Height Weight Eyes
0 18 Male 66 180 Brown
1 23 Female 67 120 Brown
2 16 71 192 Brown NaN
3 22 Male 68 185 Brown
4 24 Female 62 100 Blue
I'm wondering if there is a way to shift the row that has missing data over without shifting all columns.
Here is what I have so far:
import pandas as pd
df = pd.read_csv('C:/Documents/file.txt', sep='|', names=['Age','Gender', 'Height', 'Weight', 'Eyes'])
df_full = df.loc[df['Gender'].isin(['Male','Female'])]
df_missing = df.loc[~df['Gender'].isin(['Male','Female'])]
df_missing = df_missing.shift(1,axis=1)
df_final = pd.concat([df_full, df_missing])
I was hoping to just separate out the ones with missing data, shift the columns by one, and then put the dataframe back to the data that has no missing data. But I'm not sure how to go about shifting the columns at a certain point. This is the result I'm trying to get to:
Age Gender Height Weight Eyes
0 18 Male 66 180 Brown
1 23 Female 67 120 Brown
2 16 NaN 71 192 Brown
3 22 Male 68 185 Brown
4 24 Female 62 100 Blue
It doesn't really matter how I get it done, but the files I'm using have thousands of rows so I can not fix them individually. Any help is appreciated. Thanks!
Selectively shift a portion of each of the rows that have missing values.
df.apply(lambda r: r[:1].append(r[1:].shift())
if r['Gender'] not in ['Male', 'Female']
else r, axis=1)
The misaligned column data for each affected record will be aligned with 'NaN' inserted where the missing value was in the input text.
Age Gender Height Weight Eyes Age Gender Height Weight Eyes
1 23 Female 67 120 Brown 1 23 Female 67 120 Brown
2 16 71 192 Brown NaN ======> 2 16 NaN 71 192 Brown
For a single record, this'll do it:
df.loc[2] = df.loc[2][:1].append(df.loc[2][1:].shift())
Starting at the 'Gender' column, data is shifted right. The default fill is 'NaN'. The 'Age' column is preserved.
RegEx could help here.
Searching for ^(\d+\|)(\d) and making a replacement using $1|$2 (just added one vertical bar where Gender is missing "group 1 + | + group 2")
This could be done in almost every text editors (Notepad++, VSC, Sublime etc.)
See the example following the link: https://regexr.com/50gkh
I was doing some continuous operations on pandas dataframe where I need to chain the rename operation. The situation is like this:
import numpy as np
import pandas as pd
import seaborn as sns
df = sns.load_dataset('tips')
g = (df.groupby(['sex','time','smoker'])
.agg({'tip': ['count','sum'],
'total_bill': ['count','mean']})
.reset_index()
)
print(g.head())
This gives:
sex time smoker tip total_bill
count sum count mean
0 Male Lunch Yes 13 36.28 13 17.374615
1 Male Lunch No 20 58.83 20 18.486500
2 Male Dinner Yes 47 146.79 47 23.642553
3 Male Dinner No 77 243.17 77 20.130130
4 Female Lunch Yes 10 28.91 10 17.431000
without chaining
I can do it manually in another line:
g.columns = [i[0] + '_' + i[1] if i[1] else i[0]
for i in g.columns.ravel()]
It works fine, but I would like to chain this rename column process so that I can chain further other operations.
But I want inside chaining
How to do so?
Required output:
g = (df.groupby(['sex','time','smoker'])
.agg({'tip': ['count','sum'],
'total_bill': ['count','mean']})
.reset_index()
.rename(something here)
# or .set_axis(something here)
# or, .pipe(something here) I am not sure.
) # If i could do this this, i can do further chaining
sex time smoker tip_count tip_sum total_bill_count total_bill_mean
0 Male Lunch Yes 13 36.28 13 17.374615
1 Male Lunch No 20 58.83 20 18.486500
2 Male Dinner Yes 47 146.79 47 23.642553
3 Male Dinner No 77 243.17 77 20.130130
4 Female Lunch Yes 10 28.91 10 17.431000
You can use pipe to handle this:
import numpy as np
import pandas as pd
import seaborn as sns
df = sns.load_dataset('tips')
g = (df.groupby(['sex','time','smoker'])
.agg({'tip': ['count','sum'],
'total_bill': ['count','mean']})
.reset_index()
.pipe(lambda x: x.set_axis([f'{a}_{b}' if b == '' else f'{a}' for a,b in x.columns], axis=1, inplace=False))
)
print(g.head())
Output:
sex time smoker tip_count tip_sum total_bill_count total_bill_mean
0 Male Lunch Yes 13 36.28 13 17.374615
1 Male Lunch No 20 58.83 20 18.486500
2 Male Dinner Yes 47 146.79 47 23.642553
3 Male Dinner No 77 243.17 77 20.130130
4 Female Lunch Yes 10 28.91 10 17.431000
Note I am using f-string formatting python 3.6+ is required.
I'm trying to find the wage gap between genders given a set of majors.
Here is a text version of my table:
gender field group logwage
0 male BUSINESS 7.229572
10 female BUSINESS 7.072464
1 male COMM/JOURN 7.108538
11 female COMM/JOURN 7.015018
2 male COMPSCI/STAT 7.340410
12 female COMPSCI/STAT 7.169401
3 male EDUCATION 6.888829
13 female EDUCATION 6.770255
4 male ENGINEERING 7.397082
14 female ENGINEERING 7.323996
5 male HUMANITIES 7.053048
15 female HUMANITIES 6.920830
6 male MEDICINE 7.319011
16 female MEDICINE 7.193518
17 female NATSCI 6.993337
7 male NATSCI 7.089232
18 female OTHER 6.881126
8 male OTHER 7.091698
9 male SOCSCI/PSYCH 7.197572
19 female SOCSCI/PSYCH 6.968322
diff hasn't worked for me, as it will take the difference between every consecutive major.
and here is the code as it is now:
for row in sorted_mfield:
if sorted_mfield['field group']==sorted_mfield['field group'].shift(1):
diff= lambda x: x[0]-x[1]
My next strategy would be to go back to the unsorted dataframe where male and female were their own columns and make a difference from there, but since I've spent an hour trying to do this, and am pretty new to pandas, I thought I would ask and find out how this works. Thanks.
Solution using Pandas.DataFrame.shift() in a sorted version of the data:
df.sort_values(by=['field group', 'gender'], inplace=True)
df['gap'] = df.logwage - df.logwage.shift(1)
df[df.gender =='male'][['field group', 'gap']]
Producing the following output with the sample data:
field group gap
0 BUSINESS 0.157108
2 COMM/JOURN 0.093520
4 COMPSCI/STAT 0.171009
6 EDUCATION 0.118574
8 ENGINEERING 0.073086
10 HUMANITIES 0.132218
12 MEDICINE 0.125493
15 NATSCI 0.095895
17 OTHER 0.210572
18 SOCSCI/PSYCH 0.229250
Note: it considers that you will always have a pair of values for each field group. If you want to validate it or eliminate field groups without this pair, the code below does the filtering:
df_grouped = df.groupby('field group')
df_filtered = df_grouped.filter(lambda x: len(x) == 2)
I'd consider reshaping your DataFrame with pivot, making it easier to then compute.
Code:
df.pivot(index='field group', columns='gender', values='logwage').rename_axis([None], axis=1)
# female male
#field group
#BUSINESS 7.072464 7.229572
#COMM/JOURN 7.015018 7.108538
#COMPSCI/STAT 7.169401 7.340410
#EDUCATION 6.770255 6.888829
#ENGINEERING 7.323996 7.397082
#HUMANITIES 6.920830 7.053048
#MEDICINE 7.193518 7.319011
#NATSCI 6.993337 7.089232
#OTHER 6.881126 7.091698
#SOCSCI/PSYCH 6.968322 7.197572
df.male - df.female
#field group
#BUSINESS 0.157108
#COMM/JOURN 0.093520
#COMPSCI/STAT 0.171009
#EDUCATION 0.118574
#ENGINEERING 0.073086
#HUMANITIES 0.132218
#MEDICINE 0.125493
#NATSCI 0.095895
#OTHER 0.210572
#SOCSCI/PSYCH 0.229250
#dtype: float64