How can we handle dupes in a merged dataframe? - python

I am trying to merge two dataframes and I'm struggling to get this setup right. I Googled for a solution before posting here, but I'm still stuck. This is what I'm working with.
import pandas as pd
# Intitialise data of lists
data1 = [{'ID': 577878, 'Year':2020, 'Type': 'IB', 'Expense':6500},
{'ID': 577878, 'Year':2019, 'Type': 'IB', 'Expense':16500}]
df1 = pd.DataFrame(data1)
df1
data2 = [{'ID': 577878, 'Year':2020, 'Type': 'IB', 'Expense':23000}]
df2 = pd.DataFrame(data2)
df2
df_final = pd.merge(df1,
df2,
left_on=['ID'],
right_on=['ID'],
how='inner')
df_final
This makes sense, but I don't want the 23000 duplicated.
If I do the merge like this.
df_final = pd.merge(df1,
df2,
left_on=['ID','Year'],
right_on=['ID','Year'],
how='inner')
df_final
This also makes sense, but now the 16500 is dropped off because there is no 2019 in df2.
How can I keep both records, but not duplicate the 23000?

My interpretation is that you just don't want to see 2 entries of 23000 for both 2019 and 2020. It should be for 2020 only.
You can use outer merge (with parameter how='outer') on 2 columns ID and Year, as follows:
df_final = pd.merge(df1,
df2,
on=['ID','Year'],
how='outer')
Result:
print(df_final)
ID Year Type_x Expense_x Type_y Expense_y
0 577878 2020 IB 6500 IB 23000.0
1 577878 2019 IB 16500 NaN NaN

Try, column filter the df2 not to merge in that column:
df1.merge(df2[['ID', 'Year', 'Type']], on=['ID'])
Output:
ID Year_x Type_x Expense Year_y Type_y
0 577878 2020 IB 6500 2020 IB
1 577878 2019 IB 16500 2020 IB

Related

How can I map multiple row entries in df1 (name, category, amount) to one row entry per name under the respective category columns of another df2?

A first dataframe has a column containing categories which are the same as the headers of the second. There are multiple row entries for one name in df1. df2 will have 1 row entry per name. df1 has 1 row entry per category per name. All rows for one name occur in sequence in df1.
As shown:
enter image description here
The headers of df2 are as follows:
enter image description here
And the desired output is below:
enter image description here
How can I map data from the df1 to df2?
More specifically, how can I map multiple rows from df1 to 1 row and the respective columns of df2 in a more efficient way than looping twice to check for each category under each name?
Any help is appreciated,
Have a great day
Code:
import pandas as pds
df1 = pds.DataFrame({'Client': ['Rick', 'Rick', 'John'], 'Category': ['Service1', 'Service2', 'Service1'], 'Amount': [250, 6, 79]})
df2 = pds.DataFrame(columns = ['Client', 'Due_Date', 'Service1', 'Service2'])
output = pds.DataFrame({'Client': ['Rick', 'John'], 'Due_Date': [None,None] , 'Service1': [250, 79], 'Service2': [6, 0]})
This is an alternative approach using .pivot() and .assign()
df1_pivot = (df1.pivot(index='Client', columns='Category', values='Amount')
.reset_index().assign(Due_Date=None))
df_out = df2.assign(**df1_pivot)
print(df_out)
Client Due_Date Service1 Service2
0 John None 79.0 NaN
1 Rick None 250.0 6.0
You're looking for pandas.DataFrame.pivot :
out = (df1.pivot(index="Client", columns="Category")
.reset_index()
.set_axis(["Client", "Service1", "Service2"], axis=1)
.assign(Due_Date= None)
)
NB : I suggest you to use import pandas as pd as per the import convention
.
Output :
​
print(out)
Client Service1 Service2 Due_Date
0 John 79.0 NaN None
1 Rick 250.0 6.0 None

merging dataframes with multiple conditions

I am trying to merge two large data frames based on two common columns in these data frames. there is a small attempt and debate here but no promising solution
df1.year<=df2.year(same or later year to be manufactured)
df1.maker=df2.maker AND df1.location=df2.location
I prepared a small mock data to explain:
first data frame:
data = np.array([[2014,"toyota","california","corolla"],
[2015,"honda"," california", "civic"],
[2020,"hyndai","florida","accent"],
[2017,"nissan","NaN", "sentra"]])
df = pd.DataFrame(data, columns = ['year', 'make','location','model'])
df
second data frame:
data2 = np.array([[2012,"toyota","california","airbag"],
[2017,"toyota","california", "wheel"],
[2022,"hyndai","newyork","seat"],
[2017,"nissan","london", "light"]])
df2 = pd.DataFrame(data2, columns = ['year', 'make','location','id'])
df2
desired output:
data3 = np.array([[2017,"toyota",'corolla',"california", "wheel"]])
df3 = pd.DataFrame(data3, columns = ['year', 'make','model','location','id'])
df3
I tried to use the below approach but it is to slow and also not so accurate:
df4= pd.merge(df,df2, on=['location','make'], how='outer')
df4=df4.dropna()
df4['year'] = df4.apply(lambda x : x['year_y'] if x['year_y'] >= x['year_x'] else "0", axis=1)
You can achieve it with a merge_asof (one to one left merge) and dropna:
# ensure numeric year
df['year'] = pd.to_numeric(df['year'])
df2['year'] = pd.to_numeric(df2['year'])
(pd.merge_asof(df.sort_values('year'),
df2.sort_values('year')
.assign(year2=df2['year']),
on='year', by=['make', 'location'],
direction='forward')
.dropna(subset='id')
.convert_dtypes('year2')
)
NB. The intermediate is the size of df.
Output:
year make location model id year2
0 2014 toyota california corolla wheel 2017
one to many
As merge_asof is a one to one left join, if you want a one to many left join (or right join), you can invert the inputs and the direction.
I added an extra row for 2017 to demonstrate the difference.
year make location id
0 2012 toyota california airbag
1 2017 toyota california wheel
2 2017 toyota california windshield
3 2022 hyndai newyork seat
4 2017 nissan london light
Right join:
(pd.merge_asof(df2.sort_values('year'),
df.sort_values('year'),
on='year', by=['make', 'location'],
direction='backward')
.dropna(subset='model')
)
NB. The intermediate is the size of df2.
Output:
year make location id model
1 2017 toyota california wheel corolla
2 2017 toyota california windshield corolla
this should work:
df4= pd.merge(df,df2, on=['location','make'], how='inner')
df4.where(df4.year_x<=df4.year_y).dropna()
Output:
year_x make location model year_y id
1 2014 toyota california corolla 2017 wheel
Try this code (here 'make' and 'location' are common columns):
df_outer = pd.merge(df, df2, on=['make', 'location'], how='inner')
df3 = df_outer[df['year'] <= df2['year']]

Pandas Merge function creating incorrect data

I am attempting to join two data frames together. However, when I use the Merge function in Pandas, in some cases it is creating rows that should not be present and are not accurate.
import pandas as pd
df = pd.DataFrame({'TimeAway Type': ['Sick', 'Vacation', 'Late'], 'Date': ['2022-03-09', '2022-03-09', '2022-03-15'], 'Hours Requested': [0.04, 3, 5]})
df2 = pd.DataFrame({'Schedule Segment': ['Sick', 'VTO', 'Tardy'], 'Date': ['2022-03-09', '2022-03-09', '2022-03-15'], 'Duration': [2, 3, 1]})
merged = pd.merge(df, df2, on=['Date'])
print(merged)
Output image
As you can see from the image above, on the date where there is only 1 instance in each DF, everything works perfectly fine. However, on the dates where there are more than one in each DF, it is producing extra rows. It's almost as if it's saying "I don't know how to match this data so here's every single possible combination."
Desired Output:
Desired output image
This is just a subset of the data. The DF is quite large and there are a lot more instances of where this occurs with different values. Is there anything that can be done to stop this from happening?
Update per comment in question to join on order of records by date:
df_o=df.assign(date_order=df.groupby('Date').cumcount())
df2_o=df2.assign(date_order=df2.groupby('Date').cumcount())
df_o.merge(df2_o, on=['Date','date_order'])
Output:
TimeAway Type Date Hours Requested date_order Schedule Segment Duration
0 Sick 2022-03-09 0.04 0 Sick 2
1 Vacation 2022-03-09 3.00 1 VTO 3
2 Late 2022-03-15 5.00 0 Tardy 1
Create a psuedo key to join on order the by count in each day.
Try this:
df2['TimeAway Type'] = (df2['Schedule Segment'].map({'VTO':'Vacation',
'Tardy':'Late',})
.fillna(df2['Schedule Segment']))
merged = pd.merge(df, df2, on=['TimeAway Type', 'Date'])
merged
Output:
TimeAway Type Date Hours Requested Schedule Segment Duration
0 Sick 2022-03-09 0.04 Sick 2
1 Vacation 2022-03-09 3.00 VTO 3
2 Late 2022-03-15 5.00 Tardy 1

Merge two data frames and retain unique columns

I have these two data frames:
1st df
#df1 -----
location Ethnic Origins Percent(1)
0 Beaches-East York English 18.9
1 Davenport Portuguese 22.7
2 Eglinton-Lawrence Polish 12.0
2nd df
#df2 -----
location lat lng
0 Beaches—East York, Old Toronto, Toronto, Golde... 43.681470 -79.306021
1 Davenport, Old Toronto, Toronto, Golden Horses... 43.671561 -79.448293
2 Eglinton—Lawrence, North York, Toronto, Golden... 43.719265 -79.429765
Expected Output:
I want to use the location column of #df1 as it is cleaner and retain all other columns. I don't need the city, country info on the location column.
location Ethnic Origins Percent(1) lat lng
0 Beaches-East York English 18.9 43.681470 -79.306021
1 Davenport Portuguese 22.7 43.671561 -79.448293
2 Eglinton-Lawrence Polish 12.0 43.719265 -79.429765
I have tried several ways to merge them but to no avail.
This returns a NaN for all lat and long rows
df3 = pd.merge(df1, df2, on="location", how="left")
This returns a NaN for all Ethnic and Percent rows
df3 = pd.merge(df1, df2, on="location", how="right")
As others have noted, the problem is that the 'location' columns do not share any values. One solution to this is to use a regular expression to get rid of everything starting with the first comma and extending to the end of the string:
df2.location = df2.location.replace(r',.*', '', regex=True)
Using the exact data you provide this still won't work because you have different kinds of dashes in the two data frame. You could solve this in a similar way (no regex needed this time):
df2.location = df2.location.replace('—', '-')
And then merge as you suggested
df3 = pd.merge(df1, df2, on="location", how="left")
We should using findall create the key
df2['location']=df2.location.str.findall('|'.join(df1.location)).str[0]
df3 = pd.merge(df1, df2, on="location", how="left")
I'm guessing the problem you're having is that the column you're trying to merge on is not the same, i.e. it doesn't find the corresponding values in df2.location to merge to df1. Try changing those first and it should work:
df2["location"] = df2["location"].apply(lambda x: x.split(",")[0])
df3 = pd.merge(df1, df2, on="location", how="left")

How to get distinct count of keys along with other aggregations in pandas

My data frame (DF) looks like this
Customer_number Store_number year month last_buying_date1 amount
1 20 2014 10 2015-10-07 100
1 20 2014 10 2015-10-09 200
2 20 2014 10 2015-10-20 100
2 10 2014 10 2015-10-13 500
and I want to get an output like this
year month sum_purchase count_purchases distinct customers
2014 10 900 4 3
How do I get an output like this using Agg and group by . I am using a 2 step group by currently but struggling to get the distinct customers . Here's my approach
#### Step 1 - Aggregating everything at customer_number, store_number level
aggregations = {
'amount': 'sum',
'last_buying_date1': 'count',
}
grouped_at_Cust = DF.groupby(['customer_number','store_number','month','year']).agg(aggregations).reset_index()
grouped_at_Cust.columns = ['customer_number','store_number','month','year','total_purchase','num_purchase']
#### Step2 - Aggregating at year month level
aggregations = {
'total_purchase': 'sum',
'num_purchase': 'sum',
size
}
Monthly_customers = grouped_at_Cust.groupby(['year','month']).agg(aggregations).reset_index()
Monthly_customers.colums = ['year','month','sum_purchase','count_purchase','distinct_customers']
My struggle is in the 2nd step. How do i include size in the 2nd aggregation step ?
You could use groupby.agg and supplying the function nunique to return number of unique Customer Ids in the group.
df_grp = df.groupby(['year', 'month'], as_index=False) \
.agg({'purchase_amt':['sum','count'], 'Customer_number':['nunique']})
df_grp.columns = map('_'.join, df_grp.columns.values)
df_grp
Incase, you are trying to group them differently (omitting certain column) when performing groupby operation:
df_grp_1 = df.groupby(['year', 'month']).agg({'purchase_amt':['sum','count']})
df_grp_2 = df.groupby(['Store_number', 'month', 'year'])['Customer_number'].agg('nunique')
Take the first level of the multiindex columns which contains the agg operation performed:
df_grp_1.columns = df_grp_1.columns.get_level_values(1)
Merge them back on the intersection of the columns used to group them:
df_grp = df_grp_1.reset_index().merge(df_grp_2.reset_index().drop(['Store_number'],
axis=1), on=['year', 'month'], how='outer')
Rename the columns to new ones:
d = {'sum': 'sum_purchase', 'count': 'count_purchase', 'nunique': 'distinct_customers'}
df_grp.columns = [d.get(x, x) for x in df_grp.columns]
df_grp

Categories