I am trying to merge two large data frames based on two common columns in these data frames. there is a small attempt and debate here but no promising solution
df1.year<=df2.year(same or later year to be manufactured)
df1.maker=df2.maker AND df1.location=df2.location
I prepared a small mock data to explain:
first data frame:
data = np.array([[2014,"toyota","california","corolla"],
[2015,"honda"," california", "civic"],
[2020,"hyndai","florida","accent"],
[2017,"nissan","NaN", "sentra"]])
df = pd.DataFrame(data, columns = ['year', 'make','location','model'])
df
second data frame:
data2 = np.array([[2012,"toyota","california","airbag"],
[2017,"toyota","california", "wheel"],
[2022,"hyndai","newyork","seat"],
[2017,"nissan","london", "light"]])
df2 = pd.DataFrame(data2, columns = ['year', 'make','location','id'])
df2
desired output:
data3 = np.array([[2017,"toyota",'corolla',"california", "wheel"]])
df3 = pd.DataFrame(data3, columns = ['year', 'make','model','location','id'])
df3
I tried to use the below approach but it is to slow and also not so accurate:
df4= pd.merge(df,df2, on=['location','make'], how='outer')
df4=df4.dropna()
df4['year'] = df4.apply(lambda x : x['year_y'] if x['year_y'] >= x['year_x'] else "0", axis=1)
You can achieve it with a merge_asof (one to one left merge) and dropna:
# ensure numeric year
df['year'] = pd.to_numeric(df['year'])
df2['year'] = pd.to_numeric(df2['year'])
(pd.merge_asof(df.sort_values('year'),
df2.sort_values('year')
.assign(year2=df2['year']),
on='year', by=['make', 'location'],
direction='forward')
.dropna(subset='id')
.convert_dtypes('year2')
)
NB. The intermediate is the size of df.
Output:
year make location model id year2
0 2014 toyota california corolla wheel 2017
one to many
As merge_asof is a one to one left join, if you want a one to many left join (or right join), you can invert the inputs and the direction.
I added an extra row for 2017 to demonstrate the difference.
year make location id
0 2012 toyota california airbag
1 2017 toyota california wheel
2 2017 toyota california windshield
3 2022 hyndai newyork seat
4 2017 nissan london light
Right join:
(pd.merge_asof(df2.sort_values('year'),
df.sort_values('year'),
on='year', by=['make', 'location'],
direction='backward')
.dropna(subset='model')
)
NB. The intermediate is the size of df2.
Output:
year make location id model
1 2017 toyota california wheel corolla
2 2017 toyota california windshield corolla
this should work:
df4= pd.merge(df,df2, on=['location','make'], how='inner')
df4.where(df4.year_x<=df4.year_y).dropna()
Output:
year_x make location model year_y id
1 2014 toyota california corolla 2017 wheel
Try this code (here 'make' and 'location' are common columns):
df_outer = pd.merge(df, df2, on=['make', 'location'], how='inner')
df3 = df_outer[df['year'] <= df2['year']]
Related
I have accident data and part of this data includes the year of the accident, degree of injury and age of the injured person. this is an example of the DataFrame:
df = pd.DataFrame({'Year': ['2010', '2010','2010','2010','2010','2011','2011','2011','2011'],
'Degree_injury': ['no_injury', 'death', 'first_aid', 'minor_injury','disability','disability', 'disability', 'death','first_aid'],
'Age': [50,31,40,20,45,29,60,18,48]})
print(df)
I want three output variables to be grouped in a table by year when the age is less than 40 and get counts for number of disabilities, number of deaths, and number of minor injuries.
The output should be like this:
I generated the three variables (num_disability, num_death, num_minor_injury) when the age is < 40 as shown below.
disability_filt = (df['Degree_injury'] =='disability') &\
(df['Age'] <40)
num_disability = df[disability_filt].groupby('Year')['Degree_injury'].count()
death_filt = (df['Degree_injury'] == 'death')& \
(df['Age'] <40)
num_death = df[death_filt].groupby('Year')['Degree_injury'].count()
minor_injury_filt = (df['Degree_injury'] == 'death') & \
(df['Age'] <40)
num_minor_injury = df[minor_injury_filt].groupby('Year')['Degree_injury'].count()
How to combine these variables in one table to be as illustrated in the above table?
Thank you in advance,
Use pivot_table after filter your rows according your condition:
out = df[df['Age'].lt(40)].pivot_table(index='Year', columns='Degree_injury',
values='Age', aggfunc='count', fill_value=0)
print(out)
# Output:
Degree_injury death disability minor_injury
Year
2010 1 0 1
2011 1 1 0
# prep data
df2 = df.loc[df.Age<40,].groupby("Year").Degree_injury.value_counts().to_frame().reset_index(level=0, inplace=False)
df2 = df2.rename(columns={'Degree_injury': 'Count'})
df2['Degree_injury'] = df2.index
df2
# Year Count Degree_injury
# death 2010 1 death
# minor_injury 2010 1 minor_injury
# death 2011 1 death
# disability 2011 1 disability
# pivot result
df2.pivot(index='Year',columns='Degree_injury')
# death disability minor_injury
# Year
# 2010 1.0 NaN 1.0
# 2011 1.0 1.0 NaN
I am trying to merge two dataframes and I'm struggling to get this setup right. I Googled for a solution before posting here, but I'm still stuck. This is what I'm working with.
import pandas as pd
# Intitialise data of lists
data1 = [{'ID': 577878, 'Year':2020, 'Type': 'IB', 'Expense':6500},
{'ID': 577878, 'Year':2019, 'Type': 'IB', 'Expense':16500}]
df1 = pd.DataFrame(data1)
df1
data2 = [{'ID': 577878, 'Year':2020, 'Type': 'IB', 'Expense':23000}]
df2 = pd.DataFrame(data2)
df2
df_final = pd.merge(df1,
df2,
left_on=['ID'],
right_on=['ID'],
how='inner')
df_final
This makes sense, but I don't want the 23000 duplicated.
If I do the merge like this.
df_final = pd.merge(df1,
df2,
left_on=['ID','Year'],
right_on=['ID','Year'],
how='inner')
df_final
This also makes sense, but now the 16500 is dropped off because there is no 2019 in df2.
How can I keep both records, but not duplicate the 23000?
My interpretation is that you just don't want to see 2 entries of 23000 for both 2019 and 2020. It should be for 2020 only.
You can use outer merge (with parameter how='outer') on 2 columns ID and Year, as follows:
df_final = pd.merge(df1,
df2,
on=['ID','Year'],
how='outer')
Result:
print(df_final)
ID Year Type_x Expense_x Type_y Expense_y
0 577878 2020 IB 6500 IB 23000.0
1 577878 2019 IB 16500 NaN NaN
Try, column filter the df2 not to merge in that column:
df1.merge(df2[['ID', 'Year', 'Type']], on=['ID'])
Output:
ID Year_x Type_x Expense Year_y Type_y
0 577878 2020 IB 6500 2020 IB
1 577878 2019 IB 16500 2020 IB
I've got a dataframe df in Pandas that looks like this:
stores product discount
Westminster 102141 T
Westminster 102142 F
City of London 102141 T
City of London 102142 F
City of London 102143 T
And I'd like to end up with a dataset that looks like this:
stores product_1 discount_1 product_2 discount_2 product_3 discount_3
Westminster 102141 T 102143 F
City of London 102141 T 102143 F 102143 T
How do I do this in pandas?
I think this is some kind of pivot on the stores column, but with multiple . Or perhaps it's an "unmelt" rather than a "pivot"?
I tried:
df.pivot("stores", ["product", "discount"], ["product", "discount"])
But I get TypeError: MultiIndex.name must be a hashable type.
Use DataFrame.unstack for reshape, only necessary create counter by GroupBy.cumcount, last change ordering of second level and flatten MultiIndex in columns by map:
df = (df.set_index(['stores', df.groupby('stores').cumcount().add(1)])
.unstack()
.sort_index(axis=1, level=1))
df.columns = df.columns.map('{0[0]}_{0[1]}'.format)
df = df.reset_index()
print (df)
stores discount_1 product_1 discount_2 product_2 discount_3 \
0 City of London T 102141.0 F 102142.0 T
1 Westminster T 102141.0 F 102142.0 NaN
product_3
0 102143.0
1 NaN
I have these two data frames:
1st df
#df1 -----
location Ethnic Origins Percent(1)
0 Beaches-East York English 18.9
1 Davenport Portuguese 22.7
2 Eglinton-Lawrence Polish 12.0
2nd df
#df2 -----
location lat lng
0 Beaches—East York, Old Toronto, Toronto, Golde... 43.681470 -79.306021
1 Davenport, Old Toronto, Toronto, Golden Horses... 43.671561 -79.448293
2 Eglinton—Lawrence, North York, Toronto, Golden... 43.719265 -79.429765
Expected Output:
I want to use the location column of #df1 as it is cleaner and retain all other columns. I don't need the city, country info on the location column.
location Ethnic Origins Percent(1) lat lng
0 Beaches-East York English 18.9 43.681470 -79.306021
1 Davenport Portuguese 22.7 43.671561 -79.448293
2 Eglinton-Lawrence Polish 12.0 43.719265 -79.429765
I have tried several ways to merge them but to no avail.
This returns a NaN for all lat and long rows
df3 = pd.merge(df1, df2, on="location", how="left")
This returns a NaN for all Ethnic and Percent rows
df3 = pd.merge(df1, df2, on="location", how="right")
As others have noted, the problem is that the 'location' columns do not share any values. One solution to this is to use a regular expression to get rid of everything starting with the first comma and extending to the end of the string:
df2.location = df2.location.replace(r',.*', '', regex=True)
Using the exact data you provide this still won't work because you have different kinds of dashes in the two data frame. You could solve this in a similar way (no regex needed this time):
df2.location = df2.location.replace('—', '-')
And then merge as you suggested
df3 = pd.merge(df1, df2, on="location", how="left")
We should using findall create the key
df2['location']=df2.location.str.findall('|'.join(df1.location)).str[0]
df3 = pd.merge(df1, df2, on="location", how="left")
I'm guessing the problem you're having is that the column you're trying to merge on is not the same, i.e. it doesn't find the corresponding values in df2.location to merge to df1. Try changing those first and it should work:
df2["location"] = df2["location"].apply(lambda x: x.split(",")[0])
df3 = pd.merge(df1, df2, on="location", how="left")
I've got a fun one! And I've tried to find a duplicate question but was unsuccessful...
My dataframe consists of all United States and territories for years 2013-2016 with several attributes.
>>> df.head(2)
state enrollees utilizing enrol_age65 util_age65 year
1 Alabama 637247 635431 473376 474334 2013
2 Alaska 30486 28514 21721 20457 2013
>>> df.tail(2)
state enrollees utilizing enrol_age65 util_age65 year
214 Puerto Rico 581861 579514 453181 450150 2016
215 U.S. Territories 24329 16979 22608 15921 2016
I want to groupby year and state, and show the top 3 states (by 'enrollees' or 'utilizing' - does not matter) for each year.
Desired Output:
enrollees utilizing
year state
2013 California 3933310 3823455
New York 3133980 3002948
Florida 2984799 2847574
...
2016 California 4516216 4365896
Florida 4186823 3984756
New York 4009829 3874682
So far I've tried the following:
df.groupby(['year','state'])['enrollees','utilizing'].sum().head(3)
Which yields just the first 3 rows in the GroupBy object:
enrollees utilizing
year state
2013 Alabama 637247 635431
Alaska 30486 28514
Arizona 707683 683273
I've also tried a lambda function:
df.groupby(['year','state'])['enrollees','utilizing']\
.apply(lambda x: np.sum(x)).nlargest(3, 'enrollees')
Which yields the absolute largest 3 in the GroupBy object:
enrollees utilizing
year state
2016 California 4516216 4365896
2015 California 4324304 4191704
2014 California 4133532 4011208
I think it may have to do with the indexing of the GroupBy object, but I am not sure...Any guidance would be appreciated!
Well, you could do something not that pretty.
First getting a list of unique years using set():
years_list = list(set(df.year))
Create a dummy dataframe and a function to concat that I've made in the past:
def concatenate_loop_dfs(df_temp, df_full, axis=0):
"""
to avoid retyping the same line of code for every df.
the parameters should be the temporary df created at each loop and the concatenated DF that will contain all
values which must first be initialized (outside the loop) as df_name = pd.DataFrame(). """
if df_full.empty:
df_full = df_temp
else:
df_full = pd.concat([df_full, df_temp], axis=axis)
return df_full
creating the dummy final df
df_final = pd.DataFrame()
Now you'll loop for each year and concating into the new DF:
for year in years_list:
# The query function does a search for where
# the #year means the external variable, in this case the input from loop
# then you'll have a temporary DF with only the year and sorting and getting top3
df2 = df.query("year == #year")
df_temp = df2.groupby(['year','state'])['enrollees','utilizing'].sum().sort_values(by="enrollees", ascending=False).head(3)
# finally you'll call our function that will keep concating the tmp DFs
df_final = concatenate_loop_dfs(df_temp, df_final)
and done.
print(df_final)
You then need to sort your GroupBy object .sort_values('enrollees), ascending=False