Merge two data frames and retain unique columns - python

I have these two data frames:
1st df
#df1 -----
location Ethnic Origins Percent(1)
0 Beaches-East York English 18.9
1 Davenport Portuguese 22.7
2 Eglinton-Lawrence Polish 12.0
2nd df
#df2 -----
location lat lng
0 Beaches—East York, Old Toronto, Toronto, Golde... 43.681470 -79.306021
1 Davenport, Old Toronto, Toronto, Golden Horses... 43.671561 -79.448293
2 Eglinton—Lawrence, North York, Toronto, Golden... 43.719265 -79.429765
Expected Output:
I want to use the location column of #df1 as it is cleaner and retain all other columns. I don't need the city, country info on the location column.
location Ethnic Origins Percent(1) lat lng
0 Beaches-East York English 18.9 43.681470 -79.306021
1 Davenport Portuguese 22.7 43.671561 -79.448293
2 Eglinton-Lawrence Polish 12.0 43.719265 -79.429765
I have tried several ways to merge them but to no avail.
This returns a NaN for all lat and long rows
df3 = pd.merge(df1, df2, on="location", how="left")
This returns a NaN for all Ethnic and Percent rows
df3 = pd.merge(df1, df2, on="location", how="right")

As others have noted, the problem is that the 'location' columns do not share any values. One solution to this is to use a regular expression to get rid of everything starting with the first comma and extending to the end of the string:
df2.location = df2.location.replace(r',.*', '', regex=True)
Using the exact data you provide this still won't work because you have different kinds of dashes in the two data frame. You could solve this in a similar way (no regex needed this time):
df2.location = df2.location.replace('—', '-')
And then merge as you suggested
df3 = pd.merge(df1, df2, on="location", how="left")

We should using findall create the key
df2['location']=df2.location.str.findall('|'.join(df1.location)).str[0]
df3 = pd.merge(df1, df2, on="location", how="left")

I'm guessing the problem you're having is that the column you're trying to merge on is not the same, i.e. it doesn't find the corresponding values in df2.location to merge to df1. Try changing those first and it should work:
df2["location"] = df2["location"].apply(lambda x: x.split(",")[0])
df3 = pd.merge(df1, df2, on="location", how="left")

Related

merging dataframes with multiple conditions

I am trying to merge two large data frames based on two common columns in these data frames. there is a small attempt and debate here but no promising solution
df1.year<=df2.year(same or later year to be manufactured)
df1.maker=df2.maker AND df1.location=df2.location
I prepared a small mock data to explain:
first data frame:
data = np.array([[2014,"toyota","california","corolla"],
[2015,"honda"," california", "civic"],
[2020,"hyndai","florida","accent"],
[2017,"nissan","NaN", "sentra"]])
df = pd.DataFrame(data, columns = ['year', 'make','location','model'])
df
second data frame:
data2 = np.array([[2012,"toyota","california","airbag"],
[2017,"toyota","california", "wheel"],
[2022,"hyndai","newyork","seat"],
[2017,"nissan","london", "light"]])
df2 = pd.DataFrame(data2, columns = ['year', 'make','location','id'])
df2
desired output:
data3 = np.array([[2017,"toyota",'corolla',"california", "wheel"]])
df3 = pd.DataFrame(data3, columns = ['year', 'make','model','location','id'])
df3
I tried to use the below approach but it is to slow and also not so accurate:
df4= pd.merge(df,df2, on=['location','make'], how='outer')
df4=df4.dropna()
df4['year'] = df4.apply(lambda x : x['year_y'] if x['year_y'] >= x['year_x'] else "0", axis=1)
You can achieve it with a merge_asof (one to one left merge) and dropna:
# ensure numeric year
df['year'] = pd.to_numeric(df['year'])
df2['year'] = pd.to_numeric(df2['year'])
(pd.merge_asof(df.sort_values('year'),
df2.sort_values('year')
.assign(year2=df2['year']),
on='year', by=['make', 'location'],
direction='forward')
.dropna(subset='id')
.convert_dtypes('year2')
)
NB. The intermediate is the size of df.
Output:
year make location model id year2
0 2014 toyota california corolla wheel 2017
one to many
As merge_asof is a one to one left join, if you want a one to many left join (or right join), you can invert the inputs and the direction.
I added an extra row for 2017 to demonstrate the difference.
year make location id
0 2012 toyota california airbag
1 2017 toyota california wheel
2 2017 toyota california windshield
3 2022 hyndai newyork seat
4 2017 nissan london light
Right join:
(pd.merge_asof(df2.sort_values('year'),
df.sort_values('year'),
on='year', by=['make', 'location'],
direction='backward')
.dropna(subset='model')
)
NB. The intermediate is the size of df2.
Output:
year make location id model
1 2017 toyota california wheel corolla
2 2017 toyota california windshield corolla
this should work:
df4= pd.merge(df,df2, on=['location','make'], how='inner')
df4.where(df4.year_x<=df4.year_y).dropna()
Output:
year_x make location model year_y id
1 2014 toyota california corolla 2017 wheel
Try this code (here 'make' and 'location' are common columns):
df_outer = pd.merge(df, df2, on=['make', 'location'], how='inner')
df3 = df_outer[df['year'] <= df2['year']]

How to search a substring from one df in another df?

I have read this post and would like to do something similar.
I have 2 dfs:
df1:
file_num
city
address_line
1
Toronto
123 Fake St
2
Montreal
456 Sample Ave
df2:
DB_Num
Address
AB1
Toronto 123 Fake St
AB3
789 Random Drive, Toronto
I want to know which DB_Num in df2 match to addres_line and city in df1, and include which file_num the match was from.
My ideal output is:
file_num
city
address_line
DB_Num
Address
1
Toronto
123 Fake St
AB1
Toronto 123 Fake St
Based on the above linked post, I have made a look ahead regex, and am searching using the insert and str.extract method.
df1['search_field'] = "(?=.*" + df1['city'] + ")(?=.*" + df1['address_line'] + ")"
pat = "|".join(df1['search_field'])
df = df2.insert(0, 'search_field', df2['Address'].str.extract("(" + pat + ')', expand=False))
Since my address in df2 is entered manually, it is sometimes out of order.
Because it is out of order, I am using the look ahead method of regex.
The look ahead method is causing str.extract to not output any value. Although I can still filter out nulls and it will keep only the correct matches.
My main problem is I have no way to join back to df1 to get the file_num.
I can do this problem with a for loop and iterating each record to search, but it takes too long. df1 is actually around 5000 records, and df2 has millions, so it takes over 2 hours to run. Is there a way to leverage vectorization for this problem?
Thanks!
Start by creating a new series which is the row each "Address" in df2 corresponds to "address_line" in df1, if such a row exists:
r = '({})'.format('|'.join(df1.address_line))
merge_df = df2.Address.str.extract(r, expand=False)
merge_df
#output:
0 123 Fake St
1 NaN
Name: Address, dtype: object
Now we merge our df1 on the "address_line" column, and our df2 on our "merge_df" series:
df1.merge(df2, left_on='address_line', right_on=merge_df)
index
file_num
City
address_line
DB_num
Address
0
1.0
Toronto
123 Fake St
AB1
Toronto 123 Fake St

Merging/concatenating two datasets on a specific column (different lengths) [duplicate]

This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 1 year ago.
I have two different datasets
df1
Name Surname Age Address
Julian Ross 34 Main Street
Mary Jane 52 Cook Road
len(1200)
df2
Name Country Telephone
Julian US NA
len(800)
df1 contains the full list of unique names; df2 contains less rows as many Name were not added.
I would like to get a final dataset with the full list of names in df1 (and all the fields that are there) plus the fields in df2. I would then expect a final dataset of length 1200 with some empty fields corresponding to the missing name in df2.
I have tried as follows:
pd.concat([df1.set_index('Name'),df2.set_index('Name')], axis=1, join='inner')
but it returns the length of the smallest dataset (i.e. 800).
I have also tried
df1.merge(df2, how = 'inner', on = ['Name'])
... same result.
I am not totally familiar with joining/merging/concatenating functions, even after reading the document https://pandas.pydata.org/docs/user_guide/merging.html .
I know that probably this question will be a duplicate of some others and I will be happy to delete it if necessary, but I would be really grateful if you could provide same help and explaining how to get the expected result:
df
Name Surname Age Address Country Telephone
Julian Ross 34 Main Street US NA
Mary Jane 52 Cook Road
IIUC, Use pd.merge like below:
>>> df1.merge(df2, how='left', on='Name')
Name Surname Age Address Country Telephone
0 Julian Ross 34 Main Street US NaN
1 Mary Jane 52 Cook Road NaN NaN
If you want to keep the number of rows of df1, you have to use how='left' in the case where there is no duplicate names in df2.
Read Pandas Merging 101

Issues converting wide to long on multiple columns [duplicate]

I've got a dataframe df in Pandas that looks like this:
stores product discount
Westminster 102141 T
Westminster 102142 F
City of London 102141 T
City of London 102142 F
City of London 102143 T
And I'd like to end up with a dataset that looks like this:
stores product_1 discount_1 product_2 discount_2 product_3 discount_3
Westminster 102141 T 102143 F
City of London 102141 T 102143 F 102143 T
How do I do this in pandas?
I think this is some kind of pivot on the stores column, but with multiple . Or perhaps it's an "unmelt" rather than a "pivot"?
I tried:
df.pivot("stores", ["product", "discount"], ["product", "discount"])
But I get TypeError: MultiIndex.name must be a hashable type.
Use DataFrame.unstack for reshape, only necessary create counter by GroupBy.cumcount, last change ordering of second level and flatten MultiIndex in columns by map:
df = (df.set_index(['stores', df.groupby('stores').cumcount().add(1)])
.unstack()
.sort_index(axis=1, level=1))
df.columns = df.columns.map('{0[0]}_{0[1]}'.format)
df = df.reset_index()
print (df)
stores discount_1 product_1 discount_2 product_2 discount_3 \
0 City of London T 102141.0 F 102142.0 T
1 Westminster T 102141.0 F 102142.0 NaN
product_3
0 102143.0
1 NaN

Pandas not merging on different columns - Key error or NaN

I am trying to mimic my issue with my current data. I am trying to use pandas to merge two dataframes on different column names (Code and Number), and bring only one column from df2 (location). I get either Key error, or NaN.
Both were imported CSV files as data frames;
Both column names do not have white space;
Both columns have the same d.type
I have tried looking at other answers here, with literally copying and pasting the coded answers filling in my parts, and still get errors or NaN.
df1:
[['Name', 'Income', 'Favourite superhero', 'Code', 'Colour'],
['Joe', '80000', 'Batman', '10004', 'Red'],
['Christine', '50000', 'Superman', '10005', 'Brown'],
['Joey', '90000', 'Aquaman', '10002', 'Blue']
df2:
[['Number', 'Language', 'Location'],
['10005', 'English', 'Sudbury'],
['10002', 'French', 'Ottawa'],
['10004', 'German', 'New York']]
what I tried:
data = pd.merge(CSV1,
CSV2[['Location']],
left_on='Code',
right_on='Number',
how='left')
data = pd.merge(CSV1,
CSV2[['Location']],
left_on='Code',
right_index=True,
how='left')
I am trying to have df1 with the location column from df2 for each instance where Number
and Code are the same.
For both you commands work , you need Number existing in the right side dataframe. For the 1st command you need to drop Number columns after merge. For 2nd command, you need to set_index on the right sliced dataframe and no need to drop Number. I modified your command accordingly:
CSV1.merge(CSV2[['Number', 'Location']], left_on='Code', right_on='Number', how='left').drop('Number', 1)
Or
CSV1.merge(CSV2[['Number', 'Location']].set_index('Number'), left_on='Code', right_index=True, how='left')
Out[892]:
Name Income Favourite superhero Code Colour Location
0 Joe 80000 Batman 10004 Red New York
1 Christine 50000 Superman 10005 Brown Sudbury
2 Joey 90000 Aquaman 10002 Blue Ottawa

Categories