Conditional merge for CSV files using python (pandas) - python

I am trying to merge >=2 files with the same schema.
The files will contain duplicate entries but rows won't be identical, for example:
file1:
store_id,address,phone
9191,9827 Park st,999999999
8181,543 Hello st,1111111111
file2:
store_id,address,phone
9191,9827 Park st Apt82,999999999
7171,912 John st,87282728282
Expected output:
9191,9827 Park st Apt82,999999999
8181,543 Hello st,1111111111
7171,912 John st,87282728282
If you noticed :
9191,9827 Park st,999999999 and 9191,9827 Park st Apt82,999999999 are similar based on store_id and phone but I picked it up from file2 since the address was more descriptive.
store_id+phone_number was my composite primary key to lookup a location and find duplicates (store_id is enough to find it in the above example but I need a key based on multiple column values)
Question:
- I need to merge multiple CSV files with same schema but with duplicate rows.
- Where the row level merge should have the logic to pick a specific value of a row based on its value. Like phone picked up from file1 and address pickedup from file2.
- A combination of 1 or many column values will define if rows are duplicate or not.
Can this be achieved using pandas?

One way to smash them together is to use merge (on store_id and number, if these are the index then this would be a join rather than a merge):
In [11]: res = df1.merge(df2, on=['store_id', 'phone'], how='outer')
In [12]: res
Out[12]:
store_id address_x phone address_y
0 9191 9827 Park st 999999999 9827 Park st Apt82
1 8181 543 Hello st 1111111111 NaN
2 7171 NaN 87282728282 912 John st
You can then use where to select address_y if it exists, otherwise address_x:
In [13]: res['address'] = res.address_y.where(res.address_y, res.address_x)
In [14]: del res['address_x'], res['address_y']
In [15]: res
Out[15]:
store_id phone address
0 9191 999999999 9827 Park st Apt82
1 8181 1111111111 543 Hello st
2 7171 87282728282 912 John st

How about use concat, groupby, agg, then you can write a agg function to choose the right value:
import pandas as pd
import io
t1 = """store_id,address,phone
9191,9827 Park st,999999999
8181,543 Hello st,1111111111"""
t2 = """store_id,address,phone
9191,9827 Park st Apt82,999999999
7171,912 John st,87282728282"""
df1 = pd.read_csv(io.BytesIO(t1))
df2 = pd.read_csv(io.BytesIO(t2))
df = pd.concat([df1, df2]).reset_index(drop=True)
def f(s):
loc = s.str.len().idxmax()
return s[loc]
df.groupby(["store_id", "phone"]).agg(f)

Related

How to search a substring from one df in another df?

I have read this post and would like to do something similar.
I have 2 dfs:
df1:
file_num
city
address_line
1
Toronto
123 Fake St
2
Montreal
456 Sample Ave
df2:
DB_Num
Address
AB1
Toronto 123 Fake St
AB3
789 Random Drive, Toronto
I want to know which DB_Num in df2 match to addres_line and city in df1, and include which file_num the match was from.
My ideal output is:
file_num
city
address_line
DB_Num
Address
1
Toronto
123 Fake St
AB1
Toronto 123 Fake St
Based on the above linked post, I have made a look ahead regex, and am searching using the insert and str.extract method.
df1['search_field'] = "(?=.*" + df1['city'] + ")(?=.*" + df1['address_line'] + ")"
pat = "|".join(df1['search_field'])
df = df2.insert(0, 'search_field', df2['Address'].str.extract("(" + pat + ')', expand=False))
Since my address in df2 is entered manually, it is sometimes out of order.
Because it is out of order, I am using the look ahead method of regex.
The look ahead method is causing str.extract to not output any value. Although I can still filter out nulls and it will keep only the correct matches.
My main problem is I have no way to join back to df1 to get the file_num.
I can do this problem with a for loop and iterating each record to search, but it takes too long. df1 is actually around 5000 records, and df2 has millions, so it takes over 2 hours to run. Is there a way to leverage vectorization for this problem?
Thanks!
Start by creating a new series which is the row each "Address" in df2 corresponds to "address_line" in df1, if such a row exists:
r = '({})'.format('|'.join(df1.address_line))
merge_df = df2.Address.str.extract(r, expand=False)
merge_df
#output:
0 123 Fake St
1 NaN
Name: Address, dtype: object
Now we merge our df1 on the "address_line" column, and our df2 on our "merge_df" series:
df1.merge(df2, left_on='address_line', right_on=merge_df)
index
file_num
City
address_line
DB_num
Address
0
1.0
Toronto
123 Fake St
AB1
Toronto 123 Fake St

How to apply a function to sub selections in a pandas DataFrame object efficiently?

I have a dataframe of people's addresses and names. I have a function that processes names that I want to apply. I am creating sub selections of people with matching addresses and applying the function to those groups.
To this point I have been using .loc to as follows
for x in df['address'].unique():
sub_selection = df.loc[df['address'] == x]
sub_selection.apply(lambda x: function(x), axis = 1)
Is there a more efficient way to approach this. I am looking into pandas .groupby() functionality, but i am struggling to get it to work.
df.groupby('address').agg(lambda x: function(x['names']))
Here is some sample data:
address, name, Unique_ID
1022 Boogie Woogie Ave, John Smith, np.nan
1022 Boogie Woogie Ave, Frederick Smith, np.nan
1022 Boogie Woogie Ave, John Jacob Smith, np.nan
3030 Sesame Street, Big Bird, np.nan
3030 Sesame Street, Elmo, np.nan
3030 Sesame Street, Big Yellow Bird, np.nan
My function itself has some moving parts, but basically I check the name against a reference dictionary I create. This process passes a few other steps, but returns a list of indexes where the name matches. I use those indexes to assign a shared unique id for matching names. In my example big bird and big yellow bird would match.
def function(x):
match_list = []
if x['name'] in __lookup_dict[0]:
match_list.append((__lookup_dict[0][x['name']))
#reduce all elements matching list to a single list of place ids matching all elements
result = set(match_list[0])
for s in match_list[1:]:
if len(result.intersection(s)) != 0:
result.intersection_update(s)
#take the reduce lists and assign each place id an unique id.
#note we are working with place ids not the sub df's index. They don't match
if pd.isnull(x['Unique_ID']):
uid = str(uuid.uuid4())
for g in result:
df.at[df.index[df.index == g].tolist()[0], 'Unq_ID'] = uid
else:
pass
return result
Try using
df.groupby('address').apply(lambda x: function(x['names']))
Edited:
Check this example. I've used a dataframe from another StackOverflow question
import pandas as pd
df = pd.DataFrame({
"City":["Delhi","Delhi","Mumbai","Mumbai","Lahore","Lahore"],
"Points":[90.1,90.3,94.1,95,89,90.5],
"Gender":["Male","Female","Female","Male","Female","Male"]
})
d = {k:v for v,k in enumerate(df.City.unique())}
df['idx'] = df['City'].replace(d)
print(df)
Output:
City Points Gender idx
0 Delhi 90.1 Male 0
1 Delhi 90.3 Female 0
2 Mumbai 94.1 Female 1
3 Mumbai 95.0 Male 1
4 Lahore 89.0 Female 2
5 Lahore 90.5 Male 2
So, try using
d = {k:v for v,k in enumerate(df['address'].unique())}
df['idx'] = df['address'].replace(d)

Python / Pandas - Consider 'empty string' as a match during merge using multiple columns

I'm trying to merge 2 dataframes on multiple columns:['Unit','Geo','Region']. And, the condition is: When a value from right_df encounters an 'empty string' on left_df , it should consider as a match.
eg.,when first row of right_df joins with first row of left_df , we have a empty string for column:'Region' . So,need to consider the empty string as a match to 'AU' and get the final result 'DE".
left_df = pd.DataFrame({'Unit':['DEV','DEV','DEV','DEV','DEV','TEST1','TEST2','ACCTEST1','ACCTEST1','ACCTEST1'],
'Geo':['AP','JAPAN','NA','Europe','Europe','','','AP','Europe','NA'],
'Region':['','','','France','BENELUX','','','','',''],
'Resp':['DE','FG','BO','MD','KR','PM','NJ','JI','HN','FG']})
right_df = pd.DataFrame({'Unit':['DEV','DEV','DEV','DEV','TEST1','TEST2','ACCTEST1','DEV','ACCTEST1','TEST1','TEST2','DEV','TEST1','TEST2'],
'Geo':['AP','JAPAN','AP','NA','AP','Europe','Europe','Europe','AP','JAPAN','AP','Europe','Europe','Europe'],
'Region':['AU','JAPAN','ISA','USA','AU/NZ','France','CEE','France','ISA','JAPAN','ISA','BENELUX','CEE','CEE']})
I tried with the below code but it works only if the 'empty strings' have values. I'm struggling to add a condition saying 'consider empty string as a match' or 'ignore if right_df encounters empty string and continue with available match'. Would appreciate for any help. Thanks!!
result_df = pd.merge(left_df, right_df, how='inner', on=['Unit','Geo','Region'])
Use DataFrame.merge inside a list comprehension and perform the
left merge operations in the following order:
Merge right_df with left_df on columns Unit, Geo and Region and select column Resp.
Merge right_df with left_df(drop duplicate values in Unit and Geo) on columns Unit, Geo and select column Resp.
Merge right_df with left_df(drop duplicate values in Unit) on column Unit and select column Resp.
Then use functools.reduce with a reducing function Series.combine_first to combine the all the series in the list s and assign this result to Resp column in right_df.
from functools import reduce
c = ['Unit', 'Geo', 'Region']
s = [right_df.merge(left_df.drop_duplicates(c[:len(c) - i]),
on=c[:len(c) - i], how='left')['Resp'] for i in range(len(c))]
right_df['Resp'] = reduce(pd.Series.combine_first, s)
Result:
print(right_df)
Unit Geo Region Resp
0 DEV AP AU DE
1 DEV JAPAN JAPAN FG
2 DEV AP ISA DE
3 DEV NA USA BO
4 TEST1 AP AU/NZ PM
5 TEST2 Europe France NJ
6 ACCTEST1 Europe CEE HN
7 DEV Europe France MD
8 ACCTEST1 AP ISA JI
9 TEST1 JAPAN JAPAN PM
10 TEST2 AP ISA NJ
11 DEV Europe BENELUX KR
12 TEST1 Europe CEE PM
13 TEST2 Europe CEE NJ
Looks like there's some mismatch in your mapping, however you can use update method to handle empty strings:
# replace empty strings with nan
left_df = left_df.replace('', np.nan)
# replace np.nan with values from other dataframe
left_df.update(right_df, overwrite=False)
# merge
df = pd.merge(left_df, right_df, how='right', on=['Unit','Geo','Region'])
Hope this gives you some idea.

Update a value based on another dataframe pairing

I have problem where I need to update a value if people were at the same table.
import pandas as pd
data = {"p1":['Jen','Mark','Carrie'],
"p2":['John','Jason','Rob'],
"value":[10,20,40]}
df = pd.DataFrame(data,columns=["p1",'p2','value'])
meeting = {'person':['Jen','Mark','Carrie','John','Jason','Rob'],
'table':[1,2,3,1,2,3]}
meeting = pd.DataFrame(meeting,columns=['person','table'])
df is a relationship table and value is the field i need to update. So if two people were at the same table in the meeting dataframe then update the df row accordingly.
for example: Jen and John were both at table 1, so I need to update the row in df that has Jen and John and set their value to value + 100 so 110.
I thought about maybe doing a self join on meeting to get the format to match that of df but not sure if this is the easiest or fastest approach
IIUC you could set the person as index in the meeting dataframe, and use its table values to replace the names in df. Then if both mappings have the same value (table), replace with df.value+100:
m = df[['p1','p2']].replace(meeting.set_index('person').table).eval('p1==p2')
df['value'] = df.value.mask(m, df.value+100)
print(df)
p1 p2 value
0 Jen John 110
1 Mark Jason 120
2 Carrie Rob 140
This could be an approach, using df.to_records():
groups=meeting.groupby('table').agg(set)['person'].to_list()
df['value']=[row[-1]+100 if set(list(row)[1:3]) in groups else row[-1] for row in df.to_records()]
Output:
df
p1 p2 value
0 Jen John 110
1 Mark Jason 120
2 Carrie Rob 140

Merge two data frames and retain unique columns

I have these two data frames:
1st df
#df1 -----
location Ethnic Origins Percent(1)
0 Beaches-East York English 18.9
1 Davenport Portuguese 22.7
2 Eglinton-Lawrence Polish 12.0
2nd df
#df2 -----
location lat lng
0 Beaches—East York, Old Toronto, Toronto, Golde... 43.681470 -79.306021
1 Davenport, Old Toronto, Toronto, Golden Horses... 43.671561 -79.448293
2 Eglinton—Lawrence, North York, Toronto, Golden... 43.719265 -79.429765
Expected Output:
I want to use the location column of #df1 as it is cleaner and retain all other columns. I don't need the city, country info on the location column.
location Ethnic Origins Percent(1) lat lng
0 Beaches-East York English 18.9 43.681470 -79.306021
1 Davenport Portuguese 22.7 43.671561 -79.448293
2 Eglinton-Lawrence Polish 12.0 43.719265 -79.429765
I have tried several ways to merge them but to no avail.
This returns a NaN for all lat and long rows
df3 = pd.merge(df1, df2, on="location", how="left")
This returns a NaN for all Ethnic and Percent rows
df3 = pd.merge(df1, df2, on="location", how="right")
As others have noted, the problem is that the 'location' columns do not share any values. One solution to this is to use a regular expression to get rid of everything starting with the first comma and extending to the end of the string:
df2.location = df2.location.replace(r',.*', '', regex=True)
Using the exact data you provide this still won't work because you have different kinds of dashes in the two data frame. You could solve this in a similar way (no regex needed this time):
df2.location = df2.location.replace('—', '-')
And then merge as you suggested
df3 = pd.merge(df1, df2, on="location", how="left")
We should using findall create the key
df2['location']=df2.location.str.findall('|'.join(df1.location)).str[0]
df3 = pd.merge(df1, df2, on="location", how="left")
I'm guessing the problem you're having is that the column you're trying to merge on is not the same, i.e. it doesn't find the corresponding values in df2.location to merge to df1. Try changing those first and it should work:
df2["location"] = df2["location"].apply(lambda x: x.split(",")[0])
df3 = pd.merge(df1, df2, on="location", how="left")

Categories