How to search a substring from one df in another df? - python

I have read this post and would like to do something similar.
I have 2 dfs:
df1:
file_num
city
address_line
1
Toronto
123 Fake St
2
Montreal
456 Sample Ave
df2:
DB_Num
Address
AB1
Toronto 123 Fake St
AB3
789 Random Drive, Toronto
I want to know which DB_Num in df2 match to addres_line and city in df1, and include which file_num the match was from.
My ideal output is:
file_num
city
address_line
DB_Num
Address
1
Toronto
123 Fake St
AB1
Toronto 123 Fake St
Based on the above linked post, I have made a look ahead regex, and am searching using the insert and str.extract method.
df1['search_field'] = "(?=.*" + df1['city'] + ")(?=.*" + df1['address_line'] + ")"
pat = "|".join(df1['search_field'])
df = df2.insert(0, 'search_field', df2['Address'].str.extract("(" + pat + ')', expand=False))
Since my address in df2 is entered manually, it is sometimes out of order.
Because it is out of order, I am using the look ahead method of regex.
The look ahead method is causing str.extract to not output any value. Although I can still filter out nulls and it will keep only the correct matches.
My main problem is I have no way to join back to df1 to get the file_num.
I can do this problem with a for loop and iterating each record to search, but it takes too long. df1 is actually around 5000 records, and df2 has millions, so it takes over 2 hours to run. Is there a way to leverage vectorization for this problem?
Thanks!

Start by creating a new series which is the row each "Address" in df2 corresponds to "address_line" in df1, if such a row exists:
r = '({})'.format('|'.join(df1.address_line))
merge_df = df2.Address.str.extract(r, expand=False)
merge_df
#output:
0 123 Fake St
1 NaN
Name: Address, dtype: object
Now we merge our df1 on the "address_line" column, and our df2 on our "merge_df" series:
df1.merge(df2, left_on='address_line', right_on=merge_df)
index
file_num
City
address_line
DB_num
Address
0
1.0
Toronto
123 Fake St
AB1
Toronto 123 Fake St

Related

Update a value based on another dataframe pairing

I have problem where I need to update a value if people were at the same table.
import pandas as pd
data = {"p1":['Jen','Mark','Carrie'],
"p2":['John','Jason','Rob'],
"value":[10,20,40]}
df = pd.DataFrame(data,columns=["p1",'p2','value'])
meeting = {'person':['Jen','Mark','Carrie','John','Jason','Rob'],
'table':[1,2,3,1,2,3]}
meeting = pd.DataFrame(meeting,columns=['person','table'])
df is a relationship table and value is the field i need to update. So if two people were at the same table in the meeting dataframe then update the df row accordingly.
for example: Jen and John were both at table 1, so I need to update the row in df that has Jen and John and set their value to value + 100 so 110.
I thought about maybe doing a self join on meeting to get the format to match that of df but not sure if this is the easiest or fastest approach
IIUC you could set the person as index in the meeting dataframe, and use its table values to replace the names in df. Then if both mappings have the same value (table), replace with df.value+100:
m = df[['p1','p2']].replace(meeting.set_index('person').table).eval('p1==p2')
df['value'] = df.value.mask(m, df.value+100)
print(df)
p1 p2 value
0 Jen John 110
1 Mark Jason 120
2 Carrie Rob 140
This could be an approach, using df.to_records():
groups=meeting.groupby('table').agg(set)['person'].to_list()
df['value']=[row[-1]+100 if set(list(row)[1:3]) in groups else row[-1] for row in df.to_records()]
Output:
df
p1 p2 value
0 Jen John 110
1 Mark Jason 120
2 Carrie Rob 140

Issues converting wide to long on multiple columns [duplicate]

I've got a dataframe df in Pandas that looks like this:
stores product discount
Westminster 102141 T
Westminster 102142 F
City of London 102141 T
City of London 102142 F
City of London 102143 T
And I'd like to end up with a dataset that looks like this:
stores product_1 discount_1 product_2 discount_2 product_3 discount_3
Westminster 102141 T 102143 F
City of London 102141 T 102143 F 102143 T
How do I do this in pandas?
I think this is some kind of pivot on the stores column, but with multiple . Or perhaps it's an "unmelt" rather than a "pivot"?
I tried:
df.pivot("stores", ["product", "discount"], ["product", "discount"])
But I get TypeError: MultiIndex.name must be a hashable type.
Use DataFrame.unstack for reshape, only necessary create counter by GroupBy.cumcount, last change ordering of second level and flatten MultiIndex in columns by map:
df = (df.set_index(['stores', df.groupby('stores').cumcount().add(1)])
.unstack()
.sort_index(axis=1, level=1))
df.columns = df.columns.map('{0[0]}_{0[1]}'.format)
df = df.reset_index()
print (df)
stores discount_1 product_1 discount_2 product_2 discount_3 \
0 City of London T 102141.0 F 102142.0 T
1 Westminster T 102141.0 F 102142.0 NaN
product_3
0 102143.0
1 NaN

Merge two data frames and retain unique columns

I have these two data frames:
1st df
#df1 -----
location Ethnic Origins Percent(1)
0 Beaches-East York English 18.9
1 Davenport Portuguese 22.7
2 Eglinton-Lawrence Polish 12.0
2nd df
#df2 -----
location lat lng
0 Beaches—East York, Old Toronto, Toronto, Golde... 43.681470 -79.306021
1 Davenport, Old Toronto, Toronto, Golden Horses... 43.671561 -79.448293
2 Eglinton—Lawrence, North York, Toronto, Golden... 43.719265 -79.429765
Expected Output:
I want to use the location column of #df1 as it is cleaner and retain all other columns. I don't need the city, country info on the location column.
location Ethnic Origins Percent(1) lat lng
0 Beaches-East York English 18.9 43.681470 -79.306021
1 Davenport Portuguese 22.7 43.671561 -79.448293
2 Eglinton-Lawrence Polish 12.0 43.719265 -79.429765
I have tried several ways to merge them but to no avail.
This returns a NaN for all lat and long rows
df3 = pd.merge(df1, df2, on="location", how="left")
This returns a NaN for all Ethnic and Percent rows
df3 = pd.merge(df1, df2, on="location", how="right")
As others have noted, the problem is that the 'location' columns do not share any values. One solution to this is to use a regular expression to get rid of everything starting with the first comma and extending to the end of the string:
df2.location = df2.location.replace(r',.*', '', regex=True)
Using the exact data you provide this still won't work because you have different kinds of dashes in the two data frame. You could solve this in a similar way (no regex needed this time):
df2.location = df2.location.replace('—', '-')
And then merge as you suggested
df3 = pd.merge(df1, df2, on="location", how="left")
We should using findall create the key
df2['location']=df2.location.str.findall('|'.join(df1.location)).str[0]
df3 = pd.merge(df1, df2, on="location", how="left")
I'm guessing the problem you're having is that the column you're trying to merge on is not the same, i.e. it doesn't find the corresponding values in df2.location to merge to df1. Try changing those first and it should work:
df2["location"] = df2["location"].apply(lambda x: x.split(",")[0])
df3 = pd.merge(df1, df2, on="location", how="left")

Comparing columns from two data frames

I am relatively new to Python. If I have the following two types of dataframes, Lets say df1 and df2 respectively.
Id Name Job Name Salary Location
1 Jim Tester Jim 100 Japan
2 Bob Developer Bob 200 US
3 Sam Support Si 300 UK
Sue 400 France
I want to compare the 'Name' column in df2 to df1 such that if the name of the person (in df2) does not exist in df1 than that row in df2 would be outputed to another dataframe. So for the eg above the output would be:
Name Salary Location
Si 300 UK
Sue 400 France
Si and Sue are outputed because they do not exist in the 'Name' column in df1.
You can use Boolean indexing:
res = df2[~df2['Name'].isin(df1['Name'].unique())]
We use hashing via pd.Series.unique as an optimization in case you have duplicate names in df1.

Conditional merge for CSV files using python (pandas)

I am trying to merge >=2 files with the same schema.
The files will contain duplicate entries but rows won't be identical, for example:
file1:
store_id,address,phone
9191,9827 Park st,999999999
8181,543 Hello st,1111111111
file2:
store_id,address,phone
9191,9827 Park st Apt82,999999999
7171,912 John st,87282728282
Expected output:
9191,9827 Park st Apt82,999999999
8181,543 Hello st,1111111111
7171,912 John st,87282728282
If you noticed :
9191,9827 Park st,999999999 and 9191,9827 Park st Apt82,999999999 are similar based on store_id and phone but I picked it up from file2 since the address was more descriptive.
store_id+phone_number was my composite primary key to lookup a location and find duplicates (store_id is enough to find it in the above example but I need a key based on multiple column values)
Question:
- I need to merge multiple CSV files with same schema but with duplicate rows.
- Where the row level merge should have the logic to pick a specific value of a row based on its value. Like phone picked up from file1 and address pickedup from file2.
- A combination of 1 or many column values will define if rows are duplicate or not.
Can this be achieved using pandas?
One way to smash them together is to use merge (on store_id and number, if these are the index then this would be a join rather than a merge):
In [11]: res = df1.merge(df2, on=['store_id', 'phone'], how='outer')
In [12]: res
Out[12]:
store_id address_x phone address_y
0 9191 9827 Park st 999999999 9827 Park st Apt82
1 8181 543 Hello st 1111111111 NaN
2 7171 NaN 87282728282 912 John st
You can then use where to select address_y if it exists, otherwise address_x:
In [13]: res['address'] = res.address_y.where(res.address_y, res.address_x)
In [14]: del res['address_x'], res['address_y']
In [15]: res
Out[15]:
store_id phone address
0 9191 999999999 9827 Park st Apt82
1 8181 1111111111 543 Hello st
2 7171 87282728282 912 John st
How about use concat, groupby, agg, then you can write a agg function to choose the right value:
import pandas as pd
import io
t1 = """store_id,address,phone
9191,9827 Park st,999999999
8181,543 Hello st,1111111111"""
t2 = """store_id,address,phone
9191,9827 Park st Apt82,999999999
7171,912 John st,87282728282"""
df1 = pd.read_csv(io.BytesIO(t1))
df2 = pd.read_csv(io.BytesIO(t2))
df = pd.concat([df1, df2]).reset_index(drop=True)
def f(s):
loc = s.str.len().idxmax()
return s[loc]
df.groupby(["store_id", "phone"]).agg(f)

Categories