Partial word match between two columns of different pandas dataframes - python

I have two data-frames like :
df1 :
df2 :
I am trying make a match of any term to text.
MyCode :
import sys,os
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
import csv
import re
# data
data1 = {'termID': [1,55,341,41,5685], 'term':['Cardic Arrest','Headache','Chest Pain','Muscle Pain', 'Knee Pain']}
data2 = {'textID': [25,12,52,35], 'text':['Hello Mike, Good Morning!!',
'Oops!! My Knee pains!!',
'Stop Music!! my head pains',
'Arrest Innocent!!'
]}
#Dataframes
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
# Matching logic
matchList=[]
for index_b, row_b in df2.iterrows():
for index_a, row_a in df1.iterrows():
if row_a.term.lower() in row_b.text.lower() :
#print(row_b.text, row_a.term)
matchList.append([row_b.textID,row_b.text ,row_a.term, row_a.termID] )
cols = ['textID', 'text,','term ','termID' ]
d = pd.DataFrame(matchList, columns = cols)
print(d)
Which gave me only single row as output :
I have two issues to fix:
Not sure how can I get output for any partial match like this :
Both DF1 and DF2 are of size of around 0.4M and 13M records.
What optimum ways are there to fix these two issues?

I've a quick fix for problem 1 but not an optimisation.
You only get one match because "Knee pain" is the only term that appears in full in df1.
I've modified the if statement to split the text from df2 and check if there are any matches from the list.
Agree with #jakub that there are libraries that will do this quicker.
# Matching logic
matchList=[]
for index_b, row_b in df2.iterrows():
print(row_b)
for index_a, row_a in df1.iterrows():
if any(word in row_a.term.lower() for word in row_b.text.lower().split()):
#print(row_b.text, row_a.term)
matchList.append([row_b.textID,row_b.text ,row_a.term, row_a.termID] )
cols = ['textID', 'text,','term ','termID' ]
d = pd.DataFrame(matchList, columns = cols)
print(d)
Output
textID text, term termID
0 12 Oops!! My Knee pains!! Knee Pain 5685
1 52 Stop Music!! my head pains Headache 55
2 35 Arrest Innocent!! Cardic Arrest 1

Related

Join Pandas DataFrames on Fuzzy/Approximate Matches for Multiple Columns

I have two Pandas DataFrames that look like this. Trying to join the two data sets on 'Name','Longitude', and 'Latitude' but using a fuzzy/approximate match. Is there a way to join these together using a combination of the 'Name' strings being, for example, at least an 80% match and the 'Latitude' and 'Longitude' columns being the nearest value or within like 0.001 of each other? I tried using pd.merge_asof but couldn't figure out how to make it work. Thank you for the help!
import pandas as pd
data1 = [['Game Time Bar',42.3734,-71.1204,4.5],['Sports Grill',42.3739,-71.1214,4.6],['Sports Grill 2',42.3839,-71.1315,4.3]]
data2 = [['Game Time Sports Bar',42.3738,-71.1207,'$$'],['Sports Bar & Grill',42.3741,-71.1216,'$'],['Sports Grill',42.3841,-71.1316,'$$']]
df1 = pd.DataFrame(data1, columns=['Name', 'Latitude','Longitude','Rating'])
df2 = pd.DataFrame(data2, columns=['Name', 'Latitude','Longitude','Price'])
merge_asof won't work here since it can only merge on a single numeric column, such as datetimelike, integer, or float (see doc).
Here you can compute the (euclidean) distance between the coordinates of df1 and df2 and pickup the best match:
import pandas as pd
import numpy as np
from scipy.spatial.distance import cdist
data1 = [['Game Time Bar',42.3734,-71.1204,4.5],['Sports Grill',42.3739,-71.1214,4.6],['Sports Grill 2',42.3839,-71.1315,4.3]]
data2 = [['Game Time Sports Bar',42.3738,-71.1207,'$$'],['Sports Bar & Grill',42.3741,-71.1216,'$'],['Sports Grill',42.3841,-71.1316,'$$']]
df1 = pd.DataFrame(data1, columns=['Name', 'Latitude','Longitude','Rating'])
df2 = pd.DataFrame(data2, columns=['Name', 'Latitude','Longitude','Price'])
# Replacing 'Latitude' and 'Longitude' columns with a 'Coord' Tuple
df1['Coord'] = df1[['Latitude', 'Longitude']].apply(lambda x: (x['Latitude'], x['Longitude']), axis=1)
df1.drop(columns=['Latitude', 'Longitude'], inplace=True)
df2['Coord'] = df2[['Latitude', 'Longitude']].apply(lambda x: (x['Latitude'], x['Longitude']), axis=1)
df2.drop(columns=['Latitude', 'Longitude'], inplace=True)
# Creating a distance matrix between df1['Coord'] and df2['Coord']
distances_df1_df2 = cdist(df1['Coord'].to_list(), df2['Coord'].to_list())
# Creating df1['Price'] column from df2 and the distance matrix
for i in df1.index:
# you can replace the following lines with a loop over distances_df1_df2[i]
# and reject names that are too far from each other
min_dist = np.amin(distances_df1_df2[i])
if min_dist > 0.001:
continue
closest_match = np.argmin(distances_df1_df2[i])
# df1.loc[i, 'df2_Name'] = df2.loc[closest_match, 'Name'] # keep track of the merged row
df1.loc[i, 'Price'] = df2.loc[closest_match, 'Price']
print(df1)
Output:
Name Rating Coord Price
0 Game Time Bar 4.5 (42.3734, -71.1204) $$
1 Sports Grill 4.6 (42.3739, -71.1214) $
2 Sports Grill 2 4.3 (42.3839, -71.1315) $$
Edit: your requirement on 'Name' ("at least an 80% match") isn't really appropriate. Take a look at fuzzywuzzy to get a sense of how string distances can be measured.

How do you remove similar (not duplicated) rows in pandas dataframe using text similarity?

I have thousands of data that may or may not be similar to each other. Using python's default function drop_duplicates() doesn't really help since they only detect similar data only, for example, what if my data contains something like these:
Hey, good morning!
Hey, good morning.
Python wouldn't detect them as duplicates. There are many variations to this really, that simply cleaning the text wouldn't suffice, so I opt for text similarity.
I have tried the following code,
import textdistance
from tqdm import tqdm
tqdm.pandas()
all_sims = []
for id1, text1 in tqdm(enumerate(df1['cleaned'])):
for id2, text2 in enumerate(df1['cleaned'].iloc[id1:]):
if id1==id2:
continue
sim = textdistance.jaro_winkler(text1, text2)
if sim>=0.9:
# print("similarity value: ",sim)
# print("text 1 >> ",text1)
# print("text 2 >> ",text2)
# print("====><====")
all_sims.append(id1)
Basically I tried to iterate all the rows in the column and check it with themselves. If the jaro-winkler value detected turns out to be >= 0.9 then the index will be saved to a list.
I will then remove all these similar indices with the following code.
df1[~df1.index.isin(all_sims)]
But my code is really slow and inefficient, and I am not sure if it's the right approach. Do you have any idea to improve this?
You could try this:
import pandas as pd
import textdistance
# Toy dataframe
df = pd.DataFrame(
{
"name": [
"Mulligan Nick",
"Hitt S C",
"Breda Joy Mulligen",
"David James Tsan",
"Mulligan Nick",
"Htti S C ",
"Brenda Joy Mulligan",
"Dave James Tsan",
],
}
)
# Calculate similarities between rows
# and save corresponding indexes in a new column "match"
df["match"] = df["name"].map(
lambda x: [
i
for i, text in enumerate(df["name"])
if textdistance.jaro_winkler(x, text) >= 0.9
]
)
# Iterate to remove similar rows (keeping only the first one)
indices = []
for i, row in df.iterrows():
indices.append(i)
df = df.drop(
index=[item for item in row["match"] if item not in indices], errors="ignore"
)
# Clean up
df = df.drop(columns="match")
print(df)
# Outputs
name
0 Mulligan Nick
1 Hitt S C
2 Breda Joy Mulligen
3 David James Tsan

Calculating most frequently occuring row-specific combinations among dataframe in Python

I have a dataframe that contains text separated by a comma
1 a,b,c,d
2 a,b,e,f
3 a,b,e,f
I am trying to have an output that prints the top 2 most common combinations of 2 letters + the # of occurrences among the entire dataframe. So based on the above dataframe the output would be
(a,b,3) (e,f,2)
The combination of a and b occurs 3 times, and the combination of e and f occurs 2 times. (Yes there are more combos that occur 2 times but we can just cut it off here to keep it simple) I am really stumped on just how to even start this. I was thinking of maybe looping through each row and somehow storing all combinations, and at the end we can print out the top n combinations and how many times they occurred in the dataframe.
Below is what I have so far according to what I have in mind.
import pandas as pd
from io import StringIO
StringData = StringIO("""Date
a,b,c,d
a,b,e,f
a,b,e,f
""")
df = pd.read_csv(StringData, sep =";")
for index, row in df.iterrows():
(somehow get and store all possible 2 word combos?)
You can do it this way:
import numpy as np
import pandas as pd
from io import StringIO
StringData = StringIO("""Date
a,b,c,d
a,b,e,f
a,b,e,f
""")
df = pd.read_csv(StringData, sep =";")
df['Date'] = df['Date'].apply(lambda x: x.split(','))
df['combinations'] = df['Date'].apply(lambda x: [(x[i], x[i+1]) for i in range(len(x)-1)])
df = df.explode('combinations')
df = df.groupby('combinations').agg('count').reset_index()
df.sort_values('Date', inplace=True, ascending=False)
df['combinations'] = df.values.tolist()
df.drop('Date', axis=1, inplace=True)
df['combinations'] = df['combinations'].apply(np.hstack)
print(df.iloc[:2, :])
Output:
combinations
0 [a, b, 3]
2 [b, e, 2]

Is there a way of dinamically find partial matching numbers between columns in pandas dataframes?

Im looking for a way of comparing partial numeric values between columns from different dataframes, this columns are filled with something like social security numbers (they can´t and won´t repeat), so something like a dynamic isin() with be ideal.
This are representations of very large dataframes that I import from csv files.
{import numpy as np
import pandas as pd
df1 = pd.DataFrame({"S_number": ["271600", "860078", "342964", "763261", "215446", "205303", "973637", "814452", "399304", "404205"]})
df2 = pd.DataFrame({"Id_number": ["14452", "9930", "1544", "5303", "973637", "4205", "0271600", "342964", "763", "60078"]})
print(df1)
print(df2)
df2['Id_number_length']= df2['Id_number'].str.len()
df2.groupby('Id_number_length').count()
count_list = df2.groupby('Id_number_length')[['Id_number_length']].count()
print('count_list:\n', count_list)
df1 ['S_number'] = pd.to_numeric(df1['S_number'], downcast = 'integer')
df2['Id_number'] = pd.to_numeric(df2['Id_number'], downcast = 'integer')
inner_join = pd.merge(df1, df2, left_on =['S_number'], right_on = ['Id_number'] , how ='inner')
print('MATCH!:\n', inner_join)
outer_join = pd.merge(df1, df2, left_on =['S_number'], right_on = ['Id_number'] , how ='outer', indicator = True)
anti_join = outer_join[~(outer_join._merge == 'both')].drop('_merge', axis = 1)
print('UNMATCHED:\n', anti_join)
}
What I need to get is something as the following as a result of the inner join or whatever method:
{
df3 = pd.DataFrame({"S_number": ["271600", "860078", "342964", "763261", "215446", "205303", "973637", "814452", "399304", "404205"],
"Id_number": [ "027160", "60078","342964","763", "1544", "5303", "973637", "14452", "9930", "4205",]})
print('MATCH!:\n', df3)
}
I thought that something like this (very crude) pseudocode would work. Using count_list to strip parts of the numbers of df1 to fully match df2 instead of partially matching (notice that in df2 the missing or added numbers are always at the begining or the end)
{
for i in count_list:
if i ==6:
try inner join
except empty output
elif i ==5:
try
df1.loc[:,'S_number'] = df_ib_c.loc[:,'S_number'].str[1:]
inner join with df2
except empty output
try
df1.loc[:,'S_number'] = df_ib_c.loc[:,'S_number'].str[:-1]
inner join with df2
elif i == 4:
same as above...
}
But the lengths in count_list are variable so this for is an inefficient way.
Any help with this will be very appreciated, I´ve been stuck with this for days. Thanks in advance.
You can 'explode' each line of df1 into up to 45 lines. For example, SSN 123456789 can be map to [1,2,3...9,12,23,34,45..89,...12345678,23456789,123456789]. While this look bad, from algorithm standpoint it is O(1) for each row and therefore O(N) in total.
Using this new column as key, a simple 'merge on' can combine the 2 DFs easily - which is usually O(NlogN).
Here is an example of what I should do. I hope I've understood. Feel free to ask if it's not clear.
import pandas as pd
import joblib
from joblib import Parallel,delayed
# Building the base
df1 = pd.DataFrame({"S_number": ["271600", "860078", "342964", "763261", "215446", "205303", "973637", "814452", "399304", "404205"]})
df2 = pd.DataFrame({"Id_number": ["14452", "9930", "1544", "5303", "973637", "4205", "0271600", "342964", "763", "60078"]})
# Initiate empty list for indexes
IDX = []
# Using un function to paralleliza it if database is big
def func(x,y):
if all(c in df2.Id_number[y] for c in df1.S_number[x]):
return(x,y)
# using the max of processors
number_of_cpu = joblib.cpu_count()
# Prpeparing a delayed function to be parallelized
delayed_funcs = (delayed(func)(x,y) for x in range(len(df1)) for y in range(len(df2)))
# fiting it with processes and not threads
parallel_pool = Parallel(n_jobs=number_of_cpu,prefer="processes")
# Fillig the IDX List
IDX.append(parallel_pool(delayed_funcs))
# Droping the None
IDX = list(filter(None, IDX[0]))
# Making df3 with the tuples of indexes
df3 = pd.DataFrame(IDX)
# Making it readable
df3['df1'] = df1.S_number[df3[0]].to_list()
df3['df2'] = df2.Id_number[df3[1]].to_list()
df3
OUTPUT :

How to Copy the Matching Columns between CSV Files Using Pandas?

I have two dataframes(f1_df and f2_df):
f1_df looks like:
ID,Name,Gender
1,Smith,M
2,John,M
f2_df looks like:
name,gender,city,id
Problem:
I want the code to compare the header of f1_df with f2_df by itself and copy the data of the matching columns using panda.
Output:
the output should be like this:
name,gender,city,id # name,gender,and id are the only matching columns btw f1_df and f2_df
Smith,M, ,1 # the data copied for name, gender, and id columns
John,M, ,2
I am new to Pandas and not sure how to handle the problem. I have tried to do an inner join to the matching columns, but that did not work.
Here is what I have so far:
import pandas as pd
f1_df = pd.read_csv("file1.csv")
f2_df = pd.read_csv("file2.csv")
for i in f1_df:
for j in f2_df:
i = i.lower()
if i == j:
joined = f1_df.join(f2_df)
print joined
Any idea how to solve this?
try this if you want to merge / join your DFs on common columns:
first lets convert all columns to lower case:
df1.columns = df1.columns.str.lower()
df2.columns = df2.columns.str.lower()
now we can join on common columns
common_cols = df2.columns.intersection(df1.columns).tolist()
joined = df1.set_index(common_cols).join(df2.set_index(common_cols)).reset_index()
Output:
In [259]: joined
Out[259]:
id name gender city
0 1 Smith M NaN
1 2 John M NaN
export to CSV:
In [262]: joined.to_csv('c:/temp/joined.csv', index=False)
c:/temp/joined.csv:
id,name,gender,city
1,Smith,M,
2,John,M,

Categories