Background
I have 2 data frames which has no common key to which I can merge them. Both df have a column that contains "entity name". One df contains 8000+ entities and the other close to 2000 entities.
Sample Data:
vendor_df=
Name of Vendor City State ZIP
FREDDIE LEES AMERICAN GOURMET SAUCE St. Louis MO 63101
CITYARCHRIVER 2015 FOUNDATION St. Louis MO 63102
GLAXOSMITHKLINE CONSUMER HEALTHCARE St. Louis MO 63102
LACKEY SHEET METAL St. Louis MO 63102
regulator_df =
Name of Entity Committies
LACKEY SHEET METAL Private
PRIMUS STERILIZER COMPANY LLC Private
HELGET GAS PRODUCTS INC Autonomous
ORTHOQUEST LLC Governmant
Problem Stmt:
I have to fuzzy match the entities of these two(Name of vendor & Name of Entity) columns and get a score. So, need to know if 1st value of dataframe 1(vendor_df) is matching with any of the 2000 entities of dataframe2(regulator_df).
StackOverflow Links I checked:
fuzzy match between 2 columns (Python)
create new column in dataframe using fuzzywuzzy
Apply fuzzy matching across a dataframe column and save results in a new column
Code
import pandas as pd
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
vendor_df = pd.read_excel('C:\\Users\\40101584\\Desktop\\AUS CUB AML\\Vendors_Sheet.xlsx', sheet_name=0)
regulator_df = pd.read_excel('C:\\Users\\40101584\\Desktop\\AUS CUB AML\\Regulated_Vendors_Sheet.xlsx', sheet_name=0)
compare = pd.MultiIndex.from_product([vendor_df['Name of vendor'],
regulator_df['Name of Entity']]).to_series()
def metrics(tup):
return pd.Series([fuzz.ratio(*tup),
fuzz.token_sort_ratio(*tup)],
['ratio', 'token'])
#compare.apply(metrics) -- Either this works or the below line
result = compare.apply(metrics).unstack().idxmax().unstack(0)
Problems with Above Code:
The code works if the two dataframes are small but it is taking forever when I give the complete dataset. Above code is taken from 3rd link.
Any solution if the same thing can work fast or can work with large dataset?
UPDATE 1
Can the above code be made faster if we pass or hard-code a score say 80 which will filter series/dataframe only with fuzzyscore > 80 ?
Below solution is faster than what I posted but if someone has a more faster approach please tell:
matched_vendors = []
for row in vendor_df.index:
vendor_name = vendor_df.get_value(row,"Name of vendor")
for columns in regulator_df.index:
regulated_vendor_name=regulator_df.get_value(columns,"Name of Entity")
matched_token=fuzz.partial_ratio(vendor_name,regulated_vendor_name)
if matched_token> 80:
matched_vendors.append([vendor_name,regulated_vendor_name,matched_token])
I've implemented the code in Python with parallel processing, which will be much faster than serial computation. Furthermore, where a fuzzy metric score exceeds a threshold, only those computations are performed in parallel. Please see the link below for the code:
https://github.com/ankitcoder123/Important-Python-Codes/blob/main/Faster%20Fuzzy%20Match%20between%20two%20columns/Fuzzy_match.py
Vesrion Compatibility:
pandas version :: 1.1.5 ,
numpy vesrion:: 1.19.5,
fuzzywuzzy version :: 1.1.0 ,
joblib version :: 0.18.0
Fuzzywuzzy metric explanation:
link text
Output from code:
in my case also i need to look for only above 80. i modified your code as per my use case.hope it helps.
compare = compare.apply(metrics)
compare_80=compare[(compare['ratio'] >80) & (compare['token'] >80)]
Related
details about the goal: I'm learning basic ML and im tasked with finding the best match between some raw city names and some normalized city names.
Expected result: The idea is to find items that are similar as according to the Levenshtein distance, and output the best match on the right column of the raw data df.
What i did: Originally, I made a nested loop that compares the first row with the 36k thousand rows, output the smallest and its index, and store that in the right most column. I quickly concluded that it's not a best practice because you're not supposed to loop over pandas df, and the complexity of 10000^36k was just way too much. After some search, i found the following code, which is supposed to work properly:
rawdata["Best match"]=rawdata["city"].map(lambda x: process.extractOne(x, normadata["city"])[0])
Sadly this algo has been running for an hour on my computer so i dont think it does the job still. What would yall recommend to do this quicker ?
Thank you for any time you would spend on this.
#import libraries
import pandas as pd
import sys
!pip install fuzzywuzzy
!pip install python-Levenshtein
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
def import_data(file):
return pd.read_csv(file, header=0, dtype=str)
rawdata = import_data("raw_cities.csv")
rawdata['city']=rawdata['city'].map(str)
normadata = import_data("normalized_cities.csv")
normadata['city']=normadata['city'].map(str)
rawdata["Best match"]=rawdata["city"].map(lambda x: process.extractOne(x, normadata["city"])[0])
#my tables look as such
city
0 cleron
1 aveillans
2 paray-vieille-poste
3 issac
4 rians
9995 neuville les dieppe |
9996 saint andre de vezines
9997 saint-germain-de-lusignan
9998 bergues-sur-sambre
9999 santa-maria-figaniella
[10000 rows x 1 columns]
city
0 abergement clemenciat
1 abergement de varey
2 amberieu en bugey
3 amberieux en dombes
4 ambleon
35352 m'tsangamouji
35353 ouangani
35354 pamandzi
35355 sada
35356 tsingoni
[35357 rows x 1 columns]
Python 3.9.5/Pandas 1.1.3
I have a very large csv file with values that look like:
Ac\\Nme Products Inc.
and all the values are different company names with double backslashes in random places throughout.
I'm attempting to get rid of all the double backslashes. It's not working in Pandas. But a simple test against the standalone value just using string.replace does work.
Example:
org = "Ac\\Nme Products Inc."
result = org.replace("\\","")
print(result)
returns AcNme Products Inc. as the output, as I would expect.
However, using Pandas with the names in a csv file:
import pandas as pd
csv_input = pd.read_csv('/Users/me/file.csv')
csv_input.replace("\\", "")
csv_input.to_csv('/Users/me/file_revised.csv', index=False)
When I open the new file_revised.csv file, the value still shows as Ac\\Nme Products Inc.
EDIT 1:
Here is a snippet of file.csv as requested:
id,company_name,address,country 1000566,A1 Comm\\Nodity Traders,LEVEL 28 THREE PACIFIC PLACE 1 QUEEN'S RD EAST HK,TH 1000579,"A2 A Mf\\g. Co., Ltd.",53 YONG-AN 2ND ST. TAINAN TAIWAN,CA 1000585,"A2 Z Logisitcs Indi\\Na Pvt., Ltd.",114A/1 1ST FLOOR SOUTH RAJA ST TUTICORIN - 628 001 TAMILNADU - INDIA,PE
Pandas doesn't have a dataframe level string operation, but it can be updated per-column:
for col in csv_input.columns:
if col == 'that_int_column':
continue
csv_input[col] = csv_input[col].str.replace(r"\\N", "")
I appreciate the help in advance!
The question may seem weird at first so let me illustrate what I am trying to accomplish:
I have this df of cities and abbreviations:
I need to add another column called 'Queries' and those queries are on a list as follows:
queries = ['Document Management','Document Imaging','Imaging Services']
The trick though is that I need to duplicate my df rows for each query in the list. For instance, for row 0 I have PHOENIX, AZ. I now need 3 rows saying PHOENIX, AZ, 'query[n]'.
Something that would look like this:
Of course I created that manually but I need to scale it for a large number of cities and a large list of queries.
This sounds simple but I've been trying for some hours now I don't see how to engineer any code for it. Again, thanks for the help!
Here is one way, using .explode():
import pandas as pd
df = pd.DataFrame({'City_Name': ['Phoenix', 'Tucson', 'Mesa', 'Los Angeles'],
'State': ['AZ', 'AZ', 'AZ', 'CA']})
# 'Query' is a column of tuples
df['Query'] = [('Doc Mgmt', 'Imaging', 'Services')] * len(df.index)
# ... and explode 'unpacks' the tuples, putting one item on each line
df = df.explode('Query')
print(df)
City_Name State Query
0 Phoenix AZ Doc Mgmt
0 Phoenix AZ Imaging
0 Phoenix AZ Services
1 Tucson AZ Doc Mgmt
1 Tucson AZ Imaging
1 Tucson AZ Services
2 Mesa AZ Doc Mgmt
2 Mesa AZ Imaging
2 Mesa AZ Services
3 Los Angeles CA Doc Mgmt
3 Los Angeles CA Imaging
3 Los Angeles CA Services
You should definitely go with jsmart's answer, but posting this as an exercise.
This can also be achieved by exporting the original cities/towns dataframe (df) to a list or records, manually duplicating each one for each query then reconstructing the final dataframe.
The entire thing can fit in a single line, and is even relatively readable if you can follow what's going on ;)
pd.DataFrame([{**record, 'query': query}
for query in queries
for record in df.to_dict(orient='records')])
new to python myself, but I would get around it by creating n (n=# of unique query values) identical data frames without "Query". Then for each of the data frame, create a new column with one of the "Query" values. Finally, stack all data frames together using append. A short example:
adf1 = pd.DataFrame([['city1','sate1'],['city2','state2']])
adf2 = adf1
adf1['query'] = 'doc management'
adf2['query'] = 'doc imaging'
df = adf1.append(adf2)
Another method if there are many types of queries.
Creating a dummy column, say 'key', in both the original data frame and the query data frame, and merge the two on 'key'.
adf = pd.DataFrame([['city1','state1'],['city2','state2']])
q = pd.DataFrame([['doc management'],['doc imaging']])
adf['key'] = 'key'
q['key'] = 'key'
df = pd.merge(adf, q, on='key', how='outer')
More advanced users should have better ways. This is a temporary solution if you are in a hurry.
I am facing a problem in applying fuzzy logic for data cleansing in python. My data looks something like this
data=pd.DataFrame({'Employer':['Deloitte','Accenture','Accenture Solutions Ltd','Accenture USA', 'Ernst & young',' EY', 'Tata Consultancy Services','Deloitte Uk'], "Count":['140','120','50','45','30','20','10','5']})
data
I am using fuzzy logic to compare the values in the data frame. The final output should have a third column with result like this:
data_out=pd.DataFrame({'Employer':['Deloitte','Accenture','Accenture Solutions Ltd','Accenture USA', 'Ernst & young',' EY', 'Tata Consultancy Services','Deloitte Uk'], "New_Column":["Deloitte",'Accenture','Accenture','Accenture','Ernst & young','Ernst & young','Tata Consultancy Services','Deloitte']})
data_out
So if you see, I want less occurring values to have a new entry as a new column with the most occurred value of its type. That is where fuzzy logic is helpful.
Most of your duplicate companies can be detected using fuzzy string matching quite easily, however the replacement Ernst & young <-> EY is not really similar at all, which is why I am going to ignore this replacement here. This solution is using my library RapidFuzz, but you could implement something similar using FuzzyWuzzy aswell (with a little more code, since it does not has the extractIndices processor).
import pandas as pd
from rapidfuzz import process, utils
def add_deduped_employer_colum(data):
values = data.values.tolist()
employers = [employer for employer, _ in values]
# preprocess strings beforehand (lowercase + remove punctuation),
# so this is not done multiple times
processed_employers = [utils.default_process(employer)
for employer in employers]
deduped_employers = employers.copy()
replaced = []
for (i, (employer, processed_employer)) in enumerate(
zip(employers, processed_employers)):
# skip elements that already got replaced
if i in replaced:
continue
duplicates = process.extractIndices(
processed_employer, processed_employers[i+1:],
processor=None, score_cutoff=90, limit=None)
for (c, _) in duplicates:
deduped_employers[i+c+1] = employer
"""
by replacing the element with an empty string the index from
extractIndices stays correct but it can be skipped a lot
faster, since the compared strings will have very different
lengths
"""
processed_employers[i+c+1] = ""
replaced.append(i+c+1)
data['New_Column'] = deduped_employers
data=pd.DataFrame({
'Employer':['Deloitte','Accenture','Accenture Solutions Ltd','Accenture USA', 'Ernst & young',' EY', 'Tata Consultancy Services','Deloitte Uk'],
"Count":['140','120','50','45','30','20','10','5']})
add_deduped_employer_colum(data)
print(data)
which results in the following dataframe:
Employer Count New_Column
0 Deloitte 140 Deloitte
1 Accenture 120 Accenture
2 Accenture Solutions Ltd 50 Accenture
3 Accenture USA 45 Accenture
4 Ernst & young 30 Ernst & young
5 EY 20 EY
6 Tata Consultancy Services 10 Tata Consultancy Services
7 Deloitte Uk 5 Deloitte
I have not used fuzzy but can assist as follows
Data
df=pd.DataFrame({'Employer':['Accenture','Accenture Solutions Ltd','Accenture USA', 'hjk USA', 'Tata Consultancy Services']})
df
You did not give an explanation why Tata remains with the full name. Hence I assume it is special and mask it.
m=df.Employer.str.contains('Tata')
I then use np.where to replace anything after the first name for the rest
df['New_Column']=np.where(m, df['Employer'], df['Employer'].str.replace(r'(\s+\D+)',''))
df
Output
I'm new to Python and I'm trying to analyse this CSV file. It has a lot of different countries (as an example below).
country iso2 iso3 iso_numeric g_whoregion year e_pop_num e_inc_100k e_inc_100k_lo
Afghanistan AF AFG 4 EMR 2000 20093756 190 123
American Samoa AS ASM 16 WPR 2003 59117 5.8 5 6.7 3 3 4
Gambia GM GMB 270 AFR 2010 1692149 178 115 254 3000 1900 4300
I want to try and obtain only specific data, so only specific countries and only specific columns (like "e_pop_numb"). How would I go about doing that?
The only basic code I have is:
import csv
import itertools
f = csv.reader(open('TB_burden_countries_2018-03-06.csv'))
for row in itertools.islice(f, 0, 10):
print (row)
Which just lets me choose specific rows I want, but not necessarily the country I want to look at, or the specific columns I want.
IF you can help me or provide me a guide so I can do my own learning, I'd very much appreciate that! Thank you.
I recommend you to use pandas python library. Please follow the article as here below there is a snippet code to iluminate your thoughts.
import pandas as pd
df1=pd.read_csv("https://pythonhow.com/wp-content/uploads/2016/01/Income_data.csv")
df2.loc["Alaska":"Arkansas","2005":"2007"]
source of this information: https://pythonhow.com/accessing-dataframe-columns-rows-and-cells/
Pandas will probably be the easiest way. https://pandas.pydata.org/pandas-docs/stable/
To get it run
pip install pandas
Then read the csv into a dataframe and filter it
import pandas as pd
df = pd.read_csv(‘TB_burden_countries_2018-03-06.csv’)
df = df[df[‘country’] == ‘Gambia’]
print(df)
with
open('file') as f:
fields = f.readline().split("\t")
print(fields)
If you supply more details about what you want to see, the answer would differ.