Rapidfuzz match merge - python

Very new to this, would appreciate any advice on the following:
I have a dataset 'Projects' showing list of institutions with project IDs:
project_id institution_name
0 somali national university
1 aarhus university
2 bath spa
3 aa school of architecture
4 actionaid uk
I would like to fuzzy match merge this with the following dataset of 'Universities' and their country codes:
institution_name country_code
a tan kapuja buddhista foiskola HU
aa school of architecture UK
bath spa university UK
aalto-yliopisto FI
aarhus universitet DK
And get back this:
project_id institution_name Match organisation country_code
0 somali national university [] NaN NaN
1 aarhus university [(91)] aarhus universitet DK
2 bath spa [(90)] bath spa university UK
3 aa school of architecture [(100)] aa school of architecture UK
4 actionaid uk [] NaN NaN
Using rapidfuzz:
import pandas as pd
import numpy as np
from rapidfuzz import process, utils as fuzz_utils
def fuzzy_merge(baseFrame, compareFrame, baseKey, compareKey, threshold=90, limit=1, how='left'):
# baseFrame: the left table to join
# compareFrame: the right table to join
# baseKey: key column of the left table
# compareKey: key column of the right table
# threshold: how close the matches should be to return a match, based on Levenshtein distance
# limit: the amount of matches that will get returned, these are sorted high to low
# return: dataframe with boths keys and matches
s_mapping = {x: fuzz_utils.default_process(x) for x in compareFrame[compareKey]}
m1 = baseFrame[baseKey].apply(lambda x: process.extract(
fuzz_utils.default_process(x), s_mapping, limit=limit, score_cutoff=threshold, processor=None
))
baseFrame['Match'] = m1
m2 = baseFrame['Match'].apply(lambda x: ', '.join(i[2] for i in x))
baseFrame['organisation'] = m2
return baseFrame.merge(compareFrame, on=baseKey, how=how)
Merged = fuzzy_merge(Projects, Universities, 'institution_name', 'institution_name')
Merged
I got this (with some extra text in the match column but won't go into that now). It's nearly what I want, but the country code only matches up when it's a 100% match:
project_id institution_name Match organisation country_code
0 somali national university [] NaN NaN
1 aarhus university [(91)] aarhus universitet NaN
2 bath spa [(90)] bath spa university NaN
3 aa school of architecture [(100)] aa school of architecture UK
4 actionaid uk [] NaN NaN
I reckon this is an issue with how I'm comparing my basekey to the compareframe to create my merged dataset. I can't sort out how to return it on 'organisation' instead though - attempts to plug it in result in varying errors.

Never mind, figured it out - I didn't account for the empty cells! Replacing them with NaN worked out perfectly.
def fuzzy_merge(baseFrame, compareFrame, baseKey, compareKey, threshold=90, limit=1, how='left'):
s_mapping = {x: fuzz_utils.default_process(x) for x in compareFrame[compareKey]}
m1 = baseFrame[baseKey].apply(lambda x: process.extract(
fuzz_utils.default_process(x), s_mapping, limit=limit, score_cutoff=threshold, processor=None
))
baseFrame['Match'] = m1
m2 = baseFrame['Match'].apply(lambda x: ', '.join(i[2] for i in x))
baseFrame['organisations'] = m2.replace("",np.nan)
return baseFrame.merge(compareFrame, left_on='organisations', right_on=compareKey, how=how)

Related

isin only returning first line from csv

I'm reading from a sqlite3 db into a df:
id symbol name
0 1 QCLR Global X Funds Global X NASDAQ 100 Collar 95-1...
1 2 LCW Learn CW Investment Corporation
2 3 BUG Global X Funds Global X Cybersecurity ETF
3 4 LDOS Leidos Holdings, Inc.
4 5 LDP COHEN & STEERS LIMITED DURATION PREFERRED AND ...
... ... ... ...
10999 11000 ERIC Ericsson American Depositary Shares
11000 11001 EDI Virtus Stone Harbor Emerging Markets Total Inc...
11001 11002 EVX VanEck Environmental Services ETF
11002 11003 QCLN First Trust NASDAQ Clean Edge Green Energy Ind...
11003 11004 DTB DTE Energy Company 2020 Series G 4.375% Junior...
[11004 rows x 3 columns]
Then I have a symbols.csv file which I want to use to filter the above df:
AKAM
AKRO
Here's how I've tried to do it:
origin_symbols = pd.read_sql_query("SELECT id, symbol, name from stock", conn)
mikey_symbols = pd.read_csv("symbols.csv")
df = origin_symbols[origin_symbols['symbol'].isin(mikey_symbols)]
But for some reason I only get the first line returned from the csv:
id symbol name
6475 6476 AKAM Akamai Technologies, Inc. Common Stock
1 df
Where am I going wrong here?
You need convert csv file to Series, here is added column name and for Series select it (e.g. by position):
mikey_symbols = pd.read_csv("symbols.csv", names=['tmp']).iloc[:, 0]
#or by column name
#mikey_symbols = pd.read_csv("symbols.csv", names=['tmp'])['tmp']
And then remove possible traling spaces in both by Series.str.strip:
df = origin_symbols[origin_symbols['symbol'].str.strip().isin(mikey_symbols.str.strip())]

return gender by the country from my dataframe

I have a dataframe as follow:
name country gender
wang ca 1
jay us 1
jay ca 0
jay ca 1
lisa en 0
lisa us 1
I want to assign the gender based on code 'US'. If the name is same, then all the gender should be the same as gender assigned to code us. For name that has no duplicate, we return the same row.
The return result should be
name code gender
wang ca 1
jay us 1
lisa us 1
I used
df.gropuby(['name', 'country'])['gender'].transform()
Any suggestions on how to fix this?
# Get country and gender in separate lists for a name
a = df.groupby('name')['country'].apply(list).reset_index(name='country_list')
b = df.groupby('name')['gender'].apply(list).reset_index(name='gender_list')
# Merge
df2 = a.merge(b, on='name', how='left')
# Using apply get final required values
def get_val(x):
cl, gl = x
final = [cl[0], gl[0]]
for c,g in zip(cl,gl):
if c=='us':
final.append(c)
final.append(g)
return final
df2['final_col'] = df2[['country_list', 'gender_list']].apply(get_val, axis=1)
df2['code'] = df2['final_col'].apply(lambda l: l[0])
df2['gender'] = df2['final_col'].apply(lambda l: l[1])
print(df2)
The approach I've used is a merge() followed by a conditional replace (np.where())
It's a bit more sophisticated but will work for conditions not it your sample data.
import io
import numpy as np
df = pd.read_csv(io.StringIO("""name country gender
wang ca 1
jay us 1
jay ca 0
jay ca 1
lisa en 0
lisa us 1"""), sep="\s+")
# use "us" as basis for lookup. left merge on name only
df2 = (df.merge(df.query("country=='us'"),
on=["name"], how="left", suffixes=("", "_new"))
# replace only where it's not "us" and "us" has a different value
.assign(gender=lambda x: np.where((x["country"]!="us")&
(x["gender"]!=x["gender_new"])&
~(x["gender_new"].isna())
# force type casting so it doesn't become float64 because of NaN
, x["gender_new"].fillna(-1).astype("int64"),
x["gender"]))
# remove columns inserted by merge...
.drop(columns=["country_new", "gender_new"])
)
output
name country gender
wang ca 1
jay us 1
jay ca 1
jay ca 1
lisa en 1
lisa us 1

How to group categories in a variable using numpy and dictionary

I want to group multiple categories in a pandas variable using numpy.where and dictionary.
Currently I am trying this using just numpy.where which increases my code a lot if I have a lot of categories. I want to create a map using dictionary and then use that map in numpy.where .
Sample Data frame:
dataF = pd.DataFrame({'TITLE':['CEO','CHIEF EXECUTIVE','EXECUTIVE OFFICER','FOUNDER',
'CHIEF OP','TECH OFFICER','CHIEF TECH','VICE PRES','PRESIDENT','PRESIDANTE','OWNER','CO OWNER',
'DIRECTOR','MANAGER',np.nan]})
dataF
TITLE
0 CEO
1 CHIEF EXECUTIVE
2 EXECUTIVE OFFICER
3 FOUNDER
4 CHIEF OP
5 TECH OFFICER
6 CHIEF TECH
7 VICE PRES
8 PRESIDENT
9 PRESIDANTE
10 OWNER
11 CO OWNER
12 DIRECTOR
13 MANAGER
14 NaN
Numpy operation
dataF['TITLE_GRP'] = np.where(dataF['TITLE'].isna(),'NOTAVAILABLE',
np.where(dataF['TITLE'].str.contains('CEO|CHIEF EXECUTIVE|EXECUTIVE OFFICER|FOUN'),'CEO_FOUNDER',
np.where(dataF['TITLE'].str.contains('CHIEF|OFFICER|^CFO$|^COO$|^CIO$|^CTO$|^CMO$'),'OTHER_OFFICERS',
np.where(dataF['TITLE'].str.contains('VICE|VP'),'VP',
np.where(dataF['TITLE'].str.contains('PRESIDENT|PRES'),'PRESIDENT',
np.where(dataF['TITLE'].str.contains('OWNER'),'OWNER_CO_OWN',
np.where(dataF['TITLE'].str.contains('MANAGER|GM|MGR|MNGR|DIR|HEAD|CHAIR'),'DIR_MGR_HEAD'
,dataF['TITLE'])))))))
Transformed Data
TITLE TITLE_GRP
0 CEO CEO_FOUNDER
1 CHIEF EXECUTIVE CEO_FOUNDER
2 EXECUTIVE OFFICER CEO_FOUNDER
3 FOUNDER CEO_FOUNDER
4 CHIEF OP OTHER_OFFICERS
5 TECH OFFICER OTHER_OFFICERS
6 CHIEF TECH OTHER_OFFICERS
7 VICE PRES VP
8 PRESIDENT PRESIDENT
9 PRESIDANTE PRESIDENT
10 OWNER OWNER_CO_OWN
11 CO OWNER OWNER_CO_OWN
12 DIRECTOR DIR_MGR_HEAD
13 MANAGER DIR_MGR_HEAD
14 NaN NOTAVAILABLE
What I want to do is create some mapping like below:
TITLE_REPLACE = {'CEO_FOUNDER':'CEO|CHIEF EXECUTIVE|EXECUTIVE OFFICER|FOUN',
'OTHER_OFFICERS':'CHIEF|OFFICER|^CFO$|^COO$|^CIO$|^CTO$|^CMO$',
'VP':'VICE|VP',
'PRESIDENT':'PRESIDENT|PRES',
'OWNER_CO_OWN':'OWNER',
'DIR_MGR_HEAD':'MANAGER|GM|MGR|MNGR|DIR|HEAD|CHAIR'}
And then feed it to some function which applies the stepwise numpy operation and gives me the same result as above.
I am doing this I have to parameterize my code in such a way that all parameter for data manipulation will be provided from a json file.
I was trying pandas.replace as it has dictionary capability but it doesnt preserve the hiercichal structure as done in nested np.where, its also not able to replace the whole title as it just replaces the string when it finds a match.
In case you are able to provide solution for above I would also like to know how to solve following 2 other scenario:
This scenario contains .isin operation instead of regex
dataF['INDUSTRY'] = np.where(dataF['INDUSTRY'].isin(['AEROSPACE','AGRICULTURE/MINING','EDUCATION','ENERGY']),'AER_AGR_MIN_EDU_ENER',
np.where(dataF['INDUSTRY'].isin(['TRAVEL','INSURANCE','GOVERNMENT','FINANCIAL SERVICES','AUTO','PHARMACEUTICALS']),'TRA_INS_GOVT_FIN_AUT_PHAR',
np.where(dataF['INDUSTRY'].isin(['BUSINESS GOODS/SERVICES','CHEMICALS ','TELECOM','TRANSPORTATION']),'BS_CHEM_TELE_TRANSP',
np.where(dataF['INDUSTRY'].isin(['CONSUMER GOODS','ENTERTAINMENT','FOOD AND BEVERAGE','HEALTHCARE','INDUSTRIAL/MANUFACTURING','TECHNOLOGY']),'CG_ENTER_FB_HLTH_IND_TECH',
np.where(dataF['INDUSTRY'].isin(['ADVERTISING','ASSOCIATION','CONSULTING/ACCOUNTING','PUBLISHING/MEDIA','TECHNOLOGY']),'ADV_ASS_CONS_ACC_PUBL_MED_TECH',
np.where(dataF['INDUSTRY'].isin(['RESTAURANT','SOFTWARE']),'REST_SOFT',
'NOTAVAILABLE'))))))
This scenario contains .between operation
dataF['annual_revn'] = np.where(dataF['annual_revn'].between(1000000,10000000),'1_10_MILLION',
np.where(dataF['annual_revn'].between(10000000,15000000),'10_15_MILLION',
np.where(dataF['annual_revn'].between(15000000,20000000),'15_20_MILLION',
np.where(dataF['annual_revn'].between(20000000,50000000),'20_50_MILLION',
np.where(dataF['annual_revn'].between(50000000,1000000000),'50_1000_MILLION',
'NOTAVAILABLE_OUTLIER')))))
The below method works, but it isn't particularly elegant, and it may not be that fast.
import pandas as pd
import numpy as np
import re
dataF = pd.DataFrame({'TITLE':['CEO','CHIEF EXECUTIVE','EXECUTIVE OFFICER','FOUNDER',
'CHIEF OP','TECH OFFICER','CHIEF TECH','VICE PRES','PRESIDENT','PRESIDANTE','OWNER','CO OWNER',
'DIRECTOR','MANAGER',np.nan]})
TITLE_REPLACE = {'CEO_FOUNDER':'CEO|CHIEF EXECUTIVE|EXECUTIVE OFFICER|FOUN',
'OTHER_OFFICERS':'CHIEF|OFFICER|^CFO$|^COO$|^CIO$|^CTO$|^CMO$',
'VP':'VICE|VP',
'PRESIDENT':'PRESIDENT|PRES',
'OWNER_CO_OWN':'OWNER',
'DIR_MGR_HEAD':'MANAGER|GM|MGR|MNGR|DIR|HEAD|CHAIR'}
# Swap the keys and values from the raw data, and split regex by '|'
reverse_replace = {}
for key, value in TITLE_REPLACE.items():
for value_single in value.split('|'):
reverse_replace[value_single] = key
def mapping_func(x):
if not x is np.nan:
for key, value in reverse_replace.items():
if re.compile(key).search(x):
return value
return 'NOTAVAILABLE'
dataF['TITLE_GRP'] = dataF['TITLE'].apply(mapping_func)
TITLE TITLE_GRP
0 CEO CEO_FOUNDER
1 CHIEF EXECUTIVE CEO_FOUNDER
2 EXECUTIVE OFFICER CEO_FOUNDER
3 FOUNDER CEO_FOUNDER
4 CHIEF OP OTHER_OFFICERS
5 TECH OFFICER OTHER_OFFICERS
6 CHIEF TECH OTHER_OFFICERS
7 VICE PRES VP
8 PRESIDENT PRESIDENT
9 PRESIDANTE PRESIDENT
10 OWNER OWNER_CO_OWN
11 CO OWNER OWNER_CO_OWN
12 DIRECTOR DIR_MGR_HEAD
13 MANAGER DIR_MGR_HEAD
14 NaN NOTAVAILABLE
For your additional scenario, it may make sense to construct a df with the industry mapping data, then do df.merge to determine the grouping from the industry

How to use dynamic string to filter data frame using Python Pandas

DataFrame
PROJECT CLUSTER_x MARKET_x CLUSTER_y MARKET_y Exist
0 P17 A CHINA C CHINA both
1 P18 P INDIA P INDIA both
2 P16 P AMERICA P AMERICA both
3 P19 P INDIA P JAPAN both
This below code works perfectly alright and gives output as index 0 and 3
df_mismatched = df_common[ (df_common['MARKET_x'] != df_common['MARKET_y']) | (df_common['CLUSTER_x'] != df_common['CLUSTER_y']) ]
How we can dynamlically build such filter criteria? something like below code, so that next time hardcoding won't be necessary
str_common = '(df_common["MARKET_x"] != df_common["MARKET_y"]) | (df_common["CLUSTER_x"] != df_common["CLUSTER_y"])'
df_mismatched = df_common[str_common]
For the dynamic purpose, you can use query in python like:
con = "(MARKET_x!=MARKET_y)|(CLUSTER_x!=CLUSTER_y)"
print(df.query(con))
PROJECT CLUSTER_x MARKET_x CLUSTER_y MARKET_y Exist
0 P17 A CHINA C CHINA both
3 P18 P INDIA P JAPAN both
Remember that if the columns names have spaces or special characters it fails to produce the right results.

pandas: extract specific text before or after hyphen, that ends in given substrings

I am very new to pandas and have a data frame similar to the below
import pandas as pd
df = pd.DataFrame({'id': ["1", "2", "3","4","5"],
'mill': ["Company A Palm Oil Mill – Special Company A of CC Ltd",
"Company X POM – Company X Ltd","DDDD Mill – Company New and Old Ltd",
"Company Not Special – R Mill","Greatest Company – Great World POM"]})
id mill
0 1 Company A Palm Oil Mill – Special Company A of...
1 2 Company X POM – Company X Ltd
2 3 DDDD Mill – Company New and Old Ltd
3 4 Company Not Special – R Mill
4 5 Greatest Company – Great World POM
What I would like to get from the above data frame is something like the below:
Is there an easy way to extract those substrings into the same column. The mill name can sometimes be before and other times after the '-' but will almost always end with Palm Oil Mill, POM or Mill.
IIUC, you can using str.contains with those key words Palm Oil Mill,POM,Mill
s = df.mill.str.split(' – ', expand=True)
df['Name']=s[s.apply(lambda x : x.str.contains('Palm Oil Mill|POM|Mill'))].fillna('').sum(1)
df
Out[230]:
id mill \
0 1 Company A Palm Oil Mill – Special Company A of...
1 2 Company X POM – Company X Ltd
2 3 DDDD Mill – Company New and Old Ltd
3 4 Company Not Special – R Mill
4 5 Greatest Company – Great World POM
Name
0 Company A Palm Oil Mill
1 Company X POM
2 DDDD Mill
3 R Mill
4 Great World POM
Previous solution: You could use .str.split() and do this:
df.mill = df.mill.str.split(' –').str[0].
Update: Seeing you got a few constraints you could build up your own returning function (called func below) and put any logic you want inside there. This will loop through all strings split by - and if Mill is in the first word you return.
In other case I recommend Wen's solution.
import pandas as pd
df = pd.DataFrame({'id': ["1", "2", "3","4","5"],
'mill': ["Company A Palm Oil Mill – Special Company A of CC Ltd",
"Company X POM – Company X Ltd","DDDD Mill – Company New and Old Ltd",
"Company Not Special – R Mill","Greatest Company – Great World POM"]})
def func(x):
#Split array
ar = x.split(' – ')
# If length is smaller than 2 return value
if len(ar) < 2:
return x
# Else loop through and apply logic here
for ind, x in enumerate(ar):
if x.lower().endswith(('mill', 'pom')):
return x
# Nothing found, return x
return x
df.mill = df.mill.apply(func)
print(df)
Returns:
id mill
0 1 Company A Palm Oil Mill
1 2 Company X POM
2 3 DDDD Mill
3 4 R Mill
4 5 Great World POM
You want to split on the hyphen (if any), and return the substring ending in 'Mill' or 'POM':
def extract_mill_name(s):
"""Extract the substring which ends in 'Mill' or 'POM'"""
for subs in s.split('–'):
subs = subs.strip(' ')
if subs.endswith('Mill') or subs.endswith('POM'):
return subs
return None # parsing error. Could raise Exception instead
df.mill.apply(extract_mill_name)
0 Company A Palm Oil Mill
1 Company X POM
2 DDDD Mill
3 R Mill
4 Great World POM

Categories