fuzzy matching for SQL using fuzzy wuzzy and pandas - python

i have the following table in SQL and want to use Fuzzy Wuzzy to compare all the records in the table for any potential duplicates which in this instance line 1 is a duplicate of line 2 (or vice versa). can someone explain how i can add two additional columns to this table (Highest Score and Record Line Num) using Fuzzy Wuzzy and pandas? thanks.
Input:
Vendor Doc Date Invoice Date Invoice Ref Num Invoice Amount
ABC 5/12/2019 5/10/2019 ABCDE56. 56
ABC 5/13/2019 5/10/2019 ABCDE56 56
TIM 4/15/2019 4/10/2019 RTET5SDF 100
Desired Output:
Vendor Doc Date Invoice Date Invoice Ref Num Invoice Amount Highest Score Record Line Num
ABC 5/12/2019 5/10/2019 ABCDE56. 56 96 2
ABC 5/13/2019 5/10/2019 ABCDE56 56 96 1
TIM 4/15/2019 4/10/2019 RTET5SDF 100 0 N/A

Since you are looking for duplicates, you should filter your data frame first using the vendor name. This is to ensure it doesn't match with invoices of other vendors and reduce the processing time. However, since you didn't mention anything about it, you can skip it.
Decide on a threshold for duplicates based on the length of your invoice references. For example if the average is 5 chars, make the threshould 80%. Then, use fuzzywuzzy to get the best match.
from fuzzywuzzy import fuzz, process
# Assuming no NaNs in invoices references
inv_list = df['Invoice Ref'].to_list()
for i, inv in enumerate(inv_list)
result = process.extractOne(inv, inv_list, scorer=fuzz.token_sort_ratio)
if result[1] >= your_threshould:
df.loc[i, 'Highest Score'] = result[1]
df.loc[i, 'Record Line Num'] = inv_list.index(result[0])

Related

Python - Matching and extracting data from excel with pandas

I am working on a python script that automates some phone calls for me. I have a tool to test with that I can interact with REST API. I need to select a specific carrier based on which country code is entered. So let's say my user enters 12145221414 in my excel document, I want to choose AT&T as the carrier. How would I accept input from the first column of the table and then output what's in the 2nd column?
Obviously this can get a little tricky, since I would need to match up to 3-4 digits on the front of a phone number. My plan is to write a function that then takes the initial number and then plugs the carrier that needs to be used for that country.
Any idea how I could extract this data from the table? How would I make it so that if you entered Barbados (1246), then Lime is selected instead of AT&T?
Here's my code thus far and tables. I'm not sure how I can read one table and then pull data from that table to use for my matching function.
testlist.xlsx
| Number |
|:------------|
|8155555555|
|12465555555|
|12135555555|
|96655555555|
|525555555555|
carriers.xlsx
| countryCode | Carrier |
|:------------|:--------|
|1246|LIME|
|1|AT&T|
|81|Softbank|
|52|Telmex|
|966|Zain|
import pandas as pd
import os
FILE_PATH = "C:/temp/testlist.xlsx"
xl_1 = pd.ExcelFile(FILE_PATH)
num_df = xl_1.parse('Numbers')
FILE_PATH = "C:/temp/carriers.xlsx"
xl_2 = pd.ExcelFile(FILE_PATH)
car_df = xl_2.parse('Carriers')
for index, row in num_df.iterrows():
Any idea how I could extract this data from the table? How would I
make it so that if you entered Barbados (1246), then Lime is selected
instead of AT&T?
carriers.xlsx
countryCode
Carrier
1246
LIME
1
AT&T
81
Softbank
52
Telmex
966
Zain
script.py
import pandas as pd
FILE_PATH = "./carriers.xlsx"
df = pd.read_excel(FILE_PATH)
rows_list = df.to_dict('records')
code_carrier_map = {}
for row in rows_list:
code_carrier_map[row["countryCode"]] = row["Carrier"]
print(type(code_carrier_map), code_carrier_map)
print(f"{code_carrier_map.get(1)=}")
print(f"{code_carrier_map.get(1246)=}")
print(f"{code_carrier_map.get(52)=}")
print(f"{code_carrier_map.get(81)=}")
print(f"{code_carrier_map.get(966)=}")
Output
$ python3 script.py
<class 'dict'> {1246: 'LIME', 1: 'AT&T', 81: 'Softbank', 52: 'Telmex', 966: 'Zain'}
code_carrier_map.get(1)='AT&T'
code_carrier_map.get(1246)='LIME'
code_carrier_map.get(52)='Telmex'
code_carrier_map.get(81)='Softbank'
code_carrier_map.get(966)='Zain'
Then if you want to parse phone numbers, don't reinvent the wheel, just use this phonenumbers library.
Code
import phonenumbers
num = "+12145221414"
phone_number = phonenumbers.parse(num)
print(f"{num=}")
print(f"{phone_number.country_code=}")
print(f"{code_carrier_map.get(phone_number.country_code)=}")
Output
num='+12145221414'
phone_number.country_code=1
code_carrier_map.get(phone_number.country_code)='AT&T'
Let's assume the following input:
>>> df1
Number
0 8155555555
1 12465555555
2 12135555555
3 96655555555
4 525555555555
>>> df2
countryCode Carrier
0 1246 LIME
1 1 AT&T
2 81 Softbank
3 52 Telmex
4 966 Zain
First we need to rework a bit df2 to sort the countryCode in descending order, make it as string and set it to index.
The trick for later is to sort countryCode in descending order. This will ensure that a longer country codes, such as "1246" is matched before a shorter one like "1".
>>> df2 = df2.sort_values(by='countryCode', ascending=False).astype(str).set_index('countryCode')
>>> df2
Carrier
countryCode
1246 LIME
966 Zain
81 Softbank
52 Telmex
1 AT&T
Finally, we use a regex (here '1246|966|81|52|1' using '|'.join(df2.index)) made from the country codes in descending order to extract the longest code, and we map it to the carrier:
(df1.astype(str)['Number']
.str.extract('^(%s)'%'|'.join(df2.index))[0]
.map(df2['Carrier'])
)
output:
0 Softbank
1 LIME
2 AT&T
3 Zain
4 Telmex
Name: 0, dtype: object
NB. to add it to the initial dataframe:
df1['carrier'] = (df1.astype(str)['Number']
.str.extract('^(%s)'%'|'.join(df2.index))[0]
.map(df2['Carrier'])
).to_clipboard(0)
output:
Number carrier
0 8155555555 Softbank
1 12465555555 LIME
2 12135555555 AT&T
3 96655555555 Zain
4 525555555555 Telmex
If I understand it correctly, you just want to get the first characters from the input column (Number) and then match this with the second dataframe from carriers.xlsx.
Extract first characters of a Number column. Hint: The nbr_of_chars variable should be based on the maximum character length of the column countryCode in the carriers.xlsx
nbr_of_chars = 4
df.loc[df['Number'].notnull(), 'FirstCharsColumn'] = df['Number'].str[:nbr_of_chars]
Then the matching should be fairly easy with dataframe joins.
I can think only of an inefficient solution.
First, sort the data frame of carriers in the reverse alphabetical order of country codes. That way, longer prefixes will be closer to the beginning.
codes = xl_2.sort_values('countryCode', ascending=False)
Next, define a function that matches a number with each country code in the second data frame and finds the index of the first match, if any (remember, that match is the longest).
def cc2carrier(num):
matches = codes['countryCode'].apply(lambda x: num.startswith(x))
if not matches.any(): #Not found
return np.nan
return codes.loc[matches.idxmax()]['Carrier']
Now, apply the function to the numbers dataframe:
xl_1['Number'].apply(cc2carrier)
#1 Softbank
#2 LIME
#3 AT&T
#4 Zain
#5 Telmex
#Name: Number, dtype: object

Pandas map two dataframes using regex

I've two dataframes, one with text information and another with regex and patterns, what I need to do is to map a column from the second dataframe using regex
edit: What I need to do is to apply each regex on all df['text'] rows, and if there is a match, add the Pattern into a new column
Sample data
text_dict = {'text':['customer and increased repair and remodel activity as well as from other sales',
'sales for the overseas customers',
'marketing approach is driving strong play from top tier customers',
'employees in India have been the continuance of remote work will impact productivity',
'sales due to higher customer']}
regex_dict = {'Pattern':['Sales + customer', 'Marketing + customer', 'Employee * Productivity'],
'regex': ['(?:sales\\w*)(?:[^,.?])*(?:customer\\w*)|(?:customer\\w*)(?:[^,.?])*(?:sales\\w*)',
'(?:marketing\\w*)(?:[^,.?])*(?:customer\\w*)|(?:customer\\w*)(?:[^,.?])*(?:marketing\\w*)',
'(?:employee\\w*)(?:[^\n])*(?:productivity\\w*)|(?:productivity\\w*)(?:[^\n])*(?:employee\\w*)']}
df
text
0 customer and increased repair and remodel acti...
1 sales for the overseas customers
2 marketing approach is driving strong play from...
3 employees in India have been the continuance o...
4 sales due to higher customer
regex
Pattern regex
0 Sales + customer (?:sales\w*)(?:[^,.?])*(?:customer\w*)|(?:cust...
1 Marketing + customer (?:marketing\w*)(?:[^,.?])*(?:customer\w*)|(?:...
2 Employee * Productivity (?:employee\w*)(?:[^\n])*(?:productivity\w*)|(...
Desired output
text Pattern
0 customer and increased repair and remodel acti... Sales + customer
1 sales for the overseas customers Sales + customer
2 marketing approach is driving strong play from... Marketing + customer
3 employees in India have been the continuance o... Employee * Productivity
4 sales due to higher customer Sales + customer
tried the following, created a function that returns the Pattern in case there is a match, then I iterate over all the columns in the regex dataframe
def finding_keywords(regex, match, keyword):
if re.search(regex, match):
return keyword
else:
pass
for index, row in regex.iterrows():
df['Pattern'] = df['text'].apply(lambda x: finding_keywords(regex['Regex'][index], x, regex['Pattern'][index]))
the problem with this is that in every iteration, it erases the previous mappings, as you can see below. As I'm foo foo was the last iteration, is the only one remaining with a pattern
text Pattern
0 foo None
1 bar None
2 foo foo I'm foo foo
3 foo bar None
4 bar bar None
One solution could be to run the iteration over regex dataframe, and then iterate over df, this way I avoid loosing information, but I'm looking for a fastest solution
You can loop through the unique values of the regex dataframe and apply to the text of the df frame and return the pattern in a new regex column. Then, merge in the Pattern column and drop the regex column.
The key to my approach was to first create the column as NaN and then fillna with each iteration so the columns didn't get overwritten.
import re
import numpy as np
srs = regex['regex'].unique()
df['regex'] = np.nan
for reg in srs:
df['regex'] = df['regex'].fillna(df['text'].apply(lambda x: reg
if re.search(reg, x) else np.NaN))
df = pd.merge(df, regex, how='left', on='regex').drop('regex', axis=1)
df
Out[1]:
text Pattern
0 customer and increased repair and remodel acti... Sales + customer
1 sales for the overseas customers Sales + customer
2 marketing approach is driving strong play from... Marketing + customer
3 employees in India have been the continuance o... Employee * Productivity
4 sales due to higher customer Sales + customer

Pandas - Matching reference number to find earliest date

I'm hoping to pick your brains on optimization. I am still learning more and more about python and using it for my day to day operation analyst position. One of the tasks I have is sorting through approx 60k unique record identifiers, and searching through another dataframe that has approx 120k records of interactions, the employee who authored the interaction and the time it happened.
For Reference, the two dataframes at this point look like:
main_data = Unique Identifier Only
nok_data = Authored By Name, Unique Identifer(known as Case File Identifier), Note Text, Created On.
My set up currently runs it at approximately sorting through and matching my data at 2500 rows per minute, so approximately 25-30 minutes or so for a run. What I am curious is are there any steps I performed that are:
Redundant and inefficient overall slowing my process
A poor use of syntax to work around my lack of knowledge.
Below is my code:
nok_data = pd.read_csv("raw nok data.csv") #Data set from warehouse
main_data = pd.read_csv("exampledata.csv") #Data set taken from iTx ids from referral view
row_count = 0
error_count = 0
print(nok_data.columns.values.tolist())
print(main_data.columns.values.tolist()) #Commented out, used to grab header titles if needed.
data_length = len(main_data) #used for counting how many records left.
earliest_nok = {}
nok_data["Created On"] = pd.to_datetime(nok_data["Created On"]) #convert all dates to datetime at beginning.
for row in main_data["iTx Case ID"]:
list_data = []
nok = nok_data["Case File Identifier"] == row
matching_dates = nok_data[["Created On", "Authored By Name"]][nok == True] #takes created on date only if nok shows row was true
if len(matching_dates) > 0:
try:
min_dates = matching_dates.min(axis=0)
earliest_nok[row] = [min_dates[0], min_dates[1]]
except ValueError:
error_count += 1
earliest_nok[row] = None
row_count += 1
print("{} out of {} records").format(row_count, data_length)
with open('finaloutput.csv','wb') as csv_file:
writer = csv.writer(csv_file)
for key, value in earliest_nok.items():
writer.writerow([key, value])
Looking for any advice or expertise from those performing code like this much longer then I have. I appreciate all of you who even just took the time to read this. Happy Tuesday,
Andy M.
**** EDIT REQUESTED TO SHOW DATA
Sorry for my novice move there not including any data type.
main_data example
ITX Case ID
2017-023597
2017-023594
2017-023592
2017-023590
nok_data aka "raw nok data.csv"
Authored By: Case File Identifier: Note Text: Authored on
John Doe 2017-023594 Random Text 4/1/2017 13:24:35
John Doe 2017-023594 Random Text 4/1/2017 13:11:20
Jane Doe 2017-023590 Random Text 4/3/2017 09:32:00
Jane Doe 2017-023590 Random Text 4/3/2017 07:43:23
Jane Doe 2017-023590 Random Text 4/3/2017 7:41:00
John Doe 2017-023592 Random Text 4/5/2017 23:32:35
John Doe 2017-023592 Random Text 4/6/2017 00:00:35
It looks like you want to group on the Case File Identifier and get the minimum date and corresponding author.
# Sort the data by `Case File Identifier:` and `Authored on` date
# so that you can easily get the author corresponding to the min date using `first`.
nok_data.sort_values(['Case File Identifier:', 'Authored on'], inplace=True)
df = (
nok_data[nok_data['Case File Identifier:'].isin(main_data['ITX Case ID'])]
.groupby('Case File Identifier:')['Authored on', 'Authored By:'].first()
)
d = {k: [v['Authored on'], v['Authored By:']] for k, v in df.to_dict('index').iteritems()}
>>> d
{'2017-023590': ['4/3/17 7:41', 'Jane Doe'],
'2017-023592': ['4/5/17 23:32', 'John Doe'],
'2017-023594': ['4/1/17 13:11', 'John Doe']}
>>> df
Authored on Authored By:
Case File Identifier:
2017-023590 4/3/17 7:41 Jane Doe
2017-023592 4/5/17 23:32 John Doe
2017-023594 4/1/17 13:11 John Doe
It is probably easier to use df.to_csv(...).
The items from main_data['ITX Case ID'] where there is no matching record have been ignored but could be included if required.

Python3, Pandas - New Column Value based on Column To Left Data (Dynamic)

I have a spreadsheet with several columns containing survey responses. This spreadsheet will be merged into others and I will then have duplicate rows similar to the ones below. I will then need to take all questions with the same text and calculate the percentages of the answers based on the entirety of the merged document.
Example Excel Data
**Poll Question** **Poll Responses**
The content was clear and effectively delivered  37 Total Votes
Strongly Agree 24.30%
Agree 70.30%
Neutral 2.70%
Disagree 2.70%
Strongly Disagree 0.00%
The Instructor(s) were engaging and motivating  37 Total Votes
Strongly Agree 21.60%
Agree 73.00%
Neutral 2.70%
Disagree 2.70%
Strongly Disagree 0.00%
I would attend another training session delivered by this Instructor(s) 37 Total Votes
Strongly Agree 21.60%
Agree 73.00%
Neutral 5.40%
Disagree 0.00%
Strongly Disagree 0.00%
This was a good format for my training  37 Total Votes
Strongly Agree 24.30%
Agree 62.20%
Neutral 8.10%
Disagree 2.70%
Strongly Disagree 2.70%
Any comments/suggestions about this training course?  5 Total Votes
My method for calculating a non-percent number of votes will be to convert the percentages to a number. E.G. find and extract 37 from 37 Total Votes, then use the following formula to get the number of users that voted on that particular answer: percent * total / 100.
So 24.30 * 37 / 100 = 8.99 rounded up means 9 out of 37 people voted for "Strongly Agree".
Here's an example spreadsheet of what I'd like to be able to do:
**Poll Question** **Poll Responses** **non-percent** **subtotal**
... 37 Total Votes 0 37
... 24.30% 9 37
... 70.30% 26 37
... 2.70% 1 37
... 2.70% 1 37
... 0.00% 0 37
(note: non-percent and subtotal would be newly created columns)
Currently I take a folder full of .xls files and I loop through that folder, saving them to another in an .xlsx format. Inside that loop, I've added a comment block that contains my # NEW test CODE where I'm trying to put the logic to do this.
As you can see, I'm trying to target the cell and get the value, then get some regex and extract the number from it, (then add it to the subtotal column in that row. I then want to add it till I see a new instance of a row containing x Total Votes.
Here's my current code:
import numpy as np
import pandas as pd
files = get_files('/excels/', '.xls')
df_array = []
for i, f in enumerate(files, start=1):
sheet = pd.read_html(f, attrs={'class' : 'reportData'}, flavor='bs4')
event_id = get_event_id(pd.read_html(f, attrs={'id' : 'eventSummary'}))
event_title= get_event_title(pd.read_html(f, attrs={'id' : 'eventSummary'}))
filename = event_id + '.xlsx'
rel_path = 'xlsx/' + filename
writer = pd.ExcelWriter(rel_path)
for df in sheet:
# NEW test CODE
q_total = 0
df.columns = df.columns.str.strip()
if df[df['Poll Responses'].str.contains("Total Votes")]:
# if df['Poll Responses'].str.contains("Total Votes"):
q_total = re.findall(r'.+?(?=\sTotal\sVotes)', df['Poll Responses'].str.contains("Total Votes"))[0]
print(q_total)
# df['Question Total'] = np.where(df['Poll Responses'].str.contains("Total Votes"), 'yes', 'no')
# END NEW test Code
df.insert(0, 'Event ID', event_id)
df.insert(1, 'Event Title', event_title)
df.to_excel(writer,'sheet')
writer.save()
# progress of entire list
if i <= len(files):
print('\r{:*^10}{:.0f}%'.format('Converting: ', i/len(files)*100), end='')
print('\n')
TL;DR
This seems very convoluted, but if I can get the two new columns that contain the total votes for a question and the number (not percentage) of votes for an answer, then I can do some VLOOKUP magic for this on the merged document. Any help or methodology suggestions would be greatly appreciated. Thanks!
I solved this, I'll post the pseudo code below:
I loop through each sheet. Inside that loop, I loop through each row using for n, row in enumerate(df.itertuples(), 1):.
I get the value of the field that might contain "Poll Response" poll_response = str(row[3])
Using an if / else I check if the poll_response contains the text "Total Votes". If it does, it must be a question, otherwise it must be a row with an answer.
In the if for the question I get the cells that contain the data I need. I then have a function that compares the question text with all objects question text in the array. If it's a match, then I simply update the fields of the object, otherwise I create a new question object.
else the row is an answer row, and I use the question text to find the object in the array and update/add the answers or data.
This process loops through all the rows in each spreadsheet, and now I have my array full of unique question objects.

Pandas: search multiple columns and return column with found value

I'm try to do some auditing for our purchase orders, and I created this dataframe (here's a csv sample of it):
ProductName,Qty,LineCost,BuyQty1,BuyQty1Cost,BuyQty2,BuyQty2Cost,BuyQty3,BuyQty3Cost
SIGN2WH,48,40.63,5,43.64,48,40.63,72,39.11
SIGN2BK,144,39.11,5,43.64,48,40.63,72,39.11
In my data source, some products get different breaks, depending on the quantity purchased. Hence the columns BuyQty1 and BuyQty1Cost. Qty and LineCost are the values I need to audit. So, what I'm trying to do is:
Check what quantity break corresponds to the value on the column
Qty. Example a Qty of 48 implies that the break is BuyQty2,
and the corresponding price should be BuyQty2Cost.
Then add a column with the ratio of LineCost/BuyQty2Cost. It would be BuyQty3Cost in the case of SIGN2BK (2nd line).
How should I tackle this?
import pandas as pd
def calculate_break_level(row):
if row.Qty >= row.BuyQty3:
return row.BuyQty3Cost
elif row.Qty >= row.BuyQty2:
return row.BuyQty2Cost
else:
return row.BuyQty1Cost
# apply the function row-by-row by specifying axis=1
# the newly produced Line_Cost is in the last column.
df['Line_Cost'] = df.apply(calculate_break_level, axis=1)
Out[58]:
ProductName Qty LineCost BuyQty1 BuyQty1Cost BuyQty2 BuyQty2Cost BuyQty3 BuyQty3Cost Line_Cost
0 SIGN2WH 48 40.63 5 43.64 48 40.63 72 39.11 40.63
1 SIGN2BK 144 39.11 5 43.64 48 40.63 72 39.11 39.11

Categories