Matching strings in a pandas dataframe using fuzzywuzzy - python

I have two dataframes: Instructor_Info and Operator_Info
Instructor_Info contains a column called Names and OperatorName, and Operator_Info also has a column called Names. All names in Instructor_Info have an associated name in Operator Info. I want to use fuzz.token_sort_ratio() to find these matches by comparing each name in Instructor_Info to every name in Operator_Info and storing the string with the highest score in the OperatorName column.
This is what I have so far:
for index, row in Instructor_Info.iterrows():
match = 0
for index1,row1 in Operator_Info.iterrows():
if fuzz.token_sort_ratio(row['Names'],row1['Names']) > match:
row['OperatorName'] = row1['Names']
This code runs extremely slow and gets a couple of false matches (I can fix these manually so speed is the main issue). If anyone has any faster ideas it would be much appreciated. Thanks in advance.

Related

Printing and counting unique values from an .xlsx file

I'm fairly new to Python and still learning the ropes, so I need help with a step by step program without using any functions. I understand how to count through an unknown column range and output the quantity. However, for this program, I'm trying to loop through a column, picking out unique numbers and counting its frequency.
So I have an excel file with random numbers down column A. I only put in 20 numbers but let's pretend the range is unknown. How would I go about extracting the unique numbers and inputting them into a separate column along with how many times they appeared in the list?
I'm not really sure how to go about this. :/
unique = 1
while xw.Range((unique,1)).value != None:
frequency = 0
if unique != unique: break
quantity += 1
"end"
I presume as you can't use functions this may be homework...so, high level:
You could first go through the column and then put all the values in a list?
Secondly take the first value from the list and go through the rest of the list - is it in there? If so then it is not unique. Now remove the value where you have found the duplicate from the list. Keep going if you find another remove that too.
Take the second value and so on?
You would just need list comprehension, some loops and perhaps .pop()
Using pandas library would be the easiest way to do. I created a sample excel sheet having only one column called "Random_num"
import pandas
data = pandas.read_excel("sample.xlsx", sheet_name = "Sheet1")
print(data.head()) # This would give you a sneak peek of your data
print(data['Random_num'].value_counts()) # This would solve the problem you asked for
# Make sure to pass your column name within the quotation marks
#eg: data['your_column'].value_counts()
Thanks

Best way to fuzzy match values in a data frame and then replace the value?

I'm working with a dataframe containing various datapoints of customer data. I'm looking to essentially replace any junk phone numbers as a blank value, right now I'm struggling to find an efficient way to find potential junk values such as a phone number like 111-111-1111 and replace that specific value with a blank entry.
I currently have a fairly ugly solution where I'm going through 3 fields; home phone, cell phone and work phone, locating the index values of the rows in question and respective column and then am replacing those,
with regards to actually finding junk values in a dataframe, is there a better approach to this than what I am currently doing?
row_index = dataset[dataset['phone'].str.contains('11111')].index
column_index = dataset.columns.get_loc('phone')
Afterwards, I would zip these up and cycle through a for loop, using dataset.iat[row_index, column_index] = ''. The row and column index variables would also have the junk values in the 'cellphone' and 'workphone' columns appended on as well.
Pandas 'where' function tends to be quick:
dataset['phone'] = dataset['phone'].where(~dataset['phone'].str.contains('11111'),
None)

Match similar column elements using pandas and fuzzwuzzy

I have an excel file that contains 1000+ company names in one column and about 20,000 company names in another column.
The goal is to match as many names as possible. The problem is that the names in column one (1000+) are poorly formatted, meaning that "Company Name" string can look something like "9Com(panynAm9e00". I'm trying to figure out the best way to solve this. (only 12 names match exactly)
After trying different methods, I've ended up with attempting to match 4-5 or more characters in each name, depending on the length of each string, using regex. But I'm just struggling to find the most efficient way to do this.
For instance:
Column 1
1. 9Com(panynAm9e00
2. NikE4
3. Mitrosof2
Column 2
1. Microsoft
2. Company Name
3. Nike
Take first element in Column 1 and look for a match in Column 2. If no exact match, then look for a string with 4-5 same characters.
Any suggestions?
I would suggest reading your Excel file with pandas and pd.read_excel(), and then using fuzzywuzzy to perform your matching, for example:
import pandas as pd
from fuzzywuzzy import process, fuzz
df = pd.DataFrame([['9Com(panynAm9e00'],
['NikE4'],
['Mitrosof2']],
columns=['Name'])
known_list = ['Microsoft','Company Name','Nike']
def find_match(x):
match = process.extractOne(x, known_list, scorer=fuzz.partial_token_sort_ratio)[0]
return match
df['match found'] = [find_match(row) for row in df['Name']]
Yields:
Name match found
0 9Com(panynAm9e00 Company Name
1 NikE4 Nike
2 Mitrosof2 Microsoft
I imagine numbers are not very common in actual company names, so an initial filter step will help immensely going forward, but here is one implementation that should work relatively well even without this. A bag-of-letters (bag-of-words) approach, if you will:
convert everything (col 1 and 2) to lowercase
For each known company in column 2, store each unique letter, and how many times it appears (count) in a dictionary
Do the same (step 2) for each entry in column 1
For each entry in col 1, find the closest bag-of-letters (dictionary from step 2) from the list of real company names
The dictionary-distance implementation is up to you.

Incrementing Values in Column Based on Values in Another (Pandas)

I'd like to find a simple way to increment values in in one column that correspond to a particular date in Pandas. This is what I have so far:
old_casual_value = tstDF['casual'].loc[tstDF['ds'] == '2012-10-08'].values[0]
old_registered_value = tstDF['registered'].loc[tstDF['ds'] == '2012-10-08'].values[0]
# Adjusting the numbers of customers for that day.
tstDF.set_value(406, 'casual', old_casual_value*1.05)
tstDF.set_value(406, 'registered', old_registered_value*1.05)
If I could find a better and simpler way to do this (a one-liner), I'd greatly appreciate it.
Thanks for your help.
The following one liner should work based on your limited description of your problem. If not, please provide more information.
#The code below first filters out the records based on specified date and then increase casual and regisitered column values by 1.05 times.
tstDF.loc[tstDF['ds'] == '2012-10-08',['casual','registered']]*=1.05

how to write if condition for column2 , column 3 of a csv file

I have CSV file which has 3 columns.
Here is what I have to do:
I want to write an if condition or whatever like if Divi == 'core' then I need the count of tags (distinct) without redundancy i.e ( two sand1 in the tag for core division should be considered as only one count).
One more if condition like Div === saturn or core && type == dev then same thing need to count the no of tags(distinct)
Can anyone help me out with this? As it was my idea.. any new ideas will be accepted if it satisfies requirement
First, load up your data with pandas.
import pandas as pd
dataframe = pd.read_csv(path_to_csv)
Second, format your data properly (you might have lower case/upper case data as in column 'Division' from your example)
for column in dataframe.columns:
dataframe[column] = dataframe[column].lower()
If you want to count frequency just by one column you can:
dataframe['Division'].value_counts()
If you want to count by two columns you can:
dataframe.groupby(['Division','tag']).count()
Hope that helps
edit:
While this wont give you just the count of when 2 conditions are met, which is what you asked for, it will give you a more 'complete' answer, showing the count for all two columns combinations

Categories