Fuzzy Matching Two Columns in the Same Dataframe Using Python

Fuzzy Matching Two Columns in the Same Dataframe Using Python - python

I have two datasets within the same data frame each showing a list of companies. One dataset is from 2017 and the other is from this year. I am trying to match the two company datasets to each other and figured fuzzy matching ( FuzzyWuzzy) was the best way to do this. Using a partial ratio, I want to simply have the columns with the values listed as so: last year company's name, highest fuzzy matching ratio, this year company associated with that highest score. The original data frame has been given the variable "data" with last year company names under the column "Company" and this year company names under the column "Company name". To accomplish this task, I tried to create a function with the extractOne fuzzy matching process and then apply that function to each value/row in the dataframe. I would then add the results to my original data frame.
Here is the code below:
names_array=[]
ratio_array=[]
def match_names(last_year,this_year):
for row in last_year:
x=process.extractOne(row,this_year)
names_array.append(x[0])
ratio_array.append(x[1])
return names_array,ratio_array
#last year company names dataset
last_year=data['Company'].dropna().values
#this year companydataset
this_year=data['Company name'].values
name_match,ratio_match=match_names(last_year,this_year)
data['this_year']=pd.Series(name_match)
data['match_rating']=pd.Series(ratio_match)
data.to_csv("test.csv")
However, every time I execute this part of the code, the two added columns I created, do not show up in the csv. In fact, "test.csv" is just the same data frame as before despite the computer showing it as recently created. If anyone could point out the problem or help me out in any way, it would truly be appreciated.
Edit ( data frame preview):
Company Company name
0 BODYPHLO SPORTIQUE NaN
1 JOSEPH A PERRY NaN
2 PCH RESORT TENNIS SHOP NaN
3 GREYSTONE GOLF CLUB INC. NaN
4 MUSGROVE COUNTRY CLUB NaN
5 CITY OF PELHAM RACQUET CLUB NaN
6 NORTHRIVER YACHT CLUB NaN
7 LAKE FOREST NaN
8 TNL TENNIS PRO SHOP NaN
9 SOUTHERN ATHLETIC CLUB NaN
10 ORANGE BEACH TENNIS CENTER NaN
Then after the Company entries (last year company data sets) end, the "Company name" column ( this year company data sets) begins as so:
4168 NaN LEWIS TENNIS
4169 NaN CHUCKS PRO SHOP AT
4170 NaN CHUCK KINYON
4171 NaN LAKE COUNTRY RACQUET CLUB
4172 NaN SPORTS ACADEMY & RAC CLUB

Your dataframe structure is odd considering that one column only begins once the other end, however we can make it work. Let's take the following sample dataframe for data that you supplied:
Company Company name
0 BODYPHLO SPORTIQUE NaN
1 JOSEPH A PERRY NaN
2 PCH RESORT TENNIS SHOP NaN
3 GREYSTONE GOLF CLUB INC. NaN
4 MUSGROVE COUNTRY CLUB NaN
5 CITY OF PELHAM RACQUET CLUB NaN
6 NORTHRIVER YACHT CLUB NaN
7 LAKE FOREST NaN
8 TNL TENNIS PRO SHOP NaN
9 SOUTHERN ATHLETIC CLUB NaN
10 ORANGE BEACH TENNIS CENTER NaN
11 NaN LEWIS TENNIS
12 NaN CHUCKS PRO SHOP AT
13 NaN CHUCK KINYON
14 NaN LAKE COUNTRY RACQUET CLUB
15 NaN SPORTS ACADEMY & RAC CLUB
Then perform your matching:
import pandas as pd
from fuzzywuzzy import process, fuzz
known_list = data['Company name'].dropna()
def find_match(x):
match = process.extractOne(x['Company'], known_list, scorer=fuzz.partial_token_sort_ratio)
return pd.Series([match[0], match[1]])
data[['this year','match_rating']] = data.dropna(subset=['Company']).apply(find_match, axis=1, result_type='expand')
Yields:
Company Company name this year \
0 BODYPHLO SPORTIQUE NaN SPORTS ACADEMY & RAC CLUB
1 JOSEPH A PERRY NaN CHUCKS PRO SHOP AT
2 PCH RESORT TENNIS SHOP NaN LEWIS TENNIS
3 GREYSTONE GOLF CLUB INC. NaN LAKE COUNTRY RACQUET CLUB
4 MUSGROVE COUNTRY CLUB NaN LAKE COUNTRY RACQUET CLUB
5 CITY OF PELHAM RACQUET CLUB NaN LAKE COUNTRY RACQUET CLUB
6 NORTHRIVER YACHT CLUB NaN LAKE COUNTRY RACQUET CLUB
7 LAKE FOREST NaN LAKE COUNTRY RACQUET CLUB
8 TNL TENNIS PRO SHOP NaN LEWIS TENNIS
9 SOUTHERN ATHLETIC CLUB NaN SPORTS ACADEMY & RAC CLUB
10 ORANGE BEACH TENNIS CENTER NaN LEWIS TENNIS
match_rating
0 47.0
1 43.0
2 67.0
3 43.0
4 67.0
5 72.0
6 48.0
7 64.0
8 67.0
9 50.0
10 67.0

Related

Missing value replacemnet using mode in pandas in subgroup of a group

Having a data set as below.Here I need to group the subset in column and fill the missing values using mode method.Here specifically needs to fill the missing value of Tom from UK. So I need to group the TOM from Uk, and in that group the most repeating value needs to be added to the missing value.
Below fig shows how i need to do the group by.From the below matrix i need to replace all the Nan values using mode.
the desired output:
attaching the dataset
Name location Value
Tom USA 20
Tom UK Nan
Tom USA Nan
Tom UK 20
Jack India Nan
Nihal Africa 30
Tom UK Nan
Tom UK 20
Tom UK 30
Tom UK 20
Tom UK 30
Sam UK 30
Sam UK 30

try:
df = df\
.set_index(['Name', 'location'])\
.fillna(
df[df.Name.eq('Tom') & df.location.eq('UK')]\
.groupby(['Name', 'location'])\
.agg(pd.Series.mode)\
.to_dict()
)\
.reset_index()
Output:
Name location Value
0 Tom USA 20
1 Tom UK 20
2 Tom USA NaN
3 Tom UK 20
4 Jack India NaN
5 Nihal Africa 30
6 Tom UK 20
7 Tom UK 20
8 Tom UK 30
9 Tom UK 20
10 Tom UK 30
11 Sam UK 30
12 Sam UK 30

pandas: group years by decade

So I have data in CSV. Here is my code.
data = pd.read_csv('cast.csv')
data = pd.DataFrame(data)
print(data)
The result looks like this.
title year name type \
0 Closet Monster 2015 Buffy #1 actor
1 Suuri illusioni 1985 Homo $ actor
2 Battle of the Sexes 2017 $hutter actor
3 Secret in Their Eyes 2015 $hutter actor
4 Steve Jobs 2015 $hutter actor
... ... ... ... ...
74996 Mia fora kai ena... moro 2011 Penelope Anastasopoulou actress
74997 The Magician King 2004 Tiannah Anastassiades actress
74998 Festival of Lights 2010 Zoe Anastassiou actress
74999 Toxic Tutu 2016 Zoe Anastassiou actress
75000 Fugitive Pieces 2007 Anastassia Anastassopoulou actress
character n
0 Buffy 4 31.0
1 Guests 22.0
2 Bobby Riggs Fan 10.0
3 2002 Dodger Fan NaN
4 1988 Opera House Patron NaN
... ... ...
74996 Popi voulkanizater 11.0
74997 Unicycle Race Attendant NaN
74998 Guidance Counselor 20.0
74999 Demon of Toxicity NaN
75000 Laundry Girl 25.0
[75001 rows x 6 columns]
I want to group the data by year and type. Then I want to know the size of the each type on specific year. So here is my code.
grouped = data.groupby(['year', 'type']).size()
print(grouped)
The result look like this.
year type
1912 actor 1
actress 2
1913 actor 9
actress 1
1914 actor 38
..
2019 actress 3
2020 actor 3
actress 1
2023 actor 1
actress 2
Length: 220, dtype: int64
The problem is, how if I want to get the size data from 1910 until 2020 and the increase year is 10 (Per decade). So the year index will 1910, 1920, 1930, 1940, and so on until 2020.

I see two simple options.
1- round the years to the lower 10:
group = df['year']//10*10 # or df['year'].round(-1)
grouped = data.groupby([group, 'type']).size()
2- use pandas.cut:
years = list(range(1910,2031,10))
group = pd.cut(s, bins=years, labels=years[:-1])
grouped = data.groupby([group, 'type']).size()

Python replacing partial matching text based on a list of elements in data frame

I have built a dataframe that extracts data through a scraper. I extracted job positions, and currently, this column contains job positions as follows:
Title Research Number \
1 Dean NaN
2 Professor of Law NaN
3 Associate Dean for Information & Technology Se... NaN
4 Professor of Law\n NaN
5 Associate Dean for Faculty Development\nCharle... NaN
6 Associate Dean for Faculty Development\nCharle... NaN
7 Assistant Professor of Clinical Education & Di... NaN
8 Judge George Howard, Jr., Distinguished Profes... NaN
9 Visiting Assistant Professor of Law NaN
10 Associate Dean for Academic Affairs\nArkansas ... NaN
11 Distinguished Professor in Constitutional Law NaN
12 Assistant Professor of Law NaN
13 Instructor of Clinical Education; Supervising ... NaN
14 Associate Professor of Law NaN
15 Assistant Professor of Law\n NaN
16 Assistant Professor of Clinical Education; Tax... NaN
17 Assistant Professor of Law Librarianship; NaN
18 Byron M. Eiseman Distinguished Professor of Ta... NaN
19 Professor of Law\n NaN
20 Associate Professor of Law; Mediation Clinic D... NaN
21 Assistant Professor of Clinical Education; Fam... NaN
22 Assistant Professor of Clinical Education; Co... NaN
23 Associate Professor of Law\n NaN
24 Professor of Law Librarianship; Electronic Res... NaN
25 Professor of Law\n NaN
26 Professor of Law\n NaN
27 Associate Dean for Experiential Learning & Cli... NaN
28 Associate Professor of Law\n NaN
29 Assistant Professor of Clinical Education; Bus... NaN
30 Associate Professor of Law Librarianship; NaN
I would like to replace these titles with the following titles:
titles=["Adjunct Professor","Professor Emeritus","Associate Professor","Assistant Professor","Professor"]
How can I look for partial text and replace it? I don't want to fully replace the text if it's not a 100% match.
For example 'Visiting Assistant Professor of Law' should be replaced with 'Assistant Professor'
Thank you!

Use str.extract:
df['Title2'] = df['Title'].str.extract(f'({"|".join(titles)})')
output:
Title
1 NaN
2 Professor
3 NaN
...
29 Assistant Professor
30 Associate Professor
If you want to keep the original Title in case of no match, use:
df['Title'] = df['Title'].str.extract(f'({"|".join(titles)})', expand=False).fillna(df['Title'])

Fastest way to "unpack' a pandas dataframe

Hope the title is not misleading.
I load an Excel file in a pandas dataframe as usual
df = pd.read_excel('complete.xlsx')
and this is what's inside (usually is already ordered - this is a really small sample)
df
Out[21]:
Country City First Name Last Name Ref
0 England London John Smith 34
1 England London Bill Owen 332
2 England Brighton Max Crowe 25
3 England Brighton Steve Grant 55
4 France Paris Roland Tomas 44
5 France Paris Anatole Donnet 534
6 France Lyon Paulin Botrel 234
7 Spain Madrid Oriol Abarquero 34
8 Spain Madrid Alberto Olloqui 534
9 Spain Barcelona Ander Moreno 254
10 Spain Barcelona Cesar Aranda 222
what I need to do is automating an export of the data creating a sqlite db for every country, (i.e. 'England.sqlite') which will contain a table for evey city (i.e. London and Brighton) and every table will have the related personnel info.
The sqlite is not a problem, I'm only trying to figure how to "unpack" the dataframe in the most rapid and "pythonic way
Thanks

You can loop by DataFrame.groupby object:
for i, subdf in df.groupby('Country'):
print (i)
print (subdf)
#processing

Pandas DataFrames: Create new rows with calculations across existing rows

How can I create new rows from an existing DataFrame by grouping by certain fields (in the example "Country" and "Industry") and applying some math to another field (in the example "Field" and "Value")?
Source DataFrame
df = pd.DataFrame({'Country': ['USA','USA','USA','USA','USA','USA','Canada','Canada'],
'Industry': ['Finance', 'Finance', 'Retail',
'Retail', 'Energy', 'Energy',
'Retail', 'Retail'],
'Field': ['Import', 'Export','Import',
'Export','Import', 'Export',
'Import', 'Export'],
'Value': [100, 50, 80, 10, 20, 5, 30, 10]})
Country Industry Field Value
0 USA Finance Import 100
1 USA Finance Export 50
2 USA Retail Import 80
3 USA Retail Export 10
4 USA Energy Import 20
5 USA Energy Export 5
6 Canada Retail Import 30
7 Canada Retail Export 10
Target DataFrame
Net = Import - Export
Country Industry Field Value
0 USA Finance Net 50
1 USA Retail Net 70
2 USA Energy Net 15
3 Canada Retail Net 20

There are quite possibly many ways. Here's one using groupby and unstack:
(df.groupby(['Country', 'Industry', 'Field'], sort=False)['Value']
.sum()
.unstack('Field')
.eval('Import - Export')
.reset_index(name='Value'))
Country Industry Value
0 USA Finance 50
1 USA Retail 70
2 USA Energy 15
3 Canada Retail 20

IIUC
df=df.set_index(['Country','Industry'])
Newdf=(df.loc[df.Field=='Export','Value']-df.loc[df.Field=='Import','Value']).reset_index().assign(Field='Net')
Newdf
Country Industry Value Field
0 USA Finance -50 Net
1 USA Retail -70 Net
2 USA Energy -15 Net
3 Canada Retail -20 Net
pivot_table
df.pivot_table(index=['Country','Industry'],columns='Field',values='Value',aggfunc='sum').\
diff(axis=1).\
dropna(1).\
rename(columns={'Import':'Value'}).\
reset_index()
Out[112]:
Field Country Industry Value
0 Canada Retail 20.0
1 USA Energy 15.0
2 USA Finance 50.0
3 USA Retail 70.0

You can do it this way to add those rows to your original dataframe:
df.set_index(['Country','Industry','Field'])\
.unstack()['Value']\
.eval('Net = Import - Export')\
.stack().rename('Value').reset_index()
Output:
Country Industry Field Value
0 Canada Retail Export 10
1 Canada Retail Import 30
2 Canada Retail Net 20
3 USA Energy Export 5
4 USA Energy Import 20
5 USA Energy Net 15
6 USA Finance Export 50
7 USA Finance Import 100
8 USA Finance Net 50
9 USA Retail Export 10
10 USA Retail Import 80
11 USA Retail Net 70

You can use Groupby.diff() and after that recreate the Field column and finally use DataFrame.dropna:
df['Value'] = df.groupby(['Country', 'Industry'])['Value'].diff().abs()
df['Field'] = 'Net'
df.dropna(inplace=True)
df.reset_index(drop=True, inplace=True)
print(df)
Country Industry Field Value
0 USA Finance Net 50.0
1 USA Retail Net 70.0
2 USA Energy Net 15.0
3 Canada Retail Net 20.0

This answer takes advantage of the fact that pandas puts the group keys in the multiindex of the resulting dataframe. (If there were only one group key, you could use loc.)
>>> s = df.groupby(['Country', 'Industry', 'Field'])['Value'].sum()
>>> s.xs('Import', axis=0, level='Field') - s.xs('Export', axis=0, level='Field')
Country Industry
Canada Retail 20
USA Energy 15
Finance 50
Retail 70
Name: Value, dtype: int64

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.