I have 2 excel sheets that I have loaded. I need to add information from one to the other one. See example below.
table 1:
cust_id fname lname date_registered
1 bob holly 1/1/80
2 terri jones 2/3/90
table 2:
fname lname date_registered cust_id zip
lawrence fisher 2/3/12 3 12345
So I need to add cust_id 3 from table 2 to table 1. Along with all the other information, fname, lname, and date_registered. I don't need all the columns though, such as the zip.
I am thinking I can use the pandas/merge. But I am new to all this and not sure how this works. I need to populate the next row in table 1 with the corresponding row information in table 2. Any information would be helpful. Thanks!
With concat:
In [1]: import pandas as pd
In [2]: table_1 = pd.DataFrame({'cust_id':[1,2], 'fname':['bob', 'teri'], 'lname':['holly', 'jones'], 'date_registered':['1/1/80', '2/3/90']})
In [3]: table_2 = pd.DataFrame({'cust_id':[3], 'fname':['lawrence'], 'lname':['fisher'], 'date_registered':['2/3/12'], 'zip':[12345]})
In [4]: final_table = pd.concat([table_1, table_2])
In [5]: final_table
Out[5]:
cust_id date_registered fname lname zip
0 1 1/1/80 bob holly NaN
1 2 2/3/90 teri jones NaN
0 3 2/3/12 lawrence fisher 12345.0
Use append
appended = table1.append(table2[table1.columns])
or concat
concated = pd.concat([table1,table2], join='inner')
Both resulting in
cust_id fname lname date_registered
0 1 bob holly 1/1/80
1 2 terri jones 2/3/90
0 3 lawrence fisher 2/3/12
Related
I have two CSV files. One that contains Vendor data and one that contains Employee data. Similar to what "Fuzzy Lookup" in excel does, I'm looking to do two types of matches and output all columns from both csv files, including a new column as the similarity ratio for each row. In excel, I would use a 0.80 threshold. The below is sample data and my actual data has 2 million rows in one of the files which is going to be a nightmare if done in excel.
Output 1:
From Vendor file, fuzzy match "Vendor Name" with "Employee Name" from Employee file. Display all columns from both files and a new column for Similarity Ratio
Output 2:
From Vendor file, fuzzy match "SSN" with "SSN" from Employee file. Display all columns from both files and a new column for Similarity Ratio
These are two separate outputs
Dataframe 1: Vendor Data
Company
Vendor ID
Vendor Name
Invoice Number
Transaction Amt
Vendor Type
SSN
15
58421
CLIFFORD BROWN
854
500
Misc
668419628
150
9675
GREEN
7412
70
One Time
774801971
200
15789
SMITH, JOHN
80
40
Employee
965214872
200
69997
HAROON, SIMAN
964
100
Misc
741-98-7821
Dataframe 2: Employee Data
Employee Name
Employee ID
Manager
SSN
BROWN, CLIFFORD
1
Manager 1
668-419-628
BLUE, CITY
2
Manager 2
874126487
SMITH, JOHN
3
Manager 3
965-21-4872
HAROON, SIMON
4
Manager 4
741-98-7820
Expected output 1 - Match Name
Employee Name
Employee ID
Manager
SSN
Company
Vendor ID
Vendor Name
Invoice Number
Transaction Amt
Vendor Type
SSN
Similarity Ratio
BROWN, CLIFFORD
1
Manager 1
668-419-628
150
58421
CLIFFORD BROWN
854
500
Misc
668419628
1.00
SMITH, JOHN
3
Manager 3
965-21-4872
200
15789
SMITH, JOHN
80
40
Employee
965214872
1.00
HAROON, SIMON
4
Manager 4
741-98-7820
200
69997
HAROON, SIMAN
964
100
Misc
741-98-7821
0.96
BLUE, CITY
2
Manager 2
874126487
0.00
Expected output 2 - Match SSN
Employee Name
Employee ID
Manager
SSN
Company
Vendor ID
Vendor Name
Invoice Number
Transaction Amt
Vendor Type
SSN
Similarity Ratio
BROWN, CLIFFORD
1
Manager 1
668-419-628
150
58421
CLIFFORD, BROWN
854
500
Misc
668419628
0.97
SMITH, JOHN
3
Manager 3
965-21-4872
200
15789
SMITH, JOHN
80
40
Employee
965214872
0.97
BLUE, CITY
2
Manager 2
874126487
0.00
HAROON, SIMON
4
Manager 4
741-98-7820
0.00
I've tried the below code:
import pandas as pd
from fuzzywuzzy import fuzz
df1 = pd.read_excel(r'Directory\Sample Vendor Data.xlsx')
df2 = pd.read_excel(r'Directory\Sample Employee Data.xlsx')
matched_names = []
for row1 in df1.index:
name1 = df1._get_value(row1, 'Vendor Name')
for row2 in df2.index:
name2 = df2._get_value(row2, 'Full Name')
match = fuzz.ratio(name1, name2)
if match > 80: # This is the threshold
match.append([name1, name2, match])
df_ratio = pd.DataFrame(columns=['Vendor Name', 'Employee Name','match'], data=matched_names)
df_ratio.to_csv(r'directory\MatchingResults.csv', encoding='utf-8')
I'm just not getting the results I want and am ready to reinvent the whole script. Any suggestions would help to improve my script. Please note, I'm fairly new to Python so be gentle. I am totally open to a new approach on this example.
September 23 Update:
Still having trouble...I'm able to get the similarity ratio now but not getting all the columns from both CSV files. The issue is that both files are completely different so when I concat, it gives NaN values. Any suggestions? New code below:
import numpy as np
from fuzzywuzzy import fuzz
from itertools import product
import pandas as pd
df1 = pd.read_excel(r'Directory\Sample Vendor Data.xlsx')
df2 = pd.read_excel(r'Directory\Sample Workday Data.xlsx')
df1['full_name']= df1['Vendor Name']
df2['full_name'] = df2['Employee Name']
df1_name = df1['full_name']
df2_name = df2['full_name']
frames = [pd.DataFrame(df1), pd.DataFrame(df2)]
df = pd.concat(frames).reset_index(drop=True)
dist = [fuzz.ratio(*x) for x in product(df.full_name, repeat=2)]
dfresult = pd.DataFrame(np.array(dist).reshape(df.shape[0], df.shape[0]), columns=df.full_name.values.tolist())
#create of list of dataframes
listOfDfs = [dfresult.loc[idx] for idx in np.split(dfresult.index, df.shape[0])]
DataFrameDict = {df['full_name'][i]: listOfDfs[i] for i in range(dfresult.shape[0])}
for name in DataFrameDict.keys():
print(name)
#print(DataFrameDict[name]
df = pd.DataFrame(list(DataFrameDict.items())).df.to_excel(r'Directory\TestOutput.xlsx', index = False)
To concatenate the two DataFrames horizontally, I aligned the Employees DataFrame by the index of the matched Vendor Name. If no Vendor Name was matched, I just put an empty row instead.
In more details:
I iterated over the vendor names, and for each vendor name, I added the index of the employee name with the highest score to a list of indices. Note that I added at most one matched employee record to each vendor name.
If no match was found (too low score), I added the index of an empty record that I have added manually to the Employees Dataframe.
This list of indices is then used to reorder the Employees DataDrame.
at last, I just merge the two DataFrame horizontally. Note that the two DataFrames at this point doesn't have to be of the same size, but in such a case, the concat method just fill the gap with appending missing rows to the smaller DataFrame.
The code is as follows:
import numpy as np
import pandas as pd
from thefuzz import process as fuzzy_process # the new repository of fuzzywuzzy
# import dataframes
...
# adding empty row
employees_df = employees_df.append(pd.Series(dtype=np.float64), ignore_index=True)
index_of_empty = len(employees_df) - 1
# matching between vendor and employee names
indexed_employee_names_dict = dict(enumerate(employees_df["Employee Name"]))
matched_employees = set()
ordered_employees = []
scores = []
for vendor_name in vendors_df["Vendor Name"]:
match = fuzzy_process.extractOne(
query=vendor_name,
choices=indexed_employee_names_dict,
score_cutoff=80
)
score, index = match[1:] if match is not None else (0.0, index_of_empty)
matched_employees.add(index)
ordered_employees.append(index)
scores.append(score)
# detect unmatched employees to be positioned at the end of the dataframe
missing_employees = [i for i in range(len(employees_df)) if i not in matched_employees]
ordered_employees.extend(missing_employees)
ordered_employees_df = employees_df.iloc[ordered_employees].reset_index()
merged_df = pd.concat([vendors_df, ordered_employees_df], axis=1)
# adding the scores column and sorting by its values
scores.extend([0] * len(missing_employees))
merged_df["Similarity Ratio"] = pd.Series(scores) / 100
merged_df = merged_df.sort_values("Similarity Ratio", ascending=False)
For the matching according to the SSN columns, it can be done exactly in the same way, by just replacing the column names in the above code. Moreover, The process can be generalize to be a function that accepts DataFrames and column names:
def match_and_merge(df1: pd.DataFrame, df2: pd.DataFrame, col1: str, col2: str, cutoff: int = 80):
# adding empty row
df2 = df2.append(pd.Series(dtype=np.float64), ignore_index=True)
index_of_empty = len(df2) - 1
# matching between vendor and employee names
indexed_strings_dict = dict(enumerate(df2[col2]))
matched_indices = set()
ordered_indices = []
scores = []
for s1 in df1[col1]:
match = fuzzy_process.extractOne(
query=s1,
choices=indexed_strings_dict,
score_cutoff=cutoff
)
score, index = match[1:] if match is not None else (0.0, index_of_empty)
matched_indices.add(index)
ordered_indices.append(index)
scores.append(score)
# detect unmatched employees to be positioned at the end of the dataframe
missing_indices = [i for i in range(len(df2)) if i not in matched_indices]
ordered_indices.extend(missing_indices)
ordered_df2 = df2.iloc[ordered_indices].reset_index()
# merge rows of dataframes
merged_df = pd.concat([df1, ordered_df2], axis=1)
# adding the scores column and sorting by its values
scores.extend([0] * len(missing_indices))
merged_df["Similarity Ratio"] = pd.Series(scores) / 100
return merged_df.sort_values("Similarity Ratio", ascending=False)
if __name__ == "__main__":
vendors_df = pd.read_excel(r'Directory\Sample Vendor Data.xlsx')
employees_df = pd.read_excel(r'Directory\Sample Workday Data.xlsx')
merged_df = match_and_merge(vendors_df, employees_df, "Vendor Name", "Employee Name")
merged_df.to_excel("merged_by_names.xlsx", index=False)
merged_df = match_and_merge(vendors_df, employees_df, "SSN", "SSN")
merged_df.to_excel("merged_by_ssn.xlsx", index=False)
the above code is resulted with the following outputs:
merged_by_names.xlsx
Company
Vendor ID
Vendor Name
Invoice Number
Transaction Amt
Vendor Type
SSN
index
Employee Name
Employee ID
Manager
SSN
Similarity Ratio
200
15789
SMITH, JOHN
80
40
Employee
965214872
2
SMITH, JOHN
3
Manager 3
965-21-4872
1
15
58421
CLIFFORD BROWN
854
500
Misc
668419628
0
BROWN, CLIFFORD
1
Manager 1
668-419-628
0.95
200
69997
HAROON, SIMAN
964
100
Misc
741-98-7821
3
HAROON, SIMON
4
Manager 4
741-98-7820
0.92
150
9675
GREEN
7412
70
One Time
774801971
4
nan
nan
nan
nan
0
nan
nan
nan
nan
nan
nan
nan
1
BLUE, CITY
2
Manager 2
874126487
0
merged_by_ssn.xlsx
Company
Vendor ID
Vendor Name
Invoice Number
Transaction Amt
Vendor Type
SSN
index
Employee Name
Employee ID
Manager
SSN
Similarity Ratio
200
69997
HAROON, SIMAN
964
100
Misc
741-98-7821
3
HAROON, SIMON
4
Manager 4
741-98-7820
0.91
15
58421
CLIFFORD BROWN
854
500
Misc
668419628
0
BROWN, CLIFFORD
1
Manager 1
668-419-628
0.9
200
15789
SMITH, JOHN
80
40
Employee
965214872
2
SMITH, JOHN
3
Manager 3
965-21-4872
0.9
150
9675
GREEN
7412
70
One Time
774801971
4
nan
nan
nan
nan
0
nan
nan
nan
nan
nan
nan
nan
1
BLUE, CITY
2
Manager 2
874126487
0
I have two dataframes of a format similar to below:
df1:
0 fname lname note
1 abby ross note1
2 rob william note2
3 abby ross note3
4 john doe note4
5 bob dole note5
df2:
0 fname lname note
1 abby ross note6
2 rob william note4
I want to merge finding matches based on fname and lname and then update the note column in the first DataFrame with the note column in the second DataFrame
The result I am trying to achieve would be like this:
0 fname lname note
1 abby ross note6
2 rob william note4
3 abby ross note6
4 john doe note4
5 bob dole note5
This is the code I was working with so far:
pd.merge(df1, df2, on=['fname', 'lname'], how='left')
but it just creates a new column with _y appended to it. How can I get it to just update that column?
Any help would be greatly appreciate, thanks!
You can merge and then correct the values:
df_3 = pd.merge(df1, df2, on=['fname', 'lname'], how='outer')
df_3['note'] = df_3['note_x']
df_3.loc[df_3['note'].isna(), 'note'] = df_3['note_y']
df_3 = df_3.drop(['note_x', 'note_y'], axis=1)
Do what you are doing:
then:
# fill nan values in note_y
out_df['note_y'].fillna(out_df['note_x'])
# Keep cols you want
out_df = out_df[['fname', 'lname', 'note_y']]
# rename the columns
out_df.columns = ['fname', 'lname', 'note']
I don't like this approach a whole lot as it won't be very scalable or generalize able. Waiting for a stellar answer for this question.
Try with update
df1=df1.set_index(['fname','lname'])
df1.update(df2.set_index(['fname','lname']))
df1=df1.reset_index()
df1
Out[55]:
fname lname 0 note
0 abby ross 1.0 note6
1 rob william 2.0 note4
2 john doe 3.0 note3
3 bob dole 4.0 note4
I have the following dataframe with firstname and surname. I want to create a column fullname.
df1 = pd.DataFrame({'firstname':['jack','john','donald'],
'lastname':[pd.np.nan,'obrien','trump']})
print(df1)
firstname lastname
0 jack NaN
1 john obrien
2 donald trump
This works if there are no NaN values:
df1['fullname'] = df1['firstname']+df1['lastname']
However since there are NaNs in my dataframe, I decided to cast to string first. But it causes a problem in the fullname column:
df1['fullname'] = str(df1['firstname'])+str(df1['lastname'])
firstname lastname fullname
0 jack NaN 0 jack\n1 john\n2 donald\nName: f...
1 john obrien 0 jack\n1 john\n2 donald\nName: f...
2 donald trump 0 jack\n1 john\n2 donald\nName: f...
I can write some function that checks for nans and inserts the data into the new frame, but before I do that - is there another fast method to combine these strings into one column?
You need to treat NaNs using .fillna() Here, you can fill it with '' .
df1['fullname'] = df1['firstname'] + ' ' +df1['lastname'].fillna('')
Output:
firstname lastname fullname
0 jack NaN jack
1 john obrien john obrien
2 donald trump donald trumpt
You may also use .add and specify a fill_value
df1.firstname.add(" ").add(df1.lastname, fill_value="")
PS: Chaining too many adds or + is not recommended for strings, but for one or two columns you should be fine
df1['fullname'] = df1['firstname']+df1['lastname'].fillna('')
There is also Series.str.cat which can handle NaN and includes the separator.
df1["fullname"] = df1["firstname"].str.cat(df1["lastname"], sep=" ", na_rep="")
firstname lastname fullname
0 jack NaN jack
1 john obrien john obrien
2 donald trump donald trump
What I will do (For the case more than two columns need to join)
df1.stack().groupby(level=0).agg(' '.join)
Out[57]:
0 jack
1 john obrien
2 donald trump
dtype: object
so I've data like this:
Id Title Fname lname email
1 meeting with Jay, Aj Jay kay jk#something.com
1 meeting with Jay, Aj Aj xyz aj#something.com
2 call with Steve Steve Jack st#something.com
2 call with Steve Harvey Ray h#something.com
3 lunch Mike Mil Mike m#something.com
I want to remove firstname & last name for each unique Id from Title.
I tried grouping by Id which gives series Objects for Title, Fname, Lname,etc
df.groupby('Id')
I've concatenated Fname with .agg(lambda x: x.sum() if x.dtype == 'float64' else ','.join(x))
& kept in concated dataframe.
likewise all other columns get aggregated. Question is how do I replace values in Title based on this aggregated series.
concated['newTitle'] = [ concated.Title.str.replace(e[0]).replace(e[1]).replace(e[1])
for e in
zip(concated.FName.str.split(','), concated.LName.str.split(','))
]
I want something like this, or some other way, by which for each Id, I could get newTitle, with replaced values.
output be like:
Id Title
1 Meeting with ,
2 call with
3 lunch
Create a mapper series by joining Fname and lname and replace,
s = df.groupby('Id')[['Fname', 'lname']].apply(lambda x: '|'.join(x.stack()))
df.set_index('Id')['Title'].replace(s, '', regex = True).drop_duplicates()
Id
1 meeting with ,
2 call with
3 lunch
I have a large list of names and I am trying to cull down the duplicates. I am grouping them by name and consolidating the info if need be.
When two people don't have the same name it is no problem, we can just ffill and bfill, however, if two people have the same name we need to do some extra checks
This is an example of a group:
name code id country yob
1137 Bobby Joe USA19921111 NaN NaN NaN
2367 Bobby Joe NaN 1223133121 USA 1992
4398 Bobby Joe USA19981111 NaN NaN NaN
The code contains the persons country and birthdate. Looking at it, we can see that the first and second row are the same person. So we need to fill the info from the second row into the first row:
name code id country yob
1137 Bobby Joe USA19921111 1223133121 USA 1992
4398 Bobby Joe USA19981111 NaN NaN NaN
Here is what I have:
# Get create a dictionry of all of the rows that contain
# codes and their indexes
code_rows = dict(zip(list(group['code'].dropna().index),
group['code'].dropna().unique()))
no_code_rows = group.loc[pd.isnull(group['code']), :]
if no_code_rows.empty or len(code_rows) == group.shape[0]:
# No info to consolidate
return group
for group_idx, code in code_rows.items():
for row_idx, row in no_code_rows.iterrows():
country_yob = row['country'] + str(int(row['yob']))
if country_yob in code:
group.loc[group_idx, 'id'] = row['id']
group.loc[group_idx, 'country'] = row['country']
group.loc[group_idx, 'yob'] = row['yob']
group.drop(row_idx, inplace=True)
# Drop from temp table so we don't have to iterate
# over an extra row
no_code_rows.drop(row_idx, inplace=True)'''
break
return group
This works but I have a feeling I am missing something? I feel like I shouldn't have to use two loops for this and that maybe there is a pandas function?
EDIT
We don't know the order or how many rows we will have in each group
i.e.
name code id country yob
1137 Bobby Joe USA19921111 NaN NaN NaN
2367 Bobby Joe USA19981111 NaN NaN NaN
4398 Bobby Joe NaN 1223133121 USA 1992`
I think need:
m = df['code'].isnull()
df1 = df[~m]
df2 = df[m]
df = df1.merge(df2, on='name', suffixes=('','_'))
df['a_'] = df['country_'] + df['yob_'].astype(str)
m = df.apply(lambda x: x['a_'] in x['code'], axis=1)
df.loc[m, ['id','country','yob']] = df.loc[m, ['id_','country_','yob_']].rename(columns=lambda x: x.strip('_'))
df = df.loc[:, ~df.columns.str.endswith('_')]
print (df)
name code id country yob
0 Bobby Joe USA19921111 1223133121 USA 1992
1 Bobby Joe USA19981111 NaN NaN NaN