Finding Duplicates based on equal values in multiple columns

Finding Duplicates based on equal values in multiple columns - python

I used pandas to get a list of all Email duplicates, but not all email duplicates are in fact duplicates of a contact, because the company may be small, so that all employees have the same email-address for example.
Email
FirstName
LastName
Phone
Mobile
Company
a#company-a.com
John
Doe
12342
65464
Company_a
a#company-a.com
John
Doe
43214
45645
Comp_ny A
a#company-a.com
Adam
Smith
34223
46456
Company A
b#company-b.com
Bill
Gates
23423
63453
Company B
b#company-b.com
Bill
Gates
32421
43244
Comp B
b#company-b.com
Elon
Musk
42342
34234
Company B
That's why I came up with the following condition to filter my Email duplicate list further down:
I want to extract all the cases where the Email, FirstName and LastName are equal in a dataframe because that almost certainly would mean that this is a real duplicate. The extracted dataframe should look like this in the end:
Email
FirstName
LastName
Phone
Mobile
Company
a#company-a.com
John
Doe
12342
65464
Company_a
a#company-a.com
John
Doe
43214
45645
Comp_ny A
b#company-b.com
Bill
Gates
23423
63453
Company B
b#company-b.com
Bill
Gates
32421
43244
Comp B
How can I get there? Is it possible to check for multiple equal conditions?
I would appreciate any feedback regarding the best practices.
Thank you!

Use pd.drop_duplicates
df.drop_duplicates(subset=['Email', 'FirstName', 'LastName'], keep='first')
output
Email FirstName LastName Phone Mobile Company
0 a#company-a.com John Doe 12342 65464 Company_a
2 a#company-a.com Adam Smith 34223 46456 Company A
3 b#company-b.com Bill Gates 23423 63453 Company B
5 b#company-b.com Elon Musk 42342 34234 Company B
To get the duplicates
df[~df.index.isin(df.drop_duplicates(subset=['Email', 'FirstName', 'LastName'], keep='first').index)]
output
Email FirstName LastName Phone Mobile Company
1 a#company-a.com John Doe 43214 45645 Comp_ny A
4 b#company-b.com Bill Gates 32421 43244 Comp B

Related

Splitting up string contents of 2 or more columns in python dataframe and appending to new rows in Python

I have a problem where there are multiple rows in a csv file that I have converted to a pandas data frame. However there are some rows where the columns 'name' and 'business' have multiple names and businesses that need to be split up and placed into individual rows while keeping the data from the other columns the same for each row that is split.
Here is the example data:
input:
software
name
business
abc
Andrew Johnson, Steve Martin
Outsourcing/Offshoring, 201-500 employees,Health, Wellness and Fitness, 5001-10,000 employees
xyz
Jack Jones, Rick Paul, Johnny Jones
Banking, 1001-5000 employees,Construction, 51-200 employees,Consumer Goods, 10,001+ employees
def
Tom D., Connie J.
Unspecified, Unspecified
output i'd like to get:
software
name
business
abc
Andrew Johnson
Outsourcing/Offshoring, 201-500 employees
abc
Steve Martin
Health, Wellness and Fitness, 5001-10,000 employees
xyz
Jack Jones
Banking, 1001-5000 employees
xyz
Rick Paul
Construction, 51-200 employees
xyz
Johnny Jones
Consumer Goods, 10,001+ employees
def
Tom D
Unspecified
def
Connie J
Unspecified
There are additional columns similar to 'name' and 'business' that contain multiple pieces of information that need to be split up just like 'name' and 'business'. Cells that contain multiple pieces of information are in sequence (ordered).
Here's the code I have so far and creates new rows but it only splits up the contents in name column, but that leaves the business column and a few other columns left over that need to be split up along with the contents from the name column.
name2 = df.name.str.split(',', expand=True).stack()
df = df.join(pd.Series(index=name2.index.droplevel(1), data=name2.values, name = 'name2'))
dict = df.to_dict('record')
for row in dict:
new_segment = {}
new_segment['name'] = str(row['name2'])
#df['name'] = str(row['name2'])
for col,content in new_segment.items():
row[col] = content
df = pd.DataFrame.from_dict(dict)
df = df.drop('name2', 1)

Since pandas 1.3.0 it's possible to explode on multiple columns. So a simple solution would be to:
Split name on comma and business on commas after 'employees' or 'Unspecified' (implemented with regex below)
Explode on both name and business
This makes:
import pandas as pd
import io
data = '''software name business
abc Andrew Johnson, Steve Martin Outsourcing/Offshoring, 201-500 employees,Health, Wellness and Fitness, 5001-10,000 employees
xyz Jack Jones, Rick Paul, Johnny Jones Banking, 1001-5000 employees,Construction, 51-200 employees,Consumer Goods, 10,001+ employees
def Tom D., Connie J. Unspecified, Unspecified'''
df = pd.read_csv(io.StringIO(data), sep = '\t')
df['name'] = df['name'].str.split(', ')
df['business'] = df['business'].str.split(r'(?<=employees)\,\s*|(?<=Unspecified)\,\s*')
df = df.explode(['name','business']).reset_index(drop=True)
result:
software
name
business
0
abc
Andrew Johnson
Outsourcing/Offshoring, 201-500 employees
1
abc
Steve Martin
Health, Wellness and Fitness, 5001-10,000 employees
2
xyz
Jack Jones
Banking, 1001-5000 employees
3
xyz
Rick Paul
Construction, 51-200 employees
4
xyz
Johnny Jones
Consumer Goods, 10,001+ employees
5
def
Tom D.
Unspecified
6
def
Connie J.
Unspecified

Relationship based on time

I am trying to create a relationship between two data frames that are related, but there is no key that creates a relationship. Here is the layout of my problem:
The first data frame that I am using is information about when people entered an amusement park. In this amusement park, people can stay at the park for multiple days. So the structure of this data frame is
id
name
date
0
John Smith
07-01-2020 10:13:24
1
John Smith
07-22-2020 09:47:04
4
Jane Doe
07-22-2020 09:47:04
2
Jane Doe
06-13-2020 13:27:53
3
Thomas Wallace
07-08-2020 11:15:28
So people may visit the park once, or multiple times (assume that name is a unique identifier for people). For the other data frame, the data is what rides they went on during their time at the park. So the structure of this data frame is
name
ride
date
John Smith
Insanity
07-01-2020 13:53:07
John Smith
Bumper Cars
07-01-2020 16:37:29
John Smith
Tilt-A-Whirl
07-02-2020 08:21:18
John Smith
Insanity
07-22-2020 11:44:32
Jane Doe
Bumper Cars
06-13-2020 14:14:41
Jane Doe
Teacups
06-13-2020 17:31:56
Thomas Wallace
Insanity
07-08-2020 13:20:23
With these two data frames, I want to get the id of the visit associated with the rides that they went on during that visit. So the desired output in this example would be
id
name
ride
date
0
John Smith
Insanity
07-01-2020 13:53:07
0
John Smith
Bumper Cars
07-01-2020 16:37:29
0
John Smith
Tilt-A-Whirl
07-02-2020 08:21:18
1
John Smith
Insanity
07-22-2020 11:44:32
2
Jane Doe
Bumper Cars
06-13-2020 14:14:41
2
Jane Doe
Teacups
06-13-2020 17:31:56
3
Thomas Wallace
Insanity
07-08-2020 13:20:23
The way how I had thought about approaching this problem is by iterating over the visits and then adding the id to the ride if the name matched, the ride occurred during/after the visit, and the time delta is the smallest difference (using a large initial time delta and then updating the smallest different to that difference). If those conditions are not met, then just keep the same value. With this process in mind, here is my thought process in code:
rides['min_diff'] = pd.to_timedelta(365, unit='day')
rides['id'] = -1
for index, row in visits.iterrows():
rides['id'], rides['min_diff'] = np.where((rides['name'] == row['name']) & (
rides['date'] >= visits['date']) & (
(rides['date'] - row['date']) < rides['min_diff']),
(row['id'], rides['date'] - row['date']),
(rides['id'], rides['min_diff'))
This unfortunately does not execute because of the shapes not matching (as well as trying to assign values across multiple columns, which I am not sure how to do), but this is the general idea. I am not sure how this could be accomplished exactly, so if anyone has a solution, I would appreciate it.

Try with apply() and asof():
df1 = df1.set_index("date").sort_index() #asof requires a sorted index
df2["id"] = df2.apply(lambda x: df1[df1["Name"]==x["Name"]]["id"].asof(x["date"]), axis=1)
>>> df2
Name ride date id
0 John Smith Insanity 2020-07-01 13:53:07 0
1 John Smith Bumper Cars 2020-07-01 16:37:29 0
2 John Smith Tilt-A-Whirl 2020-07-02 08:21:18 0
3 John Smith Insanity 2020-07-22 11:44:32 1
4 Jane Doe Bumper Cars 2020-06-13 14:14:41 2
5 Jane Doe Teacups 2020-06-13 17:31:56 2
6 Thomas Wallace Insanity 2020-07-08 13:20:23 3

I think this does what you need. The ids aren't in the order you specified but they do represent visit ids with the logic you requested.
merged = pd.merge(df1, df2, how="right", left_on=['date', 'name'], right_on=['name', 'ride'])[['name_y', 'ride', 'date_y']]
merged['ymd'] = pd.to_datetime(merged.date_y).apply(lambda x: x.strftime('%Y-%m-%d'))
merged['id'] = merged.groupby(['name_y', 'ymd']).ngroup()
merged.drop('ymd', axis=1, inplace=True)
merged.columns = ['name', 'ride', 'date', 'id']
merged.sort_values(by='id', inplace=True)
print(merged)
OUT:
name ride date id
4 Jane Doe Bumper Cars 06-13-2020 14:14:41 0
5 Jane Doe Teacups 06-13-2020 17:31:56 0
0 John Smith Insanity 07-01-2020 13:53:07 1
1 John Smith Bumper Cars 07-01-2020 16:37:29 1
2 John Smith Tilt-A-Whirl 07-02-2020 08:21:18 2
3 John Smith Insanity 07-22-2020 11:44:32 3
6 Thomas Wallace Insanity 07-08-2020 13:20:23 4

Splitting rows in Pandas based on column values and mapping column names

I have a dataframe with two columns Person Name and Company Name. I want to create two more columns called Name and Name_Type. Name would be concat of Person and Company Name and Name_Type column would determine if the name is Person type or Company type. Some rows have empty strings, which creates four possibilities:
1) Empty Person + Empty Company = Can be left blank.
2) Empty Person + Company Name = Company Name Value
3) Person Name + Empty Person = Person Name Value
4) Both Name = Split them into two rows. Cannot figure out how to do that.
I am a Python and Pandas beginner, I haven't come across an answer online. Hoping to find something here. Please excuse format or other errors.
Input:
df = pd.DataFrame({"Person_name": ["Aaron", "", "Phil", "Joe"],
"Company_name": ["", "XYZ Inc", "ABC LLC", ""]})
Company_name Person_name
0 Aaron
1 XYZ Inc
2 ABC LLC Phil
3 Joe
Expected output:
Company_name Person_name Name Name_Type
0 Aaron Aaron Person_name
1 XYZ Inc XYZ Inc Company_name
2 ABC LLC Phil Phil Person_name
2 ABC LLC Phil ABC LLC Company_name
3 Joe Joe Person_name

You can use apply, unstack and merge
df = pd.DataFrame({"Person_name": ["Aaron", "", "Phil", "Joe"],
"Company_name": ["", "XYZ Inc", "ABC LLC", ""]})
def logic(row):
if row.Company_name and row.Person_name:
return pd.Series([[row.Person_name, "Person_name"], [row.Company_name, "Company_name"]])
else:
return pd.Series([[row.Person_name, "Person_name"] if row.Person_name else [row.Company_name, "Company_name"]])
df2 = df.apply(logic, 1).unstack().apply(pd.Series).dropna().reset_index().set_index("level_1").sort_index()
dff = pd.merge(df,df2, left_index=True, right_index=True).iloc[:, [0,1,3,4]]
dff.columns = ["Company_name", "Person_name", "Name", "Name_Type"]
Output
Company_name Person_name Name Name_Type
0 Aaron Aaron Person_name
1 XYZ Inc XYZ Inc Company_name
2 ABC LLC Phil Phil Person_name
2 ABC LLC Phil ABC LLC Company_name
3 Joe Joe Person_name

Use:
(df1.melt('index', var_name='Name_Type', value_name='Name')
.replace('',np.nan).dropna()
.merge(df1, on='index').sort_values('index')
.set_index('index'))
Output:
Name_Type Name Person_name Company_name
index
0 Person_name Aaron Aaron
1 Company_name XYZ Inc XYZ Inc
2 Person_name Phil Phil ABC LLC
2 Company_name ABC LLC Phil ABC LLC
3 Person_name Joe Joe

Adding a function to a string split command in Pandas

I have a dataframe that has 20 or so columns in it. One of the columns is called 'director_name' and has values such as 'John Doe' or 'Jane Doe'. I want to split this into 2 columns, 'First_Name' and 'Last_Name'. When I run the following it works as expected and splits the string into 2 columns:
data[['First_Name', 'Last_Name']] = data.director_name.str.split(' ', expand
= True)
data
First_Name Last_Name
John Doe
It works great, however it does NOT work when I have NULL (NaN) values under 'director_name'. It throws the following error:
'Columns must be same length as key'
I'd like to add a function which checks if the value != null, then do the command listed above, otherwise enter 'NA' for First_Name and 'Last_Name'
Any ideas how I would go about that?
EDIT:
I just checked the file and I'm not sure if NULL is the issue. I have some names that are 3-4 strings long. i.e.
John Allen Doe
John Allen Doe Jr
Maybe I can't split this into First_Name and Last_Name.
Hmmmm

Here is a way is to split and choose say the first two values as first name and last name
Id name
0 1 James Cameron
1 2 Martin Sheen
2 3 John Allen Doe
3 4 NaN
df['First_Name'] = df.name.str.split(' ', expand = True)[0]
df['Last_Name'] = df.name.str.split(' ', expand = True)[1]
You get
Id name First_Name Last_Name
0 1 James Cameron James Cameron
1 2 Martin Sheen Martin Sheen
2 3 John Allen Doe John Allen
3 4 NaN NaN None

Use str.split (no parameter, because splitter by default whitespace) with indexing with str for select lists by position:
print (df.name.str.split())
0 [James, Cameron]
1 [Martin, Sheen]
2 [John, Allen, Doe]
3 NaN
Name: name, dtype: object
df['First_Name'] = df.name.str.split().str[0]
df['Last_Name'] = df.name.str.split().str[1]
#data borrow from A-Za-z answer
print (df)
Id name First_Name Last_Name
0 1 James Cameron James Cameron
1 2 Martin Sheen Martin Sheen
2 3 John Allen Doe John Allen
3 4 NaN NaN NaN
There is also possible use paramter n for selecting second or first 2 names:
df['First_Name'] = df.name.str.split().str[0]
df['Last_Name'] = df.name.str.split(n=1).str[1]
print (df)
Id name First_Name Last_Name
0 1 James Cameron James Cameron
1 2 Martin Sheen Martin Sheen
2 3 John Allen Doe John Allen Doe
3 4 NaN NaN NaN
Solution with str.rstrip
df['First_Name'] = df.name.str.rsplit(n=1).str[0]
df['Last_Name'] = df.name.str.rsplit().str[-1]
print (df)
Id name First_Name Last_Name
0 1 James Cameron James Cameron
1 2 Martin Sheen Martin Sheen
2 3 John Allen Doe John Allen Doe
3 4 NaN NaN NaN

df['First_Name'] = df.name.str.split(' ', expand = True)[0]
df['Last_Name'] = df.name.str.split(' ', expand = True)[1]
This should do

This should fix your problem
Setup
data= pd.DataFrame({'director_name': {0: 'John Doe', 1: np.nan, 2: 'Alan Smith'}})
data
Out[457]:
director_name
0 John Doe
1 NaN
2 Alan Smith
Solution
#use a lambda function to check nan before splitting the column.
data[['First_Name', 'Last_Name']] = data.apply(lambda x: pd.Series([np.nan,np.nan] if pd.isnull(x.director_name) else x.director_name.split()), axis=1)
data
Out[446]:
director_name First_Name Last_Name
0 John Doe John Doe
1 NaN NaN NaN
2 Alan Smith Alan Smith
If you need to take only the first 2 names, you can do:
data[['First_Name', 'Last_Name']] = data.apply(lambda x: pd.Series([np.nan,np.nan] if pd.isnull(x.director_name) else x.director_name.split()).iloc[:2], axis=1)

mapping of one column with another two

I have a Pandas Dataframe with three columns that follows this structure:
Employee email Manager
Smith asmith#example.com Johnson
Doe jdoe#example.com Smith
Johnson jjohnson#example.com Doe
... ... ...
And I need to add a column named "Manager's email, with the email of the Employee's Manager.
Employee email Manager Manager's Email
Smith asmith#example.com Johnson jjohnson#example.com
Doe jdoe#example.com Smith asmith#example.com
Johnson jjohnson#example.com Doe jdoe#example.com
... ... ... ...
So for example, for the Employee 'Smith' , since his Manager is 'Johnson', the value of the 'Manager's Email' for that row would be the email of the Employee 'Johnson'

You can use pandas.DataFrame.merge:
>>> dfN = pd.merge(df, df, how='left', left_on='Manager', right_on='Employee', suffixes=['',"_m"])
>>> del dfN['Employee_m']
>>> del dfN['Manager_m']
>>> dfN = dfN.rename(columns={'email_m':"Manager's email"})
>>> dfN
Employee Manager email Manager's email
0 Smith Johnson asmith#example.com jjohnson#example.com
1 Doe Smith jdoe#example.com asmith#example.com
2 Johnson Doe jjohnson#example.com jdoe#example.com

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Finding Duplicates based on equal values in multiple columns - python

Related

Splitting up string contents of 2 or more columns in python dataframe and appending to new rows in Python

Relationship based on time

Splitting rows in Pandas based on column values and mapping column names

Adding a function to a string split command in Pandas

mapping of one column with another two

Categories

Resources