mapping of one column with another two - python

I have a Pandas Dataframe with three columns that follows this structure:
Employee email Manager
Smith asmith#example.com Johnson
Doe jdoe#example.com Smith
Johnson jjohnson#example.com Doe
... ... ...
And I need to add a column named "Manager's email, with the email of the Employee's Manager.
Employee email Manager Manager's Email
Smith asmith#example.com Johnson jjohnson#example.com
Doe jdoe#example.com Smith asmith#example.com
Johnson jjohnson#example.com Doe jdoe#example.com
... ... ... ...
So for example, for the Employee 'Smith' , since his Manager is 'Johnson', the value of the 'Manager's Email' for that row would be the email of the Employee 'Johnson'

You can use pandas.DataFrame.merge:
>>> dfN = pd.merge(df, df, how='left', left_on='Manager', right_on='Employee', suffixes=['',"_m"])
>>> del dfN['Employee_m']
>>> del dfN['Manager_m']
>>> dfN = dfN.rename(columns={'email_m':"Manager's email"})
>>> dfN
Employee Manager email Manager's email
0 Smith Johnson asmith#example.com jjohnson#example.com
1 Doe Smith jdoe#example.com asmith#example.com
2 Johnson Doe jjohnson#example.com jdoe#example.com

Related

Assign actual value after groupby.transform to all rows associate with it. Python

I need to assign name to a new 'responsible' column for all rows associate with customer.
If part of the string in 'codes' consist 'manager', manager's name should be assigned to the 'responsible' column. If there is no 'manager' in the codes column, 'responsible' columns should be populated with the 'empl_name' associate with the row.
original df:
df = pd.DataFrame({'cust_name': ['john', 'liza', 'john', 'john', 'liza', 'david', 'john', 'liza', 'david', 'chris'],
'empl_name': ['mike', 'nick', 'kate', 'mike', 'mike', 'kate', 'mike', 'mike', 'mike', 'jennifer'],
'codes': ['empl, office', 'manager_1, remote', 'empl, remote', 'empl, remote', 'empl, office',
'empl, remote', 'empl, remote', 'empl, office', 'empl, remote', 'manager_2, office']})
looks like:
cust_name empl_name codes
john mike empl, office
liza nick manager_1, remote
john kate empl, remote
john mike empl, remote
liza mike empl, office
david kate empl, remote
john mike empl, remote
liza mike empl, office
david mike empl, remote
chris jennifer manager_2, office
output should be:
cust_name empl_name codes responsible
john mike empl, office mike
liza nick manager_1, remote nick
john kate empl, remote kate
john mike empl, remote mike
liza mike empl, office nick
david kate empl, remote kate
john mike empl, remote mike
liza mike empl, office nick
david mike empl, remote mike
chris jennifer manager_2, office jennifer
Just assign a value to a new column:
df['manager_name'] = df.loc[df['code']=='manager','empl_name'].iloc[0]
Added a case where there is no manager name:
names = df[['empl_name', 'code']].drop_duplicates()
df['manager_name'] = names.loc[names['code']=='manager','empl_name'].iloc[0] if len(names)>1 else names.loc[0, 'empl_name']
Output:
cust_name empl_name code manager_name
0 john mike empl nick
1 john mike empl nick
2 john nick manager nick
3 john mike empl nick
4 john nick manager nick
Edit:
You can groupby cust_name and apply a custom function what does what you want:
def assign_responsible(x):
mask = x['codes'].str.contains('manager')
if sum(mask) > 0:
x['responsible'] = x.loc[mask, 'empl_name'].iloc[0]
else:
x['responsible'] = x['empl_name']
return x
df = df.groupby('cust_name').apply(assign_responsible)
Output:
cust_name empl_name codes responsible
0 john mike empl, office mike
1 liza nick manager_1, remote nick
2 john kate empl, remote kate
3 john mike empl, remote mike
4 liza mike empl, office nick
5 david kate empl, remote kate
6 john mike empl, remote mike
7 liza mike empl, office nick
8 david mike empl, remote mike
9 chris jennifer manager_2, office jennifer

Splitting up string contents of 2 or more columns in python dataframe and appending to new rows in Python

I have a problem where there are multiple rows in a csv file that I have converted to a pandas data frame. However there are some rows where the columns 'name' and 'business' have multiple names and businesses that need to be split up and placed into individual rows while keeping the data from the other columns the same for each row that is split.
Here is the example data:
input:
software
name
business
abc
Andrew Johnson, Steve Martin
Outsourcing/Offshoring, 201-500 employees,Health, Wellness and Fitness, 5001-10,000 employees
xyz
Jack Jones, Rick Paul, Johnny Jones
Banking, 1001-5000 employees,Construction, 51-200 employees,Consumer Goods, 10,001+ employees
def
Tom D., Connie J.
Unspecified, Unspecified
output i'd like to get:
software
name
business
abc
Andrew Johnson
Outsourcing/Offshoring, 201-500 employees
abc
Steve Martin
Health, Wellness and Fitness, 5001-10,000 employees
xyz
Jack Jones
Banking, 1001-5000 employees
xyz
Rick Paul
Construction, 51-200 employees
xyz
Johnny Jones
Consumer Goods, 10,001+ employees
def
Tom D
Unspecified
def
Connie J
Unspecified
There are additional columns similar to 'name' and 'business' that contain multiple pieces of information that need to be split up just like 'name' and 'business'. Cells that contain multiple pieces of information are in sequence (ordered).
Here's the code I have so far and creates new rows but it only splits up the contents in name column, but that leaves the business column and a few other columns left over that need to be split up along with the contents from the name column.
name2 = df.name.str.split(',', expand=True).stack()
df = df.join(pd.Series(index=name2.index.droplevel(1), data=name2.values, name = 'name2'))
dict = df.to_dict('record')
for row in dict:
new_segment = {}
new_segment['name'] = str(row['name2'])
#df['name'] = str(row['name2'])
for col,content in new_segment.items():
row[col] = content
df = pd.DataFrame.from_dict(dict)
df = df.drop('name2', 1)
Since pandas 1.3.0 it's possible to explode on multiple columns. So a simple solution would be to:
Split name on comma and business on commas after 'employees' or 'Unspecified' (implemented with regex below)
Explode on both name and business
This makes:
import pandas as pd
import io
data = '''software name business
abc Andrew Johnson, Steve Martin Outsourcing/Offshoring, 201-500 employees,Health, Wellness and Fitness, 5001-10,000 employees
xyz Jack Jones, Rick Paul, Johnny Jones Banking, 1001-5000 employees,Construction, 51-200 employees,Consumer Goods, 10,001+ employees
def Tom D., Connie J. Unspecified, Unspecified'''
df = pd.read_csv(io.StringIO(data), sep = '\t')
df['name'] = df['name'].str.split(', ')
df['business'] = df['business'].str.split(r'(?<=employees)\,\s*|(?<=Unspecified)\,\s*')
df = df.explode(['name','business']).reset_index(drop=True)
result:
software
name
business
0
abc
Andrew Johnson
Outsourcing/Offshoring, 201-500 employees
1
abc
Steve Martin
Health, Wellness and Fitness, 5001-10,000 employees
2
xyz
Jack Jones
Banking, 1001-5000 employees
3
xyz
Rick Paul
Construction, 51-200 employees
4
xyz
Johnny Jones
Consumer Goods, 10,001+ employees
5
def
Tom D.
Unspecified
6
def
Connie J.
Unspecified

Relationship based on time

I am trying to create a relationship between two data frames that are related, but there is no key that creates a relationship. Here is the layout of my problem:
The first data frame that I am using is information about when people entered an amusement park. In this amusement park, people can stay at the park for multiple days. So the structure of this data frame is
id
name
date
0
John Smith
07-01-2020 10:13:24
1
John Smith
07-22-2020 09:47:04
4
Jane Doe
07-22-2020 09:47:04
2
Jane Doe
06-13-2020 13:27:53
3
Thomas Wallace
07-08-2020 11:15:28
So people may visit the park once, or multiple times (assume that name is a unique identifier for people). For the other data frame, the data is what rides they went on during their time at the park. So the structure of this data frame is
name
ride
date
John Smith
Insanity
07-01-2020 13:53:07
John Smith
Bumper Cars
07-01-2020 16:37:29
John Smith
Tilt-A-Whirl
07-02-2020 08:21:18
John Smith
Insanity
07-22-2020 11:44:32
Jane Doe
Bumper Cars
06-13-2020 14:14:41
Jane Doe
Teacups
06-13-2020 17:31:56
Thomas Wallace
Insanity
07-08-2020 13:20:23
With these two data frames, I want to get the id of the visit associated with the rides that they went on during that visit. So the desired output in this example would be
id
name
ride
date
0
John Smith
Insanity
07-01-2020 13:53:07
0
John Smith
Bumper Cars
07-01-2020 16:37:29
0
John Smith
Tilt-A-Whirl
07-02-2020 08:21:18
1
John Smith
Insanity
07-22-2020 11:44:32
2
Jane Doe
Bumper Cars
06-13-2020 14:14:41
2
Jane Doe
Teacups
06-13-2020 17:31:56
3
Thomas Wallace
Insanity
07-08-2020 13:20:23
The way how I had thought about approaching this problem is by iterating over the visits and then adding the id to the ride if the name matched, the ride occurred during/after the visit, and the time delta is the smallest difference (using a large initial time delta and then updating the smallest different to that difference). If those conditions are not met, then just keep the same value. With this process in mind, here is my thought process in code:
rides['min_diff'] = pd.to_timedelta(365, unit='day')
rides['id'] = -1
for index, row in visits.iterrows():
rides['id'], rides['min_diff'] = np.where((rides['name'] == row['name']) & (
rides['date'] >= visits['date']) & (
(rides['date'] - row['date']) < rides['min_diff']),
(row['id'], rides['date'] - row['date']),
(rides['id'], rides['min_diff'))
This unfortunately does not execute because of the shapes not matching (as well as trying to assign values across multiple columns, which I am not sure how to do), but this is the general idea. I am not sure how this could be accomplished exactly, so if anyone has a solution, I would appreciate it.
Try with apply() and asof():
df1 = df1.set_index("date").sort_index() #asof requires a sorted index
df2["id"] = df2.apply(lambda x: df1[df1["Name"]==x["Name"]]["id"].asof(x["date"]), axis=1)
>>> df2
Name ride date id
0 John Smith Insanity 2020-07-01 13:53:07 0
1 John Smith Bumper Cars 2020-07-01 16:37:29 0
2 John Smith Tilt-A-Whirl 2020-07-02 08:21:18 0
3 John Smith Insanity 2020-07-22 11:44:32 1
4 Jane Doe Bumper Cars 2020-06-13 14:14:41 2
5 Jane Doe Teacups 2020-06-13 17:31:56 2
6 Thomas Wallace Insanity 2020-07-08 13:20:23 3
I think this does what you need. The ids aren't in the order you specified but they do represent visit ids with the logic you requested.
merged = pd.merge(df1, df2, how="right", left_on=['date', 'name'], right_on=['name', 'ride'])[['name_y', 'ride', 'date_y']]
merged['ymd'] = pd.to_datetime(merged.date_y).apply(lambda x: x.strftime('%Y-%m-%d'))
merged['id'] = merged.groupby(['name_y', 'ymd']).ngroup()
merged.drop('ymd', axis=1, inplace=True)
merged.columns = ['name', 'ride', 'date', 'id']
merged.sort_values(by='id', inplace=True)
print(merged)
OUT:
name ride date id
4 Jane Doe Bumper Cars 06-13-2020 14:14:41 0
5 Jane Doe Teacups 06-13-2020 17:31:56 0
0 John Smith Insanity 07-01-2020 13:53:07 1
1 John Smith Bumper Cars 07-01-2020 16:37:29 1
2 John Smith Tilt-A-Whirl 07-02-2020 08:21:18 2
3 John Smith Insanity 07-22-2020 11:44:32 3
6 Thomas Wallace Insanity 07-08-2020 13:20:23 4

Finding Duplicates based on equal values in multiple columns

I used pandas to get a list of all Email duplicates, but not all email duplicates are in fact duplicates of a contact, because the company may be small, so that all employees have the same email-address for example.
Email
FirstName
LastName
Phone
Mobile
Company
a#company-a.com
John
Doe
12342
65464
Company_a
a#company-a.com
John
Doe
43214
45645
Comp_ny A
a#company-a.com
Adam
Smith
34223
46456
Company A
b#company-b.com
Bill
Gates
23423
63453
Company B
b#company-b.com
Bill
Gates
32421
43244
Comp B
b#company-b.com
Elon
Musk
42342
34234
Company B
That's why I came up with the following condition to filter my Email duplicate list further down:
I want to extract all the cases where the Email, FirstName and LastName are equal in a dataframe because that almost certainly would mean that this is a real duplicate. The extracted dataframe should look like this in the end:
Email
FirstName
LastName
Phone
Mobile
Company
a#company-a.com
John
Doe
12342
65464
Company_a
a#company-a.com
John
Doe
43214
45645
Comp_ny A
b#company-b.com
Bill
Gates
23423
63453
Company B
b#company-b.com
Bill
Gates
32421
43244
Comp B
How can I get there? Is it possible to check for multiple equal conditions?
I would appreciate any feedback regarding the best practices.
Thank you!
Use pd.drop_duplicates
df.drop_duplicates(subset=['Email', 'FirstName', 'LastName'], keep='first')
output
Email FirstName LastName Phone Mobile Company
0 a#company-a.com John Doe 12342 65464 Company_a
2 a#company-a.com Adam Smith 34223 46456 Company A
3 b#company-b.com Bill Gates 23423 63453 Company B
5 b#company-b.com Elon Musk 42342 34234 Company B
To get the duplicates
df[~df.index.isin(df.drop_duplicates(subset=['Email', 'FirstName', 'LastName'], keep='first').index)]
output
Email FirstName LastName Phone Mobile Company
1 a#company-a.com John Doe 43214 45645 Comp_ny A
4 b#company-b.com Bill Gates 32421 43244 Comp B

Adding a function to a string split command in Pandas

I have a dataframe that has 20 or so columns in it. One of the columns is called 'director_name' and has values such as 'John Doe' or 'Jane Doe'. I want to split this into 2 columns, 'First_Name' and 'Last_Name'. When I run the following it works as expected and splits the string into 2 columns:
data[['First_Name', 'Last_Name']] = data.director_name.str.split(' ', expand
= True)
data
First_Name Last_Name
John Doe
It works great, however it does NOT work when I have NULL (NaN) values under 'director_name'. It throws the following error:
'Columns must be same length as key'
I'd like to add a function which checks if the value != null, then do the command listed above, otherwise enter 'NA' for First_Name and 'Last_Name'
Any ideas how I would go about that?
EDIT:
I just checked the file and I'm not sure if NULL is the issue. I have some names that are 3-4 strings long. i.e.
John Allen Doe
John Allen Doe Jr
Maybe I can't split this into First_Name and Last_Name.
Hmmmm
Here is a way is to split and choose say the first two values as first name and last name
Id name
0 1 James Cameron
1 2 Martin Sheen
2 3 John Allen Doe
3 4 NaN
df['First_Name'] = df.name.str.split(' ', expand = True)[0]
df['Last_Name'] = df.name.str.split(' ', expand = True)[1]
You get
Id name First_Name Last_Name
0 1 James Cameron James Cameron
1 2 Martin Sheen Martin Sheen
2 3 John Allen Doe John Allen
3 4 NaN NaN None
Use str.split (no parameter, because splitter by default whitespace) with indexing with str for select lists by position:
print (df.name.str.split())
0 [James, Cameron]
1 [Martin, Sheen]
2 [John, Allen, Doe]
3 NaN
Name: name, dtype: object
df['First_Name'] = df.name.str.split().str[0]
df['Last_Name'] = df.name.str.split().str[1]
#data borrow from A-Za-z answer
print (df)
Id name First_Name Last_Name
0 1 James Cameron James Cameron
1 2 Martin Sheen Martin Sheen
2 3 John Allen Doe John Allen
3 4 NaN NaN NaN
There is also possible use paramter n for selecting second or first 2 names:
df['First_Name'] = df.name.str.split().str[0]
df['Last_Name'] = df.name.str.split(n=1).str[1]
print (df)
Id name First_Name Last_Name
0 1 James Cameron James Cameron
1 2 Martin Sheen Martin Sheen
2 3 John Allen Doe John Allen Doe
3 4 NaN NaN NaN
Solution with str.rstrip
df['First_Name'] = df.name.str.rsplit(n=1).str[0]
df['Last_Name'] = df.name.str.rsplit().str[-1]
print (df)
Id name First_Name Last_Name
0 1 James Cameron James Cameron
1 2 Martin Sheen Martin Sheen
2 3 John Allen Doe John Allen Doe
3 4 NaN NaN NaN
df['First_Name'] = df.name.str.split(' ', expand = True)[0]
df['Last_Name'] = df.name.str.split(' ', expand = True)[1]
This should do
This should fix your problem
Setup
data= pd.DataFrame({'director_name': {0: 'John Doe', 1: np.nan, 2: 'Alan Smith'}})
data
Out[457]:
director_name
0 John Doe
1 NaN
2 Alan Smith
Solution
#use a lambda function to check nan before splitting the column.
data[['First_Name', 'Last_Name']] = data.apply(lambda x: pd.Series([np.nan,np.nan] if pd.isnull(x.director_name) else x.director_name.split()), axis=1)
data
Out[446]:
director_name First_Name Last_Name
0 John Doe John Doe
1 NaN NaN NaN
2 Alan Smith Alan Smith
If you need to take only the first 2 names, you can do:
data[['First_Name', 'Last_Name']] = data.apply(lambda x: pd.Series([np.nan,np.nan] if pd.isnull(x.director_name) else x.director_name.split()).iloc[:2], axis=1)

Categories