I have two data frames with two different name formats.
In the first data frame, I have First Name, Last Name(Eg. Jeff Robinson). The second data frame has the same names but in a different format Last Name, First Name Middle Initial(Robinson, Jeff D). Not everyone has a middle initial. This name format is considered to have the correct format.
DF1
Name
Dave Manno
Jane Shirt
Dhruv Patel
Richa Sharma
DF2
Sharma, Richa D
Shirt, Jane M
Patel, Dhruv
Manno, David
I need to find a way to merge the two datasets in a way where I can get the names in the first data and names in the second data to be side by side together. I tired merging it by the last names, but they are not unique and repeated. Eg you can have two people with the same last names.
Output:
Richa Sharma Sharma, Richa D
Dave Manno Manno, David
Dhruv Patel Patel, Dhruv
Jane Shirt Shirt, Jane M
This is what I currently have but not sure what to do after this:
df1['first_name'] = df1['employee_name'].str.split().str[0]
df1['last_name'] = df1['employee_name'].str.split().str[-1]
df2[['lastname','firstname']] =df2['Employee_Name'].str.split(",", expand=True)
Suppose you have these dataframes:
df1:
Name
0 Dave Manno
1 Jane Shirt
2 Dhruv Patel
3 Richa Sharma
df2:
Name
0 Sharma, Richa D
1 Shirt, Jane M
2 Patel, Dhruv
3 Manno, David
Then:
df1["tmp"] = df1["Name"].str.split().str[-1]
df2["tmp"] = df2["Name"].str.extract(r"(.*?),")
print(pd.merge(df1, df2, on="tmp")[["Name_x", "Name_y"]])
Prints:
Name_x Name_y
0 Dave Manno Manno, David
1 Jane Shirt Shirt, Jane M
2 Dhruv Patel Patel, Dhruv
3 Richa Sharma Sharma, Richa D
Related
I am trying to create a relationship between two data frames that are related, but there is no key that creates a relationship. Here is the layout of my problem:
The first data frame that I am using is information about when people entered an amusement park. In this amusement park, people can stay at the park for multiple days. So the structure of this data frame is
id
name
date
0
John Smith
07-01-2020 10:13:24
1
John Smith
07-22-2020 09:47:04
4
Jane Doe
07-22-2020 09:47:04
2
Jane Doe
06-13-2020 13:27:53
3
Thomas Wallace
07-08-2020 11:15:28
So people may visit the park once, or multiple times (assume that name is a unique identifier for people). For the other data frame, the data is what rides they went on during their time at the park. So the structure of this data frame is
name
ride
date
John Smith
Insanity
07-01-2020 13:53:07
John Smith
Bumper Cars
07-01-2020 16:37:29
John Smith
Tilt-A-Whirl
07-02-2020 08:21:18
John Smith
Insanity
07-22-2020 11:44:32
Jane Doe
Bumper Cars
06-13-2020 14:14:41
Jane Doe
Teacups
06-13-2020 17:31:56
Thomas Wallace
Insanity
07-08-2020 13:20:23
With these two data frames, I want to get the id of the visit associated with the rides that they went on during that visit. So the desired output in this example would be
id
name
ride
date
0
John Smith
Insanity
07-01-2020 13:53:07
0
John Smith
Bumper Cars
07-01-2020 16:37:29
0
John Smith
Tilt-A-Whirl
07-02-2020 08:21:18
1
John Smith
Insanity
07-22-2020 11:44:32
2
Jane Doe
Bumper Cars
06-13-2020 14:14:41
2
Jane Doe
Teacups
06-13-2020 17:31:56
3
Thomas Wallace
Insanity
07-08-2020 13:20:23
The way how I had thought about approaching this problem is by iterating over the visits and then adding the id to the ride if the name matched, the ride occurred during/after the visit, and the time delta is the smallest difference (using a large initial time delta and then updating the smallest different to that difference). If those conditions are not met, then just keep the same value. With this process in mind, here is my thought process in code:
rides['min_diff'] = pd.to_timedelta(365, unit='day')
rides['id'] = -1
for index, row in visits.iterrows():
rides['id'], rides['min_diff'] = np.where((rides['name'] == row['name']) & (
rides['date'] >= visits['date']) & (
(rides['date'] - row['date']) < rides['min_diff']),
(row['id'], rides['date'] - row['date']),
(rides['id'], rides['min_diff'))
This unfortunately does not execute because of the shapes not matching (as well as trying to assign values across multiple columns, which I am not sure how to do), but this is the general idea. I am not sure how this could be accomplished exactly, so if anyone has a solution, I would appreciate it.
Try with apply() and asof():
df1 = df1.set_index("date").sort_index() #asof requires a sorted index
df2["id"] = df2.apply(lambda x: df1[df1["Name"]==x["Name"]]["id"].asof(x["date"]), axis=1)
>>> df2
Name ride date id
0 John Smith Insanity 2020-07-01 13:53:07 0
1 John Smith Bumper Cars 2020-07-01 16:37:29 0
2 John Smith Tilt-A-Whirl 2020-07-02 08:21:18 0
3 John Smith Insanity 2020-07-22 11:44:32 1
4 Jane Doe Bumper Cars 2020-06-13 14:14:41 2
5 Jane Doe Teacups 2020-06-13 17:31:56 2
6 Thomas Wallace Insanity 2020-07-08 13:20:23 3
I think this does what you need. The ids aren't in the order you specified but they do represent visit ids with the logic you requested.
merged = pd.merge(df1, df2, how="right", left_on=['date', 'name'], right_on=['name', 'ride'])[['name_y', 'ride', 'date_y']]
merged['ymd'] = pd.to_datetime(merged.date_y).apply(lambda x: x.strftime('%Y-%m-%d'))
merged['id'] = merged.groupby(['name_y', 'ymd']).ngroup()
merged.drop('ymd', axis=1, inplace=True)
merged.columns = ['name', 'ride', 'date', 'id']
merged.sort_values(by='id', inplace=True)
print(merged)
OUT:
name ride date id
4 Jane Doe Bumper Cars 06-13-2020 14:14:41 0
5 Jane Doe Teacups 06-13-2020 17:31:56 0
0 John Smith Insanity 07-01-2020 13:53:07 1
1 John Smith Bumper Cars 07-01-2020 16:37:29 1
2 John Smith Tilt-A-Whirl 07-02-2020 08:21:18 2
3 John Smith Insanity 07-22-2020 11:44:32 3
6 Thomas Wallace Insanity 07-08-2020 13:20:23 4
I have two dataframes. One contains contact information for constituents. The other was created to pair up constituents that might be part of the same household.
Sample:
data1 = {'Household_0':['1234567','2345678','3456789','4567890'],
'Individual_0':['1111111','2222222','3333333','4444444'],
'Individual_1':['5555555','6666666','7777777','']}
df1=pd.DataFrame(data1)
data2 = {'Constituent Id':['1234567','2345678','3456789','4567890',
'1111111','2222222','3333333','4444444',
'5555555','6666666','7777777'],
'Display Name':['Clark Kent and Lois Lane','Bruce Banner and Betty Ross',
'Tony Stark and Pepper Pots','Steve Rogers','Clark Kent','Bruce Banner',
'Tony Stark','Steve Rogers','Lois Lane','Betty Ross','Pepper Pots']}
df2=pd.DataFrame(data2)
Resulting in:
df1
Household_0 Individual_0 Individual_1
0 1234567 1111111 5555555
1 2345678 2222222 6666666
2 3456789 3333333 7777777
3 4567890 4444444
df2
Constituent Id Display Name
0 1234567 Clark Kent and Lois Lane
1 2345678 Bruce Banner and Betty Ross
2 3456789 Tony Stark and Pepper Pots
3 4567890 Steve Rogers
4 1111111 Clark Kent
5 2222222 Bruce Banner
6 3333333 Tony Stark
7 4444444 Steve Rogers
8 5555555 Lois Lane
9 6666666 Betty Ross
10 7777777 Pepper Pots
I would like to take df1, reference the Constituent Id out of df2, and create a new dataframe that has the names of the constituents instead of their IDs, so that we can ensure they are truly family/household members.
I believe I can do this by iterating, but that seems like the wrong approach. Is there a straightforward way to do this?
you can map each column from df1 with a series based on df2 once set_index Constituent Id and select the column Display Name. Use apply to repeat the operation on each column.
print (df1.apply(lambda x: x.map(df2.set_index('Constituent Id')['Display Name'])))
Household_0 Individual_0 Individual_1
0 Clark Kent and Lois Lane Clark Kent Lois Lane
1 Bruce Banner and Betty Ross Bruce Banner Betty Ross
2 Tony Stark and Pepper Pots Tony Stark Pepper Pots
3 Steve Rogers Steve Rogers NaN
You can pipeline melt, merge and pivot_table.
df3 = (
df1
.reset_index()
.melt('index')
.merge(df2, left_on='value', right_on='Constituent Id')
.pivot_table(values='Display Name', index='index', columns='variable', aggfunc='last')
)
print(df3)
outputs
variable Household_0 Individual_0 Individual_1
index
0 Clark Kent and Lois Lane Clark Kent Lois Lane
1 Bruce Banner and Betty Ross Bruce Banner Betty Ross
2 Tony Stark and Pepper Pots Tony Stark Pepper Pots
3 Steve Rogers Steve Rogers NaN
You can also try using .applymap() to link the two together.
reference = df2.set_index('Constituent Id')['Display Name'].to_dict()
df1[df1.columns] = df1[df1.columns].applymap(reference.get)
I'm finding this problem easy to write out, but difficult to apply with my Pandas Dataframe.
When searching for anything 'unique values' and 'list' I only get answers for getting the unique values in a list.
There is a brute force solution with a double for loop, but there must be a faster Pandas solution than n^2.
I have a DataFrame with two columns: Name and Likes Food.
As output, I want a list of unique Likes Food values for each unique Name.
Example Dataframe df
Index Name Likes Food
0 Tim Pizza
1 Marie Pizza
2 Tim Pasta
3 Tim Pizza
4 John Pizza
5 Amy Pizza
6 Amy Sweet Potatoes
7 Marie Sushi
8 Tim Sushi
I know how to aggregate and groupby the unique count of Likes Food:
df.groupby( by='Name', as_index=False ).agg( {'Likes Food': pandas.Series.nunique} )
df.sort_values(by='Likes Food', ascending=False)
df.reset_index( drop=True )
>>>
Index Name Likes Food
0 Tim 3
1 Marie 2
2 Amy 2
3 John 1
But given that, what ARE the foods for each Name in that DataFrame? For readability, expressed as a list makes good sense. List sorting doesn't matter (and is easy to fix probably).
Example Output
<code here>
>>>
Index Name Likes Food Food List
0 Tim 3 [Pizza, Pasta, Sushi]
1 Marie 2 [Pizza, Sushi]
2 Amy 2 [Pizza, Sweet Potatoes]
3 John 1 [Pizza]
To obtain the output without the counts, just try unique
df.groupby("Name")["Likes"].unique()
Name
Amy [Pizza, Sweet Potatoes]
John [Pizza]
Marie [Pizza, Sushi]
Tim [Pizza, Pasta, Sushi]
Name: Likes, dtype: object
additionally, you can used named aggregation
df.groupby("Name").agg(**{"Likes Food": pd.NamedAgg(column='Likes', aggfunc="size"),
"Food List": pd.NamedAgg(column='Likes', aggfunc="nunique")}).reset_index()
Name Likes Food Food List
0 Amy 2 [Pizza, Sweet Potatoes]
1 John 1 [Pizza]
2 Marie 2 [Pizza, Sushi]
3 Tim 3 [Pizza, Pasta, Sushi]
To get both columns, also sorted, try this:
df = df.groupby("Name")["Likes_Food"].aggregate({'counts': 'nunique',
'food_list': 'unique'}).reset_index().sort_values(by='counts', ascending=False)
df
Name counts food_list
3 Tim 3 [Pizza, Pasta, Sushi]
0 Amy 2 [Pizza, SweetPotatoes]
2 Marie 2 [Pizza, Sushi]
1 John 1 [Pizza]
I have two dataframes DfMaster and DfError
DfMaster which looks like:
Id Name Building
0 4653 Jane Smith A
1 3467 Steve Jones B
2 34 Kim Lee F
3 4567 John Evans A
4 3643 Kevin Franks S
5 244 Stella Howard D
and DfError looks like
Id Name Building
0 4567 John Evans A
1 244 Stella Howard D
In DfMaster I would like to change the Building value for a record to DD if it appears in the DfError data-frame. So my desired output would be:
Id Name Building
0 4653 Jane Smith A
1 3467 Steve Jones B
2 34 Kim Lee F
3 4567 John Evans DD
4 3643 Kevin Franks S
5 244 Stella Howard DD
I am trying to use the following:
DfMaster.loc[DfError['Id'], 'Building'] = 'DD'
however I get an error:
KeyError: "None of [Int64Index([4567,244], dtype='int64')] are in the [index]"
What have I done wrong?
try this using np.where
import numpy as np
errors = list(dfError['id'].unqiue())
dfMaster['Building_id'] = np.where(dfMaster['Building_id'].isin(errors),'DD',dfMaster['Building_id'])
DataFrame.loc expects that you input an index or a Boolean series, not a value from a column.
I believe this should do the trick:
DfMaster.loc[DfMaster['Id'].isin(DfError['Id']), 'Building'] = 'DD'
Basically, it's telling:
For all rows where Id value is present in DfError['Id'], set the value of 'Building' to 'DD'.
I have three tsv files with names; file1.tsv, file2.tsv anf file3.tsv
The three tsv files have the following column names;
ID
Comment
Now I want to create a tsv file, where each ID gets a concatenated 'comment' string by checking the three files.
For example;
file1.tsv
ID Comment
Anne Smith Comment 1 of Anne smith
Peter Smith Comment 1 of peter smith
file2.tsv
ID Comment
John Cena Comment 2 of john cena
Peter Smith Comment 2 of peter smith
file3.tsv
ID Comment
John Cena Comment 3 of john cena
Peter Smith Comment 3 of peter smith
The results file should be;
results.tsv
ID Comment
Anne Smith Comment 1 of Anne smith
John Cena Comment 2 of john cena. Comment 3 of john cena.
Peter Smith Comment 1 of peter smith. Comment 2 of peter smith. Comment 3 of peter smith
I am new to panda. Just wondering if we can use Pandas or any other suitable library to perform concatenation rather than writing from scratch.
Assuming you read your tsv into df1, df2, df3
df=pd.concat([df1,df2,df2]).groupby('ID').Comment.apply('. '.join)
You can just use Pandas' read_csv function, but with the sep argument set to \t.
If you use this on all three TSV files, you should end up with three dataframes. You can then use the merge function to combine them how you wish.
to further expand on Wen's answer, the last loop is not very panda-ic, but it works...
file1 = '''ID\tComment
Anne Smith\tComment 1 of Anne smith
Peter Smith\tComment 1 of peter smith
'''
file2 = '''ID\tComment
John Cena\tComment 2 of john cena
Peter Smith\tComment 2 of peter smith
'''
file3 = '''ID\tComment
John Cena\tComment 3 of john cena
Peter Smith\tComment 3 of peter smith
'''
flist=[]
for r in [file1,file2,file3]:
fname=r+'.tsv'
with open(fname,'w') as f:
f.write(r)
flist.append(fname)
import pandas as pd
dflist=[]
for fname in flist:
df=pd.read_csv(fname,delimiter='\t')
dflist.append(df)
grouped=pd.concat(dflist).groupby('ID')
data=[]
for row in grouped:
data.append({'ID':row[0],'Comments':'. '.join(row[1].Comment)})
pd.DataFrame(data,columns=['ID','Comments']).to_csv('concat.tsv',sep='\t',index=False)