Related
This question already has answers here:
Get rows based on distinct values from one column
(2 answers)
Closed 1 year ago.
I have a dataframe with thousands rows like this:
city zip_code name
paris 1 John
paris 1 Eric
paris 2 David
LA 3 David
LA 4 David
LA 4 NaN
How can I do a groupby city and zip code and know the name for each city and zip_code grouped ?
Expected output: a dataframe with rows with unique city and unique zip_code and corresponding names in other column (one row per name)
city zip_code name
paris 1 John
Eric
paris 2 David
LA 3 David
LA 4 David
IIUC, you want to know the existing combinations of city and zip_code?
[k for k,_ in df.groupby(['city', 'zip_code'])]
output: [('LA', 3), ('LA', 4), ('paris', 1), ('paris', 2)]
edit following your change in the question:
It looks like you want:
df.drop_duplicates().dropna()
output:
city zip_code name
0 paris 1 John
1 paris 1 Eric
2 paris 2 David
3 LA 3 David
4 LA 4 David
I am trying to create a relationship between two data frames that are related, but there is no key that creates a relationship. Here is the layout of my problem:
The first data frame that I am using is information about when people entered an amusement park. In this amusement park, people can stay at the park for multiple days. So the structure of this data frame is
id
name
date
0
John Smith
07-01-2020 10:13:24
1
John Smith
07-22-2020 09:47:04
4
Jane Doe
07-22-2020 09:47:04
2
Jane Doe
06-13-2020 13:27:53
3
Thomas Wallace
07-08-2020 11:15:28
So people may visit the park once, or multiple times (assume that name is a unique identifier for people). For the other data frame, the data is what rides they went on during their time at the park. So the structure of this data frame is
name
ride
date
John Smith
Insanity
07-01-2020 13:53:07
John Smith
Bumper Cars
07-01-2020 16:37:29
John Smith
Tilt-A-Whirl
07-02-2020 08:21:18
John Smith
Insanity
07-22-2020 11:44:32
Jane Doe
Bumper Cars
06-13-2020 14:14:41
Jane Doe
Teacups
06-13-2020 17:31:56
Thomas Wallace
Insanity
07-08-2020 13:20:23
With these two data frames, I want to get the id of the visit associated with the rides that they went on during that visit. So the desired output in this example would be
id
name
ride
date
0
John Smith
Insanity
07-01-2020 13:53:07
0
John Smith
Bumper Cars
07-01-2020 16:37:29
0
John Smith
Tilt-A-Whirl
07-02-2020 08:21:18
1
John Smith
Insanity
07-22-2020 11:44:32
2
Jane Doe
Bumper Cars
06-13-2020 14:14:41
2
Jane Doe
Teacups
06-13-2020 17:31:56
3
Thomas Wallace
Insanity
07-08-2020 13:20:23
The way how I had thought about approaching this problem is by iterating over the visits and then adding the id to the ride if the name matched, the ride occurred during/after the visit, and the time delta is the smallest difference (using a large initial time delta and then updating the smallest different to that difference). If those conditions are not met, then just keep the same value. With this process in mind, here is my thought process in code:
rides['min_diff'] = pd.to_timedelta(365, unit='day')
rides['id'] = -1
for index, row in visits.iterrows():
rides['id'], rides['min_diff'] = np.where((rides['name'] == row['name']) & (
rides['date'] >= visits['date']) & (
(rides['date'] - row['date']) < rides['min_diff']),
(row['id'], rides['date'] - row['date']),
(rides['id'], rides['min_diff'))
This unfortunately does not execute because of the shapes not matching (as well as trying to assign values across multiple columns, which I am not sure how to do), but this is the general idea. I am not sure how this could be accomplished exactly, so if anyone has a solution, I would appreciate it.
Try with apply() and asof():
df1 = df1.set_index("date").sort_index() #asof requires a sorted index
df2["id"] = df2.apply(lambda x: df1[df1["Name"]==x["Name"]]["id"].asof(x["date"]), axis=1)
>>> df2
Name ride date id
0 John Smith Insanity 2020-07-01 13:53:07 0
1 John Smith Bumper Cars 2020-07-01 16:37:29 0
2 John Smith Tilt-A-Whirl 2020-07-02 08:21:18 0
3 John Smith Insanity 2020-07-22 11:44:32 1
4 Jane Doe Bumper Cars 2020-06-13 14:14:41 2
5 Jane Doe Teacups 2020-06-13 17:31:56 2
6 Thomas Wallace Insanity 2020-07-08 13:20:23 3
I think this does what you need. The ids aren't in the order you specified but they do represent visit ids with the logic you requested.
merged = pd.merge(df1, df2, how="right", left_on=['date', 'name'], right_on=['name', 'ride'])[['name_y', 'ride', 'date_y']]
merged['ymd'] = pd.to_datetime(merged.date_y).apply(lambda x: x.strftime('%Y-%m-%d'))
merged['id'] = merged.groupby(['name_y', 'ymd']).ngroup()
merged.drop('ymd', axis=1, inplace=True)
merged.columns = ['name', 'ride', 'date', 'id']
merged.sort_values(by='id', inplace=True)
print(merged)
OUT:
name ride date id
4 Jane Doe Bumper Cars 06-13-2020 14:14:41 0
5 Jane Doe Teacups 06-13-2020 17:31:56 0
0 John Smith Insanity 07-01-2020 13:53:07 1
1 John Smith Bumper Cars 07-01-2020 16:37:29 1
2 John Smith Tilt-A-Whirl 07-02-2020 08:21:18 2
3 John Smith Insanity 07-22-2020 11:44:32 3
6 Thomas Wallace Insanity 07-08-2020 13:20:23 4
Lets say I had this sample of a mixed dataset:
df:
Property Name Date of entry Old data Updated data
City Jim 1/7/2021 Jacksonville Miami
State Jack 1/8/2021 TX CA
Zip Joe 2/2/2021 11111 22222
Address Harry 2/3/2021 123 lane 123 street
Telephone Lisa 3/1/2021 111-111-11111 333-333-3333
Email Tammy 3/2/2021 tammy#yahoo.com tammy#gmail.com
Date Product Ordered Lisa 3/3/2021 2/1/2021 2/10/2021
Order count Tammy 3/4/2021 2 3
I'd like to group by all this data starting with property and have it look like this:
grouped:
Property Name Date of entry Old data Updated Data
City names1 date 1 data 1 data 2
names2 date 2 data 1 data 2
names3 date 3 data 1 data 2
State names1 date 1 data 1 data 2
names2 date 2 data 1 data 2
names3 date 3 data 1 data 2
grouped = pd.DataFrame(df.groupby(['Property','Name','Date of entry','Old Data', 'updated data'])
.size(),columns=['Count'])
grouped
and I get a type error saying: '<' not supported between instances of 'int' and 'datetime.datetime'
Is there some sort of formatting that I need to do to the df['Old data'] & df['Updated data'] columns to allow them to be added to the groupby?
added data types:
Property: Object
Name: Object
Date of entry: datetime
Old data: Object
Updated data: Object
*I modified your initial data to get a better view of the output.
You can try with pivot_table instead of groupby:
df.pivot_table(index = ['Property', 'Name', 'Date of entry'], aggfunc=lambda x: x)
Output:
Old data Updated data
Property Name Date of entry
Address Harry 2/3/2021 123 lane 123 street
Lisa 2/3/2021 123 lane 123 street
City Jack 1/8/2021 TX Miami
Jim 1/7/2021 Jacksonville Miami
Tammy 1/8/2021 TX Miami
Date Product Ordered Lisa 3/3/2021 2/1/2021 2/10/2021
Email Tammy 3/2/2021 tammy#yahoo.com tammy#gmail.com
Order count Jack 3/4/2021 2 3
Tammy 3/4/2021 2 3
State Jack 1/8/2021 TX CA
Telephone Lisa 3/1/2021 111-111-11111 333-333-3333
Zip Joe 2/2/2021 11111 22222
The whole code:
import pandas as pd
from io import StringIO
txt = '''Property Name Date of entry Old data Updated data
City Jim 1/7/2021 Jacksonville Miami
City Jack 1/8/2021 TX Miami
State Jack 1/8/2021 TX CA
Zip Joe 2/2/2021 11111 22222
Order count Jack 3/4/2021 2 3
Address Harry 2/3/2021 123 lane 123 street
Telephone Lisa 3/1/2021 111-111-11111 333-333-3333
Address Lisa 2/3/2021 123 lane 123 street
Email Tammy 3/2/2021 tammy#yahoo.com tammy#gmail.com
Date Product Ordered Lisa 3/3/2021 2/1/2021 2/10/2021
Order count Tammy 3/4/2021 2 3
City Tammy 1/8/2021 TX Miami
'''
df = pd.read_csv(StringIO(txt), header=0, skipinitialspace=True, sep=r'\s{2,}', engine='python')
print(df.pivot_table(index = ['Property', 'Name', 'Date of entry'], aggfunc=lambda x: x))
I have a large df called data which looks like:
Identifier Surname First names(s) Date change Work Pattern Region
0 12233.0 Smith Bob FT NW
1 54213.0 Jones Sally 15/04/15 FT NW
2 12237.0 Evans Steve 26/08/14 FT SE
3 10610.0 Cooper Amy 16/08/12 FT SE
I have another dataframe called updates. In this example the dataframe has updated information for data for a couple of records and looks like:
Identifier Surname First names(s) Date change
0 12233.0 Smith Bob 05/09/14
1 10610.0 Cooper Amy 16/08/12
I'm trying to find a way to update data with the updates df so the resulting dataframe looks like:
Identifier Surname First names(s) Date change Work Pattern Region
0 12233.0 Smith Bob 15/09/14 FT NW
1 54213.0 Jones Sally 15/04/15 FT NW
2 12237.0 Evans Steve 26/08/14 FT SE
3 10610.0 Cooper Amy 16/08/12 FT SE
As you can see the Date change field for Bob in the data df has been updated with the Date change from the updates df.
What can I try next?
a while back, I was dealing with that too. the straight up .update was giving me issues (sorry can't remember the exact issue I had. I think it was that when you do .update, it's reliant on indexes matching, and they didn't match in my 2 separate dataframes. so I wanted to use certain columns as my index to update on),
But I made a function to deal with it. So this might be way overkill than what's needed but try this and see if it'll work.
I'm also assuming the date you want update from the updates dataframe should be 15/09/14 not 05/09/14. So I had that different in my sample data below
Also, I'm assuming the Identifier is unique key. If not, you'll need to include multiple columns as your unique key
import sys
import pandas as pd
data = pd.DataFrame([[12233.0,'Smith','Bob','','FT','NW'],
[54213.0,'Jones','Sally','15/04/15','FT','NW'],
[12237.0,'Evans','Steve','26/08/14','FT','SE'],
[10610.0,'Cooper','Amy','16/08/12','FT','SE']],
columns = ['Identifier','Surname','First names(s)','Date change','Work Pattern','Region'])
updates = pd.DataFrame([[12233.0,'Smith','Bob','15/09/14'],
[10610.0,'Cooper','Amy','16/08/12']],
columns = ['Identifier','Surname','First names(s)','Date change'])
def update(df1, df2, keys_list):
df1 = df1.set_index(keys_list)
df2 = df2.set_index(keys_list)
dup_idx1 = df1.index.get_duplicates()
dup_idx2 = df2.index.get_duplicates()
if len(dup_idx1) > 0 or len(dup_idx2) > 0:
print('\n'+'#'*50+'\nError! Duplicate Indicies:')
for element in dup_idx1:
print('df1: %s' %(element,))
for element in dup_idx2:
print('df2: %s' %(element,))
print('#'*50+'\n\n')
df1.update(df2, overwrite=True)
df1.reset_index(inplace=True)
df2.reset_index(inplace=True)
return df1
# the 3rd input is a list, in case you need multiple columns as your unique key
df = update(data, updates, ['Identifier'])
Output:
print (data)
Identifier Surname First names(s) Date change Work Pattern Region
0 12233.0 Smith Bob FT NW
1 54213.0 Jones Sally 15/04/15 FT NW
2 12237.0 Evans Steve 26/08/14 FT SE
3 10610.0 Cooper Amy 16/08/12 FT SE
print (updates)
Identifier Surname First names(s) Date change
0 12233.0 Smith Bob 15/09/14
1 10610.0 Cooper Amy 16/08/12
df = update(data, updates, ['Identifier'])
In [19]: print (df)
Identifier Surname First names(s) Date change Work Pattern Region
0 12233.0 Smith Bob 15/09/14 FT NW
1 54213.0 Jones Sally 15/04/15 FT NW
2 12237.0 Evans Steve 26/08/14 FT SE
3 10610.0 Cooper Amy 16/08/12 FT SE
Using DataFrame.update.
First set index:
data.set_index('Identifier', inplace=True)
updates.set_index('Identifier', inplace=True)
Then update:
data.update(updates)
print(data)
Surname First names(s) Date change Work Pattern Region
Identifier
12233.0 Smith Bob 15/09/14 FT NW
54213.0 Jones Sally 15/04/15 FT NW
12237.0 Evans Steve 26/08/14 FT SE
10610.0 Cooper Amy 16/08/12 FT SE
If you need multiple columns to create a unique index you can just set them with a list. For example:
data.set_index(['Identifier', 'Surname'], inplace=True)
updates.set_index(['Identifier', 'Surname'], inplace=True)
data.update(updates)
I'm trying to concatenate with new observations. I got the answer that I think it's right but still get the system came back to me saying "ValueError
Can only compare identically-labeled DataFrame objects" Can anyone tell me why there's value error while I think I got the right result?
Here is the question:
Assume the data frame Employee is as below:
Department Title Year Education Sex
Name
Bob IT analyst 1 Bachelor M
Sam Trade associate 3 PHD M
Peter HR VP 8 Master M
Jake IT analyst 2 Master M
and another data frame new_observations is:
Department Education Sex Title Year
Mary IT F VP 9.0
Amy ? PHD F associate 5.0
Jennifer Trade Master F associate NaN
John HR Master M analyst 2.0
Judy HR Bachelor F analyst 2.0
Update Employee with these new observations.
Here is my code:
import pandas as pd
Employee =pd.DataFrame({"Name":["Bob","Sam","Peter","Jake"],
"Education":["Bachelor","PHD","Master","Master"],
"Sex":["M","M","M","M"],
"Year":[1,3,8,2],
"Department":["IT","Trade","HR","IT"],
"Title":["analyst", "associate", "VP", "analyst"]})
Employee=Employee.set_index('Name')
new_observations = pd.DataFrame({
"Name": ["Mary","Amy","Jennifer","John","Judy"],
"Department":["IT","?","Trade","HR","HR"],
"Education":["","PHD","Master","Master","Bachelor"],
"Sex":["F","F","F","M","F"],
"Title":["VP","associate","associate","analyst","analyst"],
"Year":[9.0,5.0,"NaN",2.0,2.0]},
columns=
["Name","Department","Education","Sex","Title","Year"])
new_observations=new_observations.set_index('Name')
Employee = Employee.append(new_observations,sort=False)
Here is my result:
code result
I also tried
Employee = pd.concat([Employee, new_observations], axis = 1, sort=False)
Use pd.concat on axis=0, which is default, so you don't need to include axis:
pd.concat([Employee, new_observations], sort=False)
Output:
Education Sex Year Department Title
Name
Bob Bachelor M 1 IT analyst
Sam PHD M 3 Trade associate
Peter Master M 8 HR VP
Jake Master M 2 IT analyst
Mary F 9 IT VP
Amy PHD F 5 ? associate
Jennifer Master F NaN Trade associate
John Master M 2 HR analyst
Judy Bachelor F 2 HR analyst