Pandas dataframe groupby and combine multiple row values - python

I apologize if the title isn't clear, but I had difficulty phrasing the question. It's probably best if I just show what I would like to do.
Some context: I parsed a document for names and stored each name with the page number where it appears. I need to transform the DataFrame so that there is a single row for each name the page number column combines all the pages where the name appears. I figured that this would require GroupBy, but I'm not entirely sure.
My data currently:
data = np.array([['John', 'Smith', 1], ['John', 'Smith', 7], ['Eric', 'Adams', 9], ['Jane', 'Doe', 14], ['Jane', 'Doe', 16], ['John', 'Smith', 19]])
pd.DataFrame(data, columns=['FIRST_NM', 'LAST_NM', 'PAGE_NUM'])
FIRST_NM LAST_NM PAGE_NUM
0 John Smith 1
1 John Smith 7
2 Eric Adams 9
3 Jane Doe 14
4 Jane Doe 16
5 John Smith 19
Desired dataframe:
FIRST_NM LAST_NM PAGE_NUM
0 John Smith 1,7,19
1 Eric Adams 9
2 Jane Doe 14,16

You can do this with groupby and apply:
df.groupby(['FIRST_NM', 'LAST_NM']).apply(lambda group: ','.join(group['PAGE_NUM']))
Out[23]:
FIRST_NM LAST_NM
Eric Adams 9
Jane Doe 14,16
John Smith 1,7,19
dtype: object

Related

Write list of dictionaries to multiple rows in a Pandas Dataframe

So I have a list of dictionaries, that itself has lists of dictionaries within it like this:
myDict = [{'First_Name': 'Jack', 'Last_Name': 'Smith', 'Job_Data': [{'Company': 'Amazon'}, {'Hire_Date': '2011-04-01', 'Company': 'Target'}]},
{'First_Name': 'Jill', 'Last_Name': 'Smith', 'Job_Data': [{'Hire_Date': '2009-11-16', 'Company': 'Sears'}, {'Hire_Date': '2011-04-01'}]}]
However, as you can see, some of the key values are the same, and sometimes data elements will be missing like Jack missing a Hire Date and Jill missing a company. So what I want to do is preserve the data and write it to multiple rows so that my final output looks like this:
First_Name Last_Name Hire_Date Company
0 Jack Smith NaN Amazon
1 Jack Smith 2011-04-01 Target
2 Jill Smith 2009-11-16 Sears
3 Jill Smith 2011-04-01 NaN
Edit: Follow-up question. Say now that I have a dictionary that looks like this that adds in an extract key and I want to produce a similar output but with the new data included:
myDict = [{'First_Name': 'Jack', 'Last_Name': 'Smith', 'Job_Data': [{'Company': 'Amazon'}, {'Hire_Date': '2011-04-01', 'Company': 'Target'}, 'Dependent_data': [{'Dependent': 'Susan Smith'}, {'Dependent': 'Will Smith'}]},
{'First_Name': 'Jill', 'Last_Name': 'Smith', 'Job_Data': [{'Hire_Date': '2009-11-16', 'Company': 'Sears'}, {'Hire_Date': '2011-04-01'}]}]
Output:
First_Name Last_Name Hire_Date Company Dependent
0 Jack Smith NaN Amazon Susan Smith
1 Jack Smith 2011-04-01 Target Will Smith
2 Jill Smith 2009-11-16 Sears NaN
3 Jill Smith 2011-04-01 NaN NaN
Using json_normalize
df = pd.json_normalize(data=myDict, meta=["First_Name", "Last_Name"], record_path="Job_Data")
print(df)
Company Hire_Date First_Name Last_Name
0 Amazon NaN Jack Smith
1 Target 2011-04-01 Jack Smith
2 Sears 2009-11-16 Jill Smith
3 NaN 2011-04-01 Jill Smith

Relationship based on time

I am trying to create a relationship between two data frames that are related, but there is no key that creates a relationship. Here is the layout of my problem:
The first data frame that I am using is information about when people entered an amusement park. In this amusement park, people can stay at the park for multiple days. So the structure of this data frame is
id
name
date
0
John Smith
07-01-2020 10:13:24
1
John Smith
07-22-2020 09:47:04
4
Jane Doe
07-22-2020 09:47:04
2
Jane Doe
06-13-2020 13:27:53
3
Thomas Wallace
07-08-2020 11:15:28
So people may visit the park once, or multiple times (assume that name is a unique identifier for people). For the other data frame, the data is what rides they went on during their time at the park. So the structure of this data frame is
name
ride
date
John Smith
Insanity
07-01-2020 13:53:07
John Smith
Bumper Cars
07-01-2020 16:37:29
John Smith
Tilt-A-Whirl
07-02-2020 08:21:18
John Smith
Insanity
07-22-2020 11:44:32
Jane Doe
Bumper Cars
06-13-2020 14:14:41
Jane Doe
Teacups
06-13-2020 17:31:56
Thomas Wallace
Insanity
07-08-2020 13:20:23
With these two data frames, I want to get the id of the visit associated with the rides that they went on during that visit. So the desired output in this example would be
id
name
ride
date
0
John Smith
Insanity
07-01-2020 13:53:07
0
John Smith
Bumper Cars
07-01-2020 16:37:29
0
John Smith
Tilt-A-Whirl
07-02-2020 08:21:18
1
John Smith
Insanity
07-22-2020 11:44:32
2
Jane Doe
Bumper Cars
06-13-2020 14:14:41
2
Jane Doe
Teacups
06-13-2020 17:31:56
3
Thomas Wallace
Insanity
07-08-2020 13:20:23
The way how I had thought about approaching this problem is by iterating over the visits and then adding the id to the ride if the name matched, the ride occurred during/after the visit, and the time delta is the smallest difference (using a large initial time delta and then updating the smallest different to that difference). If those conditions are not met, then just keep the same value. With this process in mind, here is my thought process in code:
rides['min_diff'] = pd.to_timedelta(365, unit='day')
rides['id'] = -1
for index, row in visits.iterrows():
rides['id'], rides['min_diff'] = np.where((rides['name'] == row['name']) & (
rides['date'] >= visits['date']) & (
(rides['date'] - row['date']) < rides['min_diff']),
(row['id'], rides['date'] - row['date']),
(rides['id'], rides['min_diff'))
This unfortunately does not execute because of the shapes not matching (as well as trying to assign values across multiple columns, which I am not sure how to do), but this is the general idea. I am not sure how this could be accomplished exactly, so if anyone has a solution, I would appreciate it.
Try with apply() and asof():
df1 = df1.set_index("date").sort_index() #asof requires a sorted index
df2["id"] = df2.apply(lambda x: df1[df1["Name"]==x["Name"]]["id"].asof(x["date"]), axis=1)
>>> df2
Name ride date id
0 John Smith Insanity 2020-07-01 13:53:07 0
1 John Smith Bumper Cars 2020-07-01 16:37:29 0
2 John Smith Tilt-A-Whirl 2020-07-02 08:21:18 0
3 John Smith Insanity 2020-07-22 11:44:32 1
4 Jane Doe Bumper Cars 2020-06-13 14:14:41 2
5 Jane Doe Teacups 2020-06-13 17:31:56 2
6 Thomas Wallace Insanity 2020-07-08 13:20:23 3
I think this does what you need. The ids aren't in the order you specified but they do represent visit ids with the logic you requested.
merged = pd.merge(df1, df2, how="right", left_on=['date', 'name'], right_on=['name', 'ride'])[['name_y', 'ride', 'date_y']]
merged['ymd'] = pd.to_datetime(merged.date_y).apply(lambda x: x.strftime('%Y-%m-%d'))
merged['id'] = merged.groupby(['name_y', 'ymd']).ngroup()
merged.drop('ymd', axis=1, inplace=True)
merged.columns = ['name', 'ride', 'date', 'id']
merged.sort_values(by='id', inplace=True)
print(merged)
OUT:
name ride date id
4 Jane Doe Bumper Cars 06-13-2020 14:14:41 0
5 Jane Doe Teacups 06-13-2020 17:31:56 0
0 John Smith Insanity 07-01-2020 13:53:07 1
1 John Smith Bumper Cars 07-01-2020 16:37:29 1
2 John Smith Tilt-A-Whirl 07-02-2020 08:21:18 2
3 John Smith Insanity 07-22-2020 11:44:32 3
6 Thomas Wallace Insanity 07-08-2020 13:20:23 4

Replace column based on string

I'm trying to replace column "Names" by a new variable "Gender" based on the first letters that we find in column name.
INPUT:
df['Name'].value_counts()
OUTPUT:
Mr. Gordon Hemmings 1
Miss Jane Wilkins 1
Mrs. Audrey North 1
Mrs. Wanda Sharp 1
Mr. Victor Hemmings 1
..
Miss Heather Abraham 1
Mrs. Kylie Hart 1
Mr. Ian Langdon 1
Mr. Gordon Watson 1
Miss Irene Vance 1
Name: Name, Length: 4999, dtype: int64
Now, see the Miss, Mrs., and Miss? The first question that comes to mind is: how many different words there are?
INPUT
df.Name.str.split().str[0].value_counts(dropna=False)
Mr. 3351
Mrs. 937
Miss 711
NaN 1
Name: Name, dtype: int64
Now I'm trying to:
#Replace missing value
df['Name'].fillna('Mr.', inplace=True)
# Create Column Gender
df['Gender'] = df['Name']
for i in range(0, df[0]):
A = df['Name'].values[i][0:3]=="Mr."
df['Gender'].values[i] = A
df.loc[df['Gender']==True, 'Gender']="Male"
df.loc[df['Gender']==False, 'Gender']="Female"
del df['Name'] #Delete column 'Name'
df
But I'm missing something since I get the following error:
KeyError: 0
The KeyError is because you don't have a column called 0. However, I would ditch that code and try something more efficient.
You can use np.where with str.contains to search for names with Mr. after using fillna(). Then, just drop the Name column.:
df['Name'] = df['Name'].fillna('Mr.')
df['Gender'] = np.where(df['Name'].str.contains('Mr\.'), 'Male', 'Female')
df = df.drop('Name', axis=1)
df
Full example:
df = pd.DataFrame({'Name': {0: 'Mr. Gordon Hemmings',
1: 'Miss Jane Wilkins',
2: 'Mrs. Audrey North',
3: 'Mrs. Wanda Sharp',
4: 'Mr. Victor Hemmings'},
'Value': {0: 1, 1: 1, 2: 1, 3: 1, 4: 1}})
print(df)
df['Name'] = df['Name'].fillna('Mr.')
df['Gender'] = np.where(df['Name'].str.contains('Mr\.'), 'Male', 'Female')
df = df.drop('Name', axis=1)
print('\n')
print(df)
Name Value
0 Mr. Gordon Hemmings 1
1 Miss Jane Wilkins 1
2 Mrs. Audrey North 1
3 Mrs. Wanda Sharp 1
4 Mr. Victor Hemmings 1
Value Gender
0 1 Male
1 1 Female
2 1 Female
3 1 Female
4 1 Male

Groupby/cumsum for dataframe with duplicate names

I'm trying to perform a cumulative sum on a dataframe that contains multiple identical names. I'd like to create another df that has a cumulative sum of the points scored per player, while also recognizing that names sometimes are not unique. The school would be the 2nd criteria. Here's an example of what I'm looking at:
df = pd.DataFrame({'Player':['John Smith', 'John Smith', 'John Smith', 'John Smith', 'John Smith'],
'School':['Duke', 'Duke', 'Duke', 'Kentucky', 'Kentucky'],
'Date':['1-1-20', '1-3-20', '1-7-20', '1-3-20', '1-08-20'],
'Points Scored':['20', '30', '15', '8', '9']})
print(df)
Player School Date Points Scored
0 John Smith Duke 1-1-20 20
1 John Smith Duke 1-3-20 30
2 John Smith Duke 1-7-20 15
3 John Smith Kentucky 1-3-20 8
4 John Smith Kentucky 1-08-20 9
I've tried using df.groupby(by=['Player', 'School', 'Date']).sum().groupby(level=[0]).cumsum()... but that doesn't seem to differentiate the second criteria. I've also tried to sort_values by School
but couldn't find any luck there. The expected output would look like the below table;
Player School Date Points Scored Cumulative Sum Points Scored
0 John Smith Duke 1-1-20 20 20
1 John Smith Duke 1-3-20 30 50
2 John Smith Duke 1-7-20 15 65
3 John Smith Kentucky 1-3-20 8 8
4 John Smith Kentucky 1-08-20 9 17
Thanks in advance for the help!
import numpy as np
import pandas as pd
df = pd.DataFrame({'Player':['John Smith', 'John Smith', 'John Smith', 'John Smith', 'John Smith'],
'School':['Duke', 'Duke', 'Duke', 'Kentucky', 'Kentucky'],
'Date':['1-1-20', '1-3-20', '1-7-20', '1-3-20', '1-08-20'],
'Points Scored':[20, 30, 15, 8, 9]}) # change to integer here
df['Cumulative Sum Points Scored'] = df.groupby(['Player','School'])['Points Scored'].apply(np.cumsum)
Output:
Player School Date Points Scored Cumulative Sum Points Scored
0 John Smith Duke 1-1-20 20 20
1 John Smith Duke 1-3-20 30 50
2 John Smith Duke 1-7-20 15 65
3 John Smith Kentucky 1-3-20 8 8
4 John Smith Kentucky 1-08-20 9 17

Change names in pandas column to start with uppercase letters

Background
I have a toy df
import pandas as pd
df = pd.DataFrame({'Text' : ['Jon J Mmith is Here',
'Mary Lisa Hder found here',
'Jane A Doe is also here',
'Tom T Tcker is here too'],
'P_ID': [1,2,3,4],
'P_Name' : ['MMITH, JON J', 'HDER, MARY LISA', 'DOE, JANE A', 'TCKER, TOM T'],
'N_ID' : ['A1', 'A2', 'A3', 'A4']
})
#rearrange columns
df = df[['Text','N_ID', 'P_ID', 'P_Name']]
df
Text N_ID P_ID P_Name
0 Jon J Mmith is Here A1 1 MMITH, JON J
1 Mary Lisa Hder found here A2 2 HDER, MARY LISA
2 Jane A Doe is also here A3 3 DOE, JANE A
3 Tom T Tcker is here to A4 4 TCKER, TOM T
Goal
1) Change the P_Name column from df into a format that looks like my desired output; that is, change the current format (e.g.MMITH, JON J) to a format (e.g. Mmith, Jon J) where the first and last names and middle letter all start with a capital letter
2) Create this in a new column P_Name_New
Desired Output
Text N_ID P_ID P_Name P_Name_New
0 Jon J Mmith is Here A1 1 MMITH, JON J Mmith, Jon J
1 Mary Lisa Hder found here A2 2 HDER, MARY LISA Hder, Mary Lisa
2 Jane A Doe is also here A3 3 DOE, JANE A Doe, Jane A
3 Tom T Tcker is here too A4 4 TCKER, TOM T Tcker, Tom T
Question
How do I achieve my desired goal?
Simply with str.title() function:
In [98]: df['P_Name_New'] = df['P_Name'].str.title()
In [99]: df
Out[99]:
Text N_ID P_ID P_Name P_Name_New
0 Jon J Smith is Here A1 1 SMITH, JON J Smith, Jon J
1 Mary Lisa Rider found here A2 2 RIDER, MARY LISA Rider, Mary Lisa
2 Jane A Doe is also here A3 3 DOE, JANE A Doe, Jane A
3 Tom T Tucker is here too A4 4 TUCKER, TOM T Tucker, Tom T

Categories