I'm trying to perform a cumulative sum on a dataframe that contains multiple identical names. I'd like to create another df that has a cumulative sum of the points scored per player, while also recognizing that names sometimes are not unique. The school would be the 2nd criteria. Here's an example of what I'm looking at:
df = pd.DataFrame({'Player':['John Smith', 'John Smith', 'John Smith', 'John Smith', 'John Smith'],
'School':['Duke', 'Duke', 'Duke', 'Kentucky', 'Kentucky'],
'Date':['1-1-20', '1-3-20', '1-7-20', '1-3-20', '1-08-20'],
'Points Scored':['20', '30', '15', '8', '9']})
print(df)
Player School Date Points Scored
0 John Smith Duke 1-1-20 20
1 John Smith Duke 1-3-20 30
2 John Smith Duke 1-7-20 15
3 John Smith Kentucky 1-3-20 8
4 John Smith Kentucky 1-08-20 9
I've tried using df.groupby(by=['Player', 'School', 'Date']).sum().groupby(level=[0]).cumsum()... but that doesn't seem to differentiate the second criteria. I've also tried to sort_values by School
but couldn't find any luck there. The expected output would look like the below table;
Player School Date Points Scored Cumulative Sum Points Scored
0 John Smith Duke 1-1-20 20 20
1 John Smith Duke 1-3-20 30 50
2 John Smith Duke 1-7-20 15 65
3 John Smith Kentucky 1-3-20 8 8
4 John Smith Kentucky 1-08-20 9 17
Thanks in advance for the help!
import numpy as np
import pandas as pd
df = pd.DataFrame({'Player':['John Smith', 'John Smith', 'John Smith', 'John Smith', 'John Smith'],
'School':['Duke', 'Duke', 'Duke', 'Kentucky', 'Kentucky'],
'Date':['1-1-20', '1-3-20', '1-7-20', '1-3-20', '1-08-20'],
'Points Scored':[20, 30, 15, 8, 9]}) # change to integer here
df['Cumulative Sum Points Scored'] = df.groupby(['Player','School'])['Points Scored'].apply(np.cumsum)
Output:
Player School Date Points Scored Cumulative Sum Points Scored
0 John Smith Duke 1-1-20 20 20
1 John Smith Duke 1-3-20 30 50
2 John Smith Duke 1-7-20 15 65
3 John Smith Kentucky 1-3-20 8 8
4 John Smith Kentucky 1-08-20 9 17
Related
I am trying to create a relationship between two data frames that are related, but there is no key that creates a relationship. Here is the layout of my problem:
The first data frame that I am using is information about when people entered an amusement park. In this amusement park, people can stay at the park for multiple days. So the structure of this data frame is
id
name
date
0
John Smith
07-01-2020 10:13:24
1
John Smith
07-22-2020 09:47:04
4
Jane Doe
07-22-2020 09:47:04
2
Jane Doe
06-13-2020 13:27:53
3
Thomas Wallace
07-08-2020 11:15:28
So people may visit the park once, or multiple times (assume that name is a unique identifier for people). For the other data frame, the data is what rides they went on during their time at the park. So the structure of this data frame is
name
ride
date
John Smith
Insanity
07-01-2020 13:53:07
John Smith
Bumper Cars
07-01-2020 16:37:29
John Smith
Tilt-A-Whirl
07-02-2020 08:21:18
John Smith
Insanity
07-22-2020 11:44:32
Jane Doe
Bumper Cars
06-13-2020 14:14:41
Jane Doe
Teacups
06-13-2020 17:31:56
Thomas Wallace
Insanity
07-08-2020 13:20:23
With these two data frames, I want to get the id of the visit associated with the rides that they went on during that visit. So the desired output in this example would be
id
name
ride
date
0
John Smith
Insanity
07-01-2020 13:53:07
0
John Smith
Bumper Cars
07-01-2020 16:37:29
0
John Smith
Tilt-A-Whirl
07-02-2020 08:21:18
1
John Smith
Insanity
07-22-2020 11:44:32
2
Jane Doe
Bumper Cars
06-13-2020 14:14:41
2
Jane Doe
Teacups
06-13-2020 17:31:56
3
Thomas Wallace
Insanity
07-08-2020 13:20:23
The way how I had thought about approaching this problem is by iterating over the visits and then adding the id to the ride if the name matched, the ride occurred during/after the visit, and the time delta is the smallest difference (using a large initial time delta and then updating the smallest different to that difference). If those conditions are not met, then just keep the same value. With this process in mind, here is my thought process in code:
rides['min_diff'] = pd.to_timedelta(365, unit='day')
rides['id'] = -1
for index, row in visits.iterrows():
rides['id'], rides['min_diff'] = np.where((rides['name'] == row['name']) & (
rides['date'] >= visits['date']) & (
(rides['date'] - row['date']) < rides['min_diff']),
(row['id'], rides['date'] - row['date']),
(rides['id'], rides['min_diff'))
This unfortunately does not execute because of the shapes not matching (as well as trying to assign values across multiple columns, which I am not sure how to do), but this is the general idea. I am not sure how this could be accomplished exactly, so if anyone has a solution, I would appreciate it.
Try with apply() and asof():
df1 = df1.set_index("date").sort_index() #asof requires a sorted index
df2["id"] = df2.apply(lambda x: df1[df1["Name"]==x["Name"]]["id"].asof(x["date"]), axis=1)
>>> df2
Name ride date id
0 John Smith Insanity 2020-07-01 13:53:07 0
1 John Smith Bumper Cars 2020-07-01 16:37:29 0
2 John Smith Tilt-A-Whirl 2020-07-02 08:21:18 0
3 John Smith Insanity 2020-07-22 11:44:32 1
4 Jane Doe Bumper Cars 2020-06-13 14:14:41 2
5 Jane Doe Teacups 2020-06-13 17:31:56 2
6 Thomas Wallace Insanity 2020-07-08 13:20:23 3
I think this does what you need. The ids aren't in the order you specified but they do represent visit ids with the logic you requested.
merged = pd.merge(df1, df2, how="right", left_on=['date', 'name'], right_on=['name', 'ride'])[['name_y', 'ride', 'date_y']]
merged['ymd'] = pd.to_datetime(merged.date_y).apply(lambda x: x.strftime('%Y-%m-%d'))
merged['id'] = merged.groupby(['name_y', 'ymd']).ngroup()
merged.drop('ymd', axis=1, inplace=True)
merged.columns = ['name', 'ride', 'date', 'id']
merged.sort_values(by='id', inplace=True)
print(merged)
OUT:
name ride date id
4 Jane Doe Bumper Cars 06-13-2020 14:14:41 0
5 Jane Doe Teacups 06-13-2020 17:31:56 0
0 John Smith Insanity 07-01-2020 13:53:07 1
1 John Smith Bumper Cars 07-01-2020 16:37:29 1
2 John Smith Tilt-A-Whirl 07-02-2020 08:21:18 2
3 John Smith Insanity 07-22-2020 11:44:32 3
6 Thomas Wallace Insanity 07-08-2020 13:20:23 4
I want to extract only the full words of a string.
I have this df:
Students Age
0 Boston Terry Emma 23
1 Tommy Julien Cambridge 20
2 London 21
3 New York Liu 30
4 Anna-Madrid+ Pauline 26
5 Mozart Cambridge 27
6 Gigi Tokyo Lily 18
7 Paris Diane Marie Dive 22
And I want to extract the FULL words from the string, NOT parts of it (ex: I want Liu if Liu is written in names, not iu if just iu if written, because Liu is not iu.)
cities = ['Boston', 'Cambridge', 'Bruxelles', 'New York', 'London', 'Amsterdam', 'Madrid', 'Tokyo', 'Paris']
liked_names = ['Emma', 'Pauline', 'Tommy Julien', 'iu']
Desired df:
Students Age Cities Liked Names
0 Boston Terry Emma 23 Boston Emma
1 Tommy Julien Cambridge 20 Cambridge Tommy Julien
2 London 21 London NaN
3 New York Liu 30 New York NaN
4 Anna-Madrid+ Pauline 26 Madrid Pauline
5 Mozart Cambridge 27 Cambridge NaN
6 Gigi Tokyo Lily 18 Tokyo NaN
7 Paris Diane Marie Dive 22 Paris NaN
I tried this code:
pat = f'({"|".join(cities)})'
df['Cities'] = df['Students'].str.extract(pat, expand=False)
pat = f'({"|".join(liked_names)})'
df['Liked Names'] = df['Students'].str.extract(pat, expand=False)
My code for cities works, I just need to repair the issue for the 'Liked Names'.
How to make this work? Thanks a lot!!!
I think what you are looking for are word boundaries. In a regular expression they can be expressed with a \b. An ugly (albeit working) solution is to modify the liked_names list to include word boundaries and then run the code:
l = [
["Boston Terry Emma", 23],
["Tommy Julien Cambridge", 20],
["London", 21],
["New York Liu", 30],
["Anna-Madrid+ Pauline", 26],
["Mozart Cambridge", 27],
["Gigi Tokyo Lily", 18],
["Paris Diane Marie Dive", 22],
]
cities = [
"Boston",
"Cambridge",
"Bruxelles",
"New York",
"London",
"Amsterdam",
"Madrid",
"Tokyo",
"Paris",
]
liked_names = ["Emma", "Pauline", "Tommy Julien", "iu"]
# here we modify the liked_names to include word boundaries.
liked_names = [r"\b" + n + r"\b" for n in liked_names]
df = pd.DataFrame(l, columns=["Students", "Age"])
pat = f'({"|".join(cities)})'
df["Cities"] = df["Students"].str.extract(pat, expand=False)
pat = f'({"|".join(liked_names)})'
df["Liked Names"] = df["Students"].str.extract(pat, expand=False)
print(df)
A nicer solution would be to include the word boundaries in the creation of the regular expression.
I first tried using \s, i.e. whitespace, but that did not work at the end of the list, so \b was the solution. You can check https://regular-expressions.mobi/wordboundaries.html?wlr=1 for some details.
You can try this regex:
liked_names = ["Emma", "Pauline", "Tommy Julien", "iu"]
pat = (
"(" + "|".join(r"[a-zA-Z]*{}[a-zA-Z]*".format(n) for n in liked_names) + ")"
)
df["Liked Names"] = df["Students"].str.extract(pat)
print(df)
Prints:
Students Age Liked Names
0 Boston Terry Emma 23 Emma
1 Tommy Julien Cambridge 20 Tommy Julien
2 London 21 NaN
3 New York Liu 30 Liu
4 Anna-Madrid+ Pauline 26 Pauline
5 Mozart Cambridge 27 NaN
6 Gigi Tokyo Lily 18 NaN
7 Paris Diane Marie Dive 22 NaN
You can do an additional check to see if matched name is in Students column.
import numpy as np
def check(row):
if row['Liked Names'] == row['Liked Names']:
# If `Liked Names` is not nan
# Get all possible names
patterns = row['Students'].split(' ')
# If matched `Liked Names` in `Students`
isAllMatched = all([name in patterns for name in row['Liked Names'].split(' ')])
if not isAllMatched:
return np.nan
else:
return row['Liked Names']
else:
# If `Liked Names` is nan, still return nan
return np.nan
df['Liked Names'] = df.apply(check, axis=1)
# print(df)
Students Age Cities Liked Names
0 Boston Terry Emma 23 Boston Emma
1 Tommy Julien Cambridge 20 Cambridge Tommy Julien
2 London 21 London NaN
3 New York Liu 30 New York NaN
4 Anna-Madrid+ Pauline 26 Madrid Pauline
5 Mozart Cambridge 27 Cambridge NaN
6 Gigi Tokyo Lily 18 Tokyo NaN
7 Paris Diane Marie Dive 22 Paris NaN
I have two dataframes. df1 contains partial names of persons and df2 contains names of persons, their dob, etc. I want to partially match df1['Partial_names'] column with df2['Full_names'] column. For example, to match Martin Lues all rows in Full_names Having Martin in them should be fetched. In Resulting dataframe should be
df1 = pd.DataFrame()
df2 = pd.Dataframe()
df1['Partial_names'] = ['John Smith','Leo Lauana' ,'Adam Marry', 'Martin Lues']
df2['Full_names'] = ['John Smith Wade', 'Adam Blake Marry', 'Riley
Leo Lauana', 'Martin Smith', 'Martin Flex Leo']
Partial_names
1 John Smith
2 Leo Lauana
3 Adam Marry
4 Martin Lues
5 Martin Author
Full_names
1 Martin Smith
2 Riley Leo Lauana
3 Adam Blake Marry
4 Jeff Hard Jin
5 Martin Flex Leo
Partial_names Resulting_Column_with_full_names
1 John Smith John Smith Wade
2 Leo Lauana Riley Leo Lauana
3 Adam Marry Adam Blake Marry
4 Martin Lues Martin Smith
Martin Flex Leo
In actual, both dataframe has more rows
Background
I have a toy df
import pandas as pd
df = pd.DataFrame({'Text' : ['Jon J Mmith is Here',
'Mary Lisa Hder found here',
'Jane A Doe is also here',
'Tom T Tcker is here too'],
'P_ID': [1,2,3,4],
'P_Name' : ['MMITH, JON J', 'HDER, MARY LISA', 'DOE, JANE A', 'TCKER, TOM T'],
'N_ID' : ['A1', 'A2', 'A3', 'A4']
})
#rearrange columns
df = df[['Text','N_ID', 'P_ID', 'P_Name']]
df
Text N_ID P_ID P_Name
0 Jon J Mmith is Here A1 1 MMITH, JON J
1 Mary Lisa Hder found here A2 2 HDER, MARY LISA
2 Jane A Doe is also here A3 3 DOE, JANE A
3 Tom T Tcker is here to A4 4 TCKER, TOM T
Goal
1) Change the P_Name column from df into a format that looks like my desired output; that is, change the current format (e.g.MMITH, JON J) to a format (e.g. Mmith, Jon J) where the first and last names and middle letter all start with a capital letter
2) Create this in a new column P_Name_New
Desired Output
Text N_ID P_ID P_Name P_Name_New
0 Jon J Mmith is Here A1 1 MMITH, JON J Mmith, Jon J
1 Mary Lisa Hder found here A2 2 HDER, MARY LISA Hder, Mary Lisa
2 Jane A Doe is also here A3 3 DOE, JANE A Doe, Jane A
3 Tom T Tcker is here too A4 4 TCKER, TOM T Tcker, Tom T
Question
How do I achieve my desired goal?
Simply with str.title() function:
In [98]: df['P_Name_New'] = df['P_Name'].str.title()
In [99]: df
Out[99]:
Text N_ID P_ID P_Name P_Name_New
0 Jon J Smith is Here A1 1 SMITH, JON J Smith, Jon J
1 Mary Lisa Rider found here A2 2 RIDER, MARY LISA Rider, Mary Lisa
2 Jane A Doe is also here A3 3 DOE, JANE A Doe, Jane A
3 Tom T Tucker is here too A4 4 TUCKER, TOM T Tucker, Tom T
I apologize if the title isn't clear, but I had difficulty phrasing the question. It's probably best if I just show what I would like to do.
Some context: I parsed a document for names and stored each name with the page number where it appears. I need to transform the DataFrame so that there is a single row for each name the page number column combines all the pages where the name appears. I figured that this would require GroupBy, but I'm not entirely sure.
My data currently:
data = np.array([['John', 'Smith', 1], ['John', 'Smith', 7], ['Eric', 'Adams', 9], ['Jane', 'Doe', 14], ['Jane', 'Doe', 16], ['John', 'Smith', 19]])
pd.DataFrame(data, columns=['FIRST_NM', 'LAST_NM', 'PAGE_NUM'])
FIRST_NM LAST_NM PAGE_NUM
0 John Smith 1
1 John Smith 7
2 Eric Adams 9
3 Jane Doe 14
4 Jane Doe 16
5 John Smith 19
Desired dataframe:
FIRST_NM LAST_NM PAGE_NUM
0 John Smith 1,7,19
1 Eric Adams 9
2 Jane Doe 14,16
You can do this with groupby and apply:
df.groupby(['FIRST_NM', 'LAST_NM']).apply(lambda group: ','.join(group['PAGE_NUM']))
Out[23]:
FIRST_NM LAST_NM
Eric Adams 9
Jane Doe 14,16
John Smith 1,7,19
dtype: object