DataFrame:
account_id plan_id policy_group_nbr plan_type split_eff_date splits
0 470804 8739131 Conversion732 Onsite Medical Center 1/19/2022 Bob Smith (28.2) | John Doe (35.9) | A...
1 470804 8739131 Conversion732 Onsite Medical Center 1/21/2022 Bob Smith (19.2) | John Doe (34.6) | A...
2 470809 2644045 801790 401(k) 1/18/2022 Jim Jones (100)
3 470809 2644045 801790 401(k) 1/5/2022 Other Name (50) | Jim Jones (50)
4 470809 2738854 801789 401(k) 1/18/2022 Jim Jones (100)
... ... ... ... ... ... ...
1720 3848482 18026734 24794 Accident 1/20/2022 Bill Underwood (50) | Jim Jones (50)
1721 3848482 18026781 BCSC FSA Admin 1/20/2022 Bill Underwood (50) | Jim Jones (50)
1722 3927880 19602958 Consulting Other 1/20/2022 Bill Brown (50) | Tim Scott (50)
1723 3927880 19863300 Producer Expense 5500 Filing 1/20/2022 Bill Brown (50) | Tim Scott (50)
1724 3927880 19863300 Producer Expense 5500 Filing 1/21/2022 Bill Brown (50) | Tim Scott (50)
I need to group by (account_id, plan_id, policy_group_nbr, plan_type) sorted by split_eff_date (desc), in order to remove all rows for the group but the most recent date while maintaining all columns. I can get a rank however, when attempting to pass an argument to the lambda function, I'm receiving a TypeError.
working as expected:
splits['rank'] = splits.groupby(['account_id', 'plan_id', 'policy_group_nbr', 'plan_type'])['split_eff_date'].apply(lambda x: x.sort_values().rank())
TypeError: incompatible index of inserted column with frame index
splits['rank'] = splits.groupby(['account_id', 'plan_id', 'policy_group_nbr', 'plan_type'])['split_eff_date'].apply(lambda x: x.sort_values(ascending=False).rank())
passing the axis argument didn't seem to help either... is this a simple syntax issue, or am I not understanding the function properly?
easier -- and typically faster -- to do this with .transform().
easier because when you sort descending, the index doesn't match when you try to assign back to original DataFrame. i tried not using an index in the .groupby(), but wasn't able to get that working.
link to documentation about .transform(): https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.SeriesGroupBy.transform.html
i recommend using .transform() like this, and be sure to supply ascending=False kwarg to .rank() as well:
df["rank2"] = df.groupby(["account_id", "plan_id", "policy_group_nbr", "plan_type"])[
"split_eff_date"
].transform(
lambda x: x.sort_values(ascending=False).rank(ascending=False, method="first")
)
result with both kinds of ranking -- i took just the first 5 rows from your sample data:
In [93]: df
Out[93]:
account_id plan_id policy_group_nbr plan_type split_eff_date rank rank2
3 470809 2644045 801790 401(k) 2022-01-05 1.0 2.0
2 470809 2644045 801790 401(k) 2022-01-18 2.0 1.0
4 470809 2738854 801789 401(k) 2022-01-18 1.0 1.0
0 470804 8739131 Conversion732 Onsite Medical Center 2022-01-19 1.0 2.0
1 470804 8739131 Conversion732 Onsite Medical Center 2022-01-21 2.0 1.0
Related
I am trying to create a relationship between two data frames that are related, but there is no key that creates a relationship. Here is the layout of my problem:
The first data frame that I am using is information about when people entered an amusement park. In this amusement park, people can stay at the park for multiple days. So the structure of this data frame is
id
name
date
0
John Smith
07-01-2020 10:13:24
1
John Smith
07-22-2020 09:47:04
4
Jane Doe
07-22-2020 09:47:04
2
Jane Doe
06-13-2020 13:27:53
3
Thomas Wallace
07-08-2020 11:15:28
So people may visit the park once, or multiple times (assume that name is a unique identifier for people). For the other data frame, the data is what rides they went on during their time at the park. So the structure of this data frame is
name
ride
date
John Smith
Insanity
07-01-2020 13:53:07
John Smith
Bumper Cars
07-01-2020 16:37:29
John Smith
Tilt-A-Whirl
07-02-2020 08:21:18
John Smith
Insanity
07-22-2020 11:44:32
Jane Doe
Bumper Cars
06-13-2020 14:14:41
Jane Doe
Teacups
06-13-2020 17:31:56
Thomas Wallace
Insanity
07-08-2020 13:20:23
With these two data frames, I want to get the id of the visit associated with the rides that they went on during that visit. So the desired output in this example would be
id
name
ride
date
0
John Smith
Insanity
07-01-2020 13:53:07
0
John Smith
Bumper Cars
07-01-2020 16:37:29
0
John Smith
Tilt-A-Whirl
07-02-2020 08:21:18
1
John Smith
Insanity
07-22-2020 11:44:32
2
Jane Doe
Bumper Cars
06-13-2020 14:14:41
2
Jane Doe
Teacups
06-13-2020 17:31:56
3
Thomas Wallace
Insanity
07-08-2020 13:20:23
The way how I had thought about approaching this problem is by iterating over the visits and then adding the id to the ride if the name matched, the ride occurred during/after the visit, and the time delta is the smallest difference (using a large initial time delta and then updating the smallest different to that difference). If those conditions are not met, then just keep the same value. With this process in mind, here is my thought process in code:
rides['min_diff'] = pd.to_timedelta(365, unit='day')
rides['id'] = -1
for index, row in visits.iterrows():
rides['id'], rides['min_diff'] = np.where((rides['name'] == row['name']) & (
rides['date'] >= visits['date']) & (
(rides['date'] - row['date']) < rides['min_diff']),
(row['id'], rides['date'] - row['date']),
(rides['id'], rides['min_diff'))
This unfortunately does not execute because of the shapes not matching (as well as trying to assign values across multiple columns, which I am not sure how to do), but this is the general idea. I am not sure how this could be accomplished exactly, so if anyone has a solution, I would appreciate it.
Try with apply() and asof():
df1 = df1.set_index("date").sort_index() #asof requires a sorted index
df2["id"] = df2.apply(lambda x: df1[df1["Name"]==x["Name"]]["id"].asof(x["date"]), axis=1)
>>> df2
Name ride date id
0 John Smith Insanity 2020-07-01 13:53:07 0
1 John Smith Bumper Cars 2020-07-01 16:37:29 0
2 John Smith Tilt-A-Whirl 2020-07-02 08:21:18 0
3 John Smith Insanity 2020-07-22 11:44:32 1
4 Jane Doe Bumper Cars 2020-06-13 14:14:41 2
5 Jane Doe Teacups 2020-06-13 17:31:56 2
6 Thomas Wallace Insanity 2020-07-08 13:20:23 3
I think this does what you need. The ids aren't in the order you specified but they do represent visit ids with the logic you requested.
merged = pd.merge(df1, df2, how="right", left_on=['date', 'name'], right_on=['name', 'ride'])[['name_y', 'ride', 'date_y']]
merged['ymd'] = pd.to_datetime(merged.date_y).apply(lambda x: x.strftime('%Y-%m-%d'))
merged['id'] = merged.groupby(['name_y', 'ymd']).ngroup()
merged.drop('ymd', axis=1, inplace=True)
merged.columns = ['name', 'ride', 'date', 'id']
merged.sort_values(by='id', inplace=True)
print(merged)
OUT:
name ride date id
4 Jane Doe Bumper Cars 06-13-2020 14:14:41 0
5 Jane Doe Teacups 06-13-2020 17:31:56 0
0 John Smith Insanity 07-01-2020 13:53:07 1
1 John Smith Bumper Cars 07-01-2020 16:37:29 1
2 John Smith Tilt-A-Whirl 07-02-2020 08:21:18 2
3 John Smith Insanity 07-22-2020 11:44:32 3
6 Thomas Wallace Insanity 07-08-2020 13:20:23 4
I have a dataframe with 2 columns i.e. UserId in integer format and Actors in string format as shown below:
Userid Actors
u1 Tony Ward,Bruce LaBruce,Kevin P. Scott,Ivar Johnson, Naomi Watts, Tony Ward,.......
u2 Tony Ward,Bruce LaBruce,Kevin P. Scott, Luke Wilson, Owen Wilson, Lumi Cavazos,......
It represents actors from all movies watched by a particular user of the platform
I want an output where we have the count of each actor for each user as shown below:
UserId Tony Ward Bruce LaBruce Kevin P. Scott Ivar Johnson Luke Wilson Owen Wilson Lumi Cavazos
u1 2 1 1 1 0 0 0
u2 1 1 1 0 1 1 1
It is something similar to countvectoriser I reckon, but i just have nouns here.
Kindly help
Assuming its a pandas.Dataframe try this, DataFrame.explode Transform each element of a list-like (result of split) to a row DataFrame.groupby aggregates the data & DataFrame.unstack transforms to required format.
df['Actors'] = df['Actors'].str.replace(",\s", ",").str.split(",")
(
df.explode('Actors').
groupby(['Userid', 'Actors'], as_index=False).size().
unstack().fillna(0)
)
I'm working with an example dataset:
date name point
0 4/24/2019 Martha 3617138
1 4/25/2019 Martha 3961918
2 4/26/2019 Martha 4774966
3 4/27/2019 Martha 5217946
4 4/24/2019 Alex 62700321
5 4/25/2019 Alex 66721020
6 4/26/2019 Alex 71745138
7 4/27/2019 Alex 88762943
8 4/28/2019 Alex 102772578
9 4/29/2019 Alex 129089274
10 3/1/2019 Josh 1063259
11 3/3/2019 Josh 1063259
12 3/4/2019 Josh 1063259
13 3/5/2019 Josh 1063259
14 3/6/2019 Josh 1063259
and a list of name values
nameslist = ['Martha', 'Alex', 'Josh']
I want to calculate the percent change of all rows, based on the identifier in the name column.
expected output:
name percent change
Martha 30.7
Alex 51.4
Josh 0
I initially tried to iterate through my list and table, and add all rows that match the list value, append a list with the calculate of change, then move the the next value of my list, but I can't articulate my code properly to make that happen.
df['date'] = pd.to_datetime(df['date'])
df = df.sort_values(by='date')
growthlist=[]
temptable=[]
for i in nameslist:
for j in df:
temptable.append(df[df['name'].str.match(nameslist[i])])
length=[]
growth=temptable[0]-temptable[length-1]
growthlist.append(i,growth)
but that generates error:
TypeError: list indices must be integers or slices, not str
I also wouldn't mind using .groupby() and .pct_change() to accomplish this goal, but
growth = df.groupby('name').pct_change()
generates a long traceback that ends with:
TypeError: unsupported operand type(s) for /: 'str' and 'float'
ultimately, I would like to nest this within a function so I could use it on other datasets and be able to choose my column name (the actual datasets i'm working with are not standardized so the target column names are often different)
def calc_growth(dataset,colname):
but I'm not sure if that's too much too ask for this one question.
Unfortunately, I'm quite lost with this question so any help would be appreciated. I'm also wondering if transformation is an easier way to go with this, because at least i will always know the exact location of the two figures I need to calculate, but I don't even know how I would start something like that.
Thanks
You can use apply with last and first value approached through .values to calculate the percentage change over the whole group:
df.groupby('name',sort=False).apply(lambda x: (x['point'].values[-1] - x['point'].values[0]) / x['point'].values[-1] * 100)\
.reset_index(name='pct change')
name pct change
0 Martha 30.67889165583545363347
1 Alex 51.42871358932579539669
2 Josh 0.00000000000000000000
Explanation
First we use groupby on name which will give us a group (read: a dataframe) based on each unique name:
for _, d in df.groupby('name', sort=False):
print(d, '\n')
date name point
0 2019-04-24 Martha 3617138
1 2019-04-25 Martha 3961918
2 2019-04-26 Martha 4774966
3 2019-04-27 Martha 5217946
date name point
4 2019-04-24 Alex 62700321
5 2019-04-25 Alex 66721020
6 2019-04-26 Alex 71745138
7 2019-04-27 Alex 88762943
8 2019-04-28 Alex 102772578
9 2019-04-29 Alex 129089274
date name point
10 2019-03-01 Josh 1063259
11 2019-03-03 Josh 1063259
12 2019-03-04 Josh 1063259
13 2019-03-05 Josh 1063259
14 2019-03-06 Josh 1063259
Then we apply our own made lambda function to each seperate group and apply the following calculation:
percentage change = (point last value - point first value) / point last value * 100
Then we use reset_index to get our name column out of the index, since groupby puts it as index.
assuming there is a fourth column, maybe description as in,
date name point descr
0 4/24/2019 Martha 3617138 12g of ecg
1 4/25/2019 Martha 3961918 12g of eg
2 4/26/2019 Martha 4774966 43m of grams
3 4/27/2019 Martha 5217946 13cm of dose
4 4/24/2019 Alex 62700321 32m of grams
5 4/25/2019 Alex 66721020 12g of egc
6 4/26/2019 Alex 71745138 43m of grams
7 4/27/2019 Alex 88762943 30cm of dose
8 4/28/2019 Alex 102772578 12g of egc
9 4/29/2019 Alex 129089274 43m of grams
10 3/1/2019 Josh 1063259 13cm of dose
11 3/3/2019 Josh 1063259 12g of eg
12 3/4/2019 Josh 1063259 12g of eg
13 3/5/2019 Josh 1063259 43m of grams
14 3/6/2019 Josh 1063259 43m of grams
can you re-write the code to
df.groupby('name',sort=False).orderby('descr').apply(lambda x: (x['point'].values[-1] - x['point'].values[0]) / x['point'].values[-1] * 100)\
.reset_index(name='pct change')\.reset_index(name='descr')
or what would you think is the right approach to incorporate the description column?
I have defined a simple function to replace missing values in numerical columns with the average of the non missing values for the columns. The function is syntactically correct and generating correct values. However, the missing values are not getting replaced
Below is the code snippet
def fillmissing_with_mean(df1):
df2 = df1._get_numeric_data()
for i in range(len(df2.columns)):
df2[df2.iloc[:,i].isnull()].iloc[:,i]=df2.iloc[:,i].mean()
return df2
fillmissing_with_mean(df)
The data frame which is passed looks like this:
age gender job name height
NaN F student alice 165.0
26.0 None student john 180.0
NaN M student eric 175.0
58.0 None manager paul NaN
33.0 M engineer julie 171.0
34.0 F scientist peter NaN
You do not need worry about select the numeric or not , when you doing the mean ,it will only affect to those numeric column, and fillna can pass by pd.Serise
df.fillna(df.mean())
Out[1398]:
age gender job name height
0 37.75 F student alice 165.00
1 26.00 None student john 180.00
2 37.75 M student eric 175.00
3 58.00 None manager paul 172.75
4 33.00 M engineer julie 171.00
5 34.00 F scientist peter 172.75
More Info
df.mean()
Out[1399]:
age 37.75
height 172.75
dtype: float64
This may be what you need. skipna=True by default, but I've included it here explicitly so you know what it's doing.
for col in ['age', 'height']:
df[col] = df[col].fillna(df[col].mean(skipna=True))
# age gender job name height
# 0 37.75 F student alice 165.00
# 1 26.00 None student john 180.00
# 2 37.75 M student eric 175.00
# 3 58.00 None manager paul 172.75
# 4 33.00 M engineer julie 171.00
# 5 34.00 F scientist peter 172.75
I have two data frames
df1 =
actorID actorName
0 annie_potts Annie Potts
1 bill_farmer Bill Farmer
2 don_rickles Don Rickles
3 erik_von_detten Erik von Detten
4 greg-berg Greg Berg
df2 =
directorID directorName
0 john_lasseter John Lasseter
1 joe_johnston Joe Johnston
2 donald_petrie Donald Petrie
3 forest_whitaker Forest Whitaker
4 charles_shyer Charles Shyer
What I ideally want is a concatenation of these two dataframes, like pd.concat((df1, df2)):
actorID-directorID actorName-directorName
0 annie_potts Annie Potts
1 bill_farmer Bill Farmer
2 don_rickles Don Rickles
3 erik_von_detten Erik von Detten
4 greg-berg Greg Berg
5 john_lasseter John Lasseter
6 joe_johnston Joe Johnston
7 donald_petrie Donald Petrie
8 forest_whitaker Forest Whitaker
9 charles_shyer Charles Shyer
however I want there to be an easy way to specify that I want to join df1.actorName and df2.directorName together, and actorID / directorID. How can I do this?