I have a dataframe like this:
person action_type time
A 4 2014-11-10
A 4 2014-11-15
A 3 2014-11-16
A 1 2014-11-18
A 4 2014-11-19
B 4 2014-11-13
B 2 2014-11-15
B 4 2014-11-19
So I want to add a new column named 'action_4' which represent the count of action_type is 4 for the person of the past 7 days(not include itself).
The result like this:
person action_type time action_4
A 4 2014-11-10 0
A 4 2014-11-15 1
A 3 2014-11-16 2
A 1 2014-11-18 1
A 4 2014-11-19 1
B 4 2014-11-13 0
B 2 2014-11-15 1
B 4 2014-11-19 1
As my the shape of my dataframe is 21649900*3, so avoid using for...in... please.
Here is my approach.
I think checking an interval based on time (e.g. 7 days) is always very expensive, so it is better to rely on the number of observations. (actually, in latest pandas version they introduced a "time-aware" rolling, but I have no experience with it...)
So my approach is, for each person, to force daily frequency and then simply counting the number of action_4 happened in the last 7 days excluding today. I have added comments to the code that should make it clear, but feel free to ask for more explanation.
import pandas as pd
from io import StringIO
inp_str = u"""
person action_type time
A 4 2014-11-10
A 4 2014-11-15
A 3 2014-11-16
A 1 2014-11-18
A 4 2014-11-19
B 4 2014-11-13
B 2 2014-11-15
B 4 2014-11-19
"""
or_df = pd.read_csv(StringIO(inp_str), sep = " ").set_index('time')
or_df.index = pd.to_datetime(or_df.index)
# Find first and last date for each person
min_dates = or_df.groupby('person').apply(lambda x: x.index[0])
max_dates = or_df.groupby('person').apply(lambda x: x.index[-1])
# Resample each person to daily frequency so that 1 obs = 1 day
transf_df = pd.concat([el.reindex(pd.date_range(min_dates[pp], max_dates[pp], freq = 'D')) for pp, el in or_df.groupby('person')])
# Forward fill person
transf_df.loc[:, 'person'] = transf_df['person'].ffill()
# Set a null value for action_type (possibly integer so you preserve the column type)
transf_df = transf_df.fillna({'action_type' : -1})
# For each person count the number of action 4, exluding today
result = transf_df.groupby('person').transform(lambda x: x.rolling(7, 1).apply(lambda y: len(y[y==4])).shift(1).fillna(0))
result.columns = ['action_4']
# Bring back to original index
pd.concat([transf_df, result], axis = 1).set_index('person', append = True).loc[or_df.set_index('person', append = True).index, :]
This gives the expected output:
action_type action_4
time person
2014-11-10 A 4.0 0.0
2014-11-15 A 4.0 1.0
2014-11-16 A 3.0 2.0
2014-11-18 A 1.0 1.0
2014-11-19 A 4.0 1.0
2014-11-13 B 4.0 0.0
2014-11-15 B 2.0 1.0
2014-11-19 B 4.0 1.0
I don't take into account your action_type column but it can help you to find the correct answer :
df2 = pd.DataFrame(None, columns=["person", "action_type", "time", "action_4"])
df["action_4"] = 0
for index, table in df.groupby('person'):
table["action_4"] = table.time.apply(lambda x: table[(table.time > (x -
datetime.timedelta(days=7))) & (table.time < x)].shape[0])
df2 = pd.concat([df2, table])
Related
I have a dataframe with ID, date and number columns and would like to create a new column that takes the mean of all numbers for this specific ID BUT only includes the numbers in the mean where date is smaller than the date of this row. How would I do this?
df = (pd.DataFrame({'ID':['1','1','1','1','2','2'],'number':['1','4','1','4','2','5'],
'date':['2021-10-19','2021-10-16','2021-10-16','2021-10-15','2021-10-19','2021-10-10']})
.assign(date = lambda x: pd.to_datetime(x.date))
.assign(mean_no_from_previous_dts = lambda x: x[x.date<??].groupby('ID').number.transform('mean'))
)
this is what i would like to get as output
ID number date mean_no_from_previous_dts
0 1 1 2021-10-19 3.0 = mean(4+1+4)
1 1 4 2021-10-16 2.5 = mean(4+1)
2 1 1 2021-10-16 4.0 = mean(1)
3 1 4 2021-10-15 0.0 = 0 (as it's the first entry for this date and ID - this number doesnt matter, can e something else)
4 2 2 2021-10-19 5.0 = mean(5)
5 2 5 2021-10-10 0.0 = 0 (as it's the first entry for this date and ID)
so for example the first entry of the column mean_no_from_previous_dts is the mean of (4+1+4): the first 4 comes from the column number and the 2nd row because 2021-10-16 (date in the 2nd row) is smaller than 2021-10-19 (date in the 1st row). The 1 comes from the 3rd row because 2021-10-16 is smaller than 2021-10-19. The second 4 comes from the 4th row because 2021-10-15 is smaller than 2021-10-19. This is for ID = 1 the the same for ID = 2
Here is solution with numpy broadcasting per groups:
df = (pd.DataFrame({'ID':['1','1','1','1','2','2'],'number':['1','4','1','4','2','5'],
'date':['2021-10-19','2021-10-16','2021-10-16','2021-10-15','2021-10-19','2021-10-10']})
.assign(date = lambda x: pd.to_datetime(x.date), number = lambda x: x['number'].astype(int))
)
def f(x):
arr = x['date'].to_numpy()
m = arr <= arr[:, None]
#remove rows with same values - set mask to False
np.fill_diagonal(m, False)
#set greater values to `NaN` and get mean without NaNs
m = np.nanmean(np.where(m, x['number'].to_numpy(), np.nan).astype(float), axis=1)
#assign to new column
x['no_of_previous_dts'] = m
return x
#last value is set to 0 per groups
df = df.groupby('ID').apply(f).fillna({'no_of_previous_dts':0})
print (df)
ID number date no_of_previous_dts
0 1 1 2021-10-19 3.0
1 1 4 2021-10-16 2.5
2 1 1 2021-10-16 4.0
3 1 4 2021-10-15 0.0
4 2 2 2021-10-19 5.0
5 2 5 2021-10-10 0.0
There is a dataframe like below:
Category
Time (s)
A
1
B
2
B
3
B
3
B
4
B
4
C
5
C
6
C
7
C
8
How can I group by this data frame and get the mean Value of the last X (for example 2) seconds.
The output should be like:
Category
Time (s)
A
1
B
4
C
7.5
Try:
out=df.groupby('Category',as_index=False)['Time (s)'].agg(lambda x:x.tail(2).mean())
OR
grouped=df.groupby('Category')['Time (s)']
out=grouped.nth([-1,-2]).groupby(level=0).mean().reset_index()
output of out:
Category Time (s)
0 A 1.0
1 B 4.0
2 C 7.5
Use groupby with tail:
print (df.groupby("Category").apply(lambda d: d.tail(2).mean()))
Time (s)
Category
A 1.0
B 4.0
C 7.5
Try with list position
df.groupby('Category')['Time (s)'].apply(lambda x : np.mean(x[-2:]))
Out[7]:
Category
A 1.0
B 4.0
C 7.5
Name: Time (s), dtype: float64
One additional note to address the problem better, in the actual data set there is also a column called store and the table can be grouped by store, date & product, When I tried the pivot solution and the cartesian product solution it did not work, is there a solution that could work for 3 grouping columns? Also the table has millions of rows.
Assuming a data frame with the following format:
d = {'product': ['a', 'b', 'c', 'a', 'b'], 'amount': [1, 2, 3, 5, 2], 'date': ['2020-6-6', '2020-6-6', '2020-6-6',
'2020-6-7', '2020-6-7']}
df = pd.DataFrame(data=d)
print(df)
product amount date
0 a 1 2020-6-6
1 b 2 2020-6-6
2 c 3 2020-6-6
3 a 5 2020-6-7
4 b 2 2020-6-7
Product c is no longer present on the date 2020-6-7, I want to be able to calculate things like percent change or difference in the amount of each product.
For example: df['diff'] = df.groupby('product')['amount'].diff()
But in order for this to work and show for example that the difference of c is -3 and -100%, c would need to be present on the next date with the amount set to 0
This is the results I am looking for:
print(df)
product amount date
0 a 1 2020-6-6
1 b 2 2020-6-6
2 c 3 2020-6-6
3 a 5 2020-6-7
4 b 2 2020-6-7
5 c 0 2020-6-7
Please note this is just a snipped data frame, in reality there might be many date periods, I am only looking to fill in the product and amount in the first date after it has been removed, not all dates after.
What is the best way to go about this?
Let us try pivot then unstack
out = df.pivot('product','date','amount').fillna(0).unstack().reset_index(name='amount')
date product amount
0 2020-6-6 a 1.0
1 2020-6-6 b 2.0
2 2020-6-6 c 3.0
3 2020-6-7 a 5.0
4 2020-6-7 b 2.0
5 2020-6-7 c 0.0
You could use the complete function from pyjanitor to explicitly expose the missing values and combine with fillna to fill the missing values with 0:
# pip install pyjanitor
# import janitor
df.complete(['date', 'product']).fillna(0)
date product amount
0 2020-6-6 a 1.0
1 2020-6-6 b 2.0
2 2020-6-6 c 3.0
3 2020-6-7 a 5.0
4 2020-6-7 b 2.0
5 2020-6-7 c 0.0
another way is to do create a cartesian product of your products & dates, then join that to your main dataframe to get the missing values.
#df['date'] = pd.to_datetime(df['date'])
#ensure you have a proper datetime object.
s = pd.merge( df[['product']].drop_duplicates().assign(ky=-1),
df[['date']].drop_duplicates().assign(ky=-1),
on=['ky']
).drop('ky',1)
df1 = pd.merge(df,s,
on = ['product','date']
,how='outer'
).fillna(0)
print(df1)
product amount date
0 a 1.0 2020-06-06
1 b 2.0 2020-06-06
2 c 3.0 2020-06-06
3 a 5.0 2020-06-07
4 b 2.0 2020-06-07
5 c 0.0 2020-06-07
I was wondering if there is an efficient way to add rows to a Dataframe that e.g. include the average or a predifined value in case there are not enough rows for a specific value in another column. I guess the description of the Problem is not the best that is why you find an example below:
Say we have the Dataframe
df1
Client NumberOfProducts ID
A 1 2
A 5 1
B 1 2
B 6 1
C 9 1
And we want to have 2 Rows for each client A, B, C, D, no matter if these 2 rows are already existing or not. So for Client A and B we can just copy the rows, for C we want to add a row which says Client = C, NumberOfProducts = average of existing rows = 9 and ID is not of interest (so we could set it to ID = smallest existing one - 1 = 0 any other value, even NaN, would also be possible). For Client D there does not exist a single row so we want to add 2 rows where NumberOfProducts is equal to the constant 2.5. The output should then look like this:
df1
Client NumberOfProducts ID
A 1 2
A 5 1
B 1 2
B 6 1
C 9 1
C 9 0
D 2.5 NaN
D 2.5 NaN
What I have done so far is to loop through the dataframe and add rows where necessary. Since this is pretty inefficient any better solution would be highly appreciated.
Use:
clients = ['A','B','C','D']
N = 2
#test only values from list and also filter only 2 rows for each client if necessary
df = df[df['Client'].isin(clients)].groupby('Client').head(N)
#create helper counter and reshape by unstack
df1 = df.set_index(['Client',df.groupby('Client').cumcount()]).unstack()
#set first if only 1 row per client - replace second NumberOfProducts by first
df1[('NumberOfProducts',1)] = df1[('NumberOfProducts',1)].fillna(df1[('NumberOfProducts',0)])
# ... replace second ID by first subtracted by 1
df1[('ID',1)] = df1[('ID',1)].fillna(df1[('ID',0)] - 1)
#add missing clients by reindex
df1 = df1.reindex(clients)
#replace NumberOfProducts by constant 2.5
df1['NumberOfProducts'] = df1['NumberOfProducts'].fillna(2.5)
print (df1)
NumberOfProducts ID
0 1 0 1
Client
A 1.0 5.0 2.0 1.0
B 1.0 6.0 2.0 1.0
C 9.0 9.0 1.0 0.0
D 2.5 2.5 NaN NaN
#last reshape to original
df2 = df1.stack().reset_index(level=1, drop=True).reset_index()
print (df2)
Client NumberOfProducts ID
0 A 1.0 2.0
1 A 5.0 1.0
2 B 1.0 2.0
3 B 6.0 1.0
4 C 9.0 1.0
5 C 9.0 0.0
6 D 2.5 NaN
7 D 2.5 NaN
I'm trying to calculate a rolling aggregate rate for a time series.
The way to think about the data is that it is the results of a bunch of multigame series against a different teams. We don't know who wins the series until the last game. I'm trying to calculate the win rate as it evolves against each of the opposing teams.
series_id date opposing_team won_series
1 1/1/2000 a 0
1 1/3/2000 a 0
1 1/5/2000 a 1
2 1/4/2000 a 0
2 1/7/2000 a 0
2 1/9/2000 a 0
3 1/6/2000 b 0
Becomes:
series_id date opposing_team won_series percent_win_against_team
1 1/1/2000 a 0 NA
1 1/3/2000 a 0 NA
1 1/5/2000 a 1 100
2 1/4/2000 a 0 NA
2 1/7/2000 a 0 100
2 1/9/2000 a 0 50
3 1/6/2000 b 0 0
I still don't feel like I understand the rule for how you decide when a series is over. Is 3 over? Why is it NA, I would have thought 1/3rd. Still, here is a way to keep track of the number of completed series and (a) win rate.
Define 26472215table.csv:
series_id,date,opposing_team,won_series
1,1/1/2000,a,0
1,1/3/2000,a,0
1,1/5/2000,a,1
2,1/4/2000,a,0
2,1/7/2000,a,0
2,1/9/2000,a,0
3,1/6/2000,b,0
Code:
import pandas as pd
import numpy as np
df = pd.read_csv('26472215table.csv')
grp2 = df.groupby(['series_id'])
sr = grp2['date'].max()
sr.name = 'LastGame'
df2 = df.join( sr, on=['series_id'], how='left')
df2.sort('date')
df2['series_comp'] = df2['date'] == df2['LastGame']
df2['running_sr_cnt'] = df2.groupby(['opposing_team'])['series_comp'].cumsum()
df2['running_win_cnt'] = df2.groupby(['opposing_team'])['won_series'].cumsum()
winrate = lambda x: x[1]/ x[0] if (x[0] > 0) else None
df2['winrate'] = df2[['running_sr_cnt','running_win_cnt']].apply(winrate, axis = 1 )
Results df2[['date', 'winrate']]:
date winrate
0 1/1/2000 NaN
1 1/3/2000 NaN
2 1/5/2000 1.0
3 1/4/2000 1.0
4 1/7/2000 1.0
5 1/9/2000 0.5
6 1/6/2000 0.0