Finding the streak of days in python - python

So I have a set of 50 dates I have specified 7 here for example
df["CreatedDate"] = pd.DataFrame('09-08-16 0:00','22-08-16 0:00','23-08-16 0:00',28-08-16 0:00,'29-08-16 0:00','30-08-16 0:00','31-08-16 0:00')
df["CreatedDate"] = pd.to_datetime(df4.CreatedDate)
df4["DAY"] = df4.CreatedDate.dt.day
How to find the continuous days which form a streak range [1-3],[4-7],[8-15],[>=16]
Streak Count
1-3 3 #(9),(22,23) are in range [1-3]
4-7 1 #(28,29,30,31) are in range [4-7]
8-15 0
>=16 0
let's just say the product (pen) has been launched 2 yrs back we are taking the dataset for last 10 months from today and from what I want to find is that if people are buying that pen continuously for 1 or 2 or 3 days and if yes place the count [1-3] and if they are buying it continuously for 4 or 5 or 6 or 7 days we place the count in [4- 7] and so on for other ranges
I dont know which condition to specify to match the criteria

I believe need:
df4 = pd.DataFrame({'CreatedDate':['09-08-16 0:00','22-08-16 0:00','23-08-16 0:00','28-08-16 0:00','29-08-16 0:00','30-08-16 0:00','31-08-16 0:00']})
df4["CreatedDate"] = pd.to_datetime(df4.CreatedDate)
df4 = df4.sort_values("CreatedDate")
count = df4.groupby((df4["CreatedDate"].diff().dt.days > 1).cumsum()).size()
print (count)
CreatedDate
0 2
1 4
2 1
dtype: int64
a = (pd.cut(count, bins=[0,3,7,15,31], labels=['1-3', '4-7','8-15', '>=16'])
.value_counts()
.sort_index()
.rename_axis('Streak')
.reset_index(name='Count'))
print (a)
Streak Count
0 1-3 2
1 4-7 1
2 8-15 0
3 >=16 0

Here's an attempt, binning is the same as #jezrael (except the last bin which I'm not sure should be limited to 31... is there a way to have open intervals with pd.cut?)
import pandas as pd
df = pd.DataFrame({ "CreatedDate": ['09-08-16 0:00','22-08-16 0:00','23-08-16 0:00','28-08-16 0:00','29-08-16 0:00','30-08-16 0:00','31-08-16 0:00']})
df["CreatedDate"] = pd.to_datetime(df.CreatedDate)
# sort by date
df = df.sort_values("CreatedDate")
# group consecutive dates
oneday = pd.Timedelta("1 day")
df["groups"] = (df.diff() > oneday).cumsum()
counts = df.groupby("groups").count()["CreatedDate"]
# bin
streaks = (pd.cut(counts, bins=[0,3,7,15,1000000], labels=['1-3', '4-7','8-15', '>=16'])
.value_counts()
.rename_axis("streak")
.reset_index(name="count"))
print(streaks)
streak count
0 1-3 2
1 4-7 1
2 >=16 0
3 8-15 0

Related

how to find total play time of each week for the given date in python?

I have a data frame that looks like the one below
k={'user_id':[1,1,1,1,1,2,2,2,3,3,3,3,3,4,4,4,5,5],
'created':[ '2/09/2021','2/10/2021','2/16/2021','2/17/2021','3/09/2021','3/10/2021','3/18/2021','3/19/2021',
'2/19/2021','2/20/2021','2/26/2021','2/27/2021','3/09/2021','2/10/2021','2/18/2021','3/19/2021',
'3/24/2021','3/30/2021',],
'stop_time':[11,12,13,14,15,25,26,27,6,7,8,9,10,11,12,13,25,26],
'play_time':[10,11,12,13,14,24,25,26,5,6,7,8,9,10,11,13,24,25]}
df=pd.DataFrame(data=k)
df['created']=pd.to_datetime(df['created'], format='%m/%d/%Y')
df['total_play_time'] = df['stop_time'] - df['play_time']
Now we need to use the first date of each user_id as the first-week start date for example we need to select the '2/9/2021' is the first-week start date of user_id 1 and '3/09/2021' as the first-week start date of user_id 2.
we need sum the total playtime each week of user_id its continue to give a sum of each until the current date(for example if run report to today its has to give sum of each week until today) and give the result as below
ID week1 week2 week3 week4 week5 week6 week7 week8 week9 week10 week11 week12
1 3 2 0 0 0 0 0 0 0 0 0 0
2 1 2 0 0 0 0 0
# Get a list of unique id's
user_ids = df["user_id"].unique()
# Get the start date of each user
start_dates = [min(df[df["user_id"]==usr]["created"]) for usr in user_ids]
# We will subtract the start date to have a common baseline for all users
df["time_since_start"] = None
for i, usr in enumerate(user_ids):
df.loc[df["user_id"]==usr,"time_since_start"] = df.loc[df["user_id"]==usr,"created"] - start_dates[i]
# we got a Timedelta object, but its more useful as a float
df['t'] = [x.value for x in df["time_since_start"]]
# get the maximum time any user has ever ..played? to make our bins
max_time = df["time_since_start"].max()
# convert it from microseconds to weeks, rounding up
max_weeks = int(np.ceil(max_time.value/8.64e+13/7))
# make the bins and add corresponding readable labels
bins = [pd.Timedelta(weeks = wk).value for wk in range(max_weeks+1)]
labels = ["week " + str(wk+1) for wk in range(max_weeks)]
# bin the data and aggregate the result
df["bin"] = pd.cut(df['t'], bins, labels = labels)
df.groupby(['user_id','bin'])['total_play_time'].sum()
user_id bin
1 week 1 2
week 2 1
week 3 0
week 4 1
week 5 0
week 6 0
2 week 1 0
week 2 2
week 3 0
week 4 0
week 5 0
week 6 0
3 week 1 2
week 2 1
week 3 1
week 4 0
week 5 0
week 6 0
4 week 1 0
week 2 1
week 3 0
week 4 0
week 5 0
week 6 0
5 week 1 1
week 2 0
week 3 0
week 4 0
week 5 0
week 6 0
Name: total_play_time, dtype: int64
You can then reshape the dataframe to a wide format if you really need to.

How to create N groups based on conditions in columns?

I need to create groups using two columns. For example, I took shop_id and week. Here is the df:
shop_id week
0 1 1
1 1 2
2 1 3
3 2 1
4 2 2
5 3 2
6 1 5
Imagine that each group is some promo which took place in each shop consecutively (week by week). So, my attempt was to use sorting, shifting by 1 to get last_week, use booleans and then iterate over them, incrementing each time whereas condition not met:
test_df = pd.DataFrame({'shop_id':[1,1,1,2,2,3,1], 'week':[1,2,3,1,2,2,5]})
def createGroups(df, shop_id, week, group):
'''Create groups where is the same shop_id and consecutive week
'''
periods = []
period = 0
# sorting to create chronological order
df = df.sort_values(by = [shop_id,week],ignore_index = True)
last_week = df[week].shift(+1)==df[week]-1
last_shop = df[shop_id].shift(+1)==df[shop_id]
# here i iterate over booleans and increment group by 1
# if shop is different or last period is more than 1 week ago
for p,s in zip(last_week,last_shop):
if (p == True) and (s == True):
periods.append(period)
else:
period += 1
periods.append(period)
df[group] = periods
return df
createGroups(test_df, 'shop_id', 'week', 'promo')
And I get the grouping I need:
shop_id week promo
0 1 1 1
1 1 2 1
2 1 3 1
3 1 5 2
4 2 1 3
5 2 2 3
6 3 2 4
However, function seems to be an overkill. Any ideas on how to get the same without a for-loop using native pandas function? I saw .ngroups() in docs but have no idea how to apply it to my case. Even better would be to vectorise it somehow, but I don't know how to achieve this:(
First we want to identify the promotions (continuously in weeks), then use groupby().ngroup() to enumerate the promotion:
df = df.sort_values('shop_id')
s = df['week'].diff().ne(1).groupby(df['shop_id']).cumsum()
df['promo'] = df.groupby(['shop_id',s]).ngroup() + 1
Update: This is based on your solution:
df = df.sort_values(['shop_id','week'])
s = df[['shop_id', 'week']]
df['promo'] = (s['shop_id'].ne(s['shop_id'].shift()) |
s['week'].diff().ne(1) ).cumsum()
Output:
shop_id week promo
0 1 1 1
1 1 2 1
2 1 3 1
6 1 5 2
3 2 1 3
4 2 2 3
5 3 2 4

Transition count within a column from one value to another value in Pandas

I have the below dataframe.
df = pd.DataFrame({'Player': [1,1,1,1,2,2,2,3,3,3,4,5], "Team": ['X','X','X','Y','X','X','Y','X','X','Y','X','Y'],'Month': [1,1,1,2,1,1,2,2,2,3,4,5]})
Input:
Player Team Month
0 1 X 1
1 1 X 1
2 1 X 1
3 1 Y 2
4 2 X 1
5 2 X 1
6 2 Y 2
7 3 X 2
8 3 X 2
9 3 Y 3
10 4 X 4
11 5 Y 5
The data frame consists of Players, the team they belong to and the month. You can have multiple entries for the same player on a given month. Some players move from Team X to Team Y on a particular month, some don’t move at all and some directly join Team Y.
I am looking for the total count of people who moved from Team X to Team Y on a given month and the output should be like below. i.e the month of transition and total count of transitions. In this case, Players 1,2 moved on Month-2 and Player-3 moved on Month-3. Players 4 and 5 didn't move.
Expected Output:
Month Count
0 2 2
1 3 1
I am able to get this done in the below fashion.
###find all the people who moved from Team X to Y###
s1 = df.drop_duplicates(['Team','Player'])
s2 = s1.groupby('Player').size().reset_index(name='counts')
s2 = s2[s2['counts']>1]
####Tie them to the original df so that I can find the month in which they moved###
s3 = s1.groupby("Player").last().reset_index()
s4 = s3[s3['Player'].isin(s2['Player'])]
s5 = s4.groupby('Month').size().reset_index(name='Count')
I am pretty sure there is a better way than what I did here. Just looking for some help to make if more efficient.
First pick out the entries which (1) changes team but (2) is not the first row of a player. And then compute the size grouped by each month.
mask = df["Team"].shift().ne(df["Team"]) & df["Player"].shift().eq(df["Player"])
out = df[mask].groupby("Month").size()
Output:
print(out) # a Series
Month
2 2
3 1
dtype: int64
# series to dataframe (optional)
out.to_frame(name="count").reset_index()
Month count
0 2 2
1 3 1
Edit: the first groupby in mask is redundant so removed.
An option is to self merge on Player, Month and check for the players that move:
s = df.drop_duplicates()
t = (s.merge(s.assign(Month=s.Month+1), on=['Player', 'Month'], how='right')
.assign(Count=lambda x: x.Team_x.eq('Y') & x.Team_y.eq('X'))
.groupby('Month', as_index=False)['Count'].sum()
)
print(t.loc[t['Count'] != 0])
Output:
Month Count
0 2 2
1 3 1

Consolidate time periods data in pandas

How do I consolidate time periods data in Python pandas?
I want to manipulate data from
person start end
1 2001-1-8 2002-2-14
1 2002-2-14 2003-3-1
2 2001-1-5 2002-2-16
2 2002-2-17 2003-3-9
to
person start end
1 2001-1-8 2002-3-1
2 2001-1-5 2002-3-9
I want to check first if the last end and new start are within 1 day first. If not, then keep the original data structure, if so, then consolidate.
df.sort_values(["person", "start", "end"], inplace=True)
def condense(df):
df['prev_end'] = df["end"].shift(1)
df['dont_condense'] = (abs(df['prev_end'] - df['start']) > timedelta(days=1))
df["group"] = df['dont_condense'].fillna(False).cumsum()
return df.groupby("group").apply(lambda x: pd.Series({"person": x.iloc[0].person,
"start": x.iloc[0].start,
"end": x.iloc[-1].end}))
df.groupby("person").apply(condense).reset_index(drop=True)
You can use if each group contains only 2 rows and need difference 1 and 0 days, also all data are sorted:
print (df)
person start end
0 1 2001-1-8 2002-2-14
1 1 2002-2-14 2003-3-1
2 2 2001-1-5 2002-2-16
3 2 2002-2-17 2003-3-9
4 3 2001-1-2 2002-2-14
5 3 2002-2-17 2003-3-10
df.start = pd.to_datetime(df.start)
df.end = pd.to_datetime(df.end)
def f(x):
#if need difference only 0 days, use
#a = (x['start'] - x['end'].shift()) == pd.Timedelta(days=0)
a = (x['start'] - x['end'].shift()).isin([pd.Timedelta(days=1), pd.Timedelta(days=0)])
if a.any():
x.end = x['end'].shift(-1)
return (x)
df1 = df.groupby('person').apply(f).dropna().reset_index(drop=True)
print (df1)
person start end
0 1 2001-01-08 2003-03-01
1 2 2001-01-05 2003-03-09
2 3 2001-01-02 2002-02-14
3 3 2002-02-17 2003-03-10

Calculating the number of consecutive periods that match a condition

Given the data in the Date and Close columns, I'd like to calculate the values in the ConsecPeriodsUp column. This column gives the number of consecutive two-week periods that the Close value has increased.
Date Close UpThisPeriod ConsecPeriodsUp
23/12/2015 3 1 1
16/12/2015 2 0 0
09/12/2015 1 0 0
02/12/2015 3 1 1
25/11/2015 2 0 0
18/11/2015 1 0 0
11/11/2015 7 1 3
04/11/2015 6 1 3
28/10/2015 5 1 2
21/10/2015 4 1 2
14/10/2015 3 1 1
07/10/2015 2 NaN NaN
30/09/2015 1 NaN NaN
I've written the following code to give the UpThisPeriod column but I can't see how I would aggregate that to get the ConsecPeriodsUp column, or whether there is way to do it in a single calculation that I'm missing.
import pandas as pd
def up_over_period(s):
return s[0] >= s[-1]
df = pd.read_csv("test_data.csv")
period = 3 # one more than the number of weeks
df['UpThisPeriod'] = pd.rolling_apply(
df['Close'],
window=period,
func=up_over_period,
).shift(-period + 1)
This can be done by adapting the groupby, shift and cumsum trick described in the Pandas Cookbook, Grouping like Python’s itertools.groupby. The main change is in dividing by the length of the period - 1 and then using the ceil function to round up to the next integer.
from math import ceil
...
s = df['UpThisPeriod'][::-1]
df['ConsecPeriodsUp'] = (s.groupby((s != s.shift()).cumsum()).cumsum() / (period - 1)).apply(ceil)

Categories