Find how many consecutive days have a specific value in pandas - python

I have the following pandas dataframe:
Date Value
2019-01-01 0
2019-01-02 0
2019-01-03 0
2019-01-04 0
2019-01-05 1
2019-01-06 1
2019-01-10 1
2019-01-11 0
2019-01-12 0
2019-01-13 0
2019-01-14 0
I would like to have a start date and end date of each group of consecutive days that have value equal to 0 and obtain something like this:
Start Date End Date. N Days
2019-01-01 2019-01-04 4
2019-01-11 2019-01-14 4

Creat the subgroup with cumsum , then groupby with agg
s = df.Value.ne(0).cumsum()
out = df[df.Value.eq(0)].groupby(s).Date.agg(['first','last','count'])
out
Out[295]:
first last count
Value
0 2019-01-01 2019-01-04 4
3 2019-01-11 2019-01-14 4
Update
s = (df.Value.ne(0) | df.Date.diff().dt.days.ne(1)).cumsum()
out = df[df.Value.eq(0)].groupby(s).Date.agg(['first','last','count'])
out
Out[306]:
first last count
1 2019-01-01 2019-01-04 4
4 2019-01-11 2019-01-14 4
5 2020-01-01 2020-01-01 1
Input data
Date Value
0 2019-01-01 0
1 2019-01-02 0
2 2019-01-03 0
3 2019-01-04 0
4 2019-01-05 1
5 2019-01-06 1
6 2019-01-10 1
7 2019-01-11 0
8 2019-01-12 0
9 2019-01-13 0
10 2019-01-14 0
11 2020-01-01 0

Related

Cumulative groupby with condition on datetime pandas

I need to calculate cumulative sums for different columns in a pandas dataframe based on a column playerId and a datetime column. My dataframe looks like this:
eventId playerId goal shot header dateutc
0 0 100 0 1 0 2020-11-08 17:00:00
1 1 100 0 0 1 2020-11-08 17:00:00
2 2 100 1 1 0 2020-11-08 17:00:00
3 3 200 0 1 0 2020-11-08 17:00:00
4 4 100 1 0 1 2020-11-15 17:00:00
5 5 100 1 1 0 2020-11-15 17:00:00
6 6 200 1 1 0 2020-11-15 17:00:00
So now I need to calculate cumulative sums for each player for the current date and all previous dates. So my final dateframe will look like this:
playerId dateutc goal shot header
0 100 2020-11-08 17:00:00 1 2 1
1 200 2020-11-08 17:00:00 0 1 0
2 100 2020-11-15 17:00:00 3 3 2
3 200 2020-11-15 17:00:00 1 2 0
Hopefully someone can help me :)
First remove eventId for avoid sum if numeric, aggregate sum and then cumsum:
df1 = (df.drop('eventId',axis=1)
.groupby(['playerId','dateutc'], sort=False)
.sum()
.groupby(level=0, sort=False)
.cumsum()
.reset_index())
print (df1)
playerId dateutc goal shot header
0 100 2020-11-08 17:00:00 1 2 1
1 200 2020-11-08 17:00:00 0 1 0
2 100 2020-11-15 17:00:00 3 3 2
3 200 2020-11-15 17:00:00 1 2 0
If need specify columns for processing:
df1 = (df.groupby(['playerId','dateutc'], sort=False)[['goal', 'shot', 'header']]
.sum()
.groupby(level=0, sort=False)
.cumsum()
.reset_index())
Try:
out = df.groupby(['playerId', 'dateutc'], sort=False)[['goal', 'shot', 'header']].sum()
out = out.groupby(level='playerId').cumsum().reset_index()
Output:
>>> out
playerId dateutc goal shot header
0 100 2020-11-08 17:00:00 1 2 1
1 200 2020-11-08 17:00:00 0 1 0
2 100 2020-11-15 17:00:00 3 3 2
3 200 2020-11-15 17:00:00 1 2 0

Python - Sum of column values between 2 dates

I am trying to create a new column in my dataframe:
Let X be a variable number of days.
Date
Units Sold
Total Units sold in the last X days
0
2019-01-01 19:00:00
5
1
2019-01-01 15:00:00
4
2
2019-01-05 11:00:00
1
3
2019-01-12 12:00:00
3
4
2019-01-15 15:00:00
2
5
2019-02-04 18:00:00
7
For each row, I need to sum up units sold + all the units sold in the last 10 days (letting x = 10 days)
Desired Result:
Date
Units Sold
Total Units sold in the last X days
0
2019-01-01 19:00:00
5
5
1
2019-01-01 15:00:00
4
9
2
2019-01-05 11:00:00
1
10
3
2019-01-12 12:00:00
3
4
4
2019-01-15 15:00:00
2
6
5
2019-02-04 18:00:00
7
7
I have used the .rolling(window=) method before using periods and I think the following can help
df = df.rolling("10D").sum() but I can't get the syntax right!!
Please please help!
Try:
df["Total Units sold in the last 10 days"] = df.rolling(on="Date", window="10D", closed="both").sum()["Units Sold"]
print(df)
Prints:
Date Units Sold Total Units sold in the last 10 days
0 2019-01-01 5 5.0
1 2019-01-01 4 9.0
2 2019-01-05 1 10.0
3 2019-01-12 3 4.0
4 2019-01-15 2 6.0
5 2019-02-04 7 7.0

Choosing the minumum distance part 2

This question is already here, but now I have added an extra part to the previous question.
I have the following dataframe:
data = {'id': [0, 0, 0, 0, 0, 0],
'time_order': ['2019-01-01 0:00:00', '2019-01-01 00:11:00', '2019-01-02 00:04:00', '2019-01-02 00:15:00', '2019-01-03 00:07:00', '2019-01-03 00:10:00']}
df_data = pd.DataFrame(data)
df_data['time_order'] = pd.to_datetime(df_data['time_order'])
df_data['day_order'] = df_data['time_order'].dt.strftime('%Y-%m-%d')
df_data['time'] = df_data['time_order'].dt.strftime('%H:%M:%S')
I have been trying to calculate the short time difference between the orders each 15 minutes, e.g. I take a time window 15 minutes and take only its half 7:30 which means I would like to calculate the difference between the first order '2019-01-01 0:00:00' and 00:07:30 and between the second order '2019-01-01 0:11:00' and 00:07:30 and take only the order that is closer to 00:07:30 each day.
I did the following:
t = 0
s = pd.Time.datetime.fromtimestamp(t).strftime('%H:%M:%S')
#x = '00:00:00'
#y = '00:15:00'
tw = 900
g = 0
a = []
for k in range(30):
begin = pd.Timestamp(s).to_pydatetime()
begin1 = begin + datetime.timedelta(seconds=int(k*60))
last = begin1 + datetime.timedelta(seconds=int(tw))
x = begin1.strftime('%H:%M:%S')
y = last.strftime('%H:%M:%S')
for i in range(1, len(df_data)):
#g +=1
if x <= df_data.iat[i-1, 4] <= y:
half_time = (pd.Timestamp(y) - pd.Timstamp(x).to_pydatetime()) / 2
half_window = (half_time + pd.Timestamp(x).to_pydatetime()).strftime('%H:%M:%S')
for l in df_data['day_order']:
for k in df_data['time_order']:
if l == k.strftime('%Y-%m-%d')
distance1 = abs(pd.Timestamp(df_data.iat[i-1, 4].to_pydatetime() - pd.Timestamp(half_window).to_pydatetime())
distance2 = abs(pd.Timestamp(df_data.iat[i, 4].to_pydatetime() - pd.Timestamp(half_window).to_pydatetime())
if distance1 < distance2:
d = distance1
else:
d = distance2
a.append(d.seconds)
so the expected result for the first day is abs(00:11:00 - 00:07:30) = 00:03:30 which is less than abs(00:00:00 - 00:07:30) = 00:07:30 and by doing so I would like to consider only the short time distance which means the 00:03:30 and ignor the first order at that day. I would like to do it for each day. I tried it with my code above, it doesn't work. Any idea would be very appreciated. Thanks in advance.
Update:
I just have added an extra command to the code above, so that I move the time window each minute, e.g. from 00:00:00 - 00:15:00 to 00:01:00- 00:16:00 and look inside this period for the short distance, as previously discribed, and ignor other times that does not belong to that window. I tired to do this procedure for 30 minutes and it worked with your suggested solution. However, it took other times that does not belong to that period of time.
import pandas as pd
import datetime
data = {'id': [0, 0, 0, 0, 0, 0],
'time_order': ['2019-01-01 0:00:00', '2019-01-01 00:11:00', '2019-01-02 00:04:00', '2019-01-02 00:15:00', '2019-01-03 00:07:00', '2019-01-03 00:10:00']}
df_data = pd.DataFrame(data)
df_data['time_order'] = pd.to_datetime(df_data['time_order'])
df_data['day_order'] = df_data['time_order'].dt.strftime('%Y-%m-%d')
df_data['time'] = df_data['time_order'].dt.strftime('%H:%M:%S')
x = '00:00:00'
y = '00:15:00'
s = '00:00:00'
tw = 900
begin = pd.Timestamp(s).to_pydatetime()
for k in range(10): # 10 times shift will happen
begin1 = begin + datetime.timedelta(seconds=int(k*60))
last = begin1 + datetime.timedelta(seconds=int(tw))
x = begin1.strftime('%H:%M:%S')
y = last.strftime('%H:%M:%S')
print('\n========\n',x,y)
diff = (pd.Timedelta(y)-pd.Timedelta(x))/2
df_data2 = df_data[(last>=pd.to_datetime(df_data['time'])) & (pd.to_datetime(df_data['time'])>begin1)].copy()
#print(df_data2)
df_data2['diff'] = abs(df_data2['time'] - (diff + pd.Timedelta(x)))
mins = df_data2.groupby('day_order').apply(lambda z: z[z['diff']==min(z['diff'])])
mins.reset_index(drop=True, inplace=True)
print(mins)
Output after first 10 shifts:
========
00:00:00 00:15:00
id time_order day_order time diff
0 0 2019-01-01 00:11:00 2019-01-01 00:11:00 0 days 00:03:30
1 0 2019-01-02 00:04:00 2019-01-02 00:04:00 0 days 00:03:30
2 0 2019-01-03 00:07:00 2019-01-03 00:07:00 0 days 00:00:30
========
00:01:00 00:16:00
id time_order day_order time diff
0 0 2019-01-01 00:11:00 2019-01-01 00:11:00 0 days 00:02:30
1 0 2019-01-02 00:04:00 2019-01-02 00:04:00 0 days 00:04:30
2 0 2019-01-03 00:07:00 2019-01-03 00:07:00 0 days 00:01:30
3 0 2019-01-03 00:10:00 2019-01-03 00:10:00 0 days 00:01:30
========
00:02:00 00:17:00
id time_order day_order time diff
0 0 2019-01-01 00:11:00 2019-01-01 00:11:00 0 days 00:01:30
1 0 2019-01-02 00:04:00 2019-01-02 00:04:00 0 days 00:05:30
2 0 2019-01-02 00:15:00 2019-01-02 00:15:00 0 days 00:05:30
3 0 2019-01-03 00:10:00 2019-01-03 00:10:00 0 days 00:00:30
========
00:03:00 00:18:00
id time_order day_order time diff
0 0 2019-01-01 00:11:00 2019-01-01 00:11:00 0 days 00:00:30
1 0 2019-01-02 00:15:00 2019-01-02 00:15:00 0 days 00:04:30
2 0 2019-01-03 00:10:00 2019-01-03 00:10:00 0 days 00:00:30
========
00:04:00 00:19:00
id time_order day_order time diff
0 0 2019-01-01 00:11:00 2019-01-01 00:11:00 0 days 00:00:30
1 0 2019-01-02 00:15:00 2019-01-02 00:15:00 0 days 00:03:30
2 0 2019-01-03 00:10:00 2019-01-03 00:10:00 0 days 00:01:30
========
00:05:00 00:20:00
id time_order day_order time diff
0 0 2019-01-01 00:11:00 2019-01-01 00:11:00 0 days 00:01:30
1 0 2019-01-02 00:15:00 2019-01-02 00:15:00 0 days 00:02:30
2 0 2019-01-03 00:10:00 2019-01-03 00:10:00 0 days 00:02:30
========
00:06:00 00:21:00
id time_order day_order time diff
0 0 2019-01-01 00:11:00 2019-01-01 00:11:00 0 days 00:02:30
1 0 2019-01-02 00:15:00 2019-01-02 00:15:00 0 days 00:01:30
2 0 2019-01-03 00:10:00 2019-01-03 00:10:00 0 days 00:03:30
========
00:07:00 00:22:00
id time_order day_order time diff
0 0 2019-01-01 00:11:00 2019-01-01 00:11:00 0 days 00:03:30
1 0 2019-01-02 00:15:00 2019-01-02 00:15:00 0 days 00:00:30
2 0 2019-01-03 00:10:00 2019-01-03 00:10:00 0 days 00:04:30
========
00:08:00 00:23:00
id time_order day_order time diff
0 0 2019-01-01 00:11:00 2019-01-01 00:11:00 0 days 00:04:30
1 0 2019-01-02 00:15:00 2019-01-02 00:15:00 0 days 00:00:30
2 0 2019-01-03 00:10:00 2019-01-03 00:10:00 0 days 00:05:30
========
00:09:00 00:24:00
id time_order day_order time diff
0 0 2019-01-01 00:11:00 2019-01-01 00:11:00 0 days 00:05:30
1 0 2019-01-02 00:15:00 2019-01-02 00:15:00 0 days 00:01:30
2 0 2019-01-03 00:10:00 2019-01-03 00:10:00 0 days 00:06:30
Now if you see, there were some iteration where there were 4 rows generated in output. If you see in the diff column you would find that, there could be pairs of rows that can have same time difference. This is due to the fact that we are considering positive and negative time difference as same.
So for example in the above output, the second iteration i.e. 00:01:00 to 00:16:00 we can see that there are two entries for 2019-01-03
2 0 2019-01-03 00:07:00 2019-01-03 00:07:00 0 days 00:01:30
3 0 2019-01-03 00:10:00 2019-01-03 00:10:00 0 days 00:01:30
And this is because both of their difference are of 00:01:30.
The mid for this range will be at 00:01:00 + 00:07:30 = 00:08:30
00:07:30 <----(- 01:30)----00:08:30---(+ 01:30)--->00:10:00
And that's why both orders were displayed

Drop overlapping periods less than 6 months in pandas dataframe

I have the following Pandas dataframe and I want to drop the rows for each customer where the difference between Dates is less than 6 month per customer. For example, I want to keep the following dates for customer with ID 1 - 2017-07-01, 2018-01-01, 2018-08-01
Customer_ID Date
1 2017-07-01
1 2017-08-01
1 2017-09-01
1 2017-10-01
1 2017-11-01
1 2017-12-01
1 2018-01-01
1 2018-02-01
1 2018-03-01
1 2018-04-01
1 2018-06-01
1 2018-08-01
2 2018-11-01
2 2019-02-01
2 2019-03-01
2 2019-05-01
2 2020-02-01
2 2020-05-01
Define the following function to process each group of rows (for each customer):
def selDates(grp):
res = []
while grp.size > 0:
stRow = grp.iloc[0]
res.append(stRow)
grp = grp[grp.Date >= stRow.Date + pd.DateOffset(months=6)]
return pd.DataFrame(res)
Then apply this function to each group:
result = df.groupby('Customer_ID', group_keys=False).apply(selDates)
The result, for your data sample, is:
Customer_ID Date
0 1 2017-07-01
6 1 2018-01-01
11 1 2018-08-01
12 2 2018-11-01
15 2 2019-05-01
16 2 2020-02-01

Get cumulative mean among groups in Python

I am trying to get a cumulative mean in python among different groups.
I have data as follows:
id date value
1 2019-01-01 2
1 2019-01-02 8
1 2019-01-04 3
1 2019-01-08 4
1 2019-01-10 12
1 2019-01-13 6
2 2019-01-01 4
2 2019-01-03 2
2 2019-01-04 3
2 2019-01-06 6
2 2019-01-11 1
The output I'm trying to get something like this:
id date value cumulative_avg
1 2019-01-01 2 NaN
1 2019-01-02 8 2
1 2019-01-04 3 5
1 2019-01-08 4 4.33
1 2019-01-10 12 4.25
1 2019-01-13 6 5.8
2 2019-01-01 4 NaN
2 2019-01-03 2 4
2 2019-01-04 3 3
2 2019-01-06 6 3
2 2019-01-11 1 3.75
I need the cumulative average to restart with each new id.
I can get a variation of what I'm looking for with a single, for example if the data set only had the data where id = 1 then I could use:
df['cumulative_avg'] = df['value'].expanding.mean().shift(1)
I try to add a group by into it but I get an error:
df['cumulative_avg'] = df.groupby('id')['value'].expanding().mean().shift(1)
TypeError: incompatible index of inserted column with frame index
Also tried:
df.set_index(['account']
ValueError: cannot handle a non-unique multi-index!
The actual data I have has millions of rows, and thousands of unique ids'. Any help with a speedy/efficient way to do this would be appreciated.
For many groups this will perform better because it ditches the apply. Take the cumsum divided by the cumcount, subtracting off the value to get the analog of expanding. Fortunately pandas interprets 0/0 as NaN.
gp = df.groupby('id')['value']
df['cum_avg'] = (gp.cumsum() - df['value'])/gp.cumcount()
id date value cum_avg
0 1 2019-01-01 2 NaN
1 1 2019-01-02 8 2.000000
2 1 2019-01-04 3 5.000000
3 1 2019-01-08 4 4.333333
4 1 2019-01-10 12 4.250000
5 1 2019-01-13 6 5.800000
6 2 2019-01-01 4 NaN
7 2 2019-01-03 2 4.000000
8 2 2019-01-04 3 3.000000
9 2 2019-01-06 6 3.000000
10 2 2019-01-11 1 3.750000
After a groupby, you can't really chain method and in your example, the shift is not made per group anymore so you would not get the expected result. And there is a problem with index alignment after anyway so you can't create a column like this. So you can do:
df['cumulative_avg'] = df.groupby('id')['value'].apply(lambda x: x.expanding().mean().shift(1))
print (df)
id date value cumulative_avg
0 1 2019-01-01 2 NaN
1 1 2019-01-02 8 2.000000
2 1 2019-01-04 3 5.000000
3 1 2019-01-08 4 4.333333
4 1 2019-01-10 12 4.250000
5 1 2019-01-13 6 5.800000
6 2 2019-01-01 4 NaN
7 2 2019-01-03 2 4.000000
8 2 2019-01-04 3 3.000000
9 2 2019-01-06 6 3.000000
10 2 2019-01-11 1 3.750000

Categories