How to get median between dates in a datetime series pandas - python

Given that I have a pandas dataframe:
waterflow_id created_at
0 5ff86588-594e-458f-9d29-385ee2e128e4 2022-03-20 13:19:21.430816+00:00
1 5ff86588-594e-458f-9d29-385ee2e128e4 2022-03-21 13:19:21.430819+00:00
2 5ff86588-594e-458f-9d29-385ee2e128e4 2022-03-21 13:19:21.430819+00:00
3 5ff86588-594e-458f-9d29-385ee2e128e4 2022-03-22 13:19:21.430821+00:00
4 5ff86588-594e-458f-9d29-385ee2e128e4 2022-03-22 13:19:21.430821+00:00
How do I get the median of days between created_at so that I can have a dataframe of days in between waterflow ids having something like:
waterflow days_median
1 0
2 4
3 6
4 7
5 10
Basically here, waterflow represents the unique occurrence of waterflow_id's
With the latest answer I tried
meddata = waterflow_df.groupby("waterflow_id")['created_at'].apply(lambda s: s.diff().median())
print(meddata)
And I recieved:
waterflow_id
0788a658-06d9-4b61-9ac4-2728ace02a86 0 days
1f8752f8-f667-44ec-84b9-acad02d384c0 0 days
2655b525-8b2c-4a53-abdc-5208cb95d96e 0 days
8d3cd7e3-900c-4996-b202-f66eb41ac37b 0 days
9d02b939-f295-4d36-8f72-e9984a52dbd9 0 days
d8d8fb70-d755-48c3-8c19-8032864719da 0 days
dc1da5e1-6974-4145-a0d8-615e08506ebf 0 days
f39366f5-c9e2-415a-baec-530bb8bd2f07 0 days
Whats weird is that I have dates spanning up to 6 months.

The output is unclear, but IIUC, you could use a GroupBy.agg:
from itertools import count
c = count(1)
df['created_at'] = pd.to_datetime(df['created_at'])
out = (df
.groupby('waterflow_id')
.agg(**{'waterflow': ('waterflow_id', lambda s: next(c)),
'days_median': ('created_at', lambda s: s.diff().median()
.total_seconds()//(3600*24))
})
)
or using factorize to number the groups:
df['created_at'] = pd.to_datetime(df['created_at'])
(df.assign(waterflow_id=df['waterflow_id'].factorize()[0]+1)
.groupby('waterflow_id')
.agg(**{'waterflow': ('waterflow_id', 'first'),
'days_median': ('created_at', lambda s: s.diff().median()
.total_seconds()//(3600*24))
})
)
output:
waterflow days_median
waterflow_id
5ff86588-594e-458f-9d29-385ee2e128e4 1 0.0
Simple version with just the median:
df['created_at'] = pd.to_datetime(df['created_at'])
out = (df.groupby('waterflow_id')['created_at']
.apply(lambda s: s.diff().median()
.total_seconds()//(3600*24))
)
output:
waterflow_id
5ff86588-594e-458f-9d29-385ee2e128e4 0.0
Name: created_at, dtype: float64

Related

Select rows based on column condition and date time

From below picture, we see that serial C was failed on 3rd January, and A failed on 5th January within 6 days period. I am interested to take samples for 3 days before the failure of each serial number.
My codes:
from pickle import TRUE
import pandas as pd
import numpy as np
import datetime
from datetime import date, timedelta
df = pd.read_csv('https://gist.githubusercontent.com/JishanAhmed2019/e464ca4da5c871428ca9ed9264467aa0/raw/da3921c1953fefbc66dddc3ce238dac53142dba8/failure.csv',sep='\t')
df['date'] = pd.to_datetime(df['date'])
#df.drop(columns=df.columns[0], axis=1, inplace=True)
df = df.sort_values(by="date")
d = datetime.timedelta(days = 3)
df_fail_date = df[df['failure']==1].groupby(['serial_number'])['date'].min()
df_fail_date = df_fail_date - d
df_fail_date
I was not not able to move further to sample my data. I am interested to get the following data, that is 3 days before the failure. Serial C had only 1 day available before failure so I wanna keep that one as well. It would be nice to add duration column to count the days before failure occurred. I appreciate your suggestions. Thanks!
Expected output dataframe:
You can use a groupby.rolling to get the dates/serials to keep, then merge to select:
df['date'] = pd.to_datetime(df['date'])
N = 3
m = (df.sort_values(by='date')
.loc[::-1]
.groupby('serial_number', group_keys=False)
.rolling(f'{N+1}d', on='date')
['failure'].max().eq(1)
.iloc[::-1]
)
out = df.merge(m[m], left_on=['serial_number', 'date'],
right_index=True, how='right')
Output:
date serial_number failure_x smart_5_raw smart_187_raw failure_y
2 2014-01-01 C 0 0 80 True
8 2014-01-02 C 0 0 200 True
4 2014-01-03 C 1 0 120 True
7 2014-01-02 A 0 0 180 True
5 2014-01-03 A 0 0 140 True
9 2014-01-04 A 0 0 280 True
14 2014-01-05 A 1 0 400 True
Another possible solution:
N = 4
df['date'] = pd.to_datetime(df['date'])
(df[df.groupby('serial_number')['failure'].transform(sum) == 1]
.sort_values(by=['serial_number', 'date'])
.groupby('serial_number')
.apply(lambda g:
g.assign(duration=1+np.arange(min(0, min(N, len(g))-len(g)), min(N, len(g)))))
.loc[lambda x: x['duration'] > 0]
.reset_index(drop=True))
Output:
date serial_number failure smart_5_raw smart_187_raw duration
0 2014-01-02 A 0 0 180 1
1 2014-01-03 A 0 0 140 2
2 2014-01-04 A 0 0 280 3
3 2014-01-05 A 1 0 400 4
4 2014-01-01 C 0 0 80 1
5 2014-01-02 C 0 0 200 2
6 2014-01-03 C 1 0 120 3

Calculate some date questions

I have a problem. I want to answer some question (see below). Unfortunately I got an error ValueError: Wrong number of items passed 0, placement implies 1. How could I determine the questions?
When was the last interactivity how many days ago (from today)?
How many orders has the customer placed ?
When was the fastest interactivity ?
When was the shortest interactivity ?
How much was the average of the interactivity ? (For that I calculated the days)
From 2021-02-10 to 2021-02-22 = 12
From 2021-02-22 to 2021-03-18 = 24
From 2021-03-18 to 2021-03-22 = 4
From 2021-03-22 to 2021-09-07 = 109
From 2021-09-07 to 2022-01-18 = 193
-------
68,4 (average days)
Dataframe
customerId fromDate
0 1 2021-02-22
1 1 2021-03-18
2 1 2021-03-22
3 1 2021-02-10
4 1 2021-09-07
5 1 None
6 1 2022-01-18
7 2 2021-05-17
Code
import pandas as pd
d = {'customerId': [1, 1, 1, 1, 1, 1, 1, 2],
'fromDate': ['2021-02-22', '2021-03-18', '2021-03-22',
'2021-02-10', '2021-09-07', None, '2022-01-18', '2021-05-17']
}
df = pd.DataFrame(data=d)
display(df)
df['fromDate'] = pd.to_datetime(df['fromDate'], errors='coerce').dt.date
#df = df['fromDate'].groupby('customerId', group_keys=False)
df = df.sort_values(['customerId','fromDate'],ascending=False)#.groupby('customerId')
df_new = pd.DataFrame()
df_new['average'] = df.groupby('customerId').mean()
[OUT] AttributeError: 'DataFrameGroupBy' object has no attribute 'groupby'
df_new = pd.DataFrame()
df_new['lastInteractivity'] = pd.to_datetime('today').normalize() - df['fromDate'].max()
[OUT] TypeError: '>=' not supported between instances of 'datetime.date' and 'float'
What I want
customerId lastInteractivity howMany shortest Longest Average
2 371 1 None None None
1 125 5 4 193 68,4
# shortes, longest, average is None because customer with the Id 2 had only 1 date
Use:
#converting to datetimes
df['fromDate'] = pd.to_datetime(df['fromDate'], errors='coerce')
#for correct add missing dates is sorting ascending by both columns
df = df.sort_values(['customerId','fromDate'])
#new column per customerId
df['lastInteractivity'] = pd.to_datetime('today').normalize() - df['fromDate']
#added missing dates per customerId, also count removed missing rows with NaNs
df = (df.dropna(subset=['fromDate'])
.set_index('fromDate')
.groupby('customerId')['lastInteractivity']
.apply(lambda x: x.asfreq('d'))
.reset_index())
#count how many missing dates
m = df['lastInteractivity'].notna()
df1 = (df[~m].groupby(['customerId', m.cumsum()])['customerId']
.size()
.add(1)
.reset_index(name='Count'))
print (df1)
customerId lastInteractivity Count
0 1 1 12
1 1 2 24
2 1 3 4
3 1 4 169
4 1 5 133
df1 = df1.groupby('customerId').agg(howMany=('Count','size'),
shortest=('Count','min'),
Longest=('Count','max'),
Average=('Count','mean'))
#get last lastInteractivity and joined df1
df = (df.groupby('customerId')['lastInteractivity']
.last()
.dt.days
.sort_index(ascending=False)
.to_frame()
.join(df1)
.reset_index())
print (df)
customerId lastInteractivity howMany shortest Longest Average
0 2 371 NaN NaN NaN NaN
1 1 125 5.0 4.0 169.0 68.4

Agg across columns based on multiple conditions

I would like to count all product_id depending on following condition:
shared_product==1
exclusive_product_storeA ==1
exclusive_product_storeB ==1
Main df
date product_id shared_product exclusive_product_storeA exclusive_product_storeB
2019-01-01 34434 1 0 0
2019-01-01 43546 1 0 0
2019-01-01 53288 1 0 0
2019-01-01 23444 0 1 0
2019-01-01 25344 0 1 0
2019-01-01 42344 0 0 1
Output DF
date count_shared_product count_exclusive_product_storeA count_exclusive_product_storeB
2019-01-01 3 2 1
This is what I have tried - but this does not give me the desired output df:
df.pivot_table(index=['shared_product','exclusive_product_storeA','exclusive_product_storeB'],aggfunc=['count'],values='product_id')
The idea here is to exclude rows that have a value of 0, groupby date and the resulting column, and finally unstack to get your final result
(
df.drop("product_id", axis=1)
.set_index("date")
.stack()
.loc[lambda x: x == 1]
.groupby(level=[0, 1])
.sum()
.unstack()
.rename_axis(index=None)
)
exclusive_product_storeA exclusive_product_storeB shared_product
2019-01-01 2 1 3
A shorter path would be to exclude the product_id, groupby date and sum the columns :
df.drop("product_id", axis=1).groupby("date").sum().rename_axis(None)

How to add a column with conditions on another Dataframe?

Motivation: I want to check if customers have bought anything during 2 months since first purchase. (retention)
Resources: I have 2 tables:
Buy date, ID and purchase code
Id and first day he bought
Sample data:
Table1
Date ID Purchase_code
2019-01-01 1 AQT1
2019-01-02 1 TRR1
2019-03-01 1 QTD1
2019-02-01 2 IGJ5
2019-02-05 2 ILW2
2019-02-20 2 WET2
2019-02-28 2 POY6
Table 2
ID First_Buy_Date
1 2019-01-01
2 2019-02-01
The expected result:
ID First_login_date Retention Frequency_buy_at_first_month
1 2019-01-01 1 2
2 2019-02-01 0 4
First convert columns to datetimes if necessary, then add first days by DataFrame.merge and create new columns by compare with Series.le or Series.gt and converting to integers:
df1['Date'] = pd.to_datetime(df1['Date'])
df2['First_Buy_Date'] = pd.to_datetime(df2['First_Buy_Date'])
df = df1.merge(df2, on='ID', how='left')
df['Retention'] = (df['First_Buy_Date'].add(pd.DateOffset(months=2))
.le(df['Date'])
.astype(int))
df['Frequency_buy_at_first_month'] = (df['First_Buy_Date'].add(pd.DateOffset(months=1))
.gt(df['Date'])
.astype(int))
Last aggregate by GroupBy.agg and max (if need only 0 or 1 output) and sum for count values:
df1 = (df.groupby(['ID','First_Buy_Date'], as_index=False)
.agg({'Retention':'max', 'Frequency_buy_at_first_month':'sum'}))
print (df1)
ID First_Buy_Date Retention Frequency_buy_at_first_month
0 1 2019-01-01 1 2
1 2 2019-02-01 0 4

Count number of rows for each ID within 1 year

I have a pandas dataframe something like this
Date ID
01/01/2016 a
05/01/2016 a
10/05/2017 a
05/05/2014 b
07/09/2014 b
12/08/2017 b
What I need to do is to add a column which shows the number of entries for each ID that occurred within the last year and another column showing the number within the next year. I've written some horrible code that iterates through the whole dataframe (millions of lines) and does the computations but there must be a better way!
I think you need between with boolean indexing for filter first and then groupby and aggregate size.
Outputs are concated and add reindex for add missing rows filled by 0:
print (df)
Date ID
0 01/01/2016 a
1 05/01/2016 a
2 10/05/2017 a
3 05/05/2018 b
4 07/09/2014 b
5 07/09/2014 c
6 12/08/2018 b
#convert to datetime (if first number is day, add parameter dayfirst)
df['Date'] = pd.to_datetime(df['Date'], dayfirst=True)
now = pd.datetime.today()
print (now)
oneyarbeforenow = now - pd.offsets.DateOffset(years=1)
oneyarafternow = now + pd.offsets.DateOffset(years=1)
#first filter
a = df[df['Date'].between(oneyarbeforenow, now)].groupby('ID').size()
b = df[df['Date'].between(now, oneyarafternow)].groupby('ID').size()
print (a)
ID
a 1
dtype: int64
print (b)
ID
b 2
dtype: int64
df1 = pd.concat([a,b],axis=1).fillna(0).astype(int).reindex(df['ID'].unique(),fill_value=0)
print (df1)
0 1
a 1 0
b 0 2
c 0 0
EDIT:
If need compare each date by first date add or subtract year offset per group need custom function with condition and sum Trues:
offs = pd.offsets.DateOffset(years=1)
f = lambda x: pd.Series([(x > x.iat[-1] - offs).sum(), \
(x < x.iat[-1] + offs).sum()], index=['last','next'])
df = df.groupby('ID')['Date'].apply(f).unstack(fill_value=0).reset_index()
print (df)
ID last next
0 a 1 3
1 b 3 2
2 c 1 1
In [19]: x['date'] = pd.to_datetime( x['date']) # convert string date to datetime pd object
In [20]: x['date'] = x['date'].dt.year # get year from the date
In [21]: x
Out[21]:
date id
0 2016 a
1 2016 a
2 2017 a
3 2014 b
4 2014 b
5 2017 b
In [27]: x.groupby(['date','id']).size() # group by both columns
Out[27]:
date id
2014 b 2
2016 a 2
2017 a 1
b 1
Using resample takes care of missing inbetween years. See. year-2015
In [550]: df.set_index('Date').groupby('ID').resample('Y').size().unstack(fill_value=0)
Out[550]:
Date 2014-12-31 2015-12-31 2016-12-31 2017-12-31
ID
a 0 0 2 1
b 2 0 0 1
Use rename if you want only year in columns
In [551]: (df.set_index('Date').groupby('ID').resample('Y').size().unstack(fill_value=0)
.rename(columns=lambda x: x.year))
Out[551]:
Date 2014 2015 2016 2017
ID
a 0 0 2 1
b 2 0 0 1

Categories