How to add a column with conditions on another Dataframe? - python

Motivation: I want to check if customers have bought anything during 2 months since first purchase. (retention)
Resources: I have 2 tables:
Buy date, ID and purchase code
Id and first day he bought
Sample data:
Table1
Date ID Purchase_code
2019-01-01 1 AQT1
2019-01-02 1 TRR1
2019-03-01 1 QTD1
2019-02-01 2 IGJ5
2019-02-05 2 ILW2
2019-02-20 2 WET2
2019-02-28 2 POY6
Table 2
ID First_Buy_Date
1 2019-01-01
2 2019-02-01
The expected result:
ID First_login_date Retention Frequency_buy_at_first_month
1 2019-01-01 1 2
2 2019-02-01 0 4

First convert columns to datetimes if necessary, then add first days by DataFrame.merge and create new columns by compare with Series.le or Series.gt and converting to integers:
df1['Date'] = pd.to_datetime(df1['Date'])
df2['First_Buy_Date'] = pd.to_datetime(df2['First_Buy_Date'])
df = df1.merge(df2, on='ID', how='left')
df['Retention'] = (df['First_Buy_Date'].add(pd.DateOffset(months=2))
.le(df['Date'])
.astype(int))
df['Frequency_buy_at_first_month'] = (df['First_Buy_Date'].add(pd.DateOffset(months=1))
.gt(df['Date'])
.astype(int))
Last aggregate by GroupBy.agg and max (if need only 0 or 1 output) and sum for count values:
df1 = (df.groupby(['ID','First_Buy_Date'], as_index=False)
.agg({'Retention':'max', 'Frequency_buy_at_first_month':'sum'}))
print (df1)
ID First_Buy_Date Retention Frequency_buy_at_first_month
0 1 2019-01-01 1 2
1 2 2019-02-01 0 4

Related

how to get smallest index in dataframe after using groupby

If create_date field does not correspond to period between from_date and to_date, I want to extract only the large index records using group by 'indicator' and record correspond to period between from_date to end_date.
from_date = '2022-01-01'
to_date = '2022-04-10'
indicator create_date
0 A 2022-01-03
1 B 2021-12-30
2 B 2021-07-11
3 C 2021-02-10
4 C 2021-09-08
5 C 2021-07-24
6 C 2021-01-30
Here is the result I want:
indicator create_date
0 A 2022-01-03
2 B 2021-07-11
6 C 2021-01-30
I've been looking for a solution for a long time, but I only found a way "How to get the index of smallest value", and I can't find a way to compare the index number.
You can create helper column for maximal index values per indicator created by DataFrameGroupBy.idxmax, last select rows by DataFrame.loc:
df2 = df.loc[df.assign(tmp=df.index).groupby('indicator')['tmp'].idxmax()]
print (df2)
indicator create_date
0 A 2022-01-03
2 B 2021-07-11
6 C 2021-01-30
EDIT: If need seelct maximal index only per not match values between from_date, to_date use boolean indexing with join by concat:
from_date = '2022-01-01'
to_date = '2022-04-10'
df['create_date'] = pd.to_datetime(df['create_date'])
m = df['create_date'].between(from_date, to_date)
df2 = df.loc[df.assign(tmp=df.index)[~m].groupby('indicator')['tmp'].idxmax()]
print (df2)
indicator create_date
2 B 2021-07-11
6 C 2021-01-30
df = pd.concat([df[m], df2])
print (df)
indicator create_date
0 A 2022-01-03
2 B 2021-07-11
6 C 2021-01-30
You can try
df['create_date'] = pd.to_datetime(df['create_date'])
m = df['create_date'].between(from_date, to_date)
df_ = df[~m].groupby('indicator', as_index=False).apply(lambda g: g.loc[[max(g.index)]]).droplevel(level=0)
out = pd.concat([df[m], df_], axis=0).sort_index()
print(out)
indicator create_date
0 A 2022-01-03
2 B 2021-07-11
6 C 2021-01-30

Count cases by dates and save it in a new dataframe

In one data frame (called X) I have Patient_admitted_id, Date, Hospital_ID of tested covid positive patients (I show this data frame below). I want to generate a separated data frame (called Y) with Dates of the calendar, total number of covid Cases and cumulative cases.
I dont know how to generate the column Cases
X data frame:
data = {'Patient_admitted_id': [214321,224323,3234234,23423],
'Date': ['2021-01-22', '2021-01-22','2021-01-22','2021-01-20'], # This is just an example I have created here, the real X data frame contains proper date values generated with Datatime
'Hospital_ID': ['1', '2','3','2'],
}
X = pd.DataFrame(data, columns=['Patient_admitted_id','Date', 'Hospital_ID' ])
X
Patient_admitted_id Date Hospital_ID
0 214321 2021-01-22 1
1 224323 2021-01-22 2
2 3234234 2021-01-22 3
3 23423 2021-01-20 2
...
Desirable Y data frame:
Date Cases Cumulative
0 2021-01-20 1 1
1 2021-01-21 0 1
2 2021-01-22 3 4
...
Use DataFrame.resample by days with counts by Resampler.size with Series.cumsum for cumulative counts:
X['Date']= pd.to_datetime(X['Date'])
df = X.resample('D', on='Date').size().reset_index(name='Cases')
df['Cumulative'] = df['Cases'].cumsum()
print (df)
Date Cases Cumulative
0 2021-01-20 1 1
1 2021-01-21 0 1
2 2021-01-22 3 4
You can use groupby on Date column and call size to get the count of individual Dates, you can then simply call Cumsum on cases to get the desired output
out = X.groupby('Date').size().to_frame('Cases').reset_index()
out['Cumulative'] = out['Cases'].cumsum()
out variable holds the desired dataframe.
OUTPUT:
Date Cases Cumulative
0 2021-01-20 1 1
1 2021-01-22 3 4
Just adding a solution with pd.Grouper
X['Date']= pd.to_datetime(X['Date'])
df = X.groupby(pd.Grouper(key='Date', freq='D')).size().reset_index(name='Cases')
df['Cumulative'] = df.Cases.cumsum()
df
Output
Date Cases Cumulative
0 2021-01-20 1 1
1 2021-01-21 0 1
2 2021-01-22 3 4

Find max number of consecutive days

The code below groups the dataframe by a key.
df = pd.DataFrame(data, columns=['id', 'date', 'cnt'])
df['date']= pd.to_datetime(df['date'])
for c_id, group in df.groupby('id'):
print(c_id)
print(group)
This produces a result like this:
id date cnt
1 2019-01-02 1
1 2019-01-03 2
1 2019-01-04 3
1 2019-01-05 1
1 2019-01-06 2
1 2019-01-07 1
id date cnt
2 2019-01-01 478964
2 2019-01-02 749249
2 2019-01-03 1144842
2 2019-01-04 1540846
2 2019-01-05 1444918
2 2019-01-06 1624770
2 2019-01-07 2227589
id date cnt
3 2019-01-01 41776
3 2019-01-02 82322
3 2019-01-03 93467
3 2019-01-04 56674
3 2019-01-05 47606
3 2019-01-06 41448
3 2019-01-07 145827
id date cnt
4 2019-01-01 41776
4 2019-01-02 82322
4 2019-01-06 93467
4 2019-01-07 56674
From this result, I want to find the maximum consecutive number of days for each id. So id 1 would be 6, id 2 would be 7, id 3 would be 7, and id 4 would be 2.
Use:
m = (df.assign(date=pd.to_datetime(df['date'])) #if necessary convert else drop
.groupby('id')['date']
.diff()
.gt(pd.Timedelta('1D'))
.cumsum())
df.groupby(['id', m]).size().max(level='id')
Output
id
1 6
2 7
3 7
4 2
dtype: int64
To get your result, run:
result = df.groupby('id').apply(lambda grp: grp.groupby(
(grp.date.shift() + pd.Timedelta(1, 'd') != grp.date).cumsum())
.id.count().max())
Details:
df.groupby('id') - First level grouping (by id).
grp.groupby(...) - Second level grouping (by sequences
of consecutive dates.
grp.date.shift() - Date from the previous row.
+ pd.Timedelta(1, 'd') - Shifted by 1 day.
!= grp.date - Not equal to the current date. The result
is a Series with True on the start of each sequence of
consecutive dates.
cumsum() - Convert the above (bool) Series to a Series
of int - consecutive numbers of above sequences, starting
from 1.
id - Take id column from each (second level) group.
count() - Compute the size of the current group.
.max() - Take max from sizes of second level groups
(within the current level 1 group).

Agg across columns based on multiple conditions

I would like to count all product_id depending on following condition:
shared_product==1
exclusive_product_storeA ==1
exclusive_product_storeB ==1
Main df
date product_id shared_product exclusive_product_storeA exclusive_product_storeB
2019-01-01 34434 1 0 0
2019-01-01 43546 1 0 0
2019-01-01 53288 1 0 0
2019-01-01 23444 0 1 0
2019-01-01 25344 0 1 0
2019-01-01 42344 0 0 1
Output DF
date count_shared_product count_exclusive_product_storeA count_exclusive_product_storeB
2019-01-01 3 2 1
This is what I have tried - but this does not give me the desired output df:
df.pivot_table(index=['shared_product','exclusive_product_storeA','exclusive_product_storeB'],aggfunc=['count'],values='product_id')
The idea here is to exclude rows that have a value of 0, groupby date and the resulting column, and finally unstack to get your final result
(
df.drop("product_id", axis=1)
.set_index("date")
.stack()
.loc[lambda x: x == 1]
.groupby(level=[0, 1])
.sum()
.unstack()
.rename_axis(index=None)
)
exclusive_product_storeA exclusive_product_storeB shared_product
2019-01-01 2 1 3
A shorter path would be to exclude the product_id, groupby date and sum the columns :
df.drop("product_id", axis=1).groupby("date").sum().rename_axis(None)

Count number of rows for each ID within 1 year

I have a pandas dataframe something like this
Date ID
01/01/2016 a
05/01/2016 a
10/05/2017 a
05/05/2014 b
07/09/2014 b
12/08/2017 b
What I need to do is to add a column which shows the number of entries for each ID that occurred within the last year and another column showing the number within the next year. I've written some horrible code that iterates through the whole dataframe (millions of lines) and does the computations but there must be a better way!
I think you need between with boolean indexing for filter first and then groupby and aggregate size.
Outputs are concated and add reindex for add missing rows filled by 0:
print (df)
Date ID
0 01/01/2016 a
1 05/01/2016 a
2 10/05/2017 a
3 05/05/2018 b
4 07/09/2014 b
5 07/09/2014 c
6 12/08/2018 b
#convert to datetime (if first number is day, add parameter dayfirst)
df['Date'] = pd.to_datetime(df['Date'], dayfirst=True)
now = pd.datetime.today()
print (now)
oneyarbeforenow = now - pd.offsets.DateOffset(years=1)
oneyarafternow = now + pd.offsets.DateOffset(years=1)
#first filter
a = df[df['Date'].between(oneyarbeforenow, now)].groupby('ID').size()
b = df[df['Date'].between(now, oneyarafternow)].groupby('ID').size()
print (a)
ID
a 1
dtype: int64
print (b)
ID
b 2
dtype: int64
df1 = pd.concat([a,b],axis=1).fillna(0).astype(int).reindex(df['ID'].unique(),fill_value=0)
print (df1)
0 1
a 1 0
b 0 2
c 0 0
EDIT:
If need compare each date by first date add or subtract year offset per group need custom function with condition and sum Trues:
offs = pd.offsets.DateOffset(years=1)
f = lambda x: pd.Series([(x > x.iat[-1] - offs).sum(), \
(x < x.iat[-1] + offs).sum()], index=['last','next'])
df = df.groupby('ID')['Date'].apply(f).unstack(fill_value=0).reset_index()
print (df)
ID last next
0 a 1 3
1 b 3 2
2 c 1 1
In [19]: x['date'] = pd.to_datetime( x['date']) # convert string date to datetime pd object
In [20]: x['date'] = x['date'].dt.year # get year from the date
In [21]: x
Out[21]:
date id
0 2016 a
1 2016 a
2 2017 a
3 2014 b
4 2014 b
5 2017 b
In [27]: x.groupby(['date','id']).size() # group by both columns
Out[27]:
date id
2014 b 2
2016 a 2
2017 a 1
b 1
Using resample takes care of missing inbetween years. See. year-2015
In [550]: df.set_index('Date').groupby('ID').resample('Y').size().unstack(fill_value=0)
Out[550]:
Date 2014-12-31 2015-12-31 2016-12-31 2017-12-31
ID
a 0 0 2 1
b 2 0 0 1
Use rename if you want only year in columns
In [551]: (df.set_index('Date').groupby('ID').resample('Y').size().unstack(fill_value=0)
.rename(columns=lambda x: x.year))
Out[551]:
Date 2014 2015 2016 2017
ID
a 0 0 2 1
b 2 0 0 1

Categories