I'm looking at counting the number of interactions grouped by ID in the last 12 months for each unique ID. The count starts from the latest date to the last one grouped by ID.
ID date
001 2022-02-01
002 2018-03-26
001 2021-08-05
001 2019-05-01
002 2019-02-01
003 2018-07-01
Output is something like the below.
ID Last_12_Months_Count
001 2
002 2
003 1
How can I achieve this in Pandas? Any function that would count the months based on the dates from the latest date per group?
Use:
m = df['date'].gt(df.groupby('ID')['date'].transform('max')
.sub(pd.offsets.DateOffset(years=1)))
df1 = df[m]
df1 = df1.groupby('ID').size().reset_index(name='Last_12_Months_Count')
print (df1)
ID Last_12_Months_Count
0 1 2
1 2 2
2 3 1
Or:
df1 = (df.groupby('ID')['date']
.agg(lambda x: x.gt(x.max() - pd.offsets.DateOffset(years=1)).sum())
.reset_index(name='Last_12_Months_Count'))
print (df1)
ID Last_12_Months_Count
0 1 2
1 2 2
2 3 1
For count multiple columns use named aggregation:
df['date1'] = df['date']
f = lambda x: x.gt(x.max() - pd.offsets.DateOffset(years=1)).sum()
df1 = (df.groupby('ID')
.agg(Last_12_Months_Count_date = ('date', f),
Last_12_Months_Count_date1 = ('date1', f))
.reset_index())
print (df1)
ID Last_12_Months_Count_date Last_12_Months_Count_date1
0 1 2 2
1 2 2 2
2 3 1 1
Related
I have 2 Dataframes like this:
ID Date1
1 2018-02-01
2 2019-03-01
3 2005-09-02
4 2021-11-09
And then I have this Dataframe:
ID Date2
4 2003-02-01
4 2004-03-11
3 1998-02-11
2 1999-02-11
1 2000-09-25
What I would want to do is find the difference in dates (who have the same ID between he differences in dataframes) using this function:
def days_between(d1, d2):
d1 = datetime.strptime(d1, "%Y-%m-%d")
d2 = datetime.strptime(d2, "%Y-%m-%d")
return abs((d2 - d1).days)
and summing up the differences for the corresponding Id.
The expected output would be:
Date is the summed up Differences in datewith corresponding ID
ID Date
1 6338
2 7323
3 2760
4 13308
Solution if df1.ID has no duplicates, only df2.ID use Series.map for new column used for subtracting by Series.sub, convert timedeltas to days by Series.dt.days and last aggregate sum:
df1['Date1'] = pd.to_datetime(df1['Date1'])
df2['Date2'] = pd.to_datetime(df2['Date2'])
df2['Date'] = df2['ID'].map(df1.set_index('ID')['Date1']).sub(df2['Date2']).dt.days
print (df2)
ID Date2 Date
0 4 2003-02-01 6856
1 4 2004-03-11 6452
2 3 1998-02-11 2760
3 2 1999-02-11 7323
4 1 2000-09-25 6338
df3 = df2.groupby('ID', as_index=False)['Date'].sum()
print (df3)
ID Date
0 1 6338
1 2 7323
2 3 2760
3 4 13308
Or use DataFrame.merge instead map:
df1['Date1'] = pd.to_datetime(df1['Date1'])
df2['Date2'] = pd.to_datetime(df2['Date2'])
df2 = df1.merge(df2, on='ID')
df2['Date'] = df2['Date1'].sub(df2['Date2']).dt.days
print (df2)
ID Date1 Date2 Date
0 1 2018-02-01 2000-09-25 6338
1 2 2019-03-01 1999-02-11 7323
2 3 2005-09-02 1998-02-11 2760
3 4 2021-11-09 2003-02-01 6856
4 4 2021-11-09 2004-03-11 6452
df3 = df2.groupby('ID', as_index=False)['Date'].sum()
print (df3)
ID Date
0 1 6338
1 2 7323
2 3 2760
3 4 13308
Does this work:
d = pd.merge(d1,d2)
d[['Date1','Date2']] = d[['Date1','Date2']].apply(pd.to_datetime, format = '%Y-%m-%d')
d['Date'] = d['Date1'] - d['Date2']
d.groupby('ID')['Date'].sum().reset_index()
ID Date
0 1 6338 days
1 2 7323 days
2 3 2760 days
3 4 13308 days
I created this dataframe I calculated the gap that I was looking but the problem is that some flats have the same price and I get a difference of price of 0. How could I replace the value 0 by the difference with the last lower price of the same group.
for example:
neighboorhood:a, bed:1, bath:1, price:5
neighboorhood:a, bed:1, bath:1, price:5
neighboorhood:a, bed:1, bath:1, price:3
neighboorhood:a, bed:1, bath:1, price:2
I get difference price of 0,2,1,nan and I'm looking for 2,2,1,nan (briefly I don't want to compare 2 flats with the same price)
Thanks in advance and good day.
data=[
[1,'a',1,1,5],[2,'a',1,1,5],[3,'a',1,1,4],[4,'a',1,1,2],[5,'b',1,2,6],[6,'b',1,2,6],[7,'b',1,2,3]
]
df = pd.DataFrame(data, columns = ['id','neighborhoodname', 'beds', 'baths', 'price'])
df['difference_price'] = ( df.dropna()
.sort_values('price',ascending=False)
.groupby(['city','beds','baths'])['price'].diff(-1) )
I think you can remove duplicates first per all columns used for groupby with diff, create new column in filtered data and last use merge with left join to original:
df1 = (df.dropna()
.sort_values('price',ascending=False)
.drop_duplicates(['neighborhoodname','beds','baths', 'price']))
df1['difference_price'] = df1.groupby(['neighborhoodname','beds','baths'])['price'].diff(-1)
df = df.merge(df1[['neighborhoodname','beds','baths','price', 'difference_price']], how='left')
print (df)
id neighborhoodname beds baths price difference_price
0 1 a 1 1 5 1.0
1 2 a 1 1 5 1.0
2 3 a 1 1 4 2.0
3 4 a 1 1 2 NaN
4 5 b 1 2 6 3.0
5 6 b 1 2 6 3.0
6 7 b 1 2 3 NaN
Or you can use lambda function for back filling 0 values per groups for avoid wrong outputs if one row groups (data moved from another groups):
df['difference_price'] = (df.sort_values('price',ascending=False)
.groupby(['neighborhoodname','beds','baths'])['price']
.apply(lambda x: x.diff(-1).replace(0, np.nan).bfill()))
print (df)
id neighborhoodname beds baths price difference_price
0 1 a 1 1 5 1.0
1 2 a 1 1 5 1.0
2 3 a 1 1 4 2.0
3 4 a 1 1 2 NaN
4 5 b 1 2 6 3.0
5 6 b 1 2 6 3.0
6 7 b 1 2 3 NaN
I already got answer to this question in R, wondering how this can be implemented in Python.
Let's say we have a pandas DataFrame like this:
import pandas as pd
d = pd.DataFrame({'2019Q1':[1], '2019Q2':[2], '2019Q3':[3]})
which displays like this:
2019Q1 2019Q2 2019Q3
0 1 2 3
How can I transform it to looks like this:
Year Quarter Value
2019 1 1
2019 2 2
2019 3 3
Use Series.str.split for MultiIndex with expand=True and then reshape by DataFrame.unstack, last data cleaning with with Series.reset_index and Series.rename_axis:
d = pd.DataFrame({'2019Q1':[1], '2019Q2':[2], '2019Q3':[3]})
d.columns = d.columns.str.split('Q', expand=True)
df = (d.unstack(0)
.reset_index(level=2, drop=True)
.rename_axis(('Year','Quarter'))
.reset_index(name='Value'))
print (df)
Year Quarter Value
0 2019 1 1
1 2019 2 2
2 2019 3 3
Thank you #Jon Clements for another solution:
df = (d.melt()
.variable
.str.extract('(?P<Year>\d{4})Q(?P<Quarter>\d)')
.assign(Value=d.T.values.flatten()))
print (df)
Year Quarter Value
0 2019 1 1
1 2019 2 2
2 2019 3 3
Alternative with split:
df = (d.melt()
.variable
.str.split('Q', expand=True)
.rename(columns={0:'Year',1:'Quarter'})
.assign(Value=d.T.values.flatten()))
print (df)
Year Quarter Value
0 2019 1 1
1 2019 2 2
2 2019 3 3
Using DataFrame.stack with DataFrame.pop and Series.str.split:
df = d.stack().reset_index(level=1).rename(columns={0:'Value'})
df[['Year', 'Quarter']] = df.pop('level_1').str.split('Q', expand=True)
Value Year Quarter
0 1 2019 1
0 2 2019 2
0 3 2019 3
If you care about the order of columns, use reindex:
df = df.reindex(['Year', 'Quarter', 'Value'], axis=1)
Year Quarter Value
0 2019 1 1
0 2019 2 2
0 2019 3 3
I have a dataframe like this:
ID day purchase
ID1 1 10
ID1 2 15
ID1 4 13
ID2 2 11
ID2 4 11
ID2 5 24
ID2 6 10
Desired output:
ID day purchase Txn
ID1 1 10 1
ID1 2 15 2
ID1 4 13 3
ID2 2 11 1
ID2 4 11 2
ID2 5 24 3
ID2 6 10 4
So for each ID, i want to create a counter to keep a track of their transactions. In SAS, i would do something like First.ID then Txn=1 else Txn+1
How to do something like this in Python?
I got the idea of sorting by ID and day. But how to create customized counter?
Here is one solution. Like you suggest, it involves sorting by ID and day (in case your original dataframe isn't), and then grouping by ID, creating a counter for each ID:
# Make sure your dataframe is sorted properly (first by ID, then by day)
df = df.sort_values(['ID', 'day'])
# group by ID
by_id = df.groupby('ID')
# Make a custom counter using the default index of dataframes (adding 1)
df['txn'] = by_id.apply(lambda x: x.reset_index()).index.get_level_values(1)+1
>>> df
ID day purchase txn
0 ID1 1 10 1
1 ID1 2 15 2
2 ID1 4 13 3
3 ID2 2 11 1
4 ID2 4 11 2
5 ID2 5 24 3
6 ID2 6 10 4
If your dataframe started out as not properly sorted, you can get back to the original order like this:
df = df.sort_index()
The simplest method I could come up with, definitely not the most efficient though.
df['txn'] = [0]*len(df)
prev_ID = None
for index, row in df.iterrows():
if row['ID'] == prev_ID:
df['txn'][index] = counter
counter += 1
else:
prev_ID = row['ID']
df['txn'][index] = 1
counter = 2
outputs
ID day purchase txn
0 ID1 1 10 1
1 ID1 2 15 2
2 ID1 4 13 3
3 ID2 2 11 1
4 ID2 4 11 2
5 ID2 5 24 3
6 ID2 6 10 4
I would like to remove all sessions after user conversion (and also removing the sessions that happened on the day of conversion)
full_sessions = pd.DataFrame(data={'user_id':[1,1,2,3,3], 'visit_no':[1,2,1,1,2], 'date':['20180307','20180308','20180307','20180308','20180308'], 'result':[0,1,1,0,0]})
print full_sessions
date result user_id visit_no
0 20180307 0 1 1
1 20180308 1 1 2
2 20180307 1 2 1
3 20180308 0 3 1
4 20180308 0 3 2
When did people convert?
conversion = full_sessions[full_sessions['result'] == 1][['user_id','date']]
print conversion
user_id date
0 1 20180308
2 2 20180307
Ideal output:
date result user_id visit_no
0 20180307 0 1 1
3 20180308 0 3 1
4 20180308 0 3 2
What do I want in SQL?
SQL would be:
SELECT * FROM (
SELECT * FROM full_sessions
LEFT JOIN conversion
ON
full_sessions.user_id = conversion.user_id AND full_sessions.date < conversion.date
UNION ALL
SELECT * FROM full_sessions
WHERE user_id NOT IN (SELECT user_id FROM conversion)
)
IIUC using merge in pandas
full_sessions.merge(conversion,on='user_id',how='left').loc[lambda x : (x.date_y>x.date_x)|(x.date_y.isnull())].dropna(1)
Out[397]:
date_x result user_id visit_no
0 20180307 0 1 1
3 20180308 0 3 1
4 20180308 0 3 2
You can join the dataframes and then filter the rows matching your criteria this way:
df_join = full_sessions.join(conversion,lsuffix='',
rsuffix='_right',how='left',on='user_id')
print(df_join)
date result user_id visit_no user_id_right date_right
0 20180307 0 1 1 1.0 20180308
1 20180308 1 1 2 1.0 20180308
2 20180307 1 2 1 2.0 20180307
3 20180308 0 3 1 NaN NaN
4 20180308 0 3 2 NaN NaN
And then just keep those with NaN in the right date or with date_right smaller than date:
>>> df_join[df_join.apply(lambda x: x.date < x.date_right
if pd.isna(x.date_right) is False
else True,axis=1)][['date','visit_no','user_id']]
date visit_no user_id
0 20180307 1 1
3 20180308 1 3
4 20180308 2 3
Here is a method which maps a series instead of join / merge alternatives.
fs['date'] = pd.to_numeric(fs['date'])
s = fs[fs['result'] == 1].set_index('user_id')['date']
result = fs.loc[fs['date'] < fs['user_id'].map(s).fillna(fs['date'].max()+1)]
Result
date result user_id visit_no
0 20180307 0 1 1
3 20180308 0 3 1
4 20180308 0 3 2
Explanation
Create a mapping from user_id to conversion date, store it in a series s.
Then just filter on dates prior to conversion dates mapped via user_id.
If no conversion date, then data will be included since we fillna with a maximal date.
Consider using datetime objects. I have converted to numeric above for simplicity.
using groupby & apply & some final cleanup with reset index, you can express it in 1 very long statement:
full_sessions.groupby('user_id', as_index=False).apply(
lambda x: x[:(x.result==1).values.argmax()] if any(x.result==1) else x
).reset_index(level=0, drop=True)
outputs:
date result user_id visit_no
0 20180307 0 1 1
3 20180308 0 3 1
4 20180308 0 3 2