I have a problem. I want to calculate from a date for example 2022-06-01 how many touches the customer with the customerId == 1 had in the last 6 months. He had two touches 2022-05-25 and 2022-05-20. I have now calculated the date up to which the data should be taken into account. However, I don't know how to group the customer and say the date you have is up to count_from_date how many touches the customer has had.
Dataframe
customerId fromDate
0 1 2022-06-01
1 1 2022-05-25
2 1 2022-05-25
3 1 2022-05-20
4 1 2021-09-05
5 2 2022-06-02
6 3 2021-03-01
7 3 2021-02-01
import pandas as pd
d = {'customerId': [1, 1, 1, 1, 1, 2, 3, 3],
'fromDate': ["2022-06-01", "2022-05-25", "2022-05-25", "2022-05-20", "2021-09-05",
"2022-06-02", "2021-03-01", "2021-02-01"]
}
df = pd.DataFrame(data=d)
print(df)
from datetime import date
from dateutil.relativedelta import relativedelta
def find_last_date(date):
six_months = date + relativedelta(months=-6)
return six_months
df['fromDate'] = pd.to_datetime(df['fromDate'], errors='coerce')
df['count_from_date'] = df['fromDate'].apply(lambda x: find_last_date(x))
print(df)
What I have
customerId fromDate count_from_date
0 1 2022-06-01 2021-12-01
1 1 2022-05-25 2021-11-25
2 1 2022-05-25 2021-11-25
3 1 2022-05-20 2021-11-20
4 1 2021-09-05 2021-03-05
5 2 2022-06-02 2021-12-02
6 3 2021-03-01 2020-09-01
7 3 2021-02-01 2020-08-01
What I want
customerId fromDate count_from_date occur_last_6_months
0 1 2022-06-01 2021-12-01 3 # 2022-05-25, 2022-05-20, 2022-05-20 = 3
1 1 2022-05-25 2021-11-25 1 # 2022-05-20 = 1
2 1 2022-05-25 2021-11-25 1 # 2022-05-20 = 1
3 1 2022-05-20 2021-11-20 0 # No in the last 6 months
4 1 2021-09-05 2021-03-05 0 # No in the last 6 months
5 2 2022-06-02 2021-12-02 0 # No in the last 6 months
6 3 2021-03-01 2020-09-01 1 # 2021-02-01 = 1
7 3 2021-02-01 2020-08-01 0 # No in the last 6 months
You can try groupby customerId and loop through the rows in subgroup to count number of fromDate between fromDate and count_from_date
def count(g):
m = pd.concat([g['fromDate'].between(d1, d2, 'neither')
for d1, d2 in zip(g['count_from_date'], g['fromDate'])], axis=1)
g = g.assign(occur_last_6_months=m.sum().tolist())
return g
out = df.groupby('customerId').apply(count)
print(out)
customerId fromDate count_from_date occur_last_6_months
0 1 2022-06-01 2021-12-01 3
1 1 2022-05-25 2021-11-25 1
2 1 2022-05-25 2021-11-25 1
3 1 2022-05-20 2021-11-20 0
4 1 2021-09-05 2021-03-05 0
5 2 2022-06-02 2021-12-02 0
6 3 2021-03-01 2020-09-01 1
7 3 2021-02-01 2020-08-01 0
For this problem, the challenge for a performant solution is to manipulate the data as to have an appropriate structure to run rolling window operations on it.
First of all, we need to avoid having duplicate indices. In your case, this means aggregating multiple touch points in a single day:
>>> df = df.groupby(['customerId', 'fromDate'], as_index=False).count()
customerId fromDate count_from_date
0 1 2021-09-05 1
1 1 2022-05-20 1
2 1 2022-05-25 2
3 1 2022-06-01 1
4 2 2022-06-02 1
5 3 2021-02-01 1
6 3 2021-03-01 1
Now, we can set the index to fromDate, sort it and groupby customerId as to be able to use rolling windows. I here use a 180D rolling window (6 months):
>>> roll_df = df.set_index(['fromDate'])
.sort_index()
.groupby('customerId')
.apply(lambda s: s['count_from_date'].rolling('180D').sum())
The sort_index step is important to ensure your data is monotonically increasing.
However, this also counts the touch on the day itself, which seems not what you want, so we remove 1 from the result:
>>> roll_df - 1
customerId fromDate
1 2021-09-05 0.0
2022-05-20 0.0
2022-05-25 2.0
2022-06-01 3.0
2 2022-06-02 0.0
3 2021-02-01 0.0
2021-03-01 1.0
Name: count_from_date, dtype: float64
Finally, we divide by the initial counts to get back to the original structure:
>>> roll_df / df.set_index(['customerId', 'fromDate'])['count_from_date']
customerId fromDate count_from_date
0 1 2021-09-05 0.0
1 1 2022-05-20 0.0
2 1 2022-05-25 1.0
3 1 2022-06-01 3.0
4 2 2022-06-02 0.0
5 3 2021-02-01 0.0
6 3 2021-03-01 1.0
You can always .reset_index() at the end.
The one liner solution is
(df.set_index(['fromDate'])
.sort_index()
.groupby('customerId')
.apply(lambda s: s['count_from_date'].rolling('180D').sum())
- 1) / df.set_index(['customerId', 'fromDate'])['count_from_date']
Related
I have the following dataframe:
A B start_date end_date id
0 1 2 2022-01-01 2022-01-10 1
1 2 2 2022-02-02 2022-02-05 2
2 1 2 2022-01-11 2022-01-15 3
3 2 2 2022-02-06 2022-02-10 4
4 2 2 2022-02-11 2022-02-15 5
5 2 3 2022-01-14 2022-01-17 6
6 2 3 2022-01-19 2022-01-22 7
There are several records that follow one after the other. For example, rows 1 and 3. Row 3 has the same values A and B and starts the next day when row 1 ends. I want to compress this dataframe into the following form:
A B start_date end_date id
0 1 2 2022-01-01 2022-01-15 1
1 2 2 2022-02-02 2022-02-15 2
2 2 3 2022-01-14 2022-01-17 3
3 2 3 2022-01-19 2022-01-22 4
That is, I save one record where the difference between the start_date of the next record and the end_date of the previous one is 1 day. In this case, end_date is changed to end_date for the last record inside such a sequence.
You can use a custom grouper to join the successive dates per group:
df[['start_date', 'end_date']] = df[['start_date', 'end_date']].apply(pd.to_datetime)
m = (df['start_date'].sub(df.groupby(['A', 'B'])
['end_date'].shift()
.add(pd.Timedelta('1d'))
).ne('0')
.groupby([df['A'], df['B']]).cumsum()
)
out = (df
.groupby(['A', 'B', m], as_index=False)
.agg({'start_date': 'first', 'end_date': 'last'})
.assign(id=lambda d: range(1, len(d)+1))
)
Output:
A B start_date end_date id
0 1 2 2022-01-01 2022-01-15 1
1 2 2 2022-02-02 2022-02-15 2
2 2 3 2022-01-14 2022-01-17 3
3 2 3 2022-01-19 2022-01-22 4
def function1(dd:pd.DataFrame):
col1=dd.start_date-dd.end_date.shift()
dd1=dd.assign(col1=col1.ne("1 days").cumsum())
return dd1.groupby("col1").agg(start_date=("start_date",min),end_date=("end_date",max))
df1.groupby(["A","B"]).apply(function1).reset_index().assign(id=lambda dd:dd.index+1)
out
A B col1 start_date end_date id
0 1 2 1 2022-01-01 2022-01-15 1
1 2 2 1 2022-02-02 2022-02-15 2
2 2 3 1 2022-01-14 2022-01-17 3
3 2 3 2 2022-01-19 2022-01-22 4
I have a problem. I have a dataframe that contains the customerId and a date fromDate. Now I want to calculate for each customer individually when the next delivery is. For example, I have the customer with the customerId = 1 and he has bought something on 2021-03-18 I would now like to find the next date and output this distance in days e.g. 2021-03-22 and 4 days. In simple terms I want to calculate the next date in the future - from Date or n - (n-1). Unless the date has a next date, it should be None e.g. 2022-01-18 should be None.
I have a problem, I get a lot of None values, moreover, I should look at each customer separately. How can I do this?
Mathematical with an example
n - (n-1) = next_day_in_days
e.g.
2021-03-22 - 2021-03-18 = 4
[OUT]
customerId fromDate next_day_in_days
1 1 2021-03-18 4
Dataframe
customerId fromDate
0 1 2021-02-22
1 1 2021-03-18
2 1 2021-03-22
3 1 2021-02-10
4 1 2021-09-07
5 1 None
6 1 2022-01-18
7 2 2021-05-17
8 3 2021-05-17
9 3 2021-07-17
10 3 2021-02-22
11 3 2021-02-22
Code
import pandas as pd
import datetime
d = {'customerId': [1, 1, 1, 1, 1, 1, 1, 2, 3, 3, 3, 3],
'fromDate': ['2021-02-22', '2021-03-18', '2021-03-22',
'2021-02-10', '2021-09-07', None, '2022-01-18', '2021-05-17', '2021-05-17', '2021-07-17', '2021-02-22', '2021-02-22']
}
df = pd.DataFrame(data=d)
print(df)
def nearest(items, pivot):
try:
return min(items, key=lambda x: abs(x - pivot))
except:
return None
df['fromDate'] = pd.to_datetime(df['fromDate'], errors='coerce').dt.date
df["next_day_in_days"] = df['fromDate'].apply(lambda x: nearest(df['fromDate'], x))
Output
[OUT]
customerId fromDate next_in_days
0 1 2021-02-22 None
1 1 2021-03-18 None
2 1 2021-03-22 None
3 1 2021-02-10 None
4 1 2021-09-07 None
5 1 NaT None
6 1 2022-01-18 None
7 2 2021-05-17 None
8 3 2021-05-17 None
9 3 2021-07-17 None
10 3 2021-02-22 None
11 3 2021-02-22 None
Name: next_in_days, dtype: object
What I want
customerId fromDate next_day_in_days
0 1 2021-02-22 24
1 1 2021-03-18 4
2 1 2021-03-22 109
3 1 2021-02-10 12
4 1 2021-09-07 133
5 1 NaT None
6 1 2022-01-18 None
7 2 2021-05-17 None
8 3 2021-05-17 61
9 3 2021-07-17 None
10 3 2021-02-22 133
11 3 2021-02-22 133
First sorting columns per customerId and fromDate, because possible duplicates remove them by same columns, so possible use DataFrameGroupBy.diff with Series.dt.days:
df['fromDate'] = pd.to_datetime(df['fromDate'], errors='coerce')
df = df.sort_values(['customerId','fromDate'])
df['next_day_in_days'] = (df.drop_duplicates(['customerId','fromDate'])
.groupby('customerId')['fromDate']
.diff(-1)
.dt.days
.abs())
Get original ordering of index if necessary.
df = df.sort_index()
Last repeat duplicated values per ['customerId', 'fromDate'], here last value 84.0 by GroupBy.ffill:
df['next_day_in_days'] = df.groupby(['customerId', 'fromDate'])['next_day_in_days'].ffill()
print (df)
customerId fromDate next_day_in_days
0 1 2021-02-22 24.0
1 1 2021-03-18 4.0
2 1 2021-03-22 169.0
3 1 2021-02-10 12.0
4 1 2021-09-07 133.0
5 1 NaT NaN
6 1 2022-01-18 NaN
7 2 2021-05-17 NaN
8 3 2021-05-17 61.0
9 3 2021-07-17 NaN
10 3 2021-02-22 84.0
11 3 2021-02-22 84.0
I have a data frame
Count ID Date
1 1 2020-07-09
2 1 2020-07-11
1 1 2020-07-21
1 2 2020-07-04
2 2 2020-07-09
3 2 2020-07-18
1 3 2020-07-02
2 3 2020-07-05
1 3 2020-07-19
2 3 2020-07-22
I want to subtract the row in the date column by the row above it that has the same count BY each ID group. Those without the same count get a value of zero
Excepted output
ID Date Days
1 2020-07-09 0
1 2020-07-11 0
1 2020-07-21 12 (2020-07-21 MINUS 2020-07-09)
2 2020-07-04 0
2 2020-07-09 0
2 2020-07-18 0
3 2020-07-02 0
3 2020-07-05 0
3 2020-07-19 17 (2020-07-19 MINUS 2020-07-02)
3 2020-07-22 17 (2020-07-22 MINUS 2020-07-05)
My initial thought process is to filter out Count-ID pairs, and then do the calculation.. I was wondering if there is a better workaround this>
You can use groupby() to group by columns ID and Count, get the difference in days by .diff(). Fill NaN values with 0 by .fillna(), as follows:
df['Date'] = pd.to_datetime(df['Date']) # convert to datetime if not already in datetime format
df['Days'] = df.groupby(['ID', 'Count'])['Date'].diff().dt.days.fillna(0, downcast='infer')
Result:
print(df)
Count ID Date Days
0 1 1 2020-07-09 0
1 2 1 2020-07-11 0
2 1 1 2020-07-21 12
3 1 2 2020-07-04 0
4 2 2 2020-07-09 0
5 3 2 2020-07-18 0
6 1 3 2020-07-02 0
7 2 3 2020-07-05 0
8 1 3 2020-07-19 17
9 2 3 2020-07-22 17
I like SeaBean's answer, but here is what I was working on before I saw that
df2 = df.sort_values(by = ['ID', 'Count'])
df2['Date'] = pd.to_datetime(df2['Date'])
df2['shift1'] = df2.groupby(['ID', 'Count'])['Date'].shift(1)
df2['diff'] = (df2.Date- df2.shift1.combine_first(df2.Date) ).dt.days
I have the following dataframe:
df = pd.DataFrame({'No': [0,0,0,1,1,2,2],
'date':['2020-01-15','2019-12-16','2021-03-01', '2018-05-19', '2016-04-08', '2020-01-02', '2020-03-07']})
df.date =pd.to_datetime(df.date)
No date
0 0 2018-01-15
1 0 2019-12-16
2 0 2021-03-01
3 1 2018-05-19
4 1 2016-04-08
5 2 2020-01-02
6 2 2020-03-07
I want to drop the rows if all the date values are earlier than 2020-01-01 for each unique number in No column, i.e. I want to drop rows with the indices 3 and 4.
Is it possible to do it without a for loop?
Use groupby and transform:
>>> df[df.groupby('No')['date'].transform('max')>='2020-01-01']
No date
0 0 2020-01-15
1 0 2019-12-16
2 0 2021-03-01
5 2 2020-01-02
6 2 2020-03-07
I have the following data frame:
id datetime interval
0 1 20160101 070000 NaN
1 1 20160101 080000 60
2 1 20160102 070000 NaN
3 1 20160102 073000 30
4 2 20160101 071500 NaN
5 2 20160101 071600 1
And would like to generate the interval column - the minutes between rows but only for the same id & the same day, just like in the example - so in sql I would partition by id and datetime and use LAG for the time interval between the previous row. How can I do it in Pandas?
You can convert column datetime to_datetime and use groupby with diff and convert timedelta to minutes by astype:
print df
id datetime interval
0 1 20160101 070000 NaN
1 1 20160101 080000 60
2 1 20160102 070000 NaN
3 1 20160102 073000 30
4 2 20160101 071500 NaN
5 2 20160101 071600 1
df['datetime'] = pd.to_datetime(df['datetime'])
df['new']=df.groupby(['id',df['datetime'].dt.day])['datetime'].diff().astype('timedelta64[m]')
print df
id datetime interval new
0 1 2016-01-01 07:00:00 NaN NaN
1 1 2016-01-01 08:00:00 60 60
2 1 2016-01-02 07:00:00 NaN NaN
3 1 2016-01-02 07:30:00 30 30
4 2 2016-01-01 07:15:00 NaN NaN
5 2 2016-01-01 07:16:00 1 1