Merge records that follow one another within group - python

I have the following dataframe:
A B start_date end_date id
0 1 2 2022-01-01 2022-01-10 1
1 2 2 2022-02-02 2022-02-05 2
2 1 2 2022-01-11 2022-01-15 3
3 2 2 2022-02-06 2022-02-10 4
4 2 2 2022-02-11 2022-02-15 5
5 2 3 2022-01-14 2022-01-17 6
6 2 3 2022-01-19 2022-01-22 7
There are several records that follow one after the other. For example, rows 1 and 3. Row 3 has the same values A and B and starts the next day when row 1 ends. I want to compress this dataframe into the following form:
A B start_date end_date id
0 1 2 2022-01-01 2022-01-15 1
1 2 2 2022-02-02 2022-02-15 2
2 2 3 2022-01-14 2022-01-17 3
3 2 3 2022-01-19 2022-01-22 4
That is, I save one record where the difference between the start_date of the next record and the end_date of the previous one is 1 day. In this case, end_date is changed to end_date for the last record inside such a sequence.

You can use a custom grouper to join the successive dates per group:
df[['start_date', 'end_date']] = df[['start_date', 'end_date']].apply(pd.to_datetime)
m = (df['start_date'].sub(df.groupby(['A', 'B'])
['end_date'].shift()
.add(pd.Timedelta('1d'))
).ne('0')
.groupby([df['A'], df['B']]).cumsum()
)
out = (df
.groupby(['A', 'B', m], as_index=False)
.agg({'start_date': 'first', 'end_date': 'last'})
.assign(id=lambda d: range(1, len(d)+1))
)
Output:
A B start_date end_date id
0 1 2 2022-01-01 2022-01-15 1
1 2 2 2022-02-02 2022-02-15 2
2 2 3 2022-01-14 2022-01-17 3
3 2 3 2022-01-19 2022-01-22 4

def function1(dd:pd.DataFrame):
col1=dd.start_date-dd.end_date.shift()
dd1=dd.assign(col1=col1.ne("1 days").cumsum())
return dd1.groupby("col1").agg(start_date=("start_date",min),end_date=("end_date",max))
df1.groupby(["A","B"]).apply(function1).reset_index().assign(id=lambda dd:dd.index+1)
out
A B col1 start_date end_date id
0 1 2 1 2022-01-01 2022-01-15 1
1 2 2 1 2022-02-02 2022-02-15 2
2 2 3 1 2022-01-14 2022-01-17 3
3 2 3 2 2022-01-19 2022-01-22 4

Related

Calculate how many touch points the customer had in X months

I have a problem. I want to calculate from a date for example 2022-06-01 how many touches the customer with the customerId == 1 had in the last 6 months. He had two touches 2022-05-25 and 2022-05-20. I have now calculated the date up to which the data should be taken into account. However, I don't know how to group the customer and say the date you have is up to count_from_date how many touches the customer has had.
Dataframe
customerId fromDate
0 1 2022-06-01
1 1 2022-05-25
2 1 2022-05-25
3 1 2022-05-20
4 1 2021-09-05
5 2 2022-06-02
6 3 2021-03-01
7 3 2021-02-01
import pandas as pd
d = {'customerId': [1, 1, 1, 1, 1, 2, 3, 3],
'fromDate': ["2022-06-01", "2022-05-25", "2022-05-25", "2022-05-20", "2021-09-05",
"2022-06-02", "2021-03-01", "2021-02-01"]
}
df = pd.DataFrame(data=d)
print(df)
from datetime import date
from dateutil.relativedelta import relativedelta
def find_last_date(date):
six_months = date + relativedelta(months=-6)
return six_months
df['fromDate'] = pd.to_datetime(df['fromDate'], errors='coerce')
df['count_from_date'] = df['fromDate'].apply(lambda x: find_last_date(x))
print(df)
What I have
customerId fromDate count_from_date
0 1 2022-06-01 2021-12-01
1 1 2022-05-25 2021-11-25
2 1 2022-05-25 2021-11-25
3 1 2022-05-20 2021-11-20
4 1 2021-09-05 2021-03-05
5 2 2022-06-02 2021-12-02
6 3 2021-03-01 2020-09-01
7 3 2021-02-01 2020-08-01
What I want
customerId fromDate count_from_date occur_last_6_months
0 1 2022-06-01 2021-12-01 3 # 2022-05-25, 2022-05-20, 2022-05-20 = 3
1 1 2022-05-25 2021-11-25 1 # 2022-05-20 = 1
2 1 2022-05-25 2021-11-25 1 # 2022-05-20 = 1
3 1 2022-05-20 2021-11-20 0 # No in the last 6 months
4 1 2021-09-05 2021-03-05 0 # No in the last 6 months
5 2 2022-06-02 2021-12-02 0 # No in the last 6 months
6 3 2021-03-01 2020-09-01 1 # 2021-02-01 = 1
7 3 2021-02-01 2020-08-01 0 # No in the last 6 months
You can try groupby customerId and loop through the rows in subgroup to count number of fromDate between fromDate and count_from_date
def count(g):
m = pd.concat([g['fromDate'].between(d1, d2, 'neither')
for d1, d2 in zip(g['count_from_date'], g['fromDate'])], axis=1)
g = g.assign(occur_last_6_months=m.sum().tolist())
return g
out = df.groupby('customerId').apply(count)
print(out)
customerId fromDate count_from_date occur_last_6_months
0 1 2022-06-01 2021-12-01 3
1 1 2022-05-25 2021-11-25 1
2 1 2022-05-25 2021-11-25 1
3 1 2022-05-20 2021-11-20 0
4 1 2021-09-05 2021-03-05 0
5 2 2022-06-02 2021-12-02 0
6 3 2021-03-01 2020-09-01 1
7 3 2021-02-01 2020-08-01 0
For this problem, the challenge for a performant solution is to manipulate the data as to have an appropriate structure to run rolling window operations on it.
First of all, we need to avoid having duplicate indices. In your case, this means aggregating multiple touch points in a single day:
>>> df = df.groupby(['customerId', 'fromDate'], as_index=False).count()
customerId fromDate count_from_date
0 1 2021-09-05 1
1 1 2022-05-20 1
2 1 2022-05-25 2
3 1 2022-06-01 1
4 2 2022-06-02 1
5 3 2021-02-01 1
6 3 2021-03-01 1
Now, we can set the index to fromDate, sort it and groupby customerId as to be able to use rolling windows. I here use a 180D rolling window (6 months):
>>> roll_df = df.set_index(['fromDate'])
.sort_index()
.groupby('customerId')
.apply(lambda s: s['count_from_date'].rolling('180D').sum())
The sort_index step is important to ensure your data is monotonically increasing.
However, this also counts the touch on the day itself, which seems not what you want, so we remove 1 from the result:
>>> roll_df - 1
customerId fromDate
1 2021-09-05 0.0
2022-05-20 0.0
2022-05-25 2.0
2022-06-01 3.0
2 2022-06-02 0.0
3 2021-02-01 0.0
2021-03-01 1.0
Name: count_from_date, dtype: float64
Finally, we divide by the initial counts to get back to the original structure:
>>> roll_df / df.set_index(['customerId', 'fromDate'])['count_from_date']
customerId fromDate count_from_date
0 1 2021-09-05 0.0
1 1 2022-05-20 0.0
2 1 2022-05-25 1.0
3 1 2022-06-01 3.0
4 2 2022-06-02 0.0
5 3 2021-02-01 0.0
6 3 2021-03-01 1.0
You can always .reset_index() at the end.
The one liner solution is
(df.set_index(['fromDate'])
.sort_index()
.groupby('customerId')
.apply(lambda s: s['count_from_date'].rolling('180D').sum())
- 1) / df.set_index(['customerId', 'fromDate'])['count_from_date']

Count days by ID - Pandas

By having the following table, how can I count the days by ID?
without use of for or any loop because it's large size data.
ID Date
a 01/01/2020
a 05/01/2020
a 08/01/2020
a 10/01/2020
b 05/05/2020
b 08/05/2020
b 12/05/2020
c 08/08/2020
c 22/08/2020
to have this result
ID Date Days Evolved Since Inicial date
a 01/01/2020 1
a 05/01/2020 4
a 08/01/2020 7
a 10/01/2020 9
b 05/05/2020 1
b 08/05/2020 3
b 12/05/2020 7
c 08/08/2020 1
c 22/08/2020 14
Use GroupBy.transform with GroupBy.first for first values to new column, so possible subtract. Then if not duplicated datetimes is possible replace 0:
df['new']=df['Date'].sub(df.groupby("ID")['Date'].transform('first')).dt.days.replace(0, 1)
print (df)
ID Date new
0 a 2020-01-01 1
1 a 2020-01-05 4
2 a 2020-01-08 7
3 a 2020-01-10 9
4 b 2020-05-05 1
5 b 2020-05-08 3
6 b 2020-05-12 7
7 c 2020-08-08 1
8 c 2020-08-22 14
Or set 1 for first value of group by Series.where and Series.duplicated:
df['new'] = (df['Date'].sub(df.groupby("ID")['Date'].transform('first'))
.dt.days.where(df['ID'].duplicated(), 1))
print (df)
ID Date new
0 a 2020-01-01 1
1 a 2020-01-05 4
2 a 2020-01-08 7
3 a 2020-01-10 9
4 b 2020-05-05 1
5 b 2020-05-08 3
6 b 2020-05-12 7
7 c 2020-08-08 1
8 c 2020-08-22 14
You could do something like (df your dataframe):
def days_evolved(sdf):
sdf["Days_evolved"] = sdf.Date - sdf.Date.iat[0]
sdf["Days_evolved"].iat[0] = pd.Timedelta(days=1)
return sdf
df = df.groupby("ID", as_index=False, sort=False).apply(days_evolved)
Result for the sample:
ID Date Days_evolved
0 a 2020-01-01 1 days
1 a 2020-01-05 4 days
2 a 2020-01-08 7 days
3 a 2020-01-10 9 days
4 b 2020-05-05 1 days
5 b 2020-05-08 3 days
6 b 2020-05-12 7 days
7 c 2020-08-08 1 days
8 c 2020-08-22 14 days
If you want int instead of pd.Timedelta then do
df["Days_evolved"] = df["Days_evolved"].dt.days
at the end.

Substract previous row from preceding row by group WITH condition

I have a data frame
Count ID Date
1 1 2020-07-09
2 1 2020-07-11
1 1 2020-07-21
1 2 2020-07-04
2 2 2020-07-09
3 2 2020-07-18
1 3 2020-07-02
2 3 2020-07-05
1 3 2020-07-19
2 3 2020-07-22
I want to subtract the row in the date column by the row above it that has the same count BY each ID group. Those without the same count get a value of zero
Excepted output
ID Date Days
1 2020-07-09 0
1 2020-07-11 0
1 2020-07-21 12 (2020-07-21 MINUS 2020-07-09)
2 2020-07-04 0
2 2020-07-09 0
2 2020-07-18 0
3 2020-07-02 0
3 2020-07-05 0
3 2020-07-19 17 (2020-07-19 MINUS 2020-07-02)
3 2020-07-22 17 (2020-07-22 MINUS 2020-07-05)
My initial thought process is to filter out Count-ID pairs, and then do the calculation.. I was wondering if there is a better workaround this>
You can use groupby() to group by columns ID and Count, get the difference in days by .diff(). Fill NaN values with 0 by .fillna(), as follows:
df['Date'] = pd.to_datetime(df['Date']) # convert to datetime if not already in datetime format
df['Days'] = df.groupby(['ID', 'Count'])['Date'].diff().dt.days.fillna(0, downcast='infer')
Result:
print(df)
Count ID Date Days
0 1 1 2020-07-09 0
1 2 1 2020-07-11 0
2 1 1 2020-07-21 12
3 1 2 2020-07-04 0
4 2 2 2020-07-09 0
5 3 2 2020-07-18 0
6 1 3 2020-07-02 0
7 2 3 2020-07-05 0
8 1 3 2020-07-19 17
9 2 3 2020-07-22 17
I like SeaBean's answer, but here is what I was working on before I saw that
df2 = df.sort_values(by = ['ID', 'Count'])
df2['Date'] = pd.to_datetime(df2['Date'])
df2['shift1'] = df2.groupby(['ID', 'Count'])['Date'].shift(1)
df2['diff'] = (df2.Date- df2.shift1.combine_first(df2.Date) ).dt.days

pandas get a sum column for next 7 days

I want to get the sum of values for next 7 days of a column
my dataframe :
date value
0 2021-04-29 1
1 2021-05-03 2
2 2021-05-06 1
3 2021-05-15 1
4 2021-05-17 2
5 2021-05-18 1
6 2021-05-21 2
7 2021-05-22 5
8 2021-05-24 4
i tried to make a new column that contains date 7 days from current date
df['temp'] = df['date'] + timedelta(days=7)
then calculate value between date range :
df['next_7days'] = df[(df.date > df.date) & (df.date <= df.temp)].value.sum()
But this gives me answer as all 0.
intended result:
date value next_7days
0 2021-04-29 1 3
1 2021-05-03 2 1
2 2021-05-06 1 0
3 2021-05-15 1 10
4 2021-05-17 2 12
5 2021-05-18 1 11
6 2021-05-21 2 9
7 2021-05-22 5 4
8 2021-05-24 4 0
The method iam using currently is quite tedious, are their any better methods to get the intended result.
With a list comprehension:
tomorrow_dates = df.date + pd.Timedelta("1 day")
next_week_dates = df.date + pd.Timedelta("7 days")
df["next_7days"] = [df.value[df.date.between(tomorrow, next_week)].sum()
for tomorrow, next_week in zip(tomorrow_dates, next_week_dates)]
where we first define tomorrow and next week's dates and store them. Then zip them together and use between of pd.Series to get a boolean series if the date is indeed between the desired range. Then using boolean indexing to get the actual values and sum them. Do this for each date pair.
to get
date value next_7days
0 2021-04-29 1 3
1 2021-05-03 2 1
2 2021-05-06 1 0
3 2021-05-15 1 10
4 2021-05-17 2 12
5 2021-05-18 1 11
6 2021-05-21 2 9
7 2021-05-22 5 4
8 2021-05-24 4 0

to_datetime assemblage error due to extra keys

My pandas version is 0.23.4.
I tried to run this code:
df['date_time'] = pd.to_datetime(df[['year','month','day','hour_scheduled_departure','minute_scheduled_departure']])
and the following error appeared:
extra keys have been passed to the datetime assemblage: [hour_scheduled_departure, minute_scheduled_departure]
Any ideas of how to get the job done by pd.to_datetime?
#anky_91
In this image an extract of first 10 rows is presented. First column [int32]: year; Second column[int32]: month; Third column[int32]: day; Fourth column[object]: hour; Fifth column[object]: minute. The length of objects is 2.
Another solution:
>>pd.concat([df.A,pd.to_datetime(pd.Series(df[df.columns[1:]].fillna('').values.tolist(),name='Date').map(lambda x: '0'.join(map(str,x))))],axis=1)
A Date
0 a 2002-07-01 05:07:00
1 b 2002-08-03 03:08:00
2 c 2002-09-05 06:09:00
3 d 2002-04-07 09:04:00
4 e 2002-02-01 02:02:00
5 f 2002-03-05 04:03:00
For the example you have added as image (i have skipped the last 3 columns due to save time)
>>df.month=df.month.map("{:02}".format)
>>df.day = df.day.map("{:02}".format)
>>pd.concat([df.A,pd.to_datetime(pd.Series(df[df.columns[1:]].fillna('').values.tolist(),name='Date').map(lambda x: ''.join(map(str,x))))],axis=1)
A Date
0 a 2015-01-01 00:05:00
1 b 2015-01-01 00:01:00
2 c 2015-01-01 00:02:00
3 d 2015-01-01 00:02:00
4 e 2015-01-01 00:25:00
5 f 2015-01-01 00:25:00
You can use rename to columns, so possible use pandas.to_datetime with columns year, month, day, hour, minute:
df = pd.DataFrame({
'A':list('abcdef'),
'year':[2002,2002,2002,2002,2002,2002],
'month':[7,8,9,4,2,3],
'day':[1,3,5,7,1,5],
'hour_scheduled_departure':[5,3,6,9,2,4],
'minute_scheduled_departure':[7,8,9,4,2,3]
})
print (df)
A year month day hour_scheduled_departure minute_scheduled_departure
0 a 2002 7 1 5 7
1 b 2002 8 3 3 8
2 c 2002 9 5 6 9
3 d 2002 4 7 9 4
4 e 2002 2 1 2 2
5 f 2002 3 5 4 3
cols = ['year','month','day','hour_scheduled_departure','minute_scheduled_departure']
d = {'hour_scheduled_departure':'hour','minute_scheduled_departure':'minute'}
df['date_time'] = pd.to_datetime(df[cols].rename(columns=d))
#if necessary remove columns
df = df.drop(cols, axis=1)
print (df)
A date_time
0 a 2002-07-01 05:07:00
1 b 2002-08-03 03:08:00
2 c 2002-09-05 06:09:00
3 d 2002-04-07 09:04:00
4 e 2002-02-01 02:02:00
5 f 2002-03-05 04:03:00
Detail:
print (df[cols].rename(columns=d))
year month day hour minute
0 2002 7 1 5 7
1 2002 8 3 3 8
2 2002 9 5 6 9
3 2002 4 7 9 4
4 2002 2 1 2 2
5 2002 3 5 4 3

Categories