python, pandas groupby the groupby dataframe - python

I have a dataframe that I have already done a groupby().agg() using
df = df.groupby(['Date','Time', 'ID_Code']).agg({'ID_Code':['count','nunique']}).reset_index()
now it looks like this
Date Time ID_Code count nunique
0 2021-01-04 10:50:00 CA_031 2 1
1 2021-01-04 12:40:00 CA_021 8 1
2 2021-01-04 13:20:00 CA_044 4 1
3 2021-01-04 13:30:00 CA_045 4 1
4 2021-01-04 13:36:00 CA_040 13 1
.. ... ... ... ... ...
433 2021-12-28 12:12:00 CA_805 3 1
434 2021-12-28 12:40:00 CA_802 3 1
435 2021-12-28 15:35:00 CA_003 22 1
436 2021-12-28 8:29:00 CA_806 3 1
What I now need is a sum of the count and how many times each ID_Code occurs.
# the below line removed the multi index header.
df.columns = ['Date', 'Time', 'ID_Code', 'count', 'nunique']
# the below line does not appear to work. Why and how would I fix it?
df = df.groupby('ID_Code').agg(total_count = ('count','sum'), frequency = ('nunique','sum'),)
What I want is:
ID_Code total_count frequency
0 CA_031 242 12
1 CA_021 89 9
2 CA_044 148 11
3 CA_045 76 7

Related

Calculate how many touch points the customer had in X months

I have a problem. I want to calculate from a date for example 2022-06-01 how many touches the customer with the customerId == 1 had in the last 6 months. He had two touches 2022-05-25 and 2022-05-20. I have now calculated the date up to which the data should be taken into account. However, I don't know how to group the customer and say the date you have is up to count_from_date how many touches the customer has had.
Dataframe
customerId fromDate
0 1 2022-06-01
1 1 2022-05-25
2 1 2022-05-25
3 1 2022-05-20
4 1 2021-09-05
5 2 2022-06-02
6 3 2021-03-01
7 3 2021-02-01
import pandas as pd
d = {'customerId': [1, 1, 1, 1, 1, 2, 3, 3],
'fromDate': ["2022-06-01", "2022-05-25", "2022-05-25", "2022-05-20", "2021-09-05",
"2022-06-02", "2021-03-01", "2021-02-01"]
}
df = pd.DataFrame(data=d)
print(df)
from datetime import date
from dateutil.relativedelta import relativedelta
def find_last_date(date):
six_months = date + relativedelta(months=-6)
return six_months
df['fromDate'] = pd.to_datetime(df['fromDate'], errors='coerce')
df['count_from_date'] = df['fromDate'].apply(lambda x: find_last_date(x))
print(df)
What I have
customerId fromDate count_from_date
0 1 2022-06-01 2021-12-01
1 1 2022-05-25 2021-11-25
2 1 2022-05-25 2021-11-25
3 1 2022-05-20 2021-11-20
4 1 2021-09-05 2021-03-05
5 2 2022-06-02 2021-12-02
6 3 2021-03-01 2020-09-01
7 3 2021-02-01 2020-08-01
What I want
customerId fromDate count_from_date occur_last_6_months
0 1 2022-06-01 2021-12-01 3 # 2022-05-25, 2022-05-20, 2022-05-20 = 3
1 1 2022-05-25 2021-11-25 1 # 2022-05-20 = 1
2 1 2022-05-25 2021-11-25 1 # 2022-05-20 = 1
3 1 2022-05-20 2021-11-20 0 # No in the last 6 months
4 1 2021-09-05 2021-03-05 0 # No in the last 6 months
5 2 2022-06-02 2021-12-02 0 # No in the last 6 months
6 3 2021-03-01 2020-09-01 1 # 2021-02-01 = 1
7 3 2021-02-01 2020-08-01 0 # No in the last 6 months
You can try groupby customerId and loop through the rows in subgroup to count number of fromDate between fromDate and count_from_date
def count(g):
m = pd.concat([g['fromDate'].between(d1, d2, 'neither')
for d1, d2 in zip(g['count_from_date'], g['fromDate'])], axis=1)
g = g.assign(occur_last_6_months=m.sum().tolist())
return g
out = df.groupby('customerId').apply(count)
print(out)
customerId fromDate count_from_date occur_last_6_months
0 1 2022-06-01 2021-12-01 3
1 1 2022-05-25 2021-11-25 1
2 1 2022-05-25 2021-11-25 1
3 1 2022-05-20 2021-11-20 0
4 1 2021-09-05 2021-03-05 0
5 2 2022-06-02 2021-12-02 0
6 3 2021-03-01 2020-09-01 1
7 3 2021-02-01 2020-08-01 0
For this problem, the challenge for a performant solution is to manipulate the data as to have an appropriate structure to run rolling window operations on it.
First of all, we need to avoid having duplicate indices. In your case, this means aggregating multiple touch points in a single day:
>>> df = df.groupby(['customerId', 'fromDate'], as_index=False).count()
customerId fromDate count_from_date
0 1 2021-09-05 1
1 1 2022-05-20 1
2 1 2022-05-25 2
3 1 2022-06-01 1
4 2 2022-06-02 1
5 3 2021-02-01 1
6 3 2021-03-01 1
Now, we can set the index to fromDate, sort it and groupby customerId as to be able to use rolling windows. I here use a 180D rolling window (6 months):
>>> roll_df = df.set_index(['fromDate'])
.sort_index()
.groupby('customerId')
.apply(lambda s: s['count_from_date'].rolling('180D').sum())
The sort_index step is important to ensure your data is monotonically increasing.
However, this also counts the touch on the day itself, which seems not what you want, so we remove 1 from the result:
>>> roll_df - 1
customerId fromDate
1 2021-09-05 0.0
2022-05-20 0.0
2022-05-25 2.0
2022-06-01 3.0
2 2022-06-02 0.0
3 2021-02-01 0.0
2021-03-01 1.0
Name: count_from_date, dtype: float64
Finally, we divide by the initial counts to get back to the original structure:
>>> roll_df / df.set_index(['customerId', 'fromDate'])['count_from_date']
customerId fromDate count_from_date
0 1 2021-09-05 0.0
1 1 2022-05-20 0.0
2 1 2022-05-25 1.0
3 1 2022-06-01 3.0
4 2 2022-06-02 0.0
5 3 2021-02-01 0.0
6 3 2021-03-01 1.0
You can always .reset_index() at the end.
The one liner solution is
(df.set_index(['fromDate'])
.sort_index()
.groupby('customerId')
.apply(lambda s: s['count_from_date'].rolling('180D').sum())
- 1) / df.set_index(['customerId', 'fromDate'])['count_from_date']

Fill in dataframe values based on group criteria without for loop?

I need to add some values to a dataframe based on the ID and DATE_TWO columns. In the case when DATE_TWO >= DATE_ONE then fill in any subsequent DATE_TWO values for that ID with the first DATE_TWO value. Here is the original dataframe:
ID
EVENT
DATE_ONE
DATE_TWO
1
13
3/1/2021
1
20
3/5/2021
3/5/2021
1
32
3/6/2021
1
43
3/7/2021
2
1
3/3/2021
2
2
4/5/2021
3
1
3/1/2021
3
12
3/7/2021
3/7/2021
3
13
3/9/2021
3
15
3/14/2021
Here is what the table after transformation:
ID
EVENT
DATE_ONE
DATE_TWO
1
13
3/1/2021
1
20
3/5/2021
3/5/2021
1
32
3/6/2021
3/5/2021
1
43
3/7/2021
3/5/2021
2
1
3/3/2021
2
2
4/5/2021
3
1
3/1/2021
3
12
3/7/2021
3/7/2021
3
13
3/9/2021
3/7/2021
3
15
3/14/2021
3/7/2021
This could be done with a for loop, but I know in python - particularly with dataframes - for loops can be slow. Is there some other more python and computationally speedy way to accomplish what I am seeking?
data = {'ID': [1,1,1,1,2,2,3,3,3,3],
'EVENT': [12, 20, 32, 43,1,2,1,12,13,15],
'DATE_ONE': ['3/1/2021','3/5/2021','3/6/2021','3/7/2021','3/3/2021','4/5/2021',
'3/1/2021','3 /7/2021','3/9/2021','3/14/2021'],
'DATE_TWO': ['','3/5/2021','','','','','','3/7/2021','','']}
I slightly changed your data so we can see how it works.
Data
import pandas as pd
import numpy as np
data = {'ID': [1,1,1,1,2,2,3,3,3,3],
'EVENT': [12, 20, 32, 43,1,2,1,12,13,15],
'DATE_ONE': ['3/1/2021','3/5/2021','3/6/2021','3/7/2021','3/3/2021','4/5/2021',
'3/1/2021','3 /7/2021','3/9/2021','3/14/2021'],
'DATE_TWO': ['','3/5/2021','','','','','3/7/2021','','3/7/2021','']}
df = pd.DataFrame(data)
df["DATE_ONE"] = pd.to_datetime(df["DATE_ONE"])
df["DATE_TWO"] = pd.to_datetime(df["DATE_TWO"])
# We better sort DATE_ONE
df = df.sort_values(["ID", "DATE_ONE"]).reset_index(drop=True)
FILL with condition
df["COND"] = np.where(df["DATE_ONE"].le(df["DATE_TWO"]).eq(True),
1,
np.where(df["DATE_TWO"].notnull() &
df["DATE_ONE"].gt(df["DATE_TWO"]),
0,
np.nan))
grp = df.groupby("ID")
df["COND"] = grp["COND"].fillna(method='ffill').fillna(0)
df["FILL"] = grp["DATE_TWO"].fillna(method='ffill')
df["DATE_TWO"] = np.where(df["COND"].eq(1), df["FILL"], df["DATE_TWO"])
df = df.drop(columns=["COND", "FILL"])
ID EVENT DATE_ONE DATE_TWO
0 1 12 2021-03-01 NaT
1 1 20 2021-03-05 2021-03-05
2 1 32 2021-03-06 2021-03-05
3 1 43 2021-03-07 2021-03-05
4 2 1 2021-03-03 NaT
5 2 2 2021-04-05 NaT
6 3 1 2021-03-01 2021-03-07
7 3 12 2021-03-07 2021-03-07
8 3 13 2021-03-09 2021-03-07
9 3 15 2021-03-14 NaT

Substract previous row from preceding row by group WITH condition

I have a data frame
Count ID Date
1 1 2020-07-09
2 1 2020-07-11
1 1 2020-07-21
1 2 2020-07-04
2 2 2020-07-09
3 2 2020-07-18
1 3 2020-07-02
2 3 2020-07-05
1 3 2020-07-19
2 3 2020-07-22
I want to subtract the row in the date column by the row above it that has the same count BY each ID group. Those without the same count get a value of zero
Excepted output
ID Date Days
1 2020-07-09 0
1 2020-07-11 0
1 2020-07-21 12 (2020-07-21 MINUS 2020-07-09)
2 2020-07-04 0
2 2020-07-09 0
2 2020-07-18 0
3 2020-07-02 0
3 2020-07-05 0
3 2020-07-19 17 (2020-07-19 MINUS 2020-07-02)
3 2020-07-22 17 (2020-07-22 MINUS 2020-07-05)
My initial thought process is to filter out Count-ID pairs, and then do the calculation.. I was wondering if there is a better workaround this>
You can use groupby() to group by columns ID and Count, get the difference in days by .diff(). Fill NaN values with 0 by .fillna(), as follows:
df['Date'] = pd.to_datetime(df['Date']) # convert to datetime if not already in datetime format
df['Days'] = df.groupby(['ID', 'Count'])['Date'].diff().dt.days.fillna(0, downcast='infer')
Result:
print(df)
Count ID Date Days
0 1 1 2020-07-09 0
1 2 1 2020-07-11 0
2 1 1 2020-07-21 12
3 1 2 2020-07-04 0
4 2 2 2020-07-09 0
5 3 2 2020-07-18 0
6 1 3 2020-07-02 0
7 2 3 2020-07-05 0
8 1 3 2020-07-19 17
9 2 3 2020-07-22 17
I like SeaBean's answer, but here is what I was working on before I saw that
df2 = df.sort_values(by = ['ID', 'Count'])
df2['Date'] = pd.to_datetime(df2['Date'])
df2['shift1'] = df2.groupby(['ID', 'Count'])['Date'].shift(1)
df2['diff'] = (df2.Date- df2.shift1.combine_first(df2.Date) ).dt.days

Calculate time blocked within a timerange with pandas

I have a list of products produced or processes finished like this one:
Name
Timestamp Start
Timestamp Stop
Product 1
2021-01-01 15:15:00
2021-01-01 15:37:00
Product 1
2021-01-01 15:30:00
2021-01-01 15:55:00
Product 1
2021-01-02 15:05:00
2021-01-02 15:22:00
Product 1
2021-01-03 15:45:00
2021-01-03 15:55:00
...
...
...
What I want to do is to calculate the amount of time where no product/process happened in a given timeframe, for example from 15:00 to 16:00 and, to be more specific, each day.
The output could be "amount of idle minutes/time where nothing happened" or "percentage of idle time".
import pandas as pd
import datetime
df = pd.read_csv('example_data.csv')
# generate list of products
listOfProducts = df['NAME'].drop_duplicates().tolist()
# define timeframe for each day
startTime = datetime.time(15, 0)
stopTime = datetime.time(16, 0)
# define daterange to look for
startDay = datetime(2021, 1, 1)
stopDay = datetime(2021,1, 5)
# do it for every product
for i in listOfProducts:
# filter dataframe by product
df_product = df[df['NAME'] == i]
# sort dataframe by start
df_product = df_product.sort_values(by='started')
# ... how to proceed?
The wanted output should look like this or similiar:
Day
Time idle
2021-01-01
00:20:00
2021-02-01
00:43:00
2021-03-01
00:50:00
...
...
Here are some notes that are important:
Timeranges of products can overlap between each other, in this case they should only "count once"
Timeranges of products can overlap the borders (15:00 or 16:00 in this case), in this case the time within the borders should be counted
I struggle to implement it in a pandas-way, because this border-cases prevent me from adding up Timedeltas.
In the past, I solved this issue by iterating row by row from here and adding the minutes or seconds. But I'm sure there is a more pandas-way, maybe with the .groupby() function?
Input data:
>>> df
Name Start Stop
0 Product 1 2021-01-01 14:49:00 2021-01-01 15:04:00 # OK (overlap 4')
1 Product 1 2021-01-01 15:15:00 2021-01-01 15:37:00 # OK
2 Product 1 2021-01-01 15:30:00 2021-01-01 15:55:00 # OK
3 Product 1 2021-01-02 15:05:00 2021-01-02 15:22:00 # OK
4 Product 1 2021-01-03 15:45:00 2021-01-03 15:55:00 # OK
5 Product 1 2021-01-03 15:51:00 2021-01-03 16:23:00 # OK (overlap 9')
6 Product 1 2021-01-04 14:28:00 2021-01-04 17:12:00 # OK (overlap 60')
7 Product 1 2021-01-05 11:46:00 2021-01-05 13:40:00 # Out of bounds
8 Product 1 2021-01-05 17:20:00 2021-01-05 19:11:00 # Out of bounds
First, remove data out of bounds (7 & 8):
import datetime
START = datetime.time(15)
STOP = datetime.time(16)
df1 = df.loc[(df["Start"].dt.floor(freq="H").dt.time <= START)
& (START <= df["Stop"].dt.floor(freq="H").dt.time),
["Start", "Stop"]]
Extract the minute of Start and Stop datetime. If the process began before 15:00, set to 0 because we want only keep overlap part. If the process ended after 16:00, set the minute to 59.
import numpy as np
df1["m1"] = np.where(df1["Start"].dt.time > START,
df1["Start"].sub(df1["Start"].dt.floor(freq="H"))
.dt.seconds // 60, 0)
df1["m2"] = np.where(df1["Stop"].dt.time < STOP,
df1["Stop"].sub(df1["Stop"].dt.floor(freq="H"))
.dt.seconds // 60, 59)
>>> df1
Start Stop m1 m2
0 2021-01-01 14:49:00 2021-01-01 15:04:00 0 4
1 2021-01-01 15:15:00 2021-01-01 15:37:00 15 37
2 2021-01-01 15:30:00 2021-01-01 15:55:00 30 55
3 2021-01-02 15:05:00 2021-01-02 15:22:00 5 22
4 2021-01-03 15:45:00 2021-01-03 15:55:00 45 55
5 2021-01-03 15:51:00 2021-01-03 16:23:00 51 59
6 2021-01-04 14:28:00 2021-01-04 17:12:00 0 59
Create an empty table len(df1)x60' to store process usage:
out = pd.DataFrame(0, index=df1.index, columns=pd.RangeIndex(60))
Fill the out dataframe:
for idx, (i1, i2) in df1[["m1", "m2"]].iterrows():
out.loc[idx, i1:i2] = 1
>>> out
0 1 2 3 4 5 6 ... 53 54 55 56 57 58 59
0 1 1 1 1 1 0 0 ... 0 0 0 0 0 0 0 # 4'
1 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 ... 1 1 1 0 0 0 0
3 0 0 0 0 0 1 1 ... 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 ... 1 1 1 0 0 0 0
5 0 0 0 0 0 0 0 ... 1 1 1 1 1 1 1
6 1 1 1 1 1 1 1 ... 1 1 1 1 1 1 1 # full hour
[7 rows x 60 columns]
Finally, compute the idle minutes:
>>> 60 - (out.groupby(df1["Start"].dt.date).sum() & 1).sum(axis="columns")
Start
2021-01-01 22
2021-01-02 42
2021-01-03 50
2021-01-04 0
dtype: int64
Note: you have to determine if the Stop datetime is closed or not.

Groupby and get value offset by one year in pandas

My goal today is to follow each ID that belongs to Category==1 in a given date, one year later. So I have a dataframe like this:
Period ID Amount Category
20130101 1 100 1
20130101 2 150 1
20130101 3 100 1
20130201 1 90 1
20130201 2 140 1
20130201 3 95 1
20130201 5 250 0
. . .
20140101 1 40 1
20140101 2 70 1
20140101 5 160 0
20140201 1 35 1
20140201 2 65 1
20140201 5 150 0
For example, in 20130201 I have 2 ID's that belong to Category 1: 1,2,3, but just 2 of them are present in 20140201: 1,2. So I need to get the value of Amount, only for those ID's, one year later, like this:
Period ID Amount Category Amount_t1
20130101 1 100 1 40
20130101 2 150 1 70
20130101 3 100 1 nan
20130201 1 90 1 35
20130201 2 140 1 65
20130201 3 95 1 nan
20130201 5 250 0 nan
. . .
20140101 1 40 1 nan
20140101 2 70 1 nan
20140101 5 160 0 nan
20140201 1 35 1 nan
20140201 2 65 1 nan
20140201 5 150 0 nan
So, if the ID doesn't appear next year or belong to Category 0, I'll get a nan. My first approach was to get the list of unique ID's on each Period and then trying to map that to the next year, using some sort of combination of groupby() and isin() like this:
aux = df[df.Category==1].groupby('Period').ID.unique()
aux.index = aux.index + pd.DateOffset(years=1)
But I didn't know how to keep going. I'm thinking some kind of groupby('ID') might be more efficient too. If it were a simple shift() that would be easy, but I'm not sure about how to get the value offset by a year by group.
You can create lagged features with an exact merge after you manually lag one of the join keys.
import pandas as pd
# Datetime so we can do calendar year subtraction
df['Period'] = pd.to_datetime(df.Period, format='%Y%m%d')
# Create one with the lagged features. Here I'll split the steps out.
df2 = df.copy()
df2['Period'] = df2.Period-pd.offsets.DateOffset(years=1) # 1 year lag
df2 = df2.rename(columns={'Amount': 'Amount_t1'})
# Keep only values you want to merge
df2 = df2[df2.Category.eq(1)]
# Bring lagged features
df.merge(df2, on=['Period', 'ID', 'Category'], how='left')
Period ID Amount Category Amount_t1
0 2013-01-01 1 100 1 40.0
1 2013-01-01 2 150 1 70.0
2 2013-01-01 3 100 1 NaN
3 2013-02-01 1 90 1 35.0
4 2013-02-01 2 140 1 65.0
5 2013-02-01 3 95 1 NaN
6 2013-02-01 5 250 0 NaN
7 2014-01-01 1 40 1 NaN
8 2014-01-01 2 70 1 NaN
9 2014-01-01 5 160 0 NaN
10 2014-02-01 1 35 1 NaN
11 2014-02-01 2 65 1 NaN
12 2014-02-01 5 150 0 NaN

Categories