Dynamic Dates difference calculation Pandas - python

customer_id Order_date
1 2015-01-16
1 2015-01-19
2 2014-12-21
2 2015-01-10
1 2015-01-10
3 2018-01-18
3 2017-03-04
4 2019-11-05
4 2010-01-01
3 2019-02-03
Lets say I have data like this
Basically for an ecommerce firm some people buy regularly, some buy once every year, some buy monthly once etc. I need to find the difference between frequency of each transaction for each customer.
This will be a dynamic list, since some people will have transacted thousand times, some would have transacted once, some ten times etc. Any ideas on how to achieve this.
Output needed:
customer_id Order_date_Difference_in_days
1 6,3 #Difference b/w first 2 dates 2015-01-10 and 2015-01-16
#is 6 days and diff b/w next 2 consecutive dates is
#2015-01-16 and 2015-01-19 is #3 days
2 20
3 320,381
4 3596
Basically these are the differences between dates after sorting them first for each customer id

You can also use the below for the current output:
m=(df.assign(Diff=df.sort_values(['customer_id','Order_date'])
.groupby('customer_id')['Order_date'].diff().dt.days).dropna())
m=m.assign(Diff=m['Diff'].astype(str)).groupby('customer_id')['Diff'].agg(','.join)
customer_id
1 6.0,3.0
2 20.0
3 320.0,381.0
4 3595.0
Name: Diff, dtype: object

First we need to sort the data by customer id and the order date
ensure your datetime is a proper date time call df['Order_date'] = pd.to_datetime(df['Order_date'])
df.sort_values(['customer_id','Order_date'],inplace=True)
df["days"] = df.groupby("customer_id")["Order_date"].apply(
lambda x: (x - x.shift()) / np.timedelta64(1, "D")
)
print(df)
customer_id Order_date days
4 1 2015-01-10 NaN
0 1 2015-01-16 6.0
1 1 2015-01-19 3.0
2 2 2014-12-21 NaN
3 2 2015-01-10 20.0
6 3 2017-03-04 NaN
5 3 2018-01-18 320.0
9 3 2019-02-03 381.0
8 4 2010-01-01 NaN
7 4 2019-11-05 3595.0
then you can do a simple agg but you'll need to conver the value into a string.
df.dropna().groupby("customer_id")["days"].agg(
lambda x: ",".join(x.astype(str))
).to_frame()
days
customer_id
1 6.0,3.0
2 20.0
3 320.0,381.0
4 3595.0

Related

Python - pandas: create a separate row for each recurrence of a record

I have date-interval-data with a "periodicity"-column representing how frequent the date interval occurs:
Weekly: same weekdays every week
Biweekly: same weekdays every other week
Monthly: Same DATES every month
Moreover I have a "recurring_until"-column specifying when the recurrence should stop.
What I need to accomplish is:
creating a separate row for each recurring record until the "recurring_until" has been reached.
What I have:
What I need:
I have been trying with various for loops without much success. Here is the sample data:
import pandas as pd
data = {'id':['1','2','3','4'],'from':['5/31/2020','6/3/2020','6/18/2020','6/10/2020'],'to':['6/5/2020','6/3/2020','6/19/2020','6/10/2020'],'periodicity':['weekly','weekly','biweekly','monthly'],'recurring_until':['7/25/2020','6/9/2020','12/30/2020','7/9/2020']}
df = pd.DataFrame(data)
First of all preprocess:
df.set_index("id", inplace=True)
df["from"], df["to"], df["recurring_until"] = pd.to_datetime(df["from"]), pd.to_datetime(df.to), pd.to_datetime(df.recurring_until)
Next compute all the periodic from:
new_from = df.apply(lambda x: pd.date_range(x["from"], x.recurring_until), axis=1) #generate all days between from and recurring_until
new_from[df.periodicity=="weekly"] = new_from[df.periodicity=="weekly"].apply(lambda x:x[::7]) #slicing by week
new_from[df.periodicity=="biweekly"] = new_from[df.periodicity=="biweekly"].apply(lambda x:x[::14]) #slicing by biweek
new_from[df.periodicity=="monthly"] = new_from[df.periodicity=="monthly"].apply(lambda x:x[x.day==x.day[0]]) #selectiong only days equal to the first day
new_from = new_from.explode() #explode to obtain a series
new_from.name = "from" #naming the series
after this we have new_from like this:
id
1 2020-05-31
1 2020-06-07
1 2020-06-14
1 2020-06-21
1 2020-06-28
1 2020-07-05
1 2020-07-12
1 2020-07-19
2 2020-06-03
3 2020-06-18
3 2020-07-02
3 2020-07-16
3 2020-07-30
3 2020-08-13
3 2020-08-27
3 2020-09-10
3 2020-09-24
3 2020-10-08
3 2020-10-22
3 2020-11-05
3 2020-11-19
3 2020-12-03
3 2020-12-17
4 2020-06-10
Name: from, dtype: datetime64[ns]
Now lets compute all the periodic to as:
new_to = new_from+(df.to-df["from"]).loc[new_from.index]
new_to.name = "to"
and we have new_to like this:
id
1 2020-06-05
1 2020-06-12
1 2020-06-19
1 2020-06-26
1 2020-07-03
1 2020-07-10
1 2020-07-17
1 2020-07-24
2 2020-06-03
3 2020-06-19
3 2020-07-03
3 2020-07-17
3 2020-07-31
3 2020-08-14
3 2020-08-28
3 2020-09-11
3 2020-09-25
3 2020-10-09
3 2020-10-23
3 2020-11-06
3 2020-11-20
3 2020-12-04
3 2020-12-18
4 2020-06-10
Name: to, dtype: datetime64[ns]
We can finally concatenate this two series and join them to the initial dataframe:
periodic_df = pd.concat([new_from, new_to], axis=1).join(df[["periodicity", "recurring_until"]]).reset_index()
result:

How to calculate time difference between two succesive rows with groupby?

I have the following dataset of students taking multiple SAT exams:
df = pd.DataFrame({'student': 'A A A A A B B B C'.split(),
'exam_date':[datetime.datetime(2013,4,1),datetime.datetime(2013,6,1),
datetime.datetime(2013,7,1),datetime.datetime(2013,10,2),
datetime.datetime(2014,1,1),datetime.datetime(2013,11,2),
datetime.datetime(2014,2,2),datetime.datetime(2014,5,2),
datetime.datetime(2014,5,2)]})
print(df)
student exam_date
0 A 2013-04-01
1 A 2013-06-01
2 A 2013-07-01
3 A 2013-10-02
4 A 2014-01-01
5 B 2013-11-02
6 B 2014-02-02
7 B 2014-05-02
8 C 2014-05-02
I want to create a new column diff with the difference of two successive exam dates for each individual student, and then filter the value with a particular threshold, i.e. 75 days. If the student doesn't have two successive dates, we need to drop that student.
I am trying the following script to create the new column:
df['exam_date'] = df.groupby('student')['exam_date'].apply(lambda x: x.sort_values())
df['diff'] = df.groupby('student')['exam_date'].diff() / np.timedelta64(1, 'D')
print(df)
student exam_date diff
0 A 2013-04-01 NaN
1 A 2013-06-01 61.0
2 A 2013-07-01 30.0
3 A 2013-10-02 93.0
4 A 2014-01-01 91.0
5 B 2013-11-02 NaN
6 B 2014-02-02 92.0
7 B 2014-05-02 89.0
8 C 2014-05-02 NaN
Then I'm using query to filter the value and get the output:
df_new = df.query('diff <= 75')
print(df_new)
student exam_date diff
1 A 2013-06-01 61.0
2 A 2013-07-01 30.0
This is correctly selecting the student A and removing the students B and C. However, I'm missing the earliest date for the student A.
Though using df[df['student'].isin(studentList)]I'm getting the desired result, but it's too much of work.
Is there any better way of getting the desired output, maybe using diff() and le()? Any suggestions would be appreciated. Thanks!
What you want is filtering students, but you are filtering exam records.
After you got df_new, just find the students set, and use that to select df:
df[df.student.isin(df_new.student.unique())]
and you'll get:
student exam_date diff
0 A 2013-04-01 NaN
1 A 2013-06-01 61.0
2 A 2013-07-01 30.0
3 A 2013-10-02 93.0
4 A 2014-01-01 91.0

Getting Subsequent Dates In Different Columns [duplicate]

This question already has answers here:
How can I pivot a dataframe?
(5 answers)
Closed 1 year ago.
I have the following dataset of students taking multiple SAT exams:
df = pd.DataFrame({'student': 'A A A A A B B B'.split(),
'exam_date':[datetime.datetime(2013,4,1),datetime.datetime(2013,6,1),
datetime.datetime(2013,8,1),datetime.datetime(2013,10,2),
datetime.datetime(2014,1,1),datetime.datetime(2013,11,2),
datetime.datetime(2014,2,2),datetime.datetime(2014,5,2)]})
print(df)
student exam_date
0 A 2013-04-01
1 A 2013-06-01
2 A 2013-08-01
3 A 2013-10-02
4 A 2014-01-01
5 B 2013-11-02
6 B 2014-02-02
7 B 2014-05-02
I want to make a dataset of each student with their first exam date, second exam date, and so on.
I am trying groupby and min to get the 1st date, but not sure about the subsequent dates.
# Find earliest time
df.groupby('student')['exam_date'].agg('min').reset_index()
I tried rank to get the desired result, but it seems too much of work.
# Rank
df['rank'] = df.groupby('student')['exam_date'].rank(ascending=True)
print(df)
student exam_date rank
0 A 2013-04-01 1.0
1 A 2013-06-01 2.0
2 A 2013-08-01 3.0
3 A 2013-10-02 4.0
4 A 2014-01-01 5.0
5 B 2013-11-02 1.0
6 B 2014-02-02 2.0
7 B 2014-05-02 3.0
Is there any better way of getting the desired output? Any suggestions would be appreciated. Thanks!
Desired Output:
student exam_01 exam_02 exam_03 exam_04
0 A 2013-04-01 2013-06-01 2013-08-01 2013-10-02
1 B 2013-11-02 2014-02-02 2013-05-02 NA
You can use groupby+cumcount to generate a helper column and pivot.
NB. This assumes the dates are sorted, if not use sort_values first.
(df.assign(id=df.groupby('student').cumcount().add(1))
.pivot(index='student', columns='id', values='exam_date')
.add_prefix('exam_')
)
Output:
id exam_1 exam_2 exam_3 exam_4 exam_5
student
A 2013-04-01 2013-06-01 2013-08-01 2013-10-02 2014-01-01
B 2013-11-02 2014-02-02 2014-05-02 NaT NaT

Transform multiple format Duration Data to common formatted '%H%M%S' . The %M part of the format (minutes) is inconsistent

I have Duration data that is an object with multiple formats, particularly in the minutes part between the colons. Any idea, how I can transform this data. I tried everything with regex imaginable (except for the correct answer :) ), which was the main part where I was struggling with. For example, below is my attempt to zero-pad the minutes column.
df['temp'] = df['temp'].replace(':?:', ':0?:', regex=True)
Input:
Duration
0 00:0:00
1 00:00:00
2 00:8:00
3 00:08:00
4 00:588:00
5 09:14:00
Expected Output Option #1 (Time format):
Duration
0 00:00:00
1 00:00:00
2 00:08:00
3 00:08:00
4 09:48:00
5 09:14:00
My end goal is to get the minutes, so another acceptable format would be:
Expected Output Option #2 (Minutes - integer or float):
Minutes
0 0
1 0
2 8
3 8
4 588
5 554
We can just do pd.to_timedelta:
pd.to_timedelta(df.Duration)
Output:
0 00:00:00
1 00:00:00
2 00:08:00
3 00:08:00
4 09:48:00
5 09:14:00
Name: Duration, dtype: timedelta64[ns]
Or Option 2 - Minutes:
pd.to_timedelta(df.Duration).dt.total_seconds()/60
Output:
0 0.0
1 0.0
2 8.0
3 8.0
4 588.0
5 554.0
Name: Duration, dtype: float64
We can do split with mul
df.Duration.str.split(':',expand=True).astype(int).mul([60,1,1/60]).sum(1)
0 0.0
1 0.0
2 8.0
3 8.0
4 588.0
5 554.0
dtype: float64

Year to date average in dataframe

I have a dataframe that I am trying to calculate the year-to-date average for my value columns. Below is a sample dataframe.
date name values values2
0 2019-01-01 a 1 1
1 2019-02-01 a 3 3
2 2019-03-01 a 2 2
3 2019-04-01 a 6 2
I want to create new columns (values_ytd & values2_ytd) that will average the values from January to the latest period within the same year (April in sample data). I will need to group the data by year & name when calculating the averages. I am looking for an output similar to this.
date name values values2 values2_ytd values_ytd
0 2019-01-01 a 1 1 1 1
1 2019-02-01 a 3 3 2 2
2 2019-03-01 a 2 2 2 2
3 2019-04-01 a 6 2 2 3
I have tried unsuccesfully to using expanding().mean(), but most likely I was doing it wrong. My main dataframe has numerous name categories and many more columns. Here is the code I was attempting to use
df1.groupby([df1['name'], df1['date'].dt.year], as_index=False).expanding().mean().loc[:, 'values':'values2'].add_suffix('_ytd').reset_index(drop=True,level=0)
but am receiving the following error.
NotImplementedError: ops for Expanding for this dtype datetime64[ns] are not implemented
Note: This code below works perfectly when substituting cumsum() for .expanding().mean()to create a year-to-date sum of the values, but I cant figure it out for averages
df1.groupby([df1['name'], df1['date'].dt.year], as_index=False).cumsum().loc[:, 'values':'values2'].add_suffix('_ytd').reset_index(drop=True,level=0)
Any help is greatly appreciated.
Try this:
df['date'] = pd.to_datetime(df['date'])
df.set_index('date', inplace=True)
df[['values2_ytd', 'values_ytd']] = df.groupby([df.index.year, 'name'])['values','values2'].expanding().mean().reset_index(level=[0,1], drop=True)
df
name values values2 values2_ytd values_ytd
date
2019-01-01 a 1 1 1.0 1.0
2019-02-01 a 3 3 2.0 2.0
2019-03-01 a 2 2 2.0 2.0
2019-04-01 a 6 2 3.0 2.0
Example using multiple names and years:
date name values values2
0 2019-01-01 a 1 1
1 2019-02-01 a 3 3
2 2019-03-01 a 2 2
3 2019-04-01 a 6 2
4 2019-01-01 b 1 4
5 2019-02-01 b 3 4
6 2020-01-01 a 1 1
7 2020-02-01 a 3 3
8 2020-03-01 a 2 2
9 2020-04-01 a 6 2
Output:
name values values2 values2_ytd values_ytd
date
2019-01-01 a 1 1 1.0 1.0
2019-02-01 a 3 3 2.0 2.0
2019-03-01 a 2 2 2.0 2.0
2019-04-01 a 6 2 3.0 2.0
2019-01-01 b 1 4 1.0 4.0
2019-02-01 b 3 4 2.0 4.0
2020-01-01 a 1 1 1.0 1.0
2020-02-01 a 3 3 2.0 2.0
2020-03-01 a 2 2 2.0 2.0
2020-04-01 a 6 2 3.0 2.0
You should set date column as index: df.set_index('date', inplace=True) and then use df.resample('AS').groupby('name').mean()

Categories