I have a pandas dataframe with a lot of columns, some of which have values on weekends.
I'm now trying to remove all weekend rows, but need to add the values I remove to the respective following Monday.
Thu: 4
Fri: 5
Sat: 2
Sun: 1
Mon: 4
Tue: 3
needs to become
Thu: 4
Fri: 5
Mon: 7
Tue: 3
I have figured out how to slice only the weekdays (using df.index.dayofweek), but can't think of a clever way to aggregate before doing so.
Here's some dummy code to start:
index = pd.date_range(datetime.datetime.now().date() -
datetime.timedelta(20),
periods = 20,
freq = 'D')
df = pd.DataFrame({
'Val_1': np.random.rand(20),
'Val_2': np.random.rand(20),
'Val_3': np.random.rand(20)
},
index = index)
df['Weekday'] = df.index.dayofweek
Any help on this would be much appreciated!
Setup
I included a random seed
np.random.seed([3, 1415])
index = pd.date_range(datetime.datetime.now().date() -
datetime.timedelta(20),
periods = 20,
freq = 'D')
df = pd.DataFrame({
'Val_1': np.random.rand(20),
'Val_2': np.random.rand(20),
'Val_3': np.random.rand(20)
},
index = index)
df['day_name'] = df.index.day_name()
df.head(6)
Val_1 Val_2 Val_3 day_name
2018-07-18 0.444939 0.278735 0.651676 Wednesday
2018-07-19 0.407554 0.609862 0.136097 Thursday
2018-07-20 0.460148 0.085823 0.544838 Friday
2018-07-21 0.465239 0.836997 0.035073 Saturday
2018-07-22 0.462691 0.739635 0.275079 Sunday
2018-07-23 0.016545 0.866059 0.706685 Monday
Solution
I fill in a series of dates with the subsequent Monday for Saturdays and Sundays. That gets used in a group by operation.
weekdays = df.index.to_series().mask(df.index.dayofweek >= 5).bfill()
d_ = df.groupby(weekdays).sum()
d_
Val_1 Val_2 Val_3
2018-07-18 0.444939 0.278735 0.651676
2018-07-19 0.407554 0.609862 0.136097
2018-07-20 0.460148 0.085823 0.544838
2018-07-23 0.944475 2.442691 1.016837
2018-07-24 0.850445 0.691271 0.713614
2018-07-25 0.817744 0.377185 0.776050
2018-07-26 0.777962 0.225146 0.542329
2018-07-27 0.757983 0.435280 0.836541
2018-07-30 2.645824 2.198333 1.375860
2018-07-31 0.926879 0.018688 0.746060
2018-08-01 0.721535 0.700566 0.373741
2018-08-02 0.117642 0.900749 0.603536
2018-08-03 0.145906 0.764869 0.775801
2018-08-06 0.738110 1.580137 1.266593
Compare
df.join(d_, rsuffix='_')
Val_1 Val_2 Val_3 day_name Val_1_ Val_2_ Val_3_
2018-07-18 0.444939 0.278735 0.651676 Wednesday 0.444939 0.278735 0.651676
2018-07-19 0.407554 0.609862 0.136097 Thursday 0.407554 0.609862 0.136097
2018-07-20 0.460148 0.085823 0.544838 Friday 0.460148 0.085823 0.544838
2018-07-21 0.465239 0.836997 0.035073 Saturday NaN NaN NaN
2018-07-22 0.462691 0.739635 0.275079 Sunday NaN NaN NaN
2018-07-23 0.016545 0.866059 0.706685 Monday 0.944475 2.442691 1.016837
2018-07-24 0.850445 0.691271 0.713614 Tuesday 0.850445 0.691271 0.713614
2018-07-25 0.817744 0.377185 0.776050 Wednesday 0.817744 0.377185 0.776050
2018-07-26 0.777962 0.225146 0.542329 Thursday 0.777962 0.225146 0.542329
2018-07-27 0.757983 0.435280 0.836541 Friday 0.757983 0.435280 0.836541
2018-07-28 0.934829 0.700900 0.538186 Saturday NaN NaN NaN
2018-07-29 0.831104 0.700946 0.185523 Sunday NaN NaN NaN
2018-07-30 0.879891 0.796487 0.652151 Monday 2.645824 2.198333 1.375860
2018-07-31 0.926879 0.018688 0.746060 Tuesday 0.926879 0.018688 0.746060
2018-08-01 0.721535 0.700566 0.373741 Wednesday 0.721535 0.700566 0.373741
2018-08-02 0.117642 0.900749 0.603536 Thursday 0.117642 0.900749 0.603536
2018-08-03 0.145906 0.764869 0.775801 Friday 0.145906 0.764869 0.775801
2018-08-04 0.199844 0.253200 0.091238 Saturday NaN NaN NaN
2018-08-05 0.437564 0.548054 0.504035 Sunday NaN NaN NaN
2018-08-06 0.100702 0.778883 0.671320 Monday 0.738110 1.580137 1.266593
Setup data using a simple series so that the weekend roll value is obvious:
index = pd.date_range(start='2018-07-18', periods = 20, freq = 'D')
df = pd.DataFrame({
'Val_1': [1] * 20,
'Val_2': [2] * 20,
'Val_3': [3] * 20,
},
index = index)
You can take the cumulative sum of the relevant columns in your dataframe, and then difference the results using a weekday boolean filter. You need to apply some special logic to correctly account for the first day(s) depending on whether it is a weekday, a Saturday or a Sunday.
The correct roll behavior can be observed using an index start date of July 21st (Saturday) and the 22nd (Sunday).
In addition, you may need to account for the situation where the last day or two falls on a weekend. As is, those values would be lost. Depending on the situation, you may wish to roll them forwards to the following Monday (in which case you would need to extend your index) or else roll them back to the preceding Friday.
weekdays = df.index.dayofweek < 5
df2 = df.iloc[:, :].cumsum()[weekdays].diff()
if weekdays[0]:
# First day is a weekday, so just use its value.
df2.iloc[0, :] = df.iloc[0, :]
elif weekdays[1]:
# First day must be a Sunday.
df2.iloc[0, :] = df.iloc[0:2, :].sum()
else:
# First day must be a Saturday.
df2.iloc[0, :] = df.iloc[0:3, :].sum()
>>> df2.head(14)
Val_1 Val_2 Val_3
2018-07-18 1 2 3
2018-07-19 1 2 3
2018-07-20 1 2 3
2018-07-23 3 6 9
2018-07-24 1 2 3
2018-07-25 1 2 3
2018-07-26 1 2 3
2018-07-27 1 2 3
2018-07-30 3 6 9
2018-07-31 1 2 3
2018-08-01 1 2 3
2018-08-02 1 2 3
2018-08-03 1 2 3
2018-08-06 3 6 9
Related
I have dataframe like this:
I want to convert the 'start_year', 'start_month', 'start_day' columns to date
and the columns 'end_year', 'end_month', 'end_day' to another date
There is a way to do that?
Thank you.
Given a dataframe like this:
year month day
0 2019.0 12.0 29.0
1 2020.0 9.0 15.0
2 2018.0 3.0 1.0
You can convert them to date string using type cast, and str.zfill:
OUTPUT:
df.apply(lambda x: f'{int(x["year"])}-{str(int(x["month"])).zfill(2)}-{str(int(x["day"])).zfill(2)}', axis=1)
0 2019-12-29
1 2020-09-15
2 2018-03-01
dtype: object
Here's an approach
simulate some data as your data was an image
use apply against each row to row series using datetime.datetime()
import datetime as dt
import numpy as np
import pandas as pd
df = pd.DataFrame(
{
"start_year": np.random.choice(range(2018, 2022), 10),
"start_month": np.random.choice(range(1, 13), 10),
"start_day": np.random.choice(range(1, 28), 10),
"end_year": np.random.choice(range(2018, 2022), 10),
"end_month": np.random.choice(range(1, 13), 10),
"end_day": np.random.choice(range(1, 28), 10),
}
)
df = df.apply(
lambda r: r.append(pd.Series({f"{startend}_date": dt.datetime(*(r[f"{startend}_{part}"]
for part in ["year", "month", "day"]))
for startend in ["start", "end"]})),
axis=1)
df
start_year
start_month
start_day
end_year
end_month
end_day
start_date
end_date
0
2018
9
6
2020
1
3
2018-09-06 00:00:00
2020-01-03 00:00:00
1
2018
11
6
2020
7
2
2018-11-06 00:00:00
2020-07-02 00:00:00
2
2021
8
13
2020
11
2
2021-08-13 00:00:00
2020-11-02 00:00:00
3
2021
3
15
2021
3
6
2021-03-15 00:00:00
2021-03-06 00:00:00
4
2019
4
13
2021
11
5
2019-04-13 00:00:00
2021-11-05 00:00:00
5
2021
2
5
2018
8
17
2021-02-05 00:00:00
2018-08-17 00:00:00
6
2020
4
19
2020
9
18
2020-04-19 00:00:00
2020-09-18 00:00:00
7
2020
3
27
2020
10
20
2020-03-27 00:00:00
2020-10-20 00:00:00
8
2019
12
23
2018
5
11
2019-12-23 00:00:00
2018-05-11 00:00:00
9
2021
7
18
2018
5
10
2021-07-18 00:00:00
2018-05-10 00:00:00
An interesting feature of pandasonic to_datetime function is that instead of
a sequence of strings you can pass to it a whole DataFrame.
But in this case there is a requirement that such a DataFrame must have columns
named year, month and day. They can be also of float type, like your source
DataFrame sample.
So a quite elegant solution is to:
take a part of the source DataFrame (3 columns with the respective year,
month and day),
rename its columns to year, month and day,
use it as the argument to to_datetime,
save the result as a new column.
To do it, start from defining a lambda function, to be used as the rename
function below:
colNames = lambda x: x.split('_')[1]
Then just call:
df['Start'] = pd.to_datetime(df.loc[:, 'start_year' : 'start_day']
.rename(columns=colNames))
df['End'] = pd.to_datetime(df.loc[:, 'end_year' : 'end_day']
.rename(columns=colNames))
For a sample of your source DataFrame, the result is:
start_year start_month start_day evidence_method_dating end_year end_month end_day Start End
0 2019.0 12.0 9.0 Historical Observations 2019.0 12.0 9.0 2019-12-09 2019-12-09
1 2019.0 2.0 18.0 Historical Observations 2019.0 7.0 28.0 2019-02-18 2019-07-28
2 2018.0 7.0 3.0 Seismicity 2019.0 8.0 20.0 2018-07-03 2019-08-20
Maybe the next part should be to remove columns with parts of both "start"
and "end" dates. Your choice.
Edit
To avoid saving the lambda (anonymous) function under a variable, define
this function as a regular (named) function:
def colNames(x):
return x.split('_')[1]
I have two datetime columns - ColumnA and ColumnB. I want to create a new column - ColumnC, using conditional logic.
Originally, I created ColumnB from a YearMonth column of dates such as 201907, 201908, etc.
When ColumnA is NaN, I want to choose ColumnB.
Otherwise, I want to choose ColumnA.
Currently, my code below is causing ColumnC to have different formats. I'm not sure how to get rid of all of those 0's. I want the whole column to be YYYY-MM-DD.
ID YearMonth ColumnA ColumnB ColumnC
0 1 201712 2017-12-29 2017-12-31 2017-12-29
1 1 201801 2018-01-31 2018-01-31 2018-01-31
2 1 201802 2018-02-28 2018-02-28 2018-02-28
3 1 201806 2018-06-29 2018-06-30 2018-06-29
4 1 201807 2018-07-31 2018-07-31 2018-07-31
5 1 201808 2018-08-31 2018-08-31 2018-08-31
6 1 201809 2018-09-28 2018-09-30 2018-09-28
7 1 201810 2018-10-31 2018-10-31 2018-10-31
8 1 201811 2018-11-30 2018-11-30 2018-11-30
9 1 201812 2018-12-31 2018-12-31 2018-12-31
10 1 201803 NaN 2018-03-31 1522454400000000000
11 1 201804 NaN 2018-04-30 1525046400000000000
12 1 201805 NaN 2018-05-31 1527724800000000000
13 1 201901 NaN 2019-01-31 1548892800000000000
14 1 201902 NaN 2019-02-28 1551312000000000000
15 1 201903 NaN 2019-03-31 1553990400000000000
16 1 201904 NaN 2019-04-30 1556582400000000000
17 1 201905 NaN 2019-05-31 1559260800000000000
18 1 201906 NaN 2019-06-30 1561852800000000000
19 1 201907 NaN 2019-07-31 1564531200000000000
20 1 201908 NaN 2019-08-31 1567209600000000000
21 1 201909 NaN 2019-09-30 1569801600000000000
df['ColumnB'] = pd.to_datetime(df['YearMonth'], format='%Y%m', errors='coerce').dropna() + pd.offsets.MonthEnd(0)
df['ColumnC'] = np.where(pd.isna(df['ColumnA']), pd.to_datetime(df['ColumnB'], format='%Y%m%d'), df['ColumnA'])
df['ColumnC'] = np.where(df['ColumnA'].isnull(),df['ColumnB'] , df['ColumnA'])
Just figured it out!
df['ColumnC'] = np.where(pd.isna(df['ColumnA']), pd.to_datetime(df['ColumnB']), pd.to_datetime(df['ColumnA']))
A B C
0 2001-01-13 10:00:00 Saturday
1 2001-01-14 12:33:00 Sunday
2 2001-01-20 15:10:00 Saturday
3 2001-01-24 13:15:00 Wednesday
4 2001-01-24 16:56:00 Wednesday
5 2001-01-24 19:09:00 Wednesday
6 2001-01-28 19:14:00 Sunday
7 2001-01-29 11:00:00 Monday
8 2001-01-29 23:50:00 Monday
9 2001-01-30 11:50:00 Tuesday
10 2001-01-30 13:00:00 Tuesday
11 2001-02-02 16:14:00 Wednesday
12 2001-02-02 09:25:00 Friday
I want to create a new df containing rows between all periods from Mondays at 12:00:00 to Wednesdays at 17:00:00
The output would be:
A B C
3 2001-01-24 13:15:00 Wednesday
5 2001-01-24 16:56:00 Wednesday
8 2001-01-29 23:50:00 Monday
9 2001-01-30 11:50:00 Tuesday
10 2001-01-30 13:00:00 Tuesday
11 2001-02-02 16:14:00 Wednesday
I tried with
df[(df["B"] >= "12:00:00") & (df["B"] <= "17:00:00")] & df[(df["C"] >= "Monday") & (df["C"] <= "Wednesday")]
But this is not what I want.
Thank you.
You can create 3 boolean masks and filter by boolean indexing - first for first day with starts time, second for all day between and last for last day and end time:
from datetime import time
#if necessary convert to datetime
df['A'] = pd.to_datetime(df['A'])
#if necessary convert to times
df['B'] = pd.to_datetime(df['B']).dt.time
m1 = (df['B']>=time(12)) & (df['C'] == 'Monday')
m2 = (df['C'] == 'Tuesday')
m3 = (df['B']<=time(17)) & (df['C'] == 'Wednesday')
df = df[m1 | m2 | m3]
print (df)
A B C
3 2001-01-24 13:15:00 Wednesday
4 2001-01-24 16:56:00 Wednesday
8 2001-01-29 23:50:00 Monday
9 2001-01-30 11:50:00 Tuesday
10 2001-01-30 13:00:00 Tuesday
12 2001-02-02 09:25:00 Wednesday
Another solution with same times from Monday to Friday:
from datetime import time
df['A'] = pd.to_datetime(df['A'])
df['B'] = pd.to_datetime(df['B']).dt.time
m1 = (df['B']>=time(12)) & (df['C'] == 'Monday')
m2 = df['C'].isin(['Tuesday', 'Wednesday'])
m3 = (df['B']<=time(17)) & (df['C'] == 'Friday')
df = df[m1 | m2 | m3]
print (df)
A B C
3 2001-01-24 13:15:00 Wednesday
4 2001-01-24 16:56:00 Wednesday
5 2001-01-24 19:09:00 Wednesday
8 2001-01-29 23:50:00 Monday
9 2001-01-30 11:50:00 Tuesday
10 2001-01-30 13:00:00 Tuesday
11 2001-02-02 16:14:00 Friday
12 2001-02-02 09:25:00 Wednesday
Use OR (|) operator and equal (=), instead of & and <=, >=). Hope it helps. Thanks.
old: df[(df["B"] >= "12:00:00") & (df["B"] <= "17:00:00")] & df[(df["C"] >= "Monday") & (df["C"] <= "Wednesday")]
New: df[(df["B"] >= "12:00:00") & (df["B"] <= "17:00:00")] & (df[(df["C"] = "Monday") | (df["C"] = "Tuesday") | (df["C"] = "Wednesday"))]
I need get 0 days 08:00:00 to 08:00:00.
code:
import pandas as pd
df = pd.DataFrame({
'Slot_no':[1,2,3,4,5,6,7],
'start_time':['0:01:00','8:01:00','10:01:00','12:01:00','14:01:00','18:01:00','20:01:00'],
'end_time':['8:00:00','10:00:00','12:00:00','14:00:00','18:00:00','20:00:00','0:00:00'],
'location_type':['not considered','Food','Parks & Outdoors','Food',
'Arts & Entertainment','Parks & Outdoors','Food']})
df = df.reindex_axis(['Slot_no','start_time','end_time','location_type','loc_set'], axis=1)
df['start_time'] = pd.to_timedelta(df['start_time'])
df['end_time'] = pd.to_timedelta(df['end_time'].replace('0:00:00', '24:00:00'))
output:
print (df)
Slot_no start_time end_time location_type loc_set
0 1 00:01:00 0 days 08:00:00 not considered NaN
1 2 08:01:00 0 days 10:00:00 Food NaN
2 3 10:01:00 0 days 12:00:00 Parks & Outdoors NaN
3 4 12:01:00 0 days 14:00:00 Food NaN
4 5 14:01:00 0 days 18:00:00 Arts & Entertainment NaN
5 6 18:01:00 0 days 20:00:00 Parks & Outdoors NaN
6 7 20:01:00 1 days 00:00:00 Food NaN
You can use to_datetime with dt.time:
df['end_time_times'] = pd.to_datetime(df['end_time']).dt.time
print (df)
Slot_no start_time end_time location_type loc_set \
0 1 00:01:00 0 days 08:00:00 not considered NaN
1 2 08:01:00 0 days 10:00:00 Food NaN
2 3 10:01:00 0 days 12:00:00 Parks & Outdoors NaN
3 4 12:01:00 0 days 14:00:00 Food NaN
4 5 14:01:00 0 days 18:00:00 Arts & Entertainment NaN
5 6 18:01:00 0 days 20:00:00 Parks & Outdoors NaN
6 7 20:01:00 1 days 00:00:00 Food NaN
end_time_times
0 08:00:00
1 10:00:00
2 12:00:00
3 14:00:00
4 18:00:00
5 20:00:00
6 00:00:00
I have two columns in a Pandas data frame that are dates.
I am looking to subtract one column from another and the result being the difference in numbers of days as an integer.
A peek at the data:
df_test.head(10)
Out[20]:
First_Date Second Date
0 2016-02-09 2015-11-19
1 2016-01-06 2015-11-30
2 NaT 2015-12-04
3 2016-01-06 2015-12-08
4 NaT 2015-12-09
5 2016-01-07 2015-12-11
6 NaT 2015-12-12
7 NaT 2015-12-14
8 2016-01-06 2015-12-14
9 NaT 2015-12-15
I have created a new column successfully with the difference:
df_test['Difference'] = df_test['First_Date'].sub(df_test['Second Date'], axis=0)
df_test.head()
Out[22]:
First_Date Second Date Difference
0 2016-02-09 2015-11-19 82 days
1 2016-01-06 2015-11-30 37 days
2 NaT 2015-12-04 NaT
3 2016-01-06 2015-12-08 29 days
4 NaT 2015-12-09 NaT
However I am unable to get a numeric version of the result:
df_test['Difference'] = df_test[['Difference']].apply(pd.to_numeric)
df_test.head()
Out[25]:
First_Date Second Date Difference
0 2016-02-09 2015-11-19 7.084800e+15
1 2016-01-06 2015-11-30 3.196800e+15
2 NaT 2015-12-04 NaN
3 2016-01-06 2015-12-08 2.505600e+15
4 NaT 2015-12-09 NaN
How about:
df_test['Difference'] = (df_test['First_Date'] - df_test['Second Date']).dt.days
This will return difference as int if there are no missing values(NaT) and float if there is.
Pandas have a rich documentation on Time series / date functionality and Time deltas
You can divide column of dtype timedelta by np.timedelta64(1, 'D'), but output is not int, but float, because NaN values:
df_test['Difference'] = df_test['Difference'] / np.timedelta64(1, 'D')
print (df_test)
First_Date Second Date Difference
0 2016-02-09 2015-11-19 82.0
1 2016-01-06 2015-11-30 37.0
2 NaT 2015-12-04 NaN
3 2016-01-06 2015-12-08 29.0
4 NaT 2015-12-09 NaN
5 2016-01-07 2015-12-11 27.0
6 NaT 2015-12-12 NaN
7 NaT 2015-12-14 NaN
8 2016-01-06 2015-12-14 23.0
9 NaT 2015-12-15 NaN
Frequency conversion.
You can use datetime module to help here. Also, as a side note, a simple date subtraction should work as below:
import datetime as dt
import numpy as np
import pandas as pd
#Assume we have df_test:
In [222]: df_test
Out[222]:
first_date second_date
0 2016-01-31 2015-11-19
1 2016-02-29 2015-11-20
2 2016-03-31 2015-11-21
3 2016-04-30 2015-11-22
4 2016-05-31 2015-11-23
5 2016-06-30 2015-11-24
6 NaT 2015-11-25
7 NaT 2015-11-26
8 2016-01-31 2015-11-27
9 NaT 2015-11-28
10 NaT 2015-11-29
11 NaT 2015-11-30
12 2016-04-30 2015-12-01
13 NaT 2015-12-02
14 NaT 2015-12-03
15 2016-04-30 2015-12-04
16 NaT 2015-12-05
17 NaT 2015-12-06
In [223]: df_test['Difference'] = df_test['first_date'] - df_test['second_date']
In [224]: df_test
Out[224]:
first_date second_date Difference
0 2016-01-31 2015-11-19 73 days
1 2016-02-29 2015-11-20 101 days
2 2016-03-31 2015-11-21 131 days
3 2016-04-30 2015-11-22 160 days
4 2016-05-31 2015-11-23 190 days
5 2016-06-30 2015-11-24 219 days
6 NaT 2015-11-25 NaT
7 NaT 2015-11-26 NaT
8 2016-01-31 2015-11-27 65 days
9 NaT 2015-11-28 NaT
10 NaT 2015-11-29 NaT
11 NaT 2015-11-30 NaT
12 2016-04-30 2015-12-01 151 days
13 NaT 2015-12-02 NaT
14 NaT 2015-12-03 NaT
15 2016-04-30 2015-12-04 148 days
16 NaT 2015-12-05 NaT
17 NaT 2015-12-06 NaT
Now, change type to datetime.timedelta, and then use the .days method on valid timedelta objects.
In [226]: df_test['Diffference'] = df_test['Difference'].astype(dt.timedelta).map(lambda x: np.nan if pd.isnull(x) else x.days)
In [227]: df_test
Out[227]:
first_date second_date Difference Diffference
0 2016-01-31 2015-11-19 73 days 73
1 2016-02-29 2015-11-20 101 days 101
2 2016-03-31 2015-11-21 131 days 131
3 2016-04-30 2015-11-22 160 days 160
4 2016-05-31 2015-11-23 190 days 190
5 2016-06-30 2015-11-24 219 days 219
6 NaT 2015-11-25 NaT NaN
7 NaT 2015-11-26 NaT NaN
8 2016-01-31 2015-11-27 65 days 65
9 NaT 2015-11-28 NaT NaN
10 NaT 2015-11-29 NaT NaN
11 NaT 2015-11-30 NaT NaN
12 2016-04-30 2015-12-01 151 days 151
13 NaT 2015-12-02 NaT NaN
14 NaT 2015-12-03 NaT NaN
15 2016-04-30 2015-12-04 148 days 148
16 NaT 2015-12-05 NaT NaN
17 NaT 2015-12-06 NaT NaN
Hope that helps.
I feel that the overall answer does not handle if the dates 'wrap' around a year. This would be useful in understanding proximity to a date being accurate by day of year. In order to do these row operations, I did the following. (I had this used in a business setting in renewing customer subscriptions).
def get_date_difference(row, x, y):
try:
# Calcuating the smallest date difference between the start and the close date
# There's some tricky logic in here to calculate for determining date difference
# the other way around (Dec -> Jan is 1 month rather than 11)
sub_start_date = int(row[x].strftime('%j')) # day of year (1-366)
close_date = int(row[y].strftime('%j')) # day of year (1-366)
later_date_of_year = max(sub_start_date, close_date)
earlier_date_of_year = min(sub_start_date, close_date)
days_diff = later_date_of_year - earlier_date_of_year
# Calculates the difference going across the next year (December -> Jan)
days_diff_reversed = (365 - later_date_of_year) + earlier_date_of_year
return min(days_diff, days_diff_reversed)
except ValueError:
return None
Then the function could be:
dfAC_Renew['date_difference'] = dfAC_Renew.apply(get_date_difference, x = 'customer_since_date', y = 'renewal_date', axis = 1)
Create a vectorized method
def calc_xb_minus_xa(df):
time_dict = {
'<Minute>': 'm',
'<Hour>': 'h',
'<Day>': 'D',
'<Week>': 'W',
'<Month>': 'M',
'<Year>': 'Y'
}
time_delta = df.at[df.index[0], 'end_time'] - df.at[df.index[0], 'open_time']
offset_base_name = str(to_offset(time_delta).base)
time_term = time_dict.get(offset_base_name)
result = (df.end_time - df.open_time) / np.timedelta64(1, time_term)
return result
Then in your df do:
df['x'] = calc_xb_minus_xa(df)
This will work for minutes, hours, days, weeks, month and Year.
open_time and end_time need to change according your df