Assume the below dataframe, df
Start_Date End_Date
0 20201101 20201130
1 20201201 20201231
2 20210101 20210131
3 20210201 20210228
4 20210301 20210331
How to Calculate time difference between two date columns in days?
Required Output
Start_Date End_Date Diff_in_Days
0 20201101 20201130
1 20201201 20201231
2 20210101 20210131
3 20210201 20210228
4 20210301 20210331
First idea is convert columns to datetimes, get difference and convert timedeltas to days by Series.dt.days:
df['Diff_in_Days'] = (pd.to_datetime(df['End_Date'], format='%Y%m%d')
.sub(pd.to_datetime(df['Start_Date'], format='%Y%m%d'))
.dt.days)
print (df)
Start_Date End_Date Diff_in_Days
0 20201101 20201130 29
1 20201201 20201231 30
2 20210101 20210131 30
3 20210201 20210228 27
4 20210301 20210331 30
Another solution better if processing datetimes later is reassign back columns and use solution above:
df['Start_Date'] = pd.to_datetime(df['Start_Date'], format='%Y%m%d')
df['End_Date'] = pd.to_datetime(df['End_Date'], format='%Y%m%d')
df['Diff_in_Days'] = df['End_Date'].sub(df['Start_Date']).dt.days
print (df)
Start_Date End_Date Diff_in_Days
0 2020-11-01 2020-11-30 29
1 2020-12-01 2020-12-31 30
2 2021-01-01 2021-01-31 30
3 2021-02-01 2021-02-28 27
4 2021-03-01 2021-03-31 30
Related
I have a dataframe:
df a b
7 2019-05-01 00:00:01
6 2019-05-02 00:15:01
1 2019-05-06 00:10:01
3 2019-05-09 01:00:01
8 2019-05-09 04:20:01
9 2019-05-12 01:10:01
4 2019-05-16 03:30:01
And
l = [datetime.datetime(2019,05,02), datetime.datetime(2019,05,10), datetime.datetime(2019,05,22) ]
I want to add a column with the following:
for each row, find the last date from l that is before it, and add number of days between them.
If none of the date is smaller - add the delta from the smallest one.
So the new column will be:
df a b. delta date
7 2019-05-01 00:00:01 -1 datetime.datetime(2019,05,02)
6 2019-05-02 00:15:01 0 datetime.datetime(2019,05,02)
1 2019-05-06 00:10:01 4 datetime.datetime(2019,05,02)
3 2019-05-09 01:00:01 7 datetime.datetime(2019,05,02)
8 2019-05-09 04:20:01 7 datetime.datetime(2019,05,02)
9 2019-05-12 01:10:01 2 datetime.datetime(2019,05,10)
4 2019-05-16 03:30:01 6 datetime.datetime(2019,05,10)
How can I do it?
Using merge_asof to align df['b'] and the list (as Series), then computing the difference:
# ensure datetime
df['b'] = pd.to_datetime(df['b'])
# craft Series for merging (could be combined with line below)
s = pd.Series(l, name='l')
# merge and fillna with minimum date
ref = pd.merge_asof(df['b'], s, left_on='b', right_on='l')['l'].fillna(s.min())
# compute the delta as days
df['delta'] =(df['b']-ref).dt.days
output:
a b delta
0 7 2019-05-01 00:00:01 -1
1 6 2019-05-02 00:15:01 0
2 1 2019-05-06 00:10:01 4
3 3 2019-05-09 01:00:01 7
4 8 2019-05-09 04:20:01 7
5 9 2019-05-12 01:10:01 2
6 4 2019-05-16 03:30:01 6
Here's a one line solution if you your b column has datetime object. Otherwise convert it to datetime object.
df['delta'] = df.apply(lambda x: sorted([x.b - i for i in l], key= lambda y: y.seconds)[0].days, axis=1)
Explanation : To each row you apply a function that :
Compute the deltatime between your row's datetime and every datetime present in l, then store it in a list
Sort this list by the numbers of seconds of each deltatime
Get the first value (with the smallest deltatime) and return its days
this code is seperate this dataset on
weekday Friday
year 2014
day 01
hour 00
minute 03
rides['weekday'] = rides.timestamp.dt.strftime("%A")
rides['year'] = rides.timestamp.dt.strftime("%Y")
rides['day'] = rides.timestamp.dt.strftime("%d")
rides['hour'] = rides.timestamp.dt.strftime("%H")
rides["minute"] = rides.timestamp.dt.strftime("%M")
I have a data frame dft:
Date Total Value
02/01/2022 2
03/01/2022 6
N/A 4
03/11/2022 4
03/15/2022 4
05/01/2022 4
For each date in the data frame, I want to calculate the how many days from today and I want to add these calculated values in a new column called Days.
I have tried the following code:
newdft = []
for item in dft:
temp = item.copy()
timediff = datetime.now() - datetime.strptime(temp["Date"], "%m/%d/%Y")
temp["Days"] = timediff.days
newdft.append(temp)
But the third date value is N/A, which caused an error. What should I add to my code so that I only conduct the calculation only when the date value is valid?
I would convert the whole Date column to be a date time object, using pd.to_datetime(), with the errors set to coerce, to replace the 'N/A' string to NaT (Not a Timestamp) with the below:
dft['Date'] = pd.to_datetime(dft['Date'], errors='coerce')
So the column will now look like this:
0 2022-02-01
1 2022-03-01
2 NaT
3 2022-03-11
4 2022-03-15
5 2022-05-01
Name: Date, dtype: datetime64[ns]
You can then subtract that column from the current date in one go, which will automatically ignore the NaT value, and assign this as a new column:
dft['Days'] = datetime.now() - dft['Date']
This will make dft look like below:
Date Total Value Days
0 2022-02-01 2 148 days 15:49:03.406935
1 2022-03-01 6 120 days 15:49:03.406935
2 NaT 4 NaT
3 2022-03-11 4 110 days 15:49:03.406935
4 2022-03-15 4 106 days 15:49:03.406935
5 2022-05-01 4 59 days 15:49:03.406935
If you just want the number instead of 59 days 15:49:03.406935, you can do the below instead:
df['Days'] = (datetime.now() - df['Date']).dt.days
Which will give you:
Date Total Value Days
0 2022-02-01 2 148.0
1 2022-03-01 6 120.0
2 NaT 4 NaN
3 2022-03-11 4 110.0
4 2022-03-15 4 106.0
5 2022-05-01 4 59.0
In contrast to Emi OB's excellent answer, if you did actually need to process individual values, it's usually easier to use apply to create a new Series from an existing one. You'd just need to filter out 'N/A'.
df['Days'] = (
df['Date']
[lambda d: d != 'N/A']
.apply(lambda d: (datetime.now() - datetime.strptime(d, "%m/%d/%Y")).days)
)
Result:
Date Total Value Days
0 02/01/2022 2 148.0
1 03/01/2022 6 120.0
2 N/A 4 NaN
3 03/11/2022 4 110.0
4 03/15/2022 4 106.0
5 05/01/2022 4 59.0
And for what it's worth, another option is date.today() instead of datetime.now():
.apply(lambda d: date.today() - datetime.strptime(d, "%m/%d/%Y").date())
And the result is a timedelta instead of float:
Date Total Value Days
0 02/01/2022 2 148 days
1 03/01/2022 6 120 days
2 N/A 4 NaT
3 03/11/2022 4 110 days
4 03/15/2022 4 106 days
5 05/01/2022 4 59 days
See also: How do I select rows from a DataFrame based on column values?
Following up on the excellent answer by Emi OB I would suggest using DataFrame.mask() to update the dataframe without type coercion.
import datetime
import pandas as pd
dft = pd.DataFrame({'Date': [
'02/01/2022',
'03/01/2022',
None,
'03/11/2022',
'03/15/2022',
'05/01/2022'],
'Total Value': [2,6,4,4,4,4]})
dft['today'] = datetime.datetime.now()
dft['Days'] = 0
dft['Days'].mask(dft['Date'].notna(),
(dft['today'] - pd.to_datetime(dft['Date'])).dt.days,
axis=0, inplace=True)
dft.drop(columns=['today'], inplace=True)
This would result in integer values in the Days column:
Date Total Value Days
0 02/01/2022 2 148
1 03/01/2022 6 120
2 None 4 None
3 03/11/2022 4 110
4 03/15/2022 4 106
5 05/01/2022 4 59
I had a Dataframe with this kind of date
Year
Day
Hour
Minute
2017
244
0
0
2017
244
0
1
2017
244
0
2
I want to create a new column on this DataFrame showing the date +hour minute but I don't know how to convert the days into months and unify everything
I try something using pd.to_datetime like the code below.
line['datetime'] = pd.to_datetime(line['Year'] + line['Day'] + line['Hour'] + line['Minute'], format= '%Y%m%d %H%M')
I would like to have something like this:
Year
Month
Day
Hour
Minute
2017
9
1
0
0
2017
9
1
0
1
2017
9
1
0
2
So in your case do
df['date'] = pd.to_datetime(df.astype(str).agg(' '.join,1),format='%Y %j %H %M')
Out[294]:
0 2017-09-01 00:00:00
1 2017-09-01 00:01:00
2 2017-09-01 00:02:00
dtype: datetime64[ns]
#df['month'] = df['date'].dt.month
#df['day'] = df['date'].dt.day
Try:
s = pd.to_datetime(df['Year'], format='%Y') \
+ pd.TimedeltaIndex(df['Day']-1, unit='D')
print(s)
# Output
0 2017-09-01
1 2017-09-01
2 2017-09-01
dtype: datetime64[ns]
Now you can insert your columns:
df.insert(1, 'Month', s.dt.month)
df['Day'] = s.dt.day
print(df)
# Output
Year Month Day Hour Minute
0 2017 9 1 0 0
1 2017 9 1 0 1
2 2017 9 1 0 2
df["Month"]=round(df["Day"]/30+.5).astype(int)
This establishes a new column and populates taht column by using the day column to calculate the month (total days / 30), rounding up by adding .5 and inserting it as an integer using astype
Example screenshot
I have a dataframe object like this:
Date ID Delta
2019-10-16 16:43:46 BA9565P 0 days 00:00:00
2019-10-17 05:28:36 BA9565P 0 days 12:44:50
2019-10-16 16:43:13 BA9565X 0 days 00:00:00
2019-10-17 03:26:52 BA9565X 0 days 10:43:39
2019-10-10 19:17:17 BABRGNR 0 days 00:00:00
2019-10-12 19:43:56 BABRGNR 2 days 00:26:39
2019-10-31 00:48:52 BABRGR8 0 days 00:00:00
2019-11-01 14:33:41 BABRGR8 1 days 13:44:49
If the same ID are within 3 days of each other, then I only need the latest result. However if the same ID are more than 3 days apart, then I want to keep both records. So far I have done this.
df2 = df[df.duplicated(['ID'], keep = False)][['Date', 'ID']]
df2["Date"] = pd.to_datetime(df2["Date"])
df2["Delta"] = df2.groupby(['ID']).diff()
df2["Delta"] = df2["Delta"].fillna(datetime.timedelta(seconds=0))
However I am not sure how should I continue. I have tried:
df2["Delta2"] = (df2["Delta"] < datetime.timedelta(days=3)
The condition would be True for the first element of the group whether they are within 3 days or not.
df2.groupby(['ID']).filter(lambda x: ((x["Delta"]<datetime.timedelta(days=3)) & \
(x["Delta"] != datetime.timedelta(seconds=0))).any())
Again, it has a similar problem due to .diff() always return "NaT" for the first element. Is there a way to access the last element of the group? Or is there a better way than use groupby().diff() ?
Solution select all rows of group if difference is more like 3 days per group else last rows for all another groups:
print (df)
Date ID Delta
0 2019-10-16 16:43:46 BA9565P 0 days 00:00:00
1 2019-10-17 05:28:36 BA9565P 0 days 12:44:50
2 2019-10-16 16:43:13 BA9565X 0 days 00:00:00
3 2019-10-20 03:26:52 BA9565X 0 days 10:43:39 <-chnaged data sample to 2019-10-20
4 2019-10-10 19:17:17 BABRGNR 0 days 00:00:00
5 2019-10-12 19:43:56 BABRGNR 2 days 00:26:39
6 2019-10-31 00:48:52 BABRGR8 0 days 00:00:00
7 2019-11-01 14:33:41 BABRGR8 1 days 13:44:49
#if not sorted dates
#df = df.sort_values(['ID','Date'])
df2 = df[df.duplicated(['ID'], keep = False)]
#get differences
df2["Delta"] = df2.groupby(['ID'])['Date'].diff().fillna(pd.Timedelta(0))
#compare by 3 days
mask = df2["Delta"] < pd.Timedelta(days=3)
#test if all Trues per groups
mask1 = mask.groupby(df2['ID']).transform('all')
#get last row per ID
mask2 = ~df2["ID"].duplicated(keep='last')
#filtering
df2 = df2[~mask1 | mask2]
print (df2)
Date ID Delta
1 2019-10-17 05:28:36 BA9565P 0 days 12:44:50
2 2019-10-16 16:43:13 BA9565X 0 days 00:00:00
3 2019-10-20 03:26:52 BA9565X 3 days 10:43:39
5 2019-10-12 19:43:56 BABRGNR 2 days 00:26:39
7 2019-11-01 14:33:41 BABRGR8 1 days 13:44:49
I have a dataframe with values per day (see df below).
I want to group the "Forecast" field per week but with Monday as the first day of the week.
Currently I can do it via pd.TimeGrouper('W') (see df_final below) but it groups the week starting on Sundays (see df_final below)
import pandas as pd
data = [("W1","G1",1234,pd.to_datetime("2015-07-1"),8),
("W1","G1",1234,pd.to_datetime("2015-07-30"),2),
("W1","G1",1234,pd.to_datetime("2015-07-15"),2),
("W1","G1",1234,pd.to_datetime("2015-07-2"),4),
("W1","G2",2345,pd.to_datetime("2015-07-5"),5),
("W1","G2",2345,pd.to_datetime("2015-07-7"),1),
("W1","G2",2345,pd.to_datetime("2015-07-9"),1),
("W1","G2",2345,pd.to_datetime("2015-07-11"),3)]
labels = ["Site","Type","Product","Date","Forecast"]
df = pd.DataFrame(data,columns=labels).set_index(["Site","Type","Product","Date"])
df
Forecast
Site Type Product Date
W1 G1 1234 2015-07-01 8
2015-07-30 2
2015-07-15 2
2015-07-02 4
G2 2345 2015-07-05 5
2015-07-07 1
2015-07-09 1
2015-07-11 3
df_final = (df
.reset_index()
.set_index("Date")
.groupby(["Site","Product",pd.TimeGrouper('W')])["Forecast"].sum()
.astype(int)
.reset_index())
df_final["DayOfWeek"] = df_final["Date"].dt.dayofweek
df_final
Site Product Date Forecast DayOfWeek
0 W1 1234 2015-07-05 12 6
1 W1 1234 2015-07-19 2 6
2 W1 1234 2015-08-02 2 6
3 W1 2345 2015-07-05 5 6
4 W1 2345 2015-07-12 5 6
Use W-MON instead W, check anchored offsets:
df_final = (df
.reset_index()
.set_index("Date")
.groupby(["Site","Product",pd.Grouper(freq='W-MON')])["Forecast"].sum()
.astype(int)
.reset_index())
df_final["DayOfWeek"] = df_final["Date"].dt.dayofweek
print (df_final)
Site Product Date Forecast DayOfWeek
0 W1 1234 2015-07-06 12 0
1 W1 1234 2015-07-20 2 0
2 W1 1234 2015-08-03 2 0
3 W1 2345 2015-07-06 5 0
4 W1 2345 2015-07-13 5 0
I have three solutions to this problem as described below. First, I should state that the ex-accepted answer is incorrect. Here is why:
# let's create an example df of length 9, 2020-03-08 is a Sunday
s = pd.DataFrame({'dt':pd.date_range('2020-03-08', periods=9, freq='D'),
'counts':0})
> s
dt
counts
0
2020-03-08 00:00:00
0
1
2020-03-09 00:00:00
0
2
2020-03-10 00:00:00
0
3
2020-03-11 00:00:00
0
4
2020-03-12 00:00:00
0
5
2020-03-13 00:00:00
0
6
2020-03-14 00:00:00
0
7
2020-03-15 00:00:00
0
8
2020-03-16 00:00:00
0
These nine days span three Monday-to-Sunday weeks. The weeks of March 2nd, 9th, and 16th. Let's try the accepted answer:
# the accepted answer
> s.groupby(pd.Grouper(key='dt',freq='W-Mon')).count()
dt
counts
2020-03-09 00:00:00
2
2020-03-16 00:00:00
7
This is wrong because the OP wants to have "Monday as the first day of the week" (not as the last day of the week) in the resulting dataframe. Let's see what we get when we try with freq='W'
> s.groupby(pd.Grouper(key='dt', freq='W')).count()
dt
counts
2020-03-08 00:00:00
1
2020-03-15 00:00:00
7
2020-03-22 00:00:00
1
This grouper actually grouped as we wanted (Monday to Sunday) but labeled the 'dt' with the END of the week, rather than the start. So, to get what we want, we can move the index by 6 days like:
w = s.groupby(pd.Grouper(key='dt', freq='W')).count()
w.index -= pd.Timedelta(days=6)
or alternatively we can do:
s.groupby(pd.Grouper(key='dt',freq='W-Mon',label='left',closed='left')).count()
a third solution, arguably the most readable one, is converting dt to period first, then grouping, and finally (if needed) converting back to timestamp:
s.groupby(s.dt.dt.to_period('W'))['counts'].count().to_timestamp()
# a variant of this solution is: s.set_index('dt').to_period('W').groupby(pd.Grouper(freq='W')).count().to_timestamp()
all of these solutions return what the OP asked for:
dt
counts
2020-03-02 00:00:00
1
2020-03-09 00:00:00
7
2020-03-16 00:00:00
1
Explanation: when freq is provided to pd.Grouper, both closed and label kwargs default to right. Setting freq to W (short for W-Sun) works because we want our week to end on Sunday (Sunday included, and g.closed == 'right' handles this). Unfortunately, the pd.Grouper docstring does not show the default values but you can see them like this:
g = pd.Grouper(key='dt', freq='W')
print(g.closed, g.label)
> right right