Differente Betweent Dates - Integer results [duplicate] - python

This question already has answers here:
Pandas: Subtracting two date columns and the result being an integer
(5 answers)
Closed 3 years ago.
I need to calculate the difference between two columns of type datetime, and the result must be in days (integer format). However, what I am getting is the result in day / month / year hour and minute.
id date_1 date_2 date_3 date_result_2-1 date_result_3-1
0 C_ID_92a2005557 2017-06-01 2017-06-27 14:18:08 2018-04-29 11:23:05 26 days 14:18:08 332 days 11:23:05
1 C_ID_3d0044924f 2017-01-01 2017-01-06 16:29:42 2018-03-30 06:48:26 5 days 16:29:42 453 days 06:48:26
2 C_ID_d639edf6cd 2016-08-01 2017-01-11 08:21:22 2018-04-28 17:43:11 163 days 08:21:22 635 days 17:43:11
3 C_ID_186d6a6901 2017-09-01 2017-09-26 16:22:21 2018-04-18 11:00:11 25 days 16:22:21 229 days 11:00:11
4 C_ID_cdbd2c0db2 2017-11-01 2017-11-12 00:00:00 2018-04-28 18:50:25 11 days 00:00:00 178 days 18:50:25
The last two columns are the result that I obtained with the simple subtraction between two columns. I would like these columns to be in full format, containing only the number of days.
I tried to convert with astype (int) but I got a result that I could not understand.
Any suggestion? Thank you very much in advance.

if you need only days try this:
df = pd.DataFrame(data={"date":['2000-05-07','1965-01-30','NaT'],
"date_2":["2019-01-19 12:26:00","2019-03-21 02:23:12", "2018-11-02 18:30:10"]})
df['date'] = pd.to_datetime(df['date']).dt.date
df['date_2'] = pd.to_datetime(df['date_2']).dt.date
df['days'] = (df['date']-df['date_2']).dt.days

Related

converting datetime column to total hours and offsetting from reference time

I have the following datetime object:
import pandas as pd
from datetime import datetime
t0=datetime.strptime("01/01/2011 00:00:00", "%d/%m/%Y %H:%M:%S")
here, t0 is my reference or start time of simulation. I wanted to convert it into total hours (but failed) so that I can add them to my Hours df column and finally convert into a datetime column that could start from 2021-01-01.
I have a following Hours column which calculates hours from the start time t0:
My model results in hours:
Hours
0 44317.0
1 44317.250393519
2 44317.500138889
3 44317.750462963
4 44318.00005787
5 44318.250266204
6 44318.500543981
7 44318.7503125
8 44319.000520833
9 44319.250729167
10 44319.500428241
In excel if I convert this hours into date format it becomes 2021-05-01, like this which is my expected output:
My expected output:
Hours
1 5/1/21 0:00
2 5/1/21 6:00
3 5/1/21 12:00
4 5/1/21 18:00
5 5/2/21 0:00
6 5/2/21 6:00
7 5/2/21 12:00
8 5/2/21 18:00
9 5/3/21 0:00
10 5/3/21 0:00
However, in python if I can converted this Hours column into a datetime column named date using pd.to_datetime(df.Hours)` it starts from 1970-01-01.
My python output which I don't want:
Hours
0 1970-01-01 00:00:00.000044317
1 1970-01-01 00:00:00.000044317
2 1970-01-01 00:00:00.000044317
3 1970-01-01 00:00:00.000044317
4 1970-01-01 00:00:00.000044318
5 1970-01-01 00:00:00.000044318
6 1970-01-01 00:00:00.000044318
7 1970-01-01 00:00:00.000044318
8 1970-01-01 00:00:00.000044319
9 1970-01-01 00:00:00.000044319
10 1970-01-01 00:00:00.000044319
Please let me know how to convert it so that it starts from 1st May, 2021.
Solution: From Michael S. answere below:
The Hours column is actually not hours but days and using pd.to_datetime(df.Hours, unit='d',origin='1900-01-01') will give the right results. The software that I am using also uses excel like epoch time of '1900-01-01' and mistakenly says the days as hours.
Here is an update to the answer with OP's edits and inputs. Excel is weird with dates, so if you have to convert your timestamps (44317 etc) to Excel's dates, you have to do some odd additions to put the dates in line with Excel's (Pandas and Excel have different "Start of Time" dates, that's why you are seeing the different values e.g. 1970 vs 2021). Your 44317 etc numbers are actually days and you have to add 1899-12-30 to those days:
hours = [44317.0, 44317.250393519, 44317.500138889, 44317.750462963,
44318.00005787, 44318.250266204, 44318.500543981, 44318.7503125,
44319.000520833, 44319.250729167, 44319.500428241]
df = pd.DataFrame({"Hours":hours})
t0=datetime.strptime("01/01/2011 00:00:00", "%d/%m/%Y %H:%M:%S")
df["Actual Date"] = pd.TimedeltaIndex(df['Hours'], unit='d') + datetime(1899, 12, 30)
# Alternateive is pd.to_datetime(df.Hours, unit='d', origin='1899-12-30')
Output:
Hours Actual Date
0 44317.000000 2021-05-01 00:00:00.000000000
1 44317.250394 2021-05-01 06:00:34.000041600
2 44317.500139 2021-05-01 12:00:12.000009600
3 44317.750463 2021-05-01 18:00:40.000003200
4 44318.000058 2021-05-02 00:00:04.999968000
5 44318.250266 2021-05-02 06:00:23.000025600
6 44318.500544 2021-05-02 12:00:46.999958400
7 44318.750313 2021-05-02 18:00:27.000000000
8 44319.000521 2021-05-03 00:00:44.999971199
9 44319.250729 2021-05-03 06:01:03.000028799
10 44319.500428 2021-05-03 12:00:37.000022400
There are ways to clean up the format, but this is the correct time as you wanted.
To match your output exactly, you can do this, just be aware that the contents of the cells in the column "Corrected Format" are now string values and not datetime values. If you want to use them as datetime values then you'll have to convert them back again:
df["Corrected Format"] = df["Actual Date"].dt.strftime("%d/%m/%Y %H:%M")
Output
Hours Actual Date Corrected Format
0 44317.000000 2021-05-01 00:00:00.000000000 01/05/2021 00:00
1 44317.250394 2021-05-01 06:00:34.000041600 01/05/2021 06:00
2 44317.500139 2021-05-01 12:00:12.000009600 01/05/2021 12:00
3 44317.750463 2021-05-01 18:00:40.000003200 01/05/2021 18:00
4 44318.000058 2021-05-02 00:00:04.999968000 02/05/2021 00:00
5 44318.250266 2021-05-02 06:00:23.000025600 02/05/2021 06:00
6 44318.500544 2021-05-02 12:00:46.999958400 02/05/2021 12:00
7 44318.750313 2021-05-02 18:00:27.000000000 02/05/2021 18:00
8 44319.000521 2021-05-03 00:00:44.999971199 03/05/2021 00:00
9 44319.250729 2021-05-03 06:01:03.000028799 03/05/2021 06:01
10 44319.500428 2021-05-03 12:00:37.000022400 03/05/2021 12:00

How to calculate the quantity of business days between two dates using Pandas

I created a pandas df with columns named start_date and current_date. Both columns have a dtype of datetime64[ns]. What's the best way to find the quantity of business days between the current_date and start_date column?
I've tried:
from pandas.tseries.holiday import USFederalHolidayCalendar
from pandas.tseries.offsets import CustomBusinessDay
us_bd = CustomBusinessDay(calendar=USFederalHolidayCalendar())
projects_df['start_date'] = pd.to_datetime(projects_df['start_date'])
projects_df['current_date'] = pd.to_datetime(projects_df['current_date'])
projects_df['days_count'] = len(pd.date_range(start=projects_df['start_date'], end=projects_df['current_date'], freq=us_bd))
I get the following error message:
Cannot convert input....start_date, dtype: datetime64[ns]] of type <class 'pandas.core.series.Series'> to Timestamp
I'm using Python version 3.10.4.
pd.date_range's parameters need to be datetimes, not series.
For this reason, we can use df.apply to apply the function to each row.
In addition, pandas has bdate_range which is just date_range with freq defaulting to business days, which is exactly what you need.
Using apply and a lambda function, we can create a new Series calculating business days between each start and current date for each row.
projects_df['start_date'] = pd.to_datetime(projects_df['start_date'])
projects_df['current_date'] = pd.to_datetime(projects_df['current_date'])
projects_df['days_count'] = projects_df.apply(lambda row: len(pd.bdate_range(row['start_date'], row['current_date'])), axis=1)
Using a random sample of 10 date pairs, my output is the following:
start_date current_date bdays
0 2022-01-03 17:08:04 2022-05-20 00:53:46 100
1 2022-04-18 09:43:02 2022-06-10 16:56:16 40
2 2022-09-01 12:02:34 2022-09-25 14:59:29 17
3 2022-04-02 14:24:12 2022-04-24 21:05:55 15
4 2022-01-31 02:15:46 2022-07-02 16:16:02 110
5 2022-08-02 22:05:15 2022-08-17 17:25:10 12
6 2022-03-06 05:30:20 2022-07-04 08:43:00 86
7 2022-01-15 17:01:33 2022-08-09 21:48:41 147
8 2022-06-04 14:47:53 2022-12-12 18:05:58 136
9 2022-02-16 11:52:03 2022-10-18 01:30:58 175

Python Timedelta[M] adds incomplete days

I have a table that has a column Months_since_Start_fin_year and a Date column. I need to add the number of months in the first column to the date in the second column.
DateTable['Date']=DateTable['First_month']+DateTable['Months_since_Start_fin_year'].astype("timedelta64[M]")
This works OK for month 0, but month 1 already has a different time and for month 2 onwards has the wrong date.
Image of output table where early months have the correct date but month 2 where I would expect June 1st actually shows May 31st
It must be adding incomplete months, but I'm not sure how to fix it?
I have also tried
DateTable['Date']=DateTable['First_month']+relativedelta(months=DateTable['Months_since_Start_fin_year'])
but I get a type error that says
TypeError: cannot convert the series to <class 'int'>
My Months_since_Start_fin_year is type int32 and my First_month variable is datetime64[ns]
The problem with adding months as an offset to a date is that not all months are equally long (28-31 days). So you need pd.DateOffset which handles that ambiguity for you. .astype("timedelta64[M]") on the other hand only gives you the average days per month within a year (30 days 10:29:06).
Ex:
import pandas as pd
# a synthetic example since you didn't provide a mre
df = pd.DataFrame({'start_date': 7*['2017-04-01'],
'month_offset': range(7)})
# make sure we have datetime dtype
df['start_date'] = pd.to_datetime(df['start_date'])
# add month offset
df['new_date'] = df.apply(lambda row: row['start_date'] +
pd.DateOffset(months=row['month_offset']),
axis=1)
which would give you e.g.
df
start_date month_offset new_date
0 2017-04-01 0 2017-04-01
1 2017-04-01 1 2017-05-01
2 2017-04-01 2 2017-06-01
3 2017-04-01 3 2017-07-01
4 2017-04-01 4 2017-08-01
5 2017-04-01 5 2017-09-01
6 2017-04-01 6 2017-10-01
You can find similar examples here on SO, e.g. Add months to a date in Pandas. I only modified the answer there by using an apply to be able to take the months offset from one of the DataFrame's columns.

monthly resampling pandas with specific start day

I'm creating a pandas DataFrame with random dates and random integers values and I want to resample it by month and compute the average value of integers. This can be done with the following code:
def random_dates(start='2018-01-01', end='2019-01-01', n=300):
start_u = start.value//10**9
end_u = end.value//10**9
return pd.to_datetime(np.random.randint(start_u, end_u, n), unit='s')
start = pd.to_datetime('2018-01-01')
end = pd.to_datetime('2019-01-01')
dates = random_dates(start, end)
ints = np.random.randint(100, size=300)
df = pd.DataFrame({'Month': dates, 'Integers': ints})
print(df.resample('M', on='Month').mean())
The thing is that the resampled months always starts from day one and I want all months to start from day 15. I'm using pandas 1.1.4 and I've tried using origin='15/01/2018' or offset='15' and none of them works with 'M' resample rule (they do work when I use 30D but it is of no use). I've also tried to use '2SM'but it also doesn't work.
So my question is if is there a way of changing the resample rule or I will have to add an offset in my data?
Assume that the source DataFrame is:
Month Amount
0 2020-05-05 1
1 2020-05-14 1
2 2020-05-15 10
3 2020-05-20 10
4 2020-05-30 10
5 2020-06-15 20
6 2020-06-20 20
To compute your "shifted" resample, first shift Month column so that
the 15-th day of month becomes the 1-st:
df.Month = df.Month - pd.Timedelta('14D')
and then resample:
res = df.resample('M', on='Month').mean()
The result is:
Amount
Month
2020-04-30 1
2020-05-31 10
2020-06-30 20
If you want, change dates in the index to month periods:
res.index = res.index.to_period('M')
Then the result will be:
Amount
Month
2020-04 1
2020-05 10
2020-06 20
Edit: Not a working solution for OP's request. See short discussion in the comments.
Interesting problem. I suggest to resample using 'SMS' - semi-month start frequency (1st and 15th). Instead of keeping just the mean values, keep the count and sum values and recalculate the weighted mean for each monthly period by its two sub-period (for example: 15/1 to 15/2 is composed of 15/1-31/1 and 1/2-15/2).
The advantages here is that unlike with an (improper use of an) offset, we are certain we always start on the 15th of the month till the 14th of the next month.
df_sm = df.resample('SMS', on='Month').aggregate(['sum', 'count'])
df_sm
Integers
sum count
Month
2018-01-01 876 16
2018-01-15 864 16
2018-02-01 412 10
2018-02-15 626 12
...
2018-12-01 492 10
2018-12-15 638 16
Rolling sum and rolling count; Find the mean out of them:
df_sm['sum_rolling'] = df_sm['Integers']['sum'].rolling(2).sum()
df_sm['count_rolling'] = df_sm['Integers']['count'].rolling(2).sum()
df_sm['mean'] = df_sm['sum_rolling'] / df_sm['count_rolling']
df_sm
Integers count_sum count_rolling mean
sum count
Month
2018-01-01 876 16 NaN NaN NaN
2018-01-15 864 16 1740.0 32.0 54.375000
2018-02-01 412 10 1276.0 26.0 49.076923
2018-02-15 626 12 1038.0 22.0 47.181818
...
2018-12-01 492 10 1556.0 27.0 57.629630
2018-12-15 638 16 1130.0 26.0 43.461538
Now, just filter the odd indices of df_sm:
df_sm.iloc[1::2]['mean']
Month
2018-01-15 54.375000
2018-02-15 47.181818
2018-03-15 51.000000
2018-04-15 44.897436
2018-05-15 52.450000
2018-06-15 33.722222
2018-07-15 41.277778
2018-08-15 46.391304
2018-09-15 45.631579
2018-10-15 54.107143
2018-11-15 58.058824
2018-12-15 43.461538
Freq: 2SMS-15, Name: mean, dtype: float64
The code:
df_sm = df.resample('SMS', on='Month').aggregate(['sum', 'count'])
df_sm['sum_rolling'] = df_sm['Integers']['sum'].rolling(2).sum()
df_sm['count_rolling'] = df_sm['Integers']['count'].rolling(2).sum()
df_sm['mean'] = df_sm['sum_rolling'] / df_sm['count_rolling']
df_out = df_sm[1::2]['mean']
Edit: Changed a name of one of the columns to make it clearer

Count values greater than threshold and assign to appropriate year pandas

I have a dataframe that looks like this:
Date DFW
242 2000-05-01 00:00:00 75.92
243 2000-05-01 12:00:00 75.02
244 2000-05-02 00:00:00 71.96
245 2000-05-02 12:00:00 75.92
246 2000-05-03 00:00:00 71.96
... ... ...
14991 2020-07-09 12:00:00 93.90
14992 2020-07-10 00:00:00 91.00
14993 2020-07-10 12:00:00 93.00
14994 2020-07-11 00:00:00 89.10
14995 2020-07-11 12:00:00 97.00
The df contains the max value of temperature for a specific location every 12 hours from May - July 11 during 2000-2020. I want to count the number of times that the value is >90 and then store that value in a column where the row is the year. Should I use groupby to accomplish this?
Expected output:
Year count
2000 x
2001 y
... ...
2019 z
2020 a
You can do with groupby:
# extract the years from dates
years = df['Date'].dt.year
# compare `DFW` with `90`
# gt90 will be just True or False
gt90 = df['DFW'].gt(90)
# sum the `True` by years
output = gt90.groupby(years).sum()
# set the years as normal column:
output = output.reset_index()
All that in one line:
df['DFW'].gt(90).groupby().sum().reset_index()
One possible approach is to extract and create a new column for year (let's say "year") and then,
df[df['DFW'] > 90].groupby('year').count().reset_index()

Categories