converting datetime column to total hours and offsetting from reference time - python

I have the following datetime object:
import pandas as pd
from datetime import datetime
t0=datetime.strptime("01/01/2011 00:00:00", "%d/%m/%Y %H:%M:%S")
here, t0 is my reference or start time of simulation. I wanted to convert it into total hours (but failed) so that I can add them to my Hours df column and finally convert into a datetime column that could start from 2021-01-01.
I have a following Hours column which calculates hours from the start time t0:
My model results in hours:
Hours
0 44317.0
1 44317.250393519
2 44317.500138889
3 44317.750462963
4 44318.00005787
5 44318.250266204
6 44318.500543981
7 44318.7503125
8 44319.000520833
9 44319.250729167
10 44319.500428241
In excel if I convert this hours into date format it becomes 2021-05-01, like this which is my expected output:
My expected output:
Hours
1 5/1/21 0:00
2 5/1/21 6:00
3 5/1/21 12:00
4 5/1/21 18:00
5 5/2/21 0:00
6 5/2/21 6:00
7 5/2/21 12:00
8 5/2/21 18:00
9 5/3/21 0:00
10 5/3/21 0:00
However, in python if I can converted this Hours column into a datetime column named date using pd.to_datetime(df.Hours)` it starts from 1970-01-01.
My python output which I don't want:
Hours
0 1970-01-01 00:00:00.000044317
1 1970-01-01 00:00:00.000044317
2 1970-01-01 00:00:00.000044317
3 1970-01-01 00:00:00.000044317
4 1970-01-01 00:00:00.000044318
5 1970-01-01 00:00:00.000044318
6 1970-01-01 00:00:00.000044318
7 1970-01-01 00:00:00.000044318
8 1970-01-01 00:00:00.000044319
9 1970-01-01 00:00:00.000044319
10 1970-01-01 00:00:00.000044319
Please let me know how to convert it so that it starts from 1st May, 2021.
Solution: From Michael S. answere below:
The Hours column is actually not hours but days and using pd.to_datetime(df.Hours, unit='d',origin='1900-01-01') will give the right results. The software that I am using also uses excel like epoch time of '1900-01-01' and mistakenly says the days as hours.

Here is an update to the answer with OP's edits and inputs. Excel is weird with dates, so if you have to convert your timestamps (44317 etc) to Excel's dates, you have to do some odd additions to put the dates in line with Excel's (Pandas and Excel have different "Start of Time" dates, that's why you are seeing the different values e.g. 1970 vs 2021). Your 44317 etc numbers are actually days and you have to add 1899-12-30 to those days:
hours = [44317.0, 44317.250393519, 44317.500138889, 44317.750462963,
44318.00005787, 44318.250266204, 44318.500543981, 44318.7503125,
44319.000520833, 44319.250729167, 44319.500428241]
df = pd.DataFrame({"Hours":hours})
t0=datetime.strptime("01/01/2011 00:00:00", "%d/%m/%Y %H:%M:%S")
df["Actual Date"] = pd.TimedeltaIndex(df['Hours'], unit='d') + datetime(1899, 12, 30)
# Alternateive is pd.to_datetime(df.Hours, unit='d', origin='1899-12-30')
Output:
Hours Actual Date
0 44317.000000 2021-05-01 00:00:00.000000000
1 44317.250394 2021-05-01 06:00:34.000041600
2 44317.500139 2021-05-01 12:00:12.000009600
3 44317.750463 2021-05-01 18:00:40.000003200
4 44318.000058 2021-05-02 00:00:04.999968000
5 44318.250266 2021-05-02 06:00:23.000025600
6 44318.500544 2021-05-02 12:00:46.999958400
7 44318.750313 2021-05-02 18:00:27.000000000
8 44319.000521 2021-05-03 00:00:44.999971199
9 44319.250729 2021-05-03 06:01:03.000028799
10 44319.500428 2021-05-03 12:00:37.000022400
There are ways to clean up the format, but this is the correct time as you wanted.
To match your output exactly, you can do this, just be aware that the contents of the cells in the column "Corrected Format" are now string values and not datetime values. If you want to use them as datetime values then you'll have to convert them back again:
df["Corrected Format"] = df["Actual Date"].dt.strftime("%d/%m/%Y %H:%M")
Output
Hours Actual Date Corrected Format
0 44317.000000 2021-05-01 00:00:00.000000000 01/05/2021 00:00
1 44317.250394 2021-05-01 06:00:34.000041600 01/05/2021 06:00
2 44317.500139 2021-05-01 12:00:12.000009600 01/05/2021 12:00
3 44317.750463 2021-05-01 18:00:40.000003200 01/05/2021 18:00
4 44318.000058 2021-05-02 00:00:04.999968000 02/05/2021 00:00
5 44318.250266 2021-05-02 06:00:23.000025600 02/05/2021 06:00
6 44318.500544 2021-05-02 12:00:46.999958400 02/05/2021 12:00
7 44318.750313 2021-05-02 18:00:27.000000000 02/05/2021 18:00
8 44319.000521 2021-05-03 00:00:44.999971199 03/05/2021 00:00
9 44319.250729 2021-05-03 06:01:03.000028799 03/05/2021 06:01
10 44319.500428 2021-05-03 12:00:37.000022400 03/05/2021 12:00

Related

Passing dataframe column containing date and time to scatterplot is generating error [duplicate]

I want a scatter plot duration(mins) versus start time like this (which is a time of day, irrespective of what date it was on):
I have a CSV file commute.csv which looks like this:
date, prediction, start, stop, duration, duration(mins), Day of week
14/08/2015, , 08:02:00, 08:22:00, 00:20:00, 20, Fri
25/08/2015, , 18:16:00, 18:27:00, 00:11:00, 11, Tue
26/08/2015, , 08:26:00, 08:46:00, 00:20:00, 20, Wed
26/08/2015, , 18:28:00, 18:46:00, 00:18:00, 18, Wed
The full CSV file is here.
I can import the CSV file like so:
import pandas as pd
times = pd.read_csv('commute.csv', parse_dates=[[0, 2], [0, 3]], dayfirst=True)
times.head()
Out:
date_start date_stop prediction duration duration(mins) Day of week
0 2015-08-14 08:02:00 2015-08-14 08:22:00 NaN 00:20:00 20 Fri
1 2015-08-25 18:16:00 2015-08-25 18:27:00 NaN 00:11:00 11 Tue
2 2015-08-26 08:26:00 2015-08-26 08:46:00 NaN 00:20:00 20 Wed
3 2015-08-26 18:28:00 2015-08-26 18:46:00 NaN 00:18:00 18 Wed
4 2015-08-28 08:37:00 2015-08-28 08:52:00 NaN 00:15:00 15 Fri
I am now struggling to plot duration(mins) versus start time (without the date). Please help!
#jezrael has been a great help... one of the comments on issue 8113 proposes using a variant of df.plot(x=x, y=y, style="."). I tried it:
times.plot(x='start', y='duration(mins)', style='.')
However, it doesn't show the same as my intended plot: the output is incorrect because the X axis has been stretched so that each data point is the same distance apart in X:
Is there no way to plot against time?
I think there is problem use time - issue 8113 in scatter graph.
But you can use hour:
df['hours'] = df.date_start.dt.hour
print df
date_start date_stop prediction duration \
0 2015-08-14 08:02:00 2015-08-14 08:22:00 NaN 00:20:00
1 2015-08-25 18:16:00 2015-08-25 18:27:00 NaN 00:11:00
2 2015-08-26 08:26:00 2015-08-26 08:46:00 NaN 00:20:00
3 2015-08-26 18:28:00 2015-08-26 18:46:00 NaN 00:18:00
duration(mins) Dayofweek hours
0 20 Fri 8
1 11 Tue 18
2 20 Wed 8
3 18 Wed 18
df.plot.scatter(x='hours', y='duration(mins)')
Another solution with counting time in minutes:
df['time'] = df.date_start.dt.hour * 60 + df.date_start.dt.minute
print df
date_start date_stop prediction duration \
0 2015-08-14 08:02:00 2015-08-14 08:22:00 NaN 00:20:00
1 2015-08-25 18:16:00 2015-08-25 18:27:00 NaN 00:11:00
2 2015-08-26 08:26:00 2015-08-26 08:46:00 NaN 00:20:00
3 2015-08-26 18:28:00 2015-08-26 18:46:00 NaN 00:18:00
duration(mins) Dayofweek time
0 20 Fri 482
1 11 Tue 1096
2 20 Wed 506
3 18 Wed 1108
df.plot.scatter(x='time', y='duration(mins)')
To follow up, as this question is close to the top of the search results & it's difficult to put the necessary answer all in a comment;
To set the proper time tick labels along the horizontal axis for start time granularity of minutes, you need to set the frequency of the tick labels then convert to datetime.
This code sample has the horizontal axis datetime as the index of the DataFrame, although of course that could equally be a column rather than an index; notice that when it is a DatetimeIndex you access the minute & hour directly rather than through the dt attribute of a datetime column.
This code interprets the datetimes as UTC datetimes datetime.utcfromtimestamp(), see https://stackoverflow.com/a/44572082/437948 for a subtly different approach.
You could add handling of second granularity according to a similar theme.
df = pd.DataFrame({'value': np.random.randint(0, 11, 6 * 24 * 7)},
index = pd.DatetimeIndex(start='2018-10-03', freq='600s',
periods=6 * 24 * 7))
df['time'] = 60 * df.index.hour + df.index.minute
f, a = plt.subplots(figsize=(20, 10))
df.plot.scatter(x='time', y='value', style='.', ax=a)
plt.xticks(np.arange(0, 25 * 60, 60))
a.set_xticklabels([datetime.utcfromtimestamp(ts * 60).strftime('%H:%M')
for ts in a.get_xticks()])
In the end, I wrote a function to turn hours, minutes and seconds into a floating point number of hours.
def to_hours(dt):
"""Return floating point number of hours through the day in `datetime` dt."""
return dt.hour + dt.minute / 60 + dt.second / 3600
# Unit test the to_hours() function
import datetime
dt = datetime.datetime(2010, 4, 23) # Dummy date for testing
assert to_hours(dt) == 0
assert to_hours(dt.replace(hour=1)) == 1
assert to_hours(dt.replace(hour=2, minute=30)) == 2.5
assert to_hours(dt.replace(minute=15)) == 0.25
assert to_hours(dt.replace(second=30)) == 30 / 3600
Then create a column of the floating point number of hours:
# Convert start and stop times to hours
commutes['start_hour'] = commutes['start_date'].map(to_hours)
The full example is in my Jupyter notebook.

Date Time Format Issues Python

I am currently having issues with date-time format, particularly converting string input to the correct python datetime format
Date/Time Dry_Temp[C] Wet_Temp[C] Solar_Diffuse_Rate[[W/m2]] \
0 01/01 00:10:00 8.45 8.237306 0.0
1 01/01 00:20:00 7.30 6.968360 0.0
2 01/01 00:30:00 6.15 5.710239 0.0
3 01/01 00:40:00 5.00 4.462898 0.0
4 01/01 00:50:00 3.85 3.226244 0.0
These are current examples of timestamps I have in my time, I have tried splitting date and time such that I now have the following columns:
WC_Humidity[%] WC_Htgsetp[C] WC_Clgsetp[C] Date Time
0 55.553640 18 26 1900-01-01 00:10:00
1 54.204342 18 26 1900-01-01 00:20:00
2 51.896272 18 26 1900-01-01 00:30:00
3 49.007770 18 26 1900-01-01 00:40:00
4 45.825810 18 26 1900-01-01 00:50:00
I have managed to get the year into datetime format, but there are still 2 problems to resolve:
the data was not recorded in 1900, so I would like to change the year in the Date,
I get the following error whent rying to convert time into time datetime python format
pandas/_libs/tslibs/strptime.pyx in pandas._libs.tslibs.strptime.array_strptime()
ValueError: time data '00:00:00' does not match format ' %m/%d %H:%M:%S' (match)
I tried having 24:00:00, however, python didn't like that either...
preferences:
I would prefer if they were both in the same cell without having to split this information into two columns.
I would also like to get rid of the seconds data as the data was recorded in 10 min intervals so there is no need for seconds in my case.
Any help would be greatly appreciated.
the data was not recorded in 1900, so I would like to change the year in the Date,
datetime.datetime.replace method of datetime.datetime instance is used for this task consider following example:
import pandas as pd
df = pd.DataFrame({"when":pd.to_datetime(["1900-01-01","1900-02-02","1900-03-03"])})
df["when"] = df["when"].apply(lambda x:x.replace(year=2000))
print(df)
output
when
0 2000-01-01
1 2000-02-02
2 2000-03-03
Note that it can be used also without pandas for example
import datetime
d = datetime.datetime.strptime("","") # use all default values which result in midnight of Jan 1 of year 1900
print(d) # 1900-01-01 00:00:00
d = d.replace(year=2000)
print(d) # 2000-01-01 00:00:00

How do I create a new column with a set timeframe using Pandas datetime64

I’m trying to look at some sales data for a small store. I have a time stamp of when the settlement was made, but sometimes it’s done before midnight and sometimes its done after midnight.
This is giving me data correct for some days and incorrect for others, as anything after midnight should be for the day before. I couldn’t find the correct pandas documentation for what I’m looking for.
Is there an if else solution to create a new column, loop through the NEW_TIMESTAMP column and set a custom timeframe (if after midnight, but before 3pm: set the day before ; else set the day). Every time I write something it either runs forever, or it crashes jupyter.
Data:
What I did is I created another series which says when a day should be offset back by one day, and I multiplied it by a pd.timedelta object, such that 0 turns into "0 days" and 1 turns into "1 day". Subtracting two series gives the right result.
Let me know how the following code works for you.
import pandas as pd
import numpy as np
# copied from https://stackoverflow.com/questions/50559078/generating-random-dates-within-a-given-range-in-pandas
def random_dates(start, end, n=15):
start_u = start.value//10**9
end_u = end.value//10**9
return pd.to_datetime(np.random.randint(start_u, end_u, n), unit='s')
dates = random_dates(start=pd.to_datetime('2020-01-01'),
end=pd.to_datetime('2021-01-01'))
timestamps = pd.Series(dates)
# this takes only the hour component of every datetime
hours = timestamps.dt.hour
# this takes only the hour component of every datetime
dates = timestamps.dt.date
# this compares the hours with 15, and returns a boolean if it is smaller
flag_is_day_before = hours < 15
# now you can set the dates by multiplying the 1s and 0s with a day timedelta
new_dates = dates - pd.to_timedelta(1, unit='day') * flag_is_day_before
df = pd.DataFrame(data=dict(timestamps=timestamps, new_dates=new_dates))
print(df)
This outputs
timestamps new_dates
0 2020-07-10 20:11:13 2020-07-10
1 2020-05-04 01:20:07 2020-05-03
2 2020-03-30 09:17:36 2020-03-29
3 2020-06-01 16:16:58 2020-06-01
4 2020-09-22 04:53:33 2020-09-21
5 2020-08-02 20:07:26 2020-08-02
6 2020-03-22 14:06:53 2020-03-21
7 2020-03-14 14:21:12 2020-03-13
8 2020-07-16 20:50:22 2020-07-16
9 2020-09-26 13:26:55 2020-09-25
10 2020-11-08 17:27:22 2020-11-08
11 2020-11-01 13:32:46 2020-10-31
12 2020-03-12 12:26:21 2020-03-11
13 2020-12-28 08:04:29 2020-12-27
14 2020-04-06 02:46:59 2020-04-05

Differente Betweent Dates - Integer results [duplicate]

This question already has answers here:
Pandas: Subtracting two date columns and the result being an integer
(5 answers)
Closed 3 years ago.
I need to calculate the difference between two columns of type datetime, and the result must be in days (integer format). However, what I am getting is the result in day / month / year hour and minute.
id date_1 date_2 date_3 date_result_2-1 date_result_3-1
0 C_ID_92a2005557 2017-06-01 2017-06-27 14:18:08 2018-04-29 11:23:05 26 days 14:18:08 332 days 11:23:05
1 C_ID_3d0044924f 2017-01-01 2017-01-06 16:29:42 2018-03-30 06:48:26 5 days 16:29:42 453 days 06:48:26
2 C_ID_d639edf6cd 2016-08-01 2017-01-11 08:21:22 2018-04-28 17:43:11 163 days 08:21:22 635 days 17:43:11
3 C_ID_186d6a6901 2017-09-01 2017-09-26 16:22:21 2018-04-18 11:00:11 25 days 16:22:21 229 days 11:00:11
4 C_ID_cdbd2c0db2 2017-11-01 2017-11-12 00:00:00 2018-04-28 18:50:25 11 days 00:00:00 178 days 18:50:25
The last two columns are the result that I obtained with the simple subtraction between two columns. I would like these columns to be in full format, containing only the number of days.
I tried to convert with astype (int) but I got a result that I could not understand.
Any suggestion? Thank you very much in advance.
if you need only days try this:
df = pd.DataFrame(data={"date":['2000-05-07','1965-01-30','NaT'],
"date_2":["2019-01-19 12:26:00","2019-03-21 02:23:12", "2018-11-02 18:30:10"]})
df['date'] = pd.to_datetime(df['date']).dt.date
df['date_2'] = pd.to_datetime(df['date_2']).dt.date
df['days'] = (df['date']-df['date_2']).dt.days

cut time spells into calendar months in pandas

I have data on spells (hospital stays), each with a start and end date, but I want to count the number of days spent in hospital for calendar months. Of course, this number can be zero for months not appearing in a spell. But I cannot just attribute the length of each spell to the starting month, as longer spells run over to the following month (or more).
Basically, it would suffice for me if I could cut spells at turn-of-month datetimes, getting from the data in the first example to the data in the second:
id start end
1 2011-01-01 10:00:00 2011-01-08 16:03:00
2 2011-01-28 03:45:00 2011-02-04 15:22:00
3 2011-03-02 11:04:00 2011-03-05 05:24:00
id start end month stay
1 2011-01-01 10:00:00 2011-01-08 16:03:00 2011-01 7
2 2011-01-28 03:45:00 2011-01-31 23:59:59 2011-01 4
2 2011-02-01 00:00:00 2011-02-04 15:22:00 2011-02 4
3 2011-03-02 11:04:00 2011-03-05 05:24:00 2011-03 3
I read up on the Time Series / Date functionality of pandas, but I do not see a straightforward solution to this. How can one accomplish the slicing?
It's simpler than you think: just subtract the dates. The result is a time span. See Add column with number of days between dates in DataFrame pandas
You even get to do this for the entire frame at once:
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.subtract.html
Update, now that I understand the problem better.
Add a new column: take the spell's end date; if the start date is in a different month, then set this new date's day to 01 and the time to 00:00.
This is the cut DateTime you can use to compute the portion of the stay attributable to each month. cut - start is the first month; end - cut is the second.

Categories