Convert custom object to standard datetime object in Pandas - python

I have a bit of an odd Series full of Date, Times which I want to convert to DateTime so that I can do some manipulation
allSubs['Subscribed']
0 12th December, 08:08
1 11th December, 14:57
2 10th December, 21:40
3 7th December, 21:39
4 5th December, 14:51
5 30th November, 15:36
When I call pd.to_datetime(allSubs['Subscribed']) on it, I get the error ' Out of bounds nanosecond timestamp: 1-12-12 08:08:00'. I tried to use param errors='coerce' but this just returns a series of nat. I want to convert the series into a pandas datetime object with format YYYY-MM-DD.
I've looked into using datetime.strptime but couldn't find an efficient way to run this against a series.
Any help much appreciated!

Use:
from dateutil import parser
allSubs['Subscribed'] = allSubs['Subscribed'].apply(parser.parse)
print (allSubs)
Subscribed
0 2018-12-12 08:08:00
1 2018-12-11 14:57:00
2 2018-12-10 21:40:00
3 2018-12-07 21:39:00
4 2018-12-05 14:51:00
5 2018-11-30 15:36:00
Or use replace by regex, also is necessary specify year, then use to_datetime by custom format - http://strftime.org/:
s = allSubs['Subscribed'].str.replace(r'(\d)(st|nd|rd|th)', r'\1 2018')
allSubs['Subscribed'] = pd.to_datetime(s, format='%d %Y %B, %H:%M')
print (allSubs)
Subscribed
0 2018-12-12 08:08:00
1 2018-12-11 14:57:00
2 2018-12-10 21:40:00
3 2018-12-07 21:39:00
4 2018-12-05 14:51:00
5 2018-11-30 15:36:00

Related

converting datetime column to total hours and offsetting from reference time

I have the following datetime object:
import pandas as pd
from datetime import datetime
t0=datetime.strptime("01/01/2011 00:00:00", "%d/%m/%Y %H:%M:%S")
here, t0 is my reference or start time of simulation. I wanted to convert it into total hours (but failed) so that I can add them to my Hours df column and finally convert into a datetime column that could start from 2021-01-01.
I have a following Hours column which calculates hours from the start time t0:
My model results in hours:
Hours
0 44317.0
1 44317.250393519
2 44317.500138889
3 44317.750462963
4 44318.00005787
5 44318.250266204
6 44318.500543981
7 44318.7503125
8 44319.000520833
9 44319.250729167
10 44319.500428241
In excel if I convert this hours into date format it becomes 2021-05-01, like this which is my expected output:
My expected output:
Hours
1 5/1/21 0:00
2 5/1/21 6:00
3 5/1/21 12:00
4 5/1/21 18:00
5 5/2/21 0:00
6 5/2/21 6:00
7 5/2/21 12:00
8 5/2/21 18:00
9 5/3/21 0:00
10 5/3/21 0:00
However, in python if I can converted this Hours column into a datetime column named date using pd.to_datetime(df.Hours)` it starts from 1970-01-01.
My python output which I don't want:
Hours
0 1970-01-01 00:00:00.000044317
1 1970-01-01 00:00:00.000044317
2 1970-01-01 00:00:00.000044317
3 1970-01-01 00:00:00.000044317
4 1970-01-01 00:00:00.000044318
5 1970-01-01 00:00:00.000044318
6 1970-01-01 00:00:00.000044318
7 1970-01-01 00:00:00.000044318
8 1970-01-01 00:00:00.000044319
9 1970-01-01 00:00:00.000044319
10 1970-01-01 00:00:00.000044319
Please let me know how to convert it so that it starts from 1st May, 2021.
Solution: From Michael S. answere below:
The Hours column is actually not hours but days and using pd.to_datetime(df.Hours, unit='d',origin='1900-01-01') will give the right results. The software that I am using also uses excel like epoch time of '1900-01-01' and mistakenly says the days as hours.
Here is an update to the answer with OP's edits and inputs. Excel is weird with dates, so if you have to convert your timestamps (44317 etc) to Excel's dates, you have to do some odd additions to put the dates in line with Excel's (Pandas and Excel have different "Start of Time" dates, that's why you are seeing the different values e.g. 1970 vs 2021). Your 44317 etc numbers are actually days and you have to add 1899-12-30 to those days:
hours = [44317.0, 44317.250393519, 44317.500138889, 44317.750462963,
44318.00005787, 44318.250266204, 44318.500543981, 44318.7503125,
44319.000520833, 44319.250729167, 44319.500428241]
df = pd.DataFrame({"Hours":hours})
t0=datetime.strptime("01/01/2011 00:00:00", "%d/%m/%Y %H:%M:%S")
df["Actual Date"] = pd.TimedeltaIndex(df['Hours'], unit='d') + datetime(1899, 12, 30)
# Alternateive is pd.to_datetime(df.Hours, unit='d', origin='1899-12-30')
Output:
Hours Actual Date
0 44317.000000 2021-05-01 00:00:00.000000000
1 44317.250394 2021-05-01 06:00:34.000041600
2 44317.500139 2021-05-01 12:00:12.000009600
3 44317.750463 2021-05-01 18:00:40.000003200
4 44318.000058 2021-05-02 00:00:04.999968000
5 44318.250266 2021-05-02 06:00:23.000025600
6 44318.500544 2021-05-02 12:00:46.999958400
7 44318.750313 2021-05-02 18:00:27.000000000
8 44319.000521 2021-05-03 00:00:44.999971199
9 44319.250729 2021-05-03 06:01:03.000028799
10 44319.500428 2021-05-03 12:00:37.000022400
There are ways to clean up the format, but this is the correct time as you wanted.
To match your output exactly, you can do this, just be aware that the contents of the cells in the column "Corrected Format" are now string values and not datetime values. If you want to use them as datetime values then you'll have to convert them back again:
df["Corrected Format"] = df["Actual Date"].dt.strftime("%d/%m/%Y %H:%M")
Output
Hours Actual Date Corrected Format
0 44317.000000 2021-05-01 00:00:00.000000000 01/05/2021 00:00
1 44317.250394 2021-05-01 06:00:34.000041600 01/05/2021 06:00
2 44317.500139 2021-05-01 12:00:12.000009600 01/05/2021 12:00
3 44317.750463 2021-05-01 18:00:40.000003200 01/05/2021 18:00
4 44318.000058 2021-05-02 00:00:04.999968000 02/05/2021 00:00
5 44318.250266 2021-05-02 06:00:23.000025600 02/05/2021 06:00
6 44318.500544 2021-05-02 12:00:46.999958400 02/05/2021 12:00
7 44318.750313 2021-05-02 18:00:27.000000000 02/05/2021 18:00
8 44319.000521 2021-05-03 00:00:44.999971199 03/05/2021 00:00
9 44319.250729 2021-05-03 06:01:03.000028799 03/05/2021 06:01
10 44319.500428 2021-05-03 12:00:37.000022400 03/05/2021 12:00

Date Time Format Issues Python

I am currently having issues with date-time format, particularly converting string input to the correct python datetime format
Date/Time Dry_Temp[C] Wet_Temp[C] Solar_Diffuse_Rate[[W/m2]] \
0 01/01 00:10:00 8.45 8.237306 0.0
1 01/01 00:20:00 7.30 6.968360 0.0
2 01/01 00:30:00 6.15 5.710239 0.0
3 01/01 00:40:00 5.00 4.462898 0.0
4 01/01 00:50:00 3.85 3.226244 0.0
These are current examples of timestamps I have in my time, I have tried splitting date and time such that I now have the following columns:
WC_Humidity[%] WC_Htgsetp[C] WC_Clgsetp[C] Date Time
0 55.553640 18 26 1900-01-01 00:10:00
1 54.204342 18 26 1900-01-01 00:20:00
2 51.896272 18 26 1900-01-01 00:30:00
3 49.007770 18 26 1900-01-01 00:40:00
4 45.825810 18 26 1900-01-01 00:50:00
I have managed to get the year into datetime format, but there are still 2 problems to resolve:
the data was not recorded in 1900, so I would like to change the year in the Date,
I get the following error whent rying to convert time into time datetime python format
pandas/_libs/tslibs/strptime.pyx in pandas._libs.tslibs.strptime.array_strptime()
ValueError: time data '00:00:00' does not match format ' %m/%d %H:%M:%S' (match)
I tried having 24:00:00, however, python didn't like that either...
preferences:
I would prefer if they were both in the same cell without having to split this information into two columns.
I would also like to get rid of the seconds data as the data was recorded in 10 min intervals so there is no need for seconds in my case.
Any help would be greatly appreciated.
the data was not recorded in 1900, so I would like to change the year in the Date,
datetime.datetime.replace method of datetime.datetime instance is used for this task consider following example:
import pandas as pd
df = pd.DataFrame({"when":pd.to_datetime(["1900-01-01","1900-02-02","1900-03-03"])})
df["when"] = df["when"].apply(lambda x:x.replace(year=2000))
print(df)
output
when
0 2000-01-01
1 2000-02-02
2 2000-03-03
Note that it can be used also without pandas for example
import datetime
d = datetime.datetime.strptime("","") # use all default values which result in midnight of Jan 1 of year 1900
print(d) # 1900-01-01 00:00:00
d = d.replace(year=2000)
print(d) # 2000-01-01 00:00:00

How to define a 4-4-5 week period in Pandas

My company uses a 4-4-5 calendar for reporting purposes. Each month (aka period) is 4-weeks long, except every 3rd month is 5-weeks long.
Pandas seems to have good support for custom calendar periods. However, I'm having trouble figuring out the correct frequency string or custom business month offset to achieve months for a 4-4-5 calendar.
For example:
df_index = pd.date_range("2020-03-29", "2021-03-27", freq="D", name="date")
df = pd.DataFrame(
index=df_index, columns=["a"], data=np.random.randint(0, 100, size=len(df_index))
)
df.groupby(pd.Grouper(level=0, freq="4W-SUN")).mean()
Grouping by 4-weeks starting on Sunday results in the following. The first three month start dates are correct but I need every third month to be 5-weeks long. The 4th month start date should be 2020-06-28.
a
date
2020-03-29 16.000000
2020-04-26 50.250000
2020-05-24 39.071429
2020-06-21 52.464286
2020-07-19 41.535714
2020-08-16 46.178571
2020-09-13 51.857143
2020-10-11 44.250000
2020-11-08 47.714286
2020-12-06 56.892857
2021-01-03 55.821429
2021-01-31 53.464286
2021-02-28 53.607143
2021-03-28 45.037037
Essentially what I'd like to achieve is something like this:
a
date
2020-03-29 20.000000
2020-04-26 50.750000
2020-05-24 49.750000
2020-06-28 49.964286
2020-07-26 52.214286
2020-08-23 47.714286
2020-09-27 46.250000
2020-10-25 53.357143
2020-11-22 52.035714
2020-12-27 39.750000
2021-01-24 43.428571
2021-02-21 49.392857
Pandas currently support only yearly and quarterly 5253 (aka 4-4-5 calendar).
See is pandas.tseries.offsets.FY5253 and pandas.tseries.offsets.FY5253Quarter
df_index = pd.date_range("2020-03-29", "2021-03-27", freq="D", name="date")
df = pd.DataFrame(index=df_index)
df['a'] = np.random.randint(0, 100, df.shape[0])
So indeed you need some more work to get to week level and maintain a 4-4-5 calendar. You could align to quarters using the native pandas offset and fill-in the 4-4-5 week pattern manually.
def date_range(start, end, offset_array, name=None):
start = pd.to_datetime(start)
end = pd.to_datetime(end)
index = []
start -= offset_array[0]
while(start<end):
for x in offset_array:
start += x
if start > end:
break
index.append(start)
return pd.Series(index, name=name)
This function takes a list of offsets rather than a regular frequency period, so it allows to move from date to date following the offsets in the given array:
offset_445 = [
pd.tseries.offsets.FY5253Quarter(weekday=6),
4*pd.tseries.offsets.Week(weekday=6),
4*pd.tseries.offsets.Week(weekday=6),
]
df_index_445 = date_range("2020-03-29", "2021-03-27", offset_445, name='date')
Out:
0 2020-05-03
1 2020-05-31
2 2020-06-28
3 2020-08-02
4 2020-08-30
5 2020-09-27
6 2020-11-01
7 2020-11-29
8 2020-12-27
9 2021-01-31
10 2021-02-28
Name: date, dtype: datetime64[ns]
Once the index is created, then it's back to aggregations logic to get the data in the right row buckets. Assuming that you want the mean for the start of each 4 or 5 week period, according to the df_index_445 you have generated, it could look like this:
# calculate the mean on reindex groups
reindex = df_index_445.searchsorted(df.index, side='right') - 1
res = df.groupby(reindex).mean()
# filter valid output
res = res[res.index>=0]
res.index = df_index_445
Out:
a
2020-05-03 47.857143
2020-05-31 53.071429
2020-06-28 49.257143
2020-08-02 40.142857
2020-08-30 47.250000
2020-09-27 52.485714
2020-11-01 48.285714
2020-11-29 56.178571
2020-12-27 51.428571
2021-01-31 50.464286
2021-02-28 53.642857
Note that since the frequency is not regular, pandas will set the datetime index frequency to None.

ValueError: time data 'May 11, 2015' does not match format '%m%y%d' (match) [duplicate]

I'm getting a value error saying my data does not match the format when it does. Not sure if this is a bug or I'm missing something here. I'm referring to this documentation for the string format. The weird part is if I write the 'data' Dataframe to a csv and read it in then call the function below it will convert the date so I'm not sure why it doesn't work without writing to a csv.
Any ideas?
data['Date'] = pd.to_datetime(data['Date'], format='%d-%b-%Y')
I'm getting two errors
TypeError: Unrecognized value type: <class 'str'>
ValueError: time data '27‑Aug‑2018' does not match format '%d-%b-%Y' (match)
Example dates -
2‑Jul‑2018
27‑Aug‑2018
28‑May‑2018
19‑Jun‑2017
5‑Mar‑2018
15‑Jan‑2018
11‑Nov‑2013
23‑Nov‑2015
23‑Jun‑2014
18‑Jun‑2018
30‑Apr‑2018
14‑May‑2018
16‑Apr‑2018
26‑Feb‑2018
19‑Mar‑2018
29‑Jun‑2015
Is it because they all aren't double digit days? What is the string format value for single digit days? Looks like this could be the cause but I'm not sure why it would error on the '27' though.
End solution (It was unicode & not a string) -
data['Date'] = data['Date'].apply(unidecode.unidecode)
data['Date'] = data['Date'].apply(lambda x: x.replace("-", "/"))
data['Date'] = pd.to_datetime(data['Date'], format="%d/%b/%Y")
There seems to be an issue with your date strings. I replicated your issue with your sample data and if I remove the hyphens and replace them manually (for the first three dates) then the code works
pd.to_datetime(df1['Date'] ,errors ='coerce')
output:
0 2018-07-02
1 2018-08-27
2 2018-05-28
3 NaT
4 NaT
5 NaT
6 NaT
7 NaT
8 NaT
9 NaT
10 NaT
11 NaT
12 NaT
13 NaT
14 NaT
15 NaT
Bottom line: your hyphens look like regular ones but are actually something else, just clean your source data and you're good to go
You got a special mark here it is not -
df.iloc[0,0][2]
Out[287]: '‑'
Replace it with '-'
pd.to_datetime(df.iloc[:,0].str.replace('‑','-'),format='%d-%b-%Y')
Out[288]:
0 2018-08-27
1 2018-05-28
2 2017-06-19
3 2018-03-05
4 2018-01-15
5 2013-11-11
6 2015-11-23
7 2014-06-23
8 2018-06-18
9 2018-04-30
10 2018-05-14
11 2018-04-16
12 2018-02-26
13 2018-03-19
14 2015-06-29
Name: 2‑Jul‑2018, dtype: datetime64[ns]

Pandas - convert float to proper datetime or time object

I have an observational data set which contain weather information. Each column contain specific field in which date and time are in two separate column. The time column contain hourly time like 0000, 0600 .. up to 2300. What I am trying to do is to filter the data set based on certain time frame, for example between 0000 UTC to 0600 UTC. When I try to read the data file in pandas data frame, by default the time column is read in float. When I try to convert it in to datatime object, it produces a format which I am unable to convert. Code example is given below:
import pandas as pd
import datetime as dt
df = pd.read_excel("test.xlsx")
df.head()
which produces the following result:
tdate itime moonph speed ... qnh windir maxtemp mintemp
0 01-Jan-17 1000.0 NM7 5 ... $1,011.60 60.0 $32.60 $22.80
1 01-Jan-17 1000.0 NM7 2 ... $1,015.40 999.0 $32.60 $22.80
2 01-Jan-17 1030.0 NM7 4 ... $1,015.10 60.0 $32.60 $22.80
3 01-Jan-17 1100.0 NM7 3 ... $1,014.80 999.0 $32.60 $22.80
4 01-Jan-17 1130.0 NM7 5 ... $1,014.60 270.0 $32.60 $22.80
Then I extracted the time column with following line:
df["time"] = df.itime
df["time"]
0 1000.0
1 1000.0
2 1030.0
3 1100.0
4 1130.0
5 1200.0
6 1230.0
7 1300.0
8 1330.0
.
.
3261 2130.0
3262 2130.0
3263 600.0
3264 630.0
3265 730.0
3266 800.0
3267 830.0
3268 1900.0
3269 1930.0
3270 2000.0
Name: time, Length: 3279, dtype: float64
Then I tried to convert the time column to datetime object:
df["time"] = pd.to_datetime(df.itime)
which produced the following result:
df["time"]
0 1970-01-01 00:00:00.000001000
1 1970-01-01 00:00:00.000001000
2 1970-01-01 00:00:00.000001030
3 1970-01-01 00:00:00.000001100
It appears that it has successfully converted the data to datetime object. However, it added the hour time to ms which is difficult for me to do filtering.
The final data format I would like to get is either:
1970-01-01 06:00:00
or
06:00
Any help is appreciated.
When you read the excel file specify the dtype of col itime as a str:
df = pd.read_excel("test.xlsx", dtype={'itime':str})
then you will have a time column of strings looking like:
df = pd.DataFrame({'itime':['2300', '0100', '0500', '1000']})
Then specify the format and convert to time:
df['Time'] = pd.to_datetime(df['itime'], format='%H%M').dt.time
itime Time
0 2300 23:00:00
1 0100 01:00:00
2 0500 05:00:00
3 1000 10:00:00
Just addon to Chris answer, if you are unable to convert because there is no zero in the front, apply the following to the dataframe.
df['itime'] = df['itime'].apply(lambda x: x.zfill(4))
So basically is that because the original format does not have even leading digit (4 digit). Example: 945 instead of 0945.
Try
df["time"] = pd.to_datetime(df.itime).dt.strftime('%Y-%m-%d %H:%M:%S')
df["time"] = pd.to_datetime(df.itime).dt.strftime('%H:%M:%S')
For the first and second outputs you want to
Best!

Categories