How to pad with trailing zeroes when datetime formatting in pandas - python

I have a pandas DataFrame that looks like this:
pta ptd tpl_num
4 05:17 05:18 0
6 05:29:30 05:30 1
9 05:42 05:44:30 2
11 05:53 05:54 3
12 06:03 06:05:30 4
I'm trying to format pta and ptd to %H:%M:%S using this:
df['pta'] = pandas.to_datetime(df['pta'], format="%H:%M:%S")
df['ptd'] = pandas.to_datetime(df['ptd'], format="%H:%M:%S")
This gives:
ValueError: time data '05:17' does not match format '%H:%M:%S' (match)
Makes sense, as some of my timestamps don't have :00 in the seconds column. Is there any way to pad these at the end? Or will I need to pad my input data manually/before adding it to the DataFrame? I've seen plenty of answers that pad leading zeroes, but couldn't find one for this.

Some dates do not match the specified format and hence are not correctly parsed. Let pandas parse them for you, and then use dt.strftime to format them as you want:
df['pta'] = pd.to_datetime(df['pta']).dt.strftime("%H:%M:%S")
df['ptd'] = pd.to_datetime(df['ptd']).dt.strftime("%H:%M:%S")
print(df)
pta ptd tpl_num
4 05:17:00 05:18:00 0
6 05:29:30 05:30:00 1
9 05:42:00 05:44:30 2
11 05:53:00 05:54:00 3
12 06:03:00 06:05:30 4

If you only want the padded strings, you can do:
df['pta'].add(':00').str[:8]
Output:
4 05:17:00
6 05:29:30
9 05:42:00
11 05:53:00
12 06:03:00
Name: pta, dtype: object
Also, for time only, you should consider using pd.to_timedelta instead of pd.to_datetime.

Related

ValueError: time data 'May 11, 2015' does not match format '%m%y%d' (match) [duplicate]

I'm getting a value error saying my data does not match the format when it does. Not sure if this is a bug or I'm missing something here. I'm referring to this documentation for the string format. The weird part is if I write the 'data' Dataframe to a csv and read it in then call the function below it will convert the date so I'm not sure why it doesn't work without writing to a csv.
Any ideas?
data['Date'] = pd.to_datetime(data['Date'], format='%d-%b-%Y')
I'm getting two errors
TypeError: Unrecognized value type: <class 'str'>
ValueError: time data '27‑Aug‑2018' does not match format '%d-%b-%Y' (match)
Example dates -
2‑Jul‑2018
27‑Aug‑2018
28‑May‑2018
19‑Jun‑2017
5‑Mar‑2018
15‑Jan‑2018
11‑Nov‑2013
23‑Nov‑2015
23‑Jun‑2014
18‑Jun‑2018
30‑Apr‑2018
14‑May‑2018
16‑Apr‑2018
26‑Feb‑2018
19‑Mar‑2018
29‑Jun‑2015
Is it because they all aren't double digit days? What is the string format value for single digit days? Looks like this could be the cause but I'm not sure why it would error on the '27' though.
End solution (It was unicode & not a string) -
data['Date'] = data['Date'].apply(unidecode.unidecode)
data['Date'] = data['Date'].apply(lambda x: x.replace("-", "/"))
data['Date'] = pd.to_datetime(data['Date'], format="%d/%b/%Y")
There seems to be an issue with your date strings. I replicated your issue with your sample data and if I remove the hyphens and replace them manually (for the first three dates) then the code works
pd.to_datetime(df1['Date'] ,errors ='coerce')
output:
0 2018-07-02
1 2018-08-27
2 2018-05-28
3 NaT
4 NaT
5 NaT
6 NaT
7 NaT
8 NaT
9 NaT
10 NaT
11 NaT
12 NaT
13 NaT
14 NaT
15 NaT
Bottom line: your hyphens look like regular ones but are actually something else, just clean your source data and you're good to go
You got a special mark here it is not -
df.iloc[0,0][2]
Out[287]: '‑'
Replace it with '-'
pd.to_datetime(df.iloc[:,0].str.replace('‑','-'),format='%d-%b-%Y')
Out[288]:
0 2018-08-27
1 2018-05-28
2 2017-06-19
3 2018-03-05
4 2018-01-15
5 2013-11-11
6 2015-11-23
7 2014-06-23
8 2018-06-18
9 2018-04-30
10 2018-05-14
11 2018-04-16
12 2018-02-26
13 2018-03-19
14 2015-06-29
Name: 2‑Jul‑2018, dtype: datetime64[ns]

parse multiple date format pandas

I 've got stuck with the following format:
0 2001-12-25
1 2002-9-27
2 2001-2-24
3 2001-5-3
4 200510
5 20078
What I need is the date in a format %Y-%m
What I tried was
def parse(date):
if len(date)<=5:
return "{}-{}".format(date[:4], date[4:5], date[5:])
else:
pass
df['Date']= parse(df['Date'])
However, I only succeeded in parse 20078 to 2007-8, the format like 2001-12-25 appeared as None.
So, how can I do it? Thank you!
we can use the pd.to_datetime and use errors='coerce' to parse the dates in steps.
assuming your column is called date
s = pd.to_datetime(df['date'],errors='coerce',format='%Y-%m-%d')
s = s.fillna(pd.to_datetime(df['date'],format='%Y%m',errors='coerce'))
df['date_fixed'] = s
print(df)
date date_fixed
0 2001-12-25 2001-12-25
1 2002-9-27 2002-09-27
2 2001-2-24 2001-02-24
3 2001-5-3 2001-05-03
4 200510 2005-10-01
5 20078 2007-08-01
In steps,
first we cast the regular datetimes to a new series called s
s = pd.to_datetime(df['date'],errors='coerce',format='%Y-%m-%d')
print(s)
0 2001-12-25
1 2002-09-27
2 2001-02-24
3 2001-05-03
4 NaT
5 NaT
Name: date, dtype: datetime64[ns]
as you can can see we have two NaT which are null datetime values in our series, these correspond with your datetimes which are missing a day,
we then reapply the same datetime method but with the opposite format, and apply those to the missing values of s
s = s.fillna(pd.to_datetime(df['date'],format='%Y%m',errors='coerce'))
print(s)
0 2001-12-25
1 2002-09-27
2 2001-02-24
3 2001-05-03
4 2005-10-01
5 2007-08-01
then we re-assign to your dataframe.
You could use a regex to pull out the year and month, and convert to datetime :
df = pd.read_clipboard("\s{2,}",header=None,names=["Dates"])
pattern = r"(?P<Year>\d{4})[-]*(?P<Month>\d{1,2})"
df['Dates'] = pd.to_datetime([f"{year}-{month}" for year, month in df.Dates.str.extract(pattern).to_numpy()])
print(df)
Dates
0 2001-12-01
1 2002-09-01
2 2001-02-01
3 2001-05-01
4 2005-10-01
5 2007-08-01
Note that pandas automatically converts the day to 1, since only year and month was supplied.

How to remove specific day timestamps from a big dataframe

I have a big dataframe consisting of 600 days worth of data. Each day has 100 timestamps. I have a separate list of 30 days from which I want to data. How do I remove data from these 30 days from the dataframe?
I tried a for loop, but it did not work. I know there is a simple method. But I don't know how to implement it.
df #is main dataframe which has many columns and rows. Index is a timestamp.
df['dates'] = df.index.strftime('%Y-%m-%d') # date part of timestamp is sliced and
#a new column is created. Instead of index, I want to use this column for comparing with bad list.
bad_list # it is a list of bad dates
for i in range(0,len(df)):
for j in range(0,len(bad_list)):
if str(df['dates'][i])== bad_list[j]:
df.drop(df[i].index,inplace=True)
You can do the following
df['dates'] = df.index.strftime('%Y-%m-%d')
#badlist should be in date format too.
newdf = df[~df['dates'].isin(badlist)]
# the ~ is used to denote "not in" the list.
#if Jan 1, 2000 is a bad date, it should be in the list as datetime(2000,1,1)
You can perform simple comparison:
>>> dates = pd.Series(pd.to_datetime(np.random.randint(int(time()) - 60 * 60 * 24 * 5, int(time()), 12), unit='s'))
>>> dates
0 2019-03-19 05:25:32
1 2019-03-20 00:58:29
2 2019-03-19 01:03:36
3 2019-03-22 11:45:24
4 2019-03-19 08:14:29
5 2019-03-21 10:17:13
6 2019-03-18 09:09:15
7 2019-03-20 00:14:16
8 2019-03-21 19:47:02
9 2019-03-23 06:19:35
10 2019-03-23 05:42:34
11 2019-03-21 11:37:46
>>> start_date = pd.to_datetime('2019-03-20')
>>> end_date = pd.to_datetime('2019-03-22')
>>> dates[(dates > start_date) & (dates < end_date)]
1 2019-03-20 00:58:29
5 2019-03-21 10:17:13
7 2019-03-20 00:14:16
8 2019-03-21 19:47:02
11 2019-03-21 11:37:46
If your source Series is not in datetime format, then you will need to use pd.to_datetime to convert it.

Pandas - convert float to proper datetime or time object

I have an observational data set which contain weather information. Each column contain specific field in which date and time are in two separate column. The time column contain hourly time like 0000, 0600 .. up to 2300. What I am trying to do is to filter the data set based on certain time frame, for example between 0000 UTC to 0600 UTC. When I try to read the data file in pandas data frame, by default the time column is read in float. When I try to convert it in to datatime object, it produces a format which I am unable to convert. Code example is given below:
import pandas as pd
import datetime as dt
df = pd.read_excel("test.xlsx")
df.head()
which produces the following result:
tdate itime moonph speed ... qnh windir maxtemp mintemp
0 01-Jan-17 1000.0 NM7 5 ... $1,011.60 60.0 $32.60 $22.80
1 01-Jan-17 1000.0 NM7 2 ... $1,015.40 999.0 $32.60 $22.80
2 01-Jan-17 1030.0 NM7 4 ... $1,015.10 60.0 $32.60 $22.80
3 01-Jan-17 1100.0 NM7 3 ... $1,014.80 999.0 $32.60 $22.80
4 01-Jan-17 1130.0 NM7 5 ... $1,014.60 270.0 $32.60 $22.80
Then I extracted the time column with following line:
df["time"] = df.itime
df["time"]
0 1000.0
1 1000.0
2 1030.0
3 1100.0
4 1130.0
5 1200.0
6 1230.0
7 1300.0
8 1330.0
.
.
3261 2130.0
3262 2130.0
3263 600.0
3264 630.0
3265 730.0
3266 800.0
3267 830.0
3268 1900.0
3269 1930.0
3270 2000.0
Name: time, Length: 3279, dtype: float64
Then I tried to convert the time column to datetime object:
df["time"] = pd.to_datetime(df.itime)
which produced the following result:
df["time"]
0 1970-01-01 00:00:00.000001000
1 1970-01-01 00:00:00.000001000
2 1970-01-01 00:00:00.000001030
3 1970-01-01 00:00:00.000001100
It appears that it has successfully converted the data to datetime object. However, it added the hour time to ms which is difficult for me to do filtering.
The final data format I would like to get is either:
1970-01-01 06:00:00
or
06:00
Any help is appreciated.
When you read the excel file specify the dtype of col itime as a str:
df = pd.read_excel("test.xlsx", dtype={'itime':str})
then you will have a time column of strings looking like:
df = pd.DataFrame({'itime':['2300', '0100', '0500', '1000']})
Then specify the format and convert to time:
df['Time'] = pd.to_datetime(df['itime'], format='%H%M').dt.time
itime Time
0 2300 23:00:00
1 0100 01:00:00
2 0500 05:00:00
3 1000 10:00:00
Just addon to Chris answer, if you are unable to convert because there is no zero in the front, apply the following to the dataframe.
df['itime'] = df['itime'].apply(lambda x: x.zfill(4))
So basically is that because the original format does not have even leading digit (4 digit). Example: 945 instead of 0945.
Try
df["time"] = pd.to_datetime(df.itime).dt.strftime('%Y-%m-%d %H:%M:%S')
df["time"] = pd.to_datetime(df.itime).dt.strftime('%H:%M:%S')
For the first and second outputs you want to
Best!

Pandas - convert strings to time without date

I've read loads of SO answers but can't find a clear solution.
I have this data in a df called day1 which represents hours:
1 10:53
2 12:17
3 14:46
4 16:36
5 18:39
6 20:31
7 22:28
Name: time, dtype: object>
I want to convert it into a time format. But when I do this:
day1.time = pd.to_datetime(day1.time, format='H%:M%')
The result includes today's date:
1 2015-09-03 10:53:00
2 2015-09-03 12:17:00
3 2015-09-03 14:46:00
4 2015-09-03 16:36:00
5 2015-09-03 18:39:00
6 2015-09-03 20:31:00
7 2015-09-03 22:28:00
Name: time, dtype: datetime64[ns]>
It seems the format argument isn't working - how do I get the time as shown here without the date?
Update
The following formats the time correctly, but somehow the column is still an object type. Why doesn't it convert to datetime64?
day1['time'] = pd.to_datetime(day1['time'], format='%H:%M').dt.time
1 10:53:00
2 12:17:00
3 14:46:00
4 16:36:00
5 18:39:00
6 20:31:00
7 22:28:00
Name: time, dtype: object>
After performing the conversion you can use the datetime accessor dt to access just the hour or time component:
In [51]:
df['hour'] = pd.to_datetime(df['time'], format='%H:%M').dt.hour
df
Out[51]:
time hour
index
1 10:53 10
2 12:17 12
3 14:46 14
4 16:36 16
5 18:39 18
6 20:31 20
7 22:28 22
Also your format string H%:M% is malformed, it's likely to raise a ValueError: ':' is a bad directive in format 'H%:M%'
Regarding your last comment the dtype is datetime.time not datetime:
In [53]:
df['time'].iloc[0]
Out[53]:
datetime.time(10, 53)
You can use to_timedelta
pd.to_timedelta(df+':00')
Out[353]:
1 10:53:00
2 12:17:00
3 14:46:00
4 16:36:00
5 18:39:00
6 20:31:00
7 22:28:00
Name: Time, dtype: timedelta64[ns]
I recently also struggled with this problem. My method is close to EdChum's method and the result is the same as YOBEN_S's answer.
Just like EdChum illustrated, using dt.hour or dt.time will give you a datetime.time object, which is probably only good for display. I can barely do any comparison or calculation on these objects. So if you need any further comparison or calculation operations on the result columns, it's better to avoid such data formats.
My method is just subtract the date from the to_datetime result:
c = pd.Series(['10:23', '12:17', '14:46'])
pd.to_datetime(c, format='%H:%M') - pd.to_datetime(c, format='%H:%M').dt.normalize()
The result is
0 10:23:00
1 12:17:00
2 14:46:00
dtype: timedelta64[ns]
dt.normalize() basically sets all time component to 00:00:00, and it will only display the date while keeping the datetime64 data format, thereby making it possible to do calculations with it.
My answer is by no means better than the other two. I just want to provide a different approach and hope it helps.

Categories