Pandas - convert strings to time without date - python

I've read loads of SO answers but can't find a clear solution.
I have this data in a df called day1 which represents hours:
1 10:53
2 12:17
3 14:46
4 16:36
5 18:39
6 20:31
7 22:28
Name: time, dtype: object>
I want to convert it into a time format. But when I do this:
day1.time = pd.to_datetime(day1.time, format='H%:M%')
The result includes today's date:
1 2015-09-03 10:53:00
2 2015-09-03 12:17:00
3 2015-09-03 14:46:00
4 2015-09-03 16:36:00
5 2015-09-03 18:39:00
6 2015-09-03 20:31:00
7 2015-09-03 22:28:00
Name: time, dtype: datetime64[ns]>
It seems the format argument isn't working - how do I get the time as shown here without the date?
Update
The following formats the time correctly, but somehow the column is still an object type. Why doesn't it convert to datetime64?
day1['time'] = pd.to_datetime(day1['time'], format='%H:%M').dt.time
1 10:53:00
2 12:17:00
3 14:46:00
4 16:36:00
5 18:39:00
6 20:31:00
7 22:28:00
Name: time, dtype: object>

After performing the conversion you can use the datetime accessor dt to access just the hour or time component:
In [51]:
df['hour'] = pd.to_datetime(df['time'], format='%H:%M').dt.hour
df
Out[51]:
time hour
index
1 10:53 10
2 12:17 12
3 14:46 14
4 16:36 16
5 18:39 18
6 20:31 20
7 22:28 22
Also your format string H%:M% is malformed, it's likely to raise a ValueError: ':' is a bad directive in format 'H%:M%'
Regarding your last comment the dtype is datetime.time not datetime:
In [53]:
df['time'].iloc[0]
Out[53]:
datetime.time(10, 53)

You can use to_timedelta
pd.to_timedelta(df+':00')
Out[353]:
1 10:53:00
2 12:17:00
3 14:46:00
4 16:36:00
5 18:39:00
6 20:31:00
7 22:28:00
Name: Time, dtype: timedelta64[ns]

I recently also struggled with this problem. My method is close to EdChum's method and the result is the same as YOBEN_S's answer.
Just like EdChum illustrated, using dt.hour or dt.time will give you a datetime.time object, which is probably only good for display. I can barely do any comparison or calculation on these objects. So if you need any further comparison or calculation operations on the result columns, it's better to avoid such data formats.
My method is just subtract the date from the to_datetime result:
c = pd.Series(['10:23', '12:17', '14:46'])
pd.to_datetime(c, format='%H:%M') - pd.to_datetime(c, format='%H:%M').dt.normalize()
The result is
0 10:23:00
1 12:17:00
2 14:46:00
dtype: timedelta64[ns]
dt.normalize() basically sets all time component to 00:00:00, and it will only display the date while keeping the datetime64 data format, thereby making it possible to do calculations with it.
My answer is by no means better than the other two. I just want to provide a different approach and hope it helps.

Related

ValueError: time data 'May 11, 2015' does not match format '%m%y%d' (match) [duplicate]

I'm getting a value error saying my data does not match the format when it does. Not sure if this is a bug or I'm missing something here. I'm referring to this documentation for the string format. The weird part is if I write the 'data' Dataframe to a csv and read it in then call the function below it will convert the date so I'm not sure why it doesn't work without writing to a csv.
Any ideas?
data['Date'] = pd.to_datetime(data['Date'], format='%d-%b-%Y')
I'm getting two errors
TypeError: Unrecognized value type: <class 'str'>
ValueError: time data '27‑Aug‑2018' does not match format '%d-%b-%Y' (match)
Example dates -
2‑Jul‑2018
27‑Aug‑2018
28‑May‑2018
19‑Jun‑2017
5‑Mar‑2018
15‑Jan‑2018
11‑Nov‑2013
23‑Nov‑2015
23‑Jun‑2014
18‑Jun‑2018
30‑Apr‑2018
14‑May‑2018
16‑Apr‑2018
26‑Feb‑2018
19‑Mar‑2018
29‑Jun‑2015
Is it because they all aren't double digit days? What is the string format value for single digit days? Looks like this could be the cause but I'm not sure why it would error on the '27' though.
End solution (It was unicode & not a string) -
data['Date'] = data['Date'].apply(unidecode.unidecode)
data['Date'] = data['Date'].apply(lambda x: x.replace("-", "/"))
data['Date'] = pd.to_datetime(data['Date'], format="%d/%b/%Y")
There seems to be an issue with your date strings. I replicated your issue with your sample data and if I remove the hyphens and replace them manually (for the first three dates) then the code works
pd.to_datetime(df1['Date'] ,errors ='coerce')
output:
0 2018-07-02
1 2018-08-27
2 2018-05-28
3 NaT
4 NaT
5 NaT
6 NaT
7 NaT
8 NaT
9 NaT
10 NaT
11 NaT
12 NaT
13 NaT
14 NaT
15 NaT
Bottom line: your hyphens look like regular ones but are actually something else, just clean your source data and you're good to go
You got a special mark here it is not -
df.iloc[0,0][2]
Out[287]: '‑'
Replace it with '-'
pd.to_datetime(df.iloc[:,0].str.replace('‑','-'),format='%d-%b-%Y')
Out[288]:
0 2018-08-27
1 2018-05-28
2 2017-06-19
3 2018-03-05
4 2018-01-15
5 2013-11-11
6 2015-11-23
7 2014-06-23
8 2018-06-18
9 2018-04-30
10 2018-05-14
11 2018-04-16
12 2018-02-26
13 2018-03-19
14 2015-06-29
Name: 2‑Jul‑2018, dtype: datetime64[ns]

parse multiple date format pandas

I 've got stuck with the following format:
0 2001-12-25
1 2002-9-27
2 2001-2-24
3 2001-5-3
4 200510
5 20078
What I need is the date in a format %Y-%m
What I tried was
def parse(date):
if len(date)<=5:
return "{}-{}".format(date[:4], date[4:5], date[5:])
else:
pass
df['Date']= parse(df['Date'])
However, I only succeeded in parse 20078 to 2007-8, the format like 2001-12-25 appeared as None.
So, how can I do it? Thank you!
we can use the pd.to_datetime and use errors='coerce' to parse the dates in steps.
assuming your column is called date
s = pd.to_datetime(df['date'],errors='coerce',format='%Y-%m-%d')
s = s.fillna(pd.to_datetime(df['date'],format='%Y%m',errors='coerce'))
df['date_fixed'] = s
print(df)
date date_fixed
0 2001-12-25 2001-12-25
1 2002-9-27 2002-09-27
2 2001-2-24 2001-02-24
3 2001-5-3 2001-05-03
4 200510 2005-10-01
5 20078 2007-08-01
In steps,
first we cast the regular datetimes to a new series called s
s = pd.to_datetime(df['date'],errors='coerce',format='%Y-%m-%d')
print(s)
0 2001-12-25
1 2002-09-27
2 2001-02-24
3 2001-05-03
4 NaT
5 NaT
Name: date, dtype: datetime64[ns]
as you can can see we have two NaT which are null datetime values in our series, these correspond with your datetimes which are missing a day,
we then reapply the same datetime method but with the opposite format, and apply those to the missing values of s
s = s.fillna(pd.to_datetime(df['date'],format='%Y%m',errors='coerce'))
print(s)
0 2001-12-25
1 2002-09-27
2 2001-02-24
3 2001-05-03
4 2005-10-01
5 2007-08-01
then we re-assign to your dataframe.
You could use a regex to pull out the year and month, and convert to datetime :
df = pd.read_clipboard("\s{2,}",header=None,names=["Dates"])
pattern = r"(?P<Year>\d{4})[-]*(?P<Month>\d{1,2})"
df['Dates'] = pd.to_datetime([f"{year}-{month}" for year, month in df.Dates.str.extract(pattern).to_numpy()])
print(df)
Dates
0 2001-12-01
1 2002-09-01
2 2001-02-01
3 2001-05-01
4 2005-10-01
5 2007-08-01
Note that pandas automatically converts the day to 1, since only year and month was supplied.

How to pad with trailing zeroes when datetime formatting in pandas

I have a pandas DataFrame that looks like this:
pta ptd tpl_num
4 05:17 05:18 0
6 05:29:30 05:30 1
9 05:42 05:44:30 2
11 05:53 05:54 3
12 06:03 06:05:30 4
I'm trying to format pta and ptd to %H:%M:%S using this:
df['pta'] = pandas.to_datetime(df['pta'], format="%H:%M:%S")
df['ptd'] = pandas.to_datetime(df['ptd'], format="%H:%M:%S")
This gives:
ValueError: time data '05:17' does not match format '%H:%M:%S' (match)
Makes sense, as some of my timestamps don't have :00 in the seconds column. Is there any way to pad these at the end? Or will I need to pad my input data manually/before adding it to the DataFrame? I've seen plenty of answers that pad leading zeroes, but couldn't find one for this.
Some dates do not match the specified format and hence are not correctly parsed. Let pandas parse them for you, and then use dt.strftime to format them as you want:
df['pta'] = pd.to_datetime(df['pta']).dt.strftime("%H:%M:%S")
df['ptd'] = pd.to_datetime(df['ptd']).dt.strftime("%H:%M:%S")
print(df)
pta ptd tpl_num
4 05:17:00 05:18:00 0
6 05:29:30 05:30:00 1
9 05:42:00 05:44:30 2
11 05:53:00 05:54:00 3
12 06:03:00 06:05:30 4
If you only want the padded strings, you can do:
df['pta'].add(':00').str[:8]
Output:
4 05:17:00
6 05:29:30
9 05:42:00
11 05:53:00
12 06:03:00
Name: pta, dtype: object
Also, for time only, you should consider using pd.to_timedelta instead of pd.to_datetime.

Pandas error on unix datetime conversation -- OutOfBoundsDatetime: cannot convert input with unit 's'

I am getting this error
File "pandas/_libs/tslib.pyx", line 356, in pandas._libs.tslib.array_with_unit_to_datetime
pandas._libs.tslibs.np_datetime.OutOfBoundsDatetime: cannot convert input with unit 's'
when trying to convert pandas column to datetime format.
I checked this answer Convert unix time to readable date in pandas dataframe
but it did not help me to solve the problem.
There is an issue on github that seems to be closed but in the same time people keep reporting issues:
https://github.com/pandas-dev/pandas/issues/10987
Dataframe column has unix time format, here is print out of top 20 rows
0 1420096800
1 1420096800
2 1420097100
3 1420097100
4 1420097400
5 1420097400
6 1420093800
7 1420097700
8 1420097700
9 1420098000
10 1420098480
11 1420098600
12 1420099200
13 1420099500
14 1420099500
15 1420100100
16 1420100400
17 1420096800
18 1420100700
19 1420100820
20 1420101840
Any ideas about how I might solve it?
I tried changing units from s to ms, but it did not help.
pd.__version__
'0.24.2'
Row
df[key] = pd.to_datetime(df[key], unit='s')
It works if you add the origin='unix' parameter:
pd.to_datetime(df['date'], origin='unix', unit='s')
0 2015-01-01 07:20:00
1 2015-01-01 07:20:00
2 2015-01-01 07:25:00
3 2015-01-01 07:25:00
4 2015-01-01 07:30:00

how to import dates in python

I have a pandas dataframe containg a column (dtype: object) in which dates are expressed as:
0 2014-11-07 14:08:00
1 2014-10-18 16:53:00
2 2014-10-27 11:57:00
3 2014-10-27 11:57:00
4 2014-10-08 16:35:00
5 2014-10-24 16:36:00
6 2014-11-06 15:34:00
7 2014-11-11 10:30:00
8 2014-10-31 13:20:00
9 2014-11-07 13:15:00
10 2014-09-20 14:36:00
11 2014-11-07 17:21:00
12 2014-09-23 08:53:00
13 2014-11-05 09:37:00
14 2014-10-26 18:48:00
...
Name: ts_placed, Length: 13655, dtype: object
What i want to do is to read the column as dates and then split the dataset according to weeks.
What I tried to do is:
data["ts_placed"] = pd.to_datetime(data.ts_placed)
data.sort('ts_placed')
It did not work
TypeError: unorderable types: str() > datetime.datetime()
Does anybody know a way to import dates in pythons when these are expressed as objects?
Thank you very much
Use Series.dt methods.
For the date, you can use Series.dt.date:
data['Date Column'] = data['Date Column'].dt.date
For the week, you can use Series.dt.weekofyear :
data['Week'] = data['Date Column'].dt.weekofyear
Then you would create new data based on week:
weekdata = data[data['Week'] == week number]
The sort should also work now.
Looks like to_datetime doesn't work on a series. Looks like a vectorized version works:
data['ts_placed'] = [pd.to_datetime(strD) for strD in data.ts_placed]
data.sort('ts_placed')
UPDATE Wanted my accepted answer to match the figured solution from comments. So if the vectorized version of to_datetime is run it will not convert the vector to datatime objects if all of the strings cannot be converted. The version above will convert those that can be converted. In either case one should check whether all values have been converted.
Using the vectorized version one could check using:
data.ts_placed = pd.to_datetime(data.ts_placed)
if(not isinstance(data.ts_placed[0], pd.lib.Timestamp)):
print 'Dates not converted correctly'
Using the manually vectorized version as above:
if(sum(not isinstance(strD, datetime.datetime) for strD in data.ts_placed) > 0):
print 'Dates not converted correctly'

Categories