I have a pandas dataframe containg a column (dtype: object) in which dates are expressed as:
0 2014-11-07 14:08:00
1 2014-10-18 16:53:00
2 2014-10-27 11:57:00
3 2014-10-27 11:57:00
4 2014-10-08 16:35:00
5 2014-10-24 16:36:00
6 2014-11-06 15:34:00
7 2014-11-11 10:30:00
8 2014-10-31 13:20:00
9 2014-11-07 13:15:00
10 2014-09-20 14:36:00
11 2014-11-07 17:21:00
12 2014-09-23 08:53:00
13 2014-11-05 09:37:00
14 2014-10-26 18:48:00
...
Name: ts_placed, Length: 13655, dtype: object
What i want to do is to read the column as dates and then split the dataset according to weeks.
What I tried to do is:
data["ts_placed"] = pd.to_datetime(data.ts_placed)
data.sort('ts_placed')
It did not work
TypeError: unorderable types: str() > datetime.datetime()
Does anybody know a way to import dates in pythons when these are expressed as objects?
Thank you very much
Use Series.dt methods.
For the date, you can use Series.dt.date:
data['Date Column'] = data['Date Column'].dt.date
For the week, you can use Series.dt.weekofyear :
data['Week'] = data['Date Column'].dt.weekofyear
Then you would create new data based on week:
weekdata = data[data['Week'] == week number]
The sort should also work now.
Looks like to_datetime doesn't work on a series. Looks like a vectorized version works:
data['ts_placed'] = [pd.to_datetime(strD) for strD in data.ts_placed]
data.sort('ts_placed')
UPDATE Wanted my accepted answer to match the figured solution from comments. So if the vectorized version of to_datetime is run it will not convert the vector to datatime objects if all of the strings cannot be converted. The version above will convert those that can be converted. In either case one should check whether all values have been converted.
Using the vectorized version one could check using:
data.ts_placed = pd.to_datetime(data.ts_placed)
if(not isinstance(data.ts_placed[0], pd.lib.Timestamp)):
print 'Dates not converted correctly'
Using the manually vectorized version as above:
if(sum(not isinstance(strD, datetime.datetime) for strD in data.ts_placed) > 0):
print 'Dates not converted correctly'
Related
I have a column where there is only time. After reading that CSV file i have converted that column to datetime datatype as it was object when i read it in jupyter notebook. When i try to filter i am getting error like below
TypeError: Index must be DatetimeIndex
code
newdata = newdata['APPOINTMENT_TIME'].between_time('14:30:00', '20:00:00')
sample_data
APPOINTMENT_TIME Id
13:30:00 1
15:10:00 2
18:50:00 3
14:10:00 4
14:00:00 5
Here i am trying display the rows whose appointment_time is between 14:30:00 to 20:00:00
datatype info
Could anyone help. Thanks in advance
between_time is a special method that works with datetime objects as index, which is not your case. It would be useful if you had data like 2021-12-21 13:30:00
In your case, you can just use the between method on strings and the fact that times with your format HH:MM:SS will be naturally sorted:
filtered_data = newdata[newdata['APPOINTMENT_TIME'].between('14:30:00', '20:00:00')]
Output:
APPOINTMENT_TIME Id
1 15:10:00 2
2 18:50:00 3
NB. You can't use a range that starts before midnight and ends after.
I'm getting a value error saying my data does not match the format when it does. Not sure if this is a bug or I'm missing something here. I'm referring to this documentation for the string format. The weird part is if I write the 'data' Dataframe to a csv and read it in then call the function below it will convert the date so I'm not sure why it doesn't work without writing to a csv.
Any ideas?
data['Date'] = pd.to_datetime(data['Date'], format='%d-%b-%Y')
I'm getting two errors
TypeError: Unrecognized value type: <class 'str'>
ValueError: time data '27‑Aug‑2018' does not match format '%d-%b-%Y' (match)
Example dates -
2‑Jul‑2018
27‑Aug‑2018
28‑May‑2018
19‑Jun‑2017
5‑Mar‑2018
15‑Jan‑2018
11‑Nov‑2013
23‑Nov‑2015
23‑Jun‑2014
18‑Jun‑2018
30‑Apr‑2018
14‑May‑2018
16‑Apr‑2018
26‑Feb‑2018
19‑Mar‑2018
29‑Jun‑2015
Is it because they all aren't double digit days? What is the string format value for single digit days? Looks like this could be the cause but I'm not sure why it would error on the '27' though.
End solution (It was unicode & not a string) -
data['Date'] = data['Date'].apply(unidecode.unidecode)
data['Date'] = data['Date'].apply(lambda x: x.replace("-", "/"))
data['Date'] = pd.to_datetime(data['Date'], format="%d/%b/%Y")
There seems to be an issue with your date strings. I replicated your issue with your sample data and if I remove the hyphens and replace them manually (for the first three dates) then the code works
pd.to_datetime(df1['Date'] ,errors ='coerce')
output:
0 2018-07-02
1 2018-08-27
2 2018-05-28
3 NaT
4 NaT
5 NaT
6 NaT
7 NaT
8 NaT
9 NaT
10 NaT
11 NaT
12 NaT
13 NaT
14 NaT
15 NaT
Bottom line: your hyphens look like regular ones but are actually something else, just clean your source data and you're good to go
You got a special mark here it is not -
df.iloc[0,0][2]
Out[287]: '‑'
Replace it with '-'
pd.to_datetime(df.iloc[:,0].str.replace('‑','-'),format='%d-%b-%Y')
Out[288]:
0 2018-08-27
1 2018-05-28
2 2017-06-19
3 2018-03-05
4 2018-01-15
5 2013-11-11
6 2015-11-23
7 2014-06-23
8 2018-06-18
9 2018-04-30
10 2018-05-14
11 2018-04-16
12 2018-02-26
13 2018-03-19
14 2015-06-29
Name: 2‑Jul‑2018, dtype: datetime64[ns]
I have a pandas DataFrame that looks like this:
pta ptd tpl_num
4 05:17 05:18 0
6 05:29:30 05:30 1
9 05:42 05:44:30 2
11 05:53 05:54 3
12 06:03 06:05:30 4
I'm trying to format pta and ptd to %H:%M:%S using this:
df['pta'] = pandas.to_datetime(df['pta'], format="%H:%M:%S")
df['ptd'] = pandas.to_datetime(df['ptd'], format="%H:%M:%S")
This gives:
ValueError: time data '05:17' does not match format '%H:%M:%S' (match)
Makes sense, as some of my timestamps don't have :00 in the seconds column. Is there any way to pad these at the end? Or will I need to pad my input data manually/before adding it to the DataFrame? I've seen plenty of answers that pad leading zeroes, but couldn't find one for this.
Some dates do not match the specified format and hence are not correctly parsed. Let pandas parse them for you, and then use dt.strftime to format them as you want:
df['pta'] = pd.to_datetime(df['pta']).dt.strftime("%H:%M:%S")
df['ptd'] = pd.to_datetime(df['ptd']).dt.strftime("%H:%M:%S")
print(df)
pta ptd tpl_num
4 05:17:00 05:18:00 0
6 05:29:30 05:30:00 1
9 05:42:00 05:44:30 2
11 05:53:00 05:54:00 3
12 06:03:00 06:05:30 4
If you only want the padded strings, you can do:
df['pta'].add(':00').str[:8]
Output:
4 05:17:00
6 05:29:30
9 05:42:00
11 05:53:00
12 06:03:00
Name: pta, dtype: object
Also, for time only, you should consider using pd.to_timedelta instead of pd.to_datetime.
I've read loads of SO answers but can't find a clear solution.
I have this data in a df called day1 which represents hours:
1 10:53
2 12:17
3 14:46
4 16:36
5 18:39
6 20:31
7 22:28
Name: time, dtype: object>
I want to convert it into a time format. But when I do this:
day1.time = pd.to_datetime(day1.time, format='H%:M%')
The result includes today's date:
1 2015-09-03 10:53:00
2 2015-09-03 12:17:00
3 2015-09-03 14:46:00
4 2015-09-03 16:36:00
5 2015-09-03 18:39:00
6 2015-09-03 20:31:00
7 2015-09-03 22:28:00
Name: time, dtype: datetime64[ns]>
It seems the format argument isn't working - how do I get the time as shown here without the date?
Update
The following formats the time correctly, but somehow the column is still an object type. Why doesn't it convert to datetime64?
day1['time'] = pd.to_datetime(day1['time'], format='%H:%M').dt.time
1 10:53:00
2 12:17:00
3 14:46:00
4 16:36:00
5 18:39:00
6 20:31:00
7 22:28:00
Name: time, dtype: object>
After performing the conversion you can use the datetime accessor dt to access just the hour or time component:
In [51]:
df['hour'] = pd.to_datetime(df['time'], format='%H:%M').dt.hour
df
Out[51]:
time hour
index
1 10:53 10
2 12:17 12
3 14:46 14
4 16:36 16
5 18:39 18
6 20:31 20
7 22:28 22
Also your format string H%:M% is malformed, it's likely to raise a ValueError: ':' is a bad directive in format 'H%:M%'
Regarding your last comment the dtype is datetime.time not datetime:
In [53]:
df['time'].iloc[0]
Out[53]:
datetime.time(10, 53)
You can use to_timedelta
pd.to_timedelta(df+':00')
Out[353]:
1 10:53:00
2 12:17:00
3 14:46:00
4 16:36:00
5 18:39:00
6 20:31:00
7 22:28:00
Name: Time, dtype: timedelta64[ns]
I recently also struggled with this problem. My method is close to EdChum's method and the result is the same as YOBEN_S's answer.
Just like EdChum illustrated, using dt.hour or dt.time will give you a datetime.time object, which is probably only good for display. I can barely do any comparison or calculation on these objects. So if you need any further comparison or calculation operations on the result columns, it's better to avoid such data formats.
My method is just subtract the date from the to_datetime result:
c = pd.Series(['10:23', '12:17', '14:46'])
pd.to_datetime(c, format='%H:%M') - pd.to_datetime(c, format='%H:%M').dt.normalize()
The result is
0 10:23:00
1 12:17:00
2 14:46:00
dtype: timedelta64[ns]
dt.normalize() basically sets all time component to 00:00:00, and it will only display the date while keeping the datetime64 data format, thereby making it possible to do calculations with it.
My answer is by no means better than the other two. I just want to provide a different approach and hope it helps.
I have a pandas data frame with a column that represents dates as:
Name: ts_placed, Length: 13631, dtype: datetime64[ns]
It looks like this:
0 2014-10-18 16:53:00
1 2014-10-27 11:57:00
2 2014-10-27 11:57:00
3 2014-10-08 16:35:00
4 2014-10-24 16:36:00
5 2014-11-06 15:34:00
6 2014-11-11 10:30:00
....
I know how to group it in general using the function:
grouped = data.groupby('ts_placed')
What I want to do is to use the same function but to group the rows by week.
Pass
pd.DatetimeIndex(df.date).week
as the argument to groupby. This is the ordinal week in the year; see DatetimeIndex for other definitions of week.
you can also use Timergrouper
df.set_index(your_date).groupby(pd.TimeGrouper('W')).size()