How to plot timedelta data from a pandas DataFrame? - python

I am trying to plot a Series (a columns from a dataframe to be precise). It seems to have valid data in the format hh:mm:ss (timedelta64)
In [14]: x5.task_a.describe()
Out[14]:
count 165
mean 0 days 03:35:41.121212
std 0 days 07:07:40.950819
min 0 days 00:00:06
25% 0 days 00:37:13
50% 0 days 01:28:17
75% 0 days 03:41:32
max 2 days 12:32:26
Name: task_a, dtype: object
In [15]: x5.task_a.head()
Out[15]:
wbdqueue_id
26868 00:26:11
26869 02:08:28
26872 00:26:07
26874 00:48:22
26875 00:26:17
Name: task_a, dtype: timedelta64[ns]
But when I try to plot it, I get an error saying there is no numeric data in the Empty 'DataFrame'.
I've tried:
x5.task_a.plot.kde()
and
x5.plot()
where x5 is the DataFrame with several Series of such timedelta data.
TypeError: Empty 'DataFrame': no numeric data to plot
I see that one can generate series of random values and plot it.
What am I doing wrong?

Convert to any logical numeric values, like hours or minutes, and then use .plot.kde()
(x5.task_a / np.timedelta64(1, 'h')).plot.kde()
Details
In [149]: x5
Out[149]:
task_a
0 0 days 22:27:46.684800
1 1 days 00:20:43.036800
2 0 days 12:16:24.873600
3 1 days 11:10:14.880000
4 1 days 03:31:05.548800
5 1 days 05:20:52.944000
6 1 days 00:09:09.590400
7 0 days 13:53:50.179200
8 1 days 04:08:57.695999
9 0 days 14:14:53.088000
In [150]: x5.task_a / np.timedelta64(1, 'h') # convert to hours
Out[150]:
0 22.462968
1 24.345288
2 12.273576
3 35.170800
4 27.518208
5 29.348040
6 24.152664
7 13.897272
8 28.149360
9 14.248080
Name: task_a, dtype: float64
Or to minutes
In [151]: x5.task_a / np.timedelta64(1, 'm')
Out[151]:
0 1347.77808
1 1460.71728
2 736.41456
3 2110.24800
4 1651.09248
5 1760.88240
6 1449.15984
7 833.83632
8 1688.96160
9 854.88480
Name: task_a, dtype: float64
another way using total_seconds
In [153]: x5.task_a.dt.total_seconds() / 60
Out[153]:
0 1347.77808
1 1460.71728
2 736.41456
3 2110.24800
4 1651.09248
5 1760.88240
6 1449.15984
7 833.83632
8 1688.96160
9 854.88480
Name: task_a, dtype: float64

You can convert the TimedeltaIndex to total_seconds
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
idx = pd.date_range('20140101', '20140201')
df = pd.DataFrame(index=idx)
df['col0'] = np.random.randn(len(idx))
diff_idx = (pd.Series(((idx-
idx.shift(1)).fillna(pd.Timedelta(0))).map(pd.TimedeltaIndex.total_seconds),
index=idx)) # need to do this because we can't shift index
df['diff_dt'] = diff_idx
df['diff_dt'].plot()

Related

Convert string hours to minute pd.eval

I want to convert all rows of my DataFrame that contains hours and minutes into minutes only.
I have a dataframe that looks like this:
df=
time
0 8h30
1 14h07
2 08h30
3 7h50
4 8h0
5 8h15
6 6h15
I'm using the following method to convert:
df['time'] = pd.eval(
df['time'].replace(['h'], ['*60+'], regex=True))
Output
SyntaxError: invalid syntax
I think the error comes from the format of the hour, maybe pd.evalcant accept 08h30 or 8h0, how to solve this probleme ?
Pandas can already handle such strings if the units are included in the string. While 14h07 can't be parse (why assume 07 is minutes?), 14h07 can be converted to a Timedelta :
>>> pd.to_timedelta("14h07m")
Timedelta('0 days 14:07:00')
Given this dataframe :
d1 = pd.DataFrame(['8h30m', '14h07m', '08h30m', '8h0m'],
columns=['time'])
You can convert the time series into a Timedelta series with pd.to_timedelta :
>>> d1['tm'] = pd.to_timedelta(d1['time'])
>>> d1
time tm
0 8h30m 0 days 08:30:00
1 14h07m 0 days 14:07:00
2 08h30m 0 days 08:30:00
3 8h0m 0 days 08:00:00
To handle the missing minutes unit in the original data, just append m:
d1['tm'] = pd.to_timedelta(d1['time'] + 'm')
Once you have a Timedelta you can calculate hours and minutes.
The components of the values can be retrieved with Timedelta.components
>>> d1.tm.dt.components.hours
0 8
1 14
2 8
3 8
Name: hours, dtype: int64
To get the total minutes, seconds or hours, change the frequency to minutes:
>>> d1.tm.astype('timedelta64[m]')
0 510.0
1 847.0
2 510.0
3 480.0
Name: tm, dtype: float64
Bringing all the operations together :
>>> d1['tm'] = pd.to_timedelta(d1['time'])
>>> d2 = (d1.assign(h=d1.tm.dt.components.hours,
... m=d1.tm.dt.components.minutes,
... total_minutes=d1.tm.astype('timedelta64[m]')))
>>>
>>> d2
time tm h m total_minutes
0 8h30m 0 days 08:30:00 8 30 510.0
1 14h07m 0 days 14:07:00 14 7 847.0
2 08h30m 0 days 08:30:00 8 30 510.0
3 8h0m 0 days 08:00:00 8 0 480.0
To avoid having to trim leading zeros, an alternative approach:
df[['h', 'm']] = df['time'].str.split('h', expand=True).astype(int)
df['total_min'] = df['h']*60 + df['m']
Result:
time h m total_min
0 8h30 8 30 510
1 14h07 14 7 847
2 08h30 8 30 510
3 7h50 7 50 470
4 8h0 8 0 480
5 8h15 8 15 495
6 6h15 6 15 375
Just to give an alternative approach with kind of the same elements as above you could do:
df = pd.DataFrame(data=["8h30", "14h07", "08h30", "7h50", "8h0 ", "8h15", "6h15"],
columns=["time"])
First split you column on the "h"
hm = df["time"].str.split("h", expand=True)
Then combine the columns again, but zeropad time hours and minutes in order to make valid time strings:
df2 = hm[0].str.strip().str.zfill(2) + hm[1].str.strip().str.zfill(2)
Then convert the string column with proper values to a date time column:
df3 = pd.to_datetime(df2, format="%H%M")
Finally, calculate the number of minutes by subtrackting a zero time (to make deltatimes) and divide by the minutes deltatime:
zerotime= pd.to_datetime("0000", format="%H%M")
df['minutes'] = (df3 - zerotime) / pd.Timedelta(minutes=1)
The results look like:
time minutes
0 8h30 510.0
1 14h07 847.0
2 08h30 510.0
3 7h50 470.0
4 8h0 480.0
5 8h15 495.0
6 6h15 375.0

Pandas and DateTime TypeError: cannot compare a TimedeltaIndex with type float

I have a pandas DataFrame Series time differences that looks like::
print(delta_t)
1 0 days 00:00:59
3 0 days 00:04:22
6 0 days 00:00:56
8 0 days 00:01:21
19 0 days 00:01:09
22 0 days 00:00:36
...
(the full DataFrame had a bunch of NaNs which I dropped).
I'd like to know which delta_t's are less than 1 day, 1 hour, 1 minute,
so I tried:
delta_t_lt1day = delta_t[np.where(delta_t < 30.)]
but then got a:
TypeError: cannot compare a TimedeltaIndex with type float
Little help?!?!
Assuming your Series is in timedelta format, you can skip the np.where, and index using something like this, where you compare your actual values to other timedeltas, using the appropriate units:
delta_t_lt1day = delta_t[delta_t < pd.Timedelta(1,'D')]
delta_t_lt1hour = delta_t[delta_t < pd.Timedelta(1,'h')]
delta_t_lt1minute = delta_t[delta_t < pd.Timedelta(1,'m')]
You'll get the following series:
>>> delta_t_lt1day
0
1 00:00:59
3 00:04:22
6 00:00:56
8 00:01:21
19 00:01:09
22 00:00:36
Name: 1, dtype: timedelta64[ns]
>>> delta_t_lt1hour
0
1 00:00:59
3 00:04:22
6 00:00:56
8 00:01:21
19 00:01:09
22 00:00:36
Name: 1, dtype: timedelta64[ns]
>>> delta_t_lt1minute
0
1 00:00:59
6 00:00:56
22 00:00:36
Name: 1, dtype: timedelta64[ns]
You could use the TimeDelta class:
import pandas as pd
deltas = pd.to_timedelta(['0 days 00:00:59',
'0 days 00:04:22',
'0 days 00:00:56',
'0 days 00:01:21',
'0 days 00:01:09',
'0 days 00:31:09',
'0 days 00:00:36'])
for e in deltas[deltas < pd.Timedelta(value=30, unit='m')]:
print(e)
Output
0 days 00:00:59
0 days 00:04:22
0 days 00:00:56
0 days 00:01:21
0 days 00:01:09
0 days 00:00:36
Note that this filter outs '0 days 00:31:09' as expected. The expression pd.Timedelta(value=30, unit='m') creates a time delta of 30 minutes.

Times read in as time deltas with large number of days in front

I'm working with a data set of my sleep for the past year or so. I've read the CSV into a pandas Dataframe. In it is a column called 'Duration'. I convert it into a timeDelta as follows:
df.Duration = pd.to_timedelta(df.Duration)
df.Duration.head()
Which outputs
0 17711 days 08:27:00
1 17711 days 07:56:00
2 17711 days 04:22:00
3 17711 days 07:29:00
4 17711 days 06:46:00
Name: Duration, dtype: timedelta64[ns]
I sort of understand why I get 17711 days in front of the hours, but all I really want is the hours. To solve this, I could write
df.Duration = (df.Duration - pd.Timedelta('17711 days'))
Which gives me
0 08:27:00
1 07:56:00
2 04:22:00
3 07:29:00
4 06:46:00
Name: Duration, dtype: timedelta64[ns]
However this is a pretty brittle method. Is there a better method of getting just the hours I want?
datetime.timdelta objects store days, seconds and microseconds as attributes. We can access them in a pandas.DataFrame with dt:
Setting up some dummy data
import datetime as dt
import pandas as pd
df = pd.DataFrame(
data=(
dt.timedelta(days=17711, hours=i, minutes=i, seconds=i) for i in range(0, 10)
),
columns=['Duration']
)
print(df['Duration'])
Duration
0 17711 days 00:00:00
1 17711 days 01:01:01
2 17711 days 02:02:02
3 17711 days 03:03:03
4 17711 days 04:04:04
5 17711 days 05:05:05
6 17711 days 06:06:06
7 17711 days 07:07:07
8 17711 days 08:08:08
9 17711 days 09:09:09
Name: Duration, dtype: timedelta64[ns]
Accesing seconds and turning them into hours
print(df['Duration'].dt.seconds / 3600)
0 0.000000
1 1.016944
2 2.033889
3 3.050833
4 4.067778
5 5.084722
6 6.101667
7 7.118611
8 8.135556
9 9.152500
Name: Duration, dtype: float64
Only hours
print(df['Duration'].dt.seconds // 3600)
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
Name: Duration, dtype: int64
Using split() with regex should do what you're looking for I think:
df[['Days', 'Time']] = df['Duration'].str.split('.* days', expand=True)
This will split the column into two and then you can just call it using the "Time" key.
Code:
>>> import pandas as pd
>>> d = ['17711 days 08:27:00',
... '17711 days 07:56:00',
... '17711 days 04:22:00',
... '17711 days 07:29:00',
... '17711 days 06:46:00']
>>> df = pd.DataFrame({'Duration': d})
>>> df[['Days', 'Time']] = df['Duration'].str.split('.* days', expand=True)
>>> df.Time = pd.to_timedelta(df.Time)
>>> df.Time.head()
0 08:27:00
1 07:56:00
2 04:22:00
3 07:29:00
4 06:46:00
Name: Time, dtype: timedelta64[ns]

How can I create a datetime column without 'date' part?

I have a dataframe and there's a column named 'Time' in it like the below(HH:MM:SS:fffff).
>>> df['Time']
0 09:42:29:75284
1 09:42:29:95584
2 09:42:31:15036
3 09:42:35:15138
4 09:42:35:95491
5 09:42:43:55414
6 09:42:45:35866
7 09:42:46:74638
8 09:42:47:35582
9 09:42:47:74774
10 09:42:48:94582
...
Name: Time, Length: 18924, dtype: object
I want to change its type as datetime, in order to make it easier to calculate. Is it possible to change its type, using pandas.to_datetime, as datetime without date?
You can convert it to timedelta64[ns] dtype:
Source DF:
In [164]: df
Out[164]:
Time
0 09:42:29:75284
1 09:42:29:95584
2 09:42:31:15036
3 09:42:35:15138
4 09:42:35:95491
5 09:42:43:55414
6 09:42:45:35866
7 09:42:46:74638
8 09:42:47:35582
9 09:42:47:74774
10 09:42:48:94582
In [165]: df.dtypes
Out[165]:
Time object # <-------- NOTE!
dtype: object
Converted:
In [166]: df.Time = pd.to_timedelta(df.Time.str.replace(r'\:(\d+)$', r'.\1'),
errors='coerce')
In [167]: df
Out[167]:
Time
0 09:42:29.752840
1 09:42:29.955840
2 09:42:31.150360
3 09:42:35.151380
4 09:42:35.954910
5 09:42:43.554140
6 09:42:45.358660
7 09:42:46.746380
8 09:42:47.355820
9 09:42:47.747740
10 09:42:48.945820
In [168]: df.dtypes
Out[168]:
Time timedelta64[ns] # <-------- NOTE!
dtype: object
Please refer python to_datetime documentation.
import pandas as pd
df = pd.DataFrame({'Time': ['09:42:29:75284','09:42:29:95584','09:42:31:15036']})
df
Out[]:
Time
0 09:42:29:75284
1 09:42:29:95584
2 09:42:31:15036
You can convert this into datetime format by specifying format as follows:
pd.to_datetime(df['Time'], format='%H:%M:%S:%f')
Out[]:
0 1900-01-01 09:42:29.752840
1 1900-01-01 09:42:29.955840
2 1900-01-01 09:42:31.150360
Name: Time, dtype: datetime64[ns]
but doing this will also add date 1900-01-01.

Dividing a series containing datetime by a series containing an integer in Pandas

I have a series s1 which is of type datetime and has a time which represents a range between a start time and an end time - typical values are 7 days, 4 hours 5 mins etc. I have series s2 which contains integers for the number of events that happened in that time range.
I want to calculate the event frequency by:
event_freq = s1 / s2
I get the error:
cannot operate on a series with out a rhs of a series/ndarray of type datetime64[ns] or a timedelta
Whats the best way to fix this?
Thanks in advance!
EXAMPLE of s1 is:
some_id
1 2012-09-02 09:18:40
3 2012-04-02 09:36:39
4 2012-02-02 09:58:02
5 2013-02-09 14:31:52
6 2012-01-09 12:59:20
EXAMPLE of s2 is:
some_id
1 3
3 1
4 1
5 2
6 1
8 1
10 3
12 2
This might possibly be a bug but what works is to operate on the underlying numpy array like so:
import pandas as pd
from pandas import Series
startdate = Series(pd.date_range('2013-01-01', '2013-01-03'))
enddate = Series(pd.date_range('2013-03-01', '2013-03-03'))
s1 = enddate - startdate
s2 = Series([2, 3, 4])
event_freq = Series(s1.values / s2)
Here are the Series:
>>> s1
0 59 days, 00:00:00
1 59 days, 00:00:00
2 59 days, 00:00:00
dtype: timedelta64[ns]
>>> s2
0 2
1 3
2 4
dtype: int64
>>> event_freq
0 29 days, 12:00:00
1 19 days, 16:00:00
2 14 days, 18:00:00
dtype: timedelta64[ns]

Categories