I have a dataframe df and its first column is timedelta64
df.info():
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 686 entries, 0 to 685
Data columns (total 6 columns):
0 686 non-null timedelta64[ns]
1 686 non-null object
2 686 non-null object
3 686 non-null object
4 686 non-null object
5 686 non-null object
If I print(df[0][2]), for example, it will give me 0 days 05:01:11. However, I don't want the 0 days filed. I only want 05:01:11 to be printed. Could someone teaches me how to do this? Thanks so much!
It is possible by:
df['duration1'] = df['duration'].astype(str).str[-18:-10]
But solution is not general, if input is 3 days 05:01:11 it remove 3 days too.
So solution working only for timedeltas less as one day correctly.
More general solution is create custom format:
N = 10
np.random.seed(11230)
rng = pd.date_range('2017-04-03 15:30:00', periods=N, freq='13.5H')
df = pd.DataFrame({'duration': np.abs(np.random.choice(rng, size=N) -
np.random.choice(rng, size=N)) })
df['duration1'] = df['duration'].astype(str).str[-18:-10]
def f(x):
ts = x.total_seconds()
hours, remainder = divmod(ts, 3600)
minutes, seconds = divmod(remainder, 60)
return ('{}:{:02d}:{:02d}').format(int(hours), int(minutes), int(seconds))
df['duration2'] = df['duration'].apply(f)
print (df)
duration duration1 duration2
0 2 days 06:00:00 06:00:00 54:00:00
1 2 days 19:30:00 19:30:00 67:30:00
2 1 days 03:00:00 03:00:00 27:00:00
3 0 days 00:00:00 00:00:00 0:00:00
4 4 days 12:00:00 12:00:00 108:00:00
5 1 days 03:00:00 03:00:00 27:00:00
6 0 days 13:30:00 13:30:00 13:30:00
7 1 days 16:30:00 16:30:00 40:30:00
8 0 days 00:00:00 00:00:00 0:00:00
9 1 days 16:30:00 16:30:00 40:30:00
Here's a short and robust version using apply():
df['timediff_string'] = df['timediff'].apply(
lambda x: f'{x.components.hours:02d}:{x.components.minutes:02d}:{x.components.seconds:02d}'
if not pd.isnull(x) else ''
)
This leverages the components attribute of pandas Timedelta objects and also handles empty values (NaT).
If the timediff column does not contain pandas Timedelta objects, you can convert it:
df['timediff'] = pd.to_timedelta(df['timediff'])
datetime.timedelta already formats the way you'd like. The crux of this issue is that Pandas internally converts to numpy.timedelta.
import pandas as pd
from datetime import timedelta
time_1 = timedelta(days=3, seconds=3400)
time_2 = timedelta(days=0, seconds=3400)
print(time_1)
print(time_2)
times = pd.Series([time_1, time_2])
# Times are converted to Numpy timedeltas.
print(times)
# Convert to string after converting to datetime.timedelta.
times = times.apply(
lambda numpy_td: str(timedelta(seconds=numpy_td.total_seconds())))
print(times)
So, convert to a datetime.timedelta and then str (to prevent conversion back to numpy.timedelta) before printing.
3 days, 0:56:40
0:56:400
0 3 days 00:56:40
1 0 days 00:56:40
dtype: timedelta64[ns]
0 3 days, 0:56:40
1 0:56:40
dtype: object
I came here looking for answers to the same question, so I felt I should add further clarification. : )
You can convert it into a Python timedelta, then to str and finally back to a Series:
pd.Series(df["duration"].dt.to_pytimedelta().astype(str), name="start_time")
Given OP is ok with an object column (a little verbose):
def splitter(td):
td = str(td).split(' ')[-1:][0]
return td
df['split'] = df['timediff'].apply(splitter)
Basically we're taking the timedelta column, transforming the contents to a string, then splitting the string (creates a list) and taking the last item of that list, which would be the hh:mm:ss component.
Note that specifying ' ' for what to split by is redundant here.
Alternative one liner:
df['split2'] = df['timediff'].astype('str').str.split().str[-1]
which is very similar, but not very pretty IMHO. Also, the output includes milliseconds, which is not the case in the first solution. I'm not sure what the reason for that is (please comment if you do). If your data is big it might be worthwhile to time these different approaches.
If wou want to remove all nonzero components (not only days), you can do it like this:
def pd_td_fmt(td):
import pandas as pd
abbr = {'days': 'd', 'hours': 'h', 'minutes': 'min', 'seconds': 's', 'milliseconds': 'ms', 'microseconds': 'us',
'nanoseconds': 'ns'}
fmt = lambda td:"".join(f"{v}{abbr[k]}" for k, v in td.components._asdict().items() if v != 0)
if isinstance(td, pd.Timedelta):
return fmt(td)
elif isinstance(td,pd.TimedeltaIndex):
return td.map(fmt)
else:
raise ValueError
If you can be sure that your timedelta is less than a day, this might work. To do this in as few lines as possible, I convert the timedelta to a datetime by adding the unix epoch 0 and then using the now-datetime dt function to format the date format.
df['duration1'] = (df['duration'] + pd.to_datetime(0)).dt.strftime('%M:%S')
Related
I have a dataframe which has a timestamp column in the format: YYYY-MM-DD HH:MM:SS:sss. Example is shown below:
0 2019-12-17 21:17:39.424
1 2019-12-17 21:17:41.065
2 2019-12-17 21:18:06.640
3 2019-12-17 21:18:07.229
4 2019-12-17 21:18:07.858
...
1072 2019-12-17 22:54:54.052
1073 2019-12-17 22:54:56.075
1074 2019-12-17 22:55:23.040
1075 2019-12-17 22:55:23.040
1076 2019-12-17 22:55:26.363
Name: time_stamp, Length: 1077, dtype: datetime64[ns]
there are more than thousands of rows that I am reading from a csv file. What I have been trying to find the time interval (timedelta) between each successive timestamp. Since there difference between every successive pair is not greater than few seconds, I just want to retrieve that part (discarding the data, hour and minutes parts, which are 0 anyway.
I can perform the simple subtraction iteratively inside a loop, but the result that i get is a string for each calculation. Example is shown below:
> 0 0 days 00:00:03.988000
1 0 days 00:00:01.641000
2 0 days 00:00:25.575000
3 0 days 00:00:00.589000
4 0 days 00:00:00.629000
...
1072 0 days 00:00:36.084000
1073 0 days 00:00:02.023000
1074 0 days 00:00:26.965000
1075 0 days 00:00:00
1076 0 days 00:00:03.323000
Name: arr_time, Length: 1077, dtype: object
Now, as you can see, the datatype is string which prevents me performing various operations related to timedelta or datetime datatype. I am unable to change its datatype. I am so confused between datetime, timestamp and timedelta concepts that I can not figure out what operations or methods are supported for each case.
I can provide the raw csv file.
Can some please help me in just retrieving the seconds and milliseconds parts of each timedelta values into a Series or Dataframe?
Your data contains date/time information (for example as a string like "2019-12-17T21:17:39.424") - you parse that to datetime e.g. like
df['time_stamp'] = pd.to_datetime(df['time_stamp'])
# gives dtype: datetime64[ns]
An individual element of this column (pd.Series) would be a Timestamp. If you subtract two timestamps from one another, you get a timedelta:
# the difference between timestamps are timedeltas:
df['dt'] = df['time_stamp'].diff()
# df['dt']
# 0 NaT
# 1 0 days 00:00:01.641000
# 2 0 days 00:00:25.575000
# 3 0 days 00:00:00.589000
# 4 0 days 00:00:00.629000
# Name: dt, dtype: timedelta64[ns]
Now that you have a column of dtype timedelta, you can work with that to get seconds and milliseconds:
# get the seconds fraction by flooring the total_seconds() of the timedelta
df['dt_s'] = np.floor(df['dt'].dt.total_seconds())
# df['dt_s']
# 0 NaN
# 1 1.0
# 2 25.0
# 3 0.0
# 4 0.0
# Name: dt_s, dtype: float64
# get the milliseconds by converting total_seconds() to milliseconds and taking modulo 1000:
df['dt_ms'] = (df['dt'].dt.total_seconds()*1000) % 1000
# df['dt_ms']
# 0 NaN
# 1 641.0
# 2 575.0
# 3 589.0
# 4 629.0
# Name: dt_ms, dtype: float64
If desired, you could format the seconds and millisecond components to a string column:
# format to ss:fff output:
df['s_ms'] = (df['dt_s'].fillna(0).apply(lambda s: f'{int(s):02d}') +
':' +
df['dt_ms'].fillna(0).apply(lambda s: f'{int(s):03d}'))
# df['s_ms']
# 0 00:000
# 1 01:641
# 2 25:575
# 3 00:589
# 4 00:629
# Name: s_ms, dtype: object
Want to calculate the difference of days between pandas date series -
0 2013-02-16
1 2013-01-29
2 2013-02-21
3 2013-02-22
4 2013-03-01
5 2013-03-14
6 2013-03-18
7 2013-03-21
and today's date.
I tried but could not come up with logical solution.
Please help me with the code. Actually I am new to python and there are lot of syntactical errors happening while applying any function.
You could do something like
# generate time data
data = pd.to_datetime(pd.Series(["2018-09-1", "2019-01-25", "2018-10-10"]))
pd.to_datetime("now") > data
returns:
0 False
1 True
2 False
you could then use that to select the data
data[pd.to_datetime("now") > data]
Hope it helps.
Edit: I misread it but you can easily alter this example to calculate the difference:
data - pd.to_datetime("now")
returns:
0 -122 days +13:10:37.489823
1 24 days 13:10:37.489823
2 -83 days +13:10:37.489823
dtype: timedelta64[ns]
You can try as Follows:
>>> from datetime import datetime
>>> df
col1
0 2013-02-16
1 2013-01-29
2 2013-02-21
3 2013-02-22
4 2013-03-01
5 2013-03-14
6 2013-03-18
7 2013-03-21
Make Sure to convert the column names to_datetime:
>>> df['col1'] = pd.to_datetime(df['col1'], infer_datetime_format=True)
set the current datetime in order to Further get the diffrence:
>>> curr_time = pd.to_datetime("now")
Now get the Difference as follows:
>>> df['col1'] - curr_time
0 -2145 days +07:48:48.736939
1 -2163 days +07:48:48.736939
2 -2140 days +07:48:48.736939
3 -2139 days +07:48:48.736939
4 -2132 days +07:48:48.736939
5 -2119 days +07:48:48.736939
6 -2115 days +07:48:48.736939
7 -2112 days +07:48:48.736939
Name: col1, dtype: timedelta64[ns]
With numpy you can solve it like difference-two-dates-days-weeks-months-years-pandas-python-2
. bottom line
df['diff_days'] = df['First dates column'] - df['Second Date column']
# for days use 'D' for weeks use 'W', for month use 'M' and for years use 'Y'
df['diff_days']=df['diff_days']/np.timedelta64(1,'D')
print(df)
if you want days as int and not as float use
df['diff_days']=df['diff_days']//np.timedelta64(1,'D')
From the pandas docs under Converting To Timestamps you will find:
"Converting to Timestamps To convert a Series or list-like object of date-like objects e.g. strings, epochs, or a mixture, you can use the to_datetime function"
I haven't used pandas before but this suggests your pandas date series (a list-like object) is iterable and each element of this series is an instance of a class which has a to_datetime function.
Assuming my assumptions are correct, the following function would take such a list and return a list of timedeltas' (a datetime object representing the difference between two date time objects).
from datetime import datetime
def convert(pandas_series):
# get the current date
now = datetime.now()
# Use a list comprehension and the pandas to_datetime method to calculate timedeltas.
return [now - pandas_element.to_datetime() for pandas_series]
# assuming 'some_pandas_series' is a list-like pandas series object
list_of_timedeltas = convert(some_pandas_series)
When I compute the difference between two pandas datetime64 dates I get np.timedelta64. Is there any easy way to convert these deltas into representations like hours, days, weeks, etc.?
I could not find any methods in np.timedelta64 that facilitate conversions between different units, but it looks like Pandas seems to know how to convert these units to days when printing timedeltas (e.g. I get: 29 days, 23:20:00 in the string representation dataframes). Any way to access this functionality ?
Update:
Strangely, none of the following work:
> df['column_with_times'].days
> df['column_with_times'].apply(lambda x: x.days)
but this one does:
df['column_with_times'][0].days
pandas stores timedelta data in the numpy timedelta64[ns] type, but also provides the Timedelta type to wrap this for more convenience (eg to provide such accessors of the days, hours, .. and other components).
In [41]: timedelta_col = pd.Series(pd.timedelta_range('1 days', periods=5, freq='2 h'))
In [42]: timedelta_col
Out[42]:
0 1 days 00:00:00
1 1 days 02:00:00
2 1 days 04:00:00
3 1 days 06:00:00
4 1 days 08:00:00
dtype: timedelta64[ns]
To access the different components of a full column (series), you have to use the .dt accessor. For example:
In [43]: timedelta_col.dt.hours
Out[43]:
0 0
1 2
2 4
3 6
4 8
dtype: int64
With timedelta_col.dt.components you get a frame with all the different components (days to nanoseconds) as different columns.
When accessing one value of the column above, this gives back a Timedelta, and on this you don't need to use the dt accessor, but you can access directly the components:
In [45]: timedelta_col[0]
Out[45]: Timedelta('1 days 00:00:00')
In [46]: timedelta_col[0].days
Out[46]: 1L
So the .dt accessor provides access to the attributes of the Timedelta scalar, but on the full column. That is the reason you see that df['column_with_times'][0].days works but df['column_with_times'].days not.
The reason that df['column_with_times'].apply(lambda x: x.days) does not work is that apply is given the timedelta64 values (and not the Timedelta pandas type), and these don't have such attributes.
I have a Series of dates in datetime64 format.
I want to convert them to a series of Period with a monthly frequency. (Essentially, I want to group dates into months for analytical purposes).
There must be a way of doing this - I just cannot find it quickly.
Note: these dates are not the index of the data frame - they are just a column of data in the data frame.
Example input data (as a Series)
data = pd.to_datetime(pd.Series(['2014-10-01', '2014-10-01', '2014-10-31', '2014-11-15', '2014-11-30', np.NaN, '2014-12-01']))
print (data)
My current kludge/work around looks like
data = pd.to_datetime(pd.Series(['2014-10-01', '2014-10-01', '2014-10-31', '2014-11-15', '2014-11-30', np.NaN, '2014-01-01']))
data = pd.DatetimeIndex(data).to_period('M')
data = pd.Series(data.year).astype('str') + '-' + pd.Series((data.month).astype('int')).map('{:0>2d}'.format)
data = data.where(data != '2262-04', other='No Date')
print (data)
Their are some issues currently (even in master) dealing with NaT in PeriodIndex, so your approach won't work like that. But seems that you simply want to resample; so do this. You can of course specify a function for how if you want.
In [57]: data
Out[57]:
0 2014-10-01
1 2014-10-01
2 2014-10-31
3 2014-11-15
4 2014-11-30
5 NaT
6 2014-12-01
dtype: datetime64[ns]
In [58]: df = DataFrame(dict(A = data, B = np.arange(len(data))))
In [59]: df.dropna(how='any',subset=['A']).set_index('A').resample('M',how='count')
Out[59]:
B
A
2014-10-31 3
2014-11-30 2
2014-12-31 1
import pandas as pd
import numpy as np
datetime import datetime
data = pd.to_datetime(
pd.Series(['2014-10-01', '2014-10-01', '2014-10-31', '2014-11-15', '2014-11-30', np.NaN, '2014-01-01']))
data=pd.Series(['{}-{:02d}'.format(x.year,x.month) if isinstance(x, datetime) else "Nat" for x in pd.DatetimeIndex(data).to_pydatetime()])
0 2014-10
1 2014-10
2 2014-10
3 2014-11
4 2014-11
5 Nat
6 2014-01
dtype: object
Best I could come up with, if the only non datetimes objects possible are floats you can change if isinstance(x, datetime) to if not isinstance(x, float)
I have a dataframe in pandas called 'munged_data' with two columns 'entry_date' and 'dob' which i have converted to Timestamps using pd.to_timestamp.I am trying to figure out how to calculate ages of people based on the time difference between 'entry_date' and 'dob' and to do this i need to get the difference in days between the two columns ( so that i can then do somehting like round(days/365.25). I do not seem to be able to find a way to do this using a vectorized operation. When I do munged_data.entry_date-munged_data.dob i get the following :
internal_quote_id
2 15685977 days, 23:54:30.457856
3 11651985 days, 23:49:15.359744
4 9491988 days, 23:39:55.621376
7 11907004 days, 0:10:30.196224
9 15282164 days, 23:30:30.196224
15 15282227 days, 23:50:40.261632
However i do not seem to be able to extract the days as an integer so that i can continue with my calculation.
Any help appreciated.
Using the Pandas type Timedelta available since v0.15.0 you also can do:
In[1]: import pandas as pd
In[2]: df = pd.DataFrame([ pd.Timestamp('20150111'),
pd.Timestamp('20150301') ], columns=['date'])
In[3]: df['today'] = pd.Timestamp('20150315')
In[4]: df
Out[4]:
date today
0 2015-01-11 2015-03-15
1 2015-03-01 2015-03-15
In[5]: (df['today'] - df['date']).dt.days
Out[5]:
0 63
1 14
dtype: int64
You need 0.11 for this (0.11rc1 is out, final prob next week)
In [9]: df = DataFrame([ Timestamp('20010101'), Timestamp('20040601') ])
In [10]: df
Out[10]:
0
0 2001-01-01 00:00:00
1 2004-06-01 00:00:00
In [11]: df = DataFrame([ Timestamp('20010101'),
Timestamp('20040601') ],columns=['age'])
In [12]: df
Out[12]:
age
0 2001-01-01 00:00:00
1 2004-06-01 00:00:00
In [13]: df['today'] = Timestamp('20130419')
In [14]: df['diff'] = df['today']-df['age']
In [16]: df['years'] = df['diff'].apply(lambda x: float(x.item().days)/365)
In [17]: df
Out[17]:
age today diff years
0 2001-01-01 00:00:00 2013-04-19 00:00:00 4491 days, 00:00:00 12.304110
1 2004-06-01 00:00:00 2013-04-19 00:00:00 3244 days, 00:00:00 8.887671
You need this odd apply at the end because not yet full support for timedelta64[ns] scalars (e.g. like how we use Timestamps now for datetime64[ns], coming in 0.12)
Not sure if you still need it, but in Pandas 0.14 i usually use .astype('timedelta64[X]') method
http://pandas.pydata.org/pandas-docs/stable/timeseries.html (frequency conversion)
df = pd.DataFrame([ pd.Timestamp('20010101'), pd.Timestamp('20040605') ])
df.ix[0]-df.ix[1]
Returns:
0 -1251 days
dtype: timedelta64[ns]
(df.ix[0]-df.ix[1]).astype('timedelta64[Y]')
Returns:
0 -4
dtype: float64
Hope that will help
Let's specify that you have a pandas series named time_difference which has type
numpy.timedelta64[ns]
One way of extracting just the day (or whatever desired attribute) is the following:
just_day = time_difference.apply(lambda x: pd.tslib.Timedelta(x).days)
This function is used because the numpy.timedelta64 object does not have a 'days' attribute.
To convert any type of data into days just use pd.Timedelta().days:
pd.Timedelta(1985, unit='Y').days
84494