Timestamp, timedelta and conversion in Python - python

I have a dataframe which has a timestamp column in the format: YYYY-MM-DD HH:MM:SS:sss. Example is shown below:
0 2019-12-17 21:17:39.424
1 2019-12-17 21:17:41.065
2 2019-12-17 21:18:06.640
3 2019-12-17 21:18:07.229
4 2019-12-17 21:18:07.858
...
1072 2019-12-17 22:54:54.052
1073 2019-12-17 22:54:56.075
1074 2019-12-17 22:55:23.040
1075 2019-12-17 22:55:23.040
1076 2019-12-17 22:55:26.363
Name: time_stamp, Length: 1077, dtype: datetime64[ns]
there are more than thousands of rows that I am reading from a csv file. What I have been trying to find the time interval (timedelta) between each successive timestamp. Since there difference between every successive pair is not greater than few seconds, I just want to retrieve that part (discarding the data, hour and minutes parts, which are 0 anyway.
I can perform the simple subtraction iteratively inside a loop, but the result that i get is a string for each calculation. Example is shown below:
> 0 0 days 00:00:03.988000
1 0 days 00:00:01.641000
2 0 days 00:00:25.575000
3 0 days 00:00:00.589000
4 0 days 00:00:00.629000
...
1072 0 days 00:00:36.084000
1073 0 days 00:00:02.023000
1074 0 days 00:00:26.965000
1075 0 days 00:00:00
1076 0 days 00:00:03.323000
Name: arr_time, Length: 1077, dtype: object
Now, as you can see, the datatype is string which prevents me performing various operations related to timedelta or datetime datatype. I am unable to change its datatype. I am so confused between datetime, timestamp and timedelta concepts that I can not figure out what operations or methods are supported for each case.
I can provide the raw csv file.
Can some please help me in just retrieving the seconds and milliseconds parts of each timedelta values into a Series or Dataframe?

Your data contains date/time information (for example as a string like "2019-12-17T21:17:39.424") - you parse that to datetime e.g. like
df['time_stamp'] = pd.to_datetime(df['time_stamp'])
# gives dtype: datetime64[ns]
An individual element of this column (pd.Series) would be a Timestamp. If you subtract two timestamps from one another, you get a timedelta:
# the difference between timestamps are timedeltas:
df['dt'] = df['time_stamp'].diff()
# df['dt']
# 0 NaT
# 1 0 days 00:00:01.641000
# 2 0 days 00:00:25.575000
# 3 0 days 00:00:00.589000
# 4 0 days 00:00:00.629000
# Name: dt, dtype: timedelta64[ns]
Now that you have a column of dtype timedelta, you can work with that to get seconds and milliseconds:
# get the seconds fraction by flooring the total_seconds() of the timedelta
df['dt_s'] = np.floor(df['dt'].dt.total_seconds())
# df['dt_s']
# 0 NaN
# 1 1.0
# 2 25.0
# 3 0.0
# 4 0.0
# Name: dt_s, dtype: float64
# get the milliseconds by converting total_seconds() to milliseconds and taking modulo 1000:
df['dt_ms'] = (df['dt'].dt.total_seconds()*1000) % 1000
# df['dt_ms']
# 0 NaN
# 1 641.0
# 2 575.0
# 3 589.0
# 4 629.0
# Name: dt_ms, dtype: float64
If desired, you could format the seconds and millisecond components to a string column:
# format to ss:fff output:
df['s_ms'] = (df['dt_s'].fillna(0).apply(lambda s: f'{int(s):02d}') +
':' +
df['dt_ms'].fillna(0).apply(lambda s: f'{int(s):03d}'))
# df['s_ms']
# 0 00:000
# 1 01:641
# 2 25:575
# 3 00:589
# 4 00:629
# Name: s_ms, dtype: object

Related

remove date in pandas [duplicate]

I have a dataframe df and its first column is timedelta64
df.info():
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 686 entries, 0 to 685
Data columns (total 6 columns):
0 686 non-null timedelta64[ns]
1 686 non-null object
2 686 non-null object
3 686 non-null object
4 686 non-null object
5 686 non-null object
If I print(df[0][2]), for example, it will give me 0 days 05:01:11. However, I don't want the 0 days filed. I only want 05:01:11 to be printed. Could someone teaches me how to do this? Thanks so much!
It is possible by:
df['duration1'] = df['duration'].astype(str).str[-18:-10]
But solution is not general, if input is 3 days 05:01:11 it remove 3 days too.
So solution working only for timedeltas less as one day correctly.
More general solution is create custom format:
N = 10
np.random.seed(11230)
rng = pd.date_range('2017-04-03 15:30:00', periods=N, freq='13.5H')
df = pd.DataFrame({'duration': np.abs(np.random.choice(rng, size=N) -
np.random.choice(rng, size=N)) })
df['duration1'] = df['duration'].astype(str).str[-18:-10]
def f(x):
ts = x.total_seconds()
hours, remainder = divmod(ts, 3600)
minutes, seconds = divmod(remainder, 60)
return ('{}:{:02d}:{:02d}').format(int(hours), int(minutes), int(seconds))
df['duration2'] = df['duration'].apply(f)
print (df)
duration duration1 duration2
0 2 days 06:00:00 06:00:00 54:00:00
1 2 days 19:30:00 19:30:00 67:30:00
2 1 days 03:00:00 03:00:00 27:00:00
3 0 days 00:00:00 00:00:00 0:00:00
4 4 days 12:00:00 12:00:00 108:00:00
5 1 days 03:00:00 03:00:00 27:00:00
6 0 days 13:30:00 13:30:00 13:30:00
7 1 days 16:30:00 16:30:00 40:30:00
8 0 days 00:00:00 00:00:00 0:00:00
9 1 days 16:30:00 16:30:00 40:30:00
Here's a short and robust version using apply():
df['timediff_string'] = df['timediff'].apply(
lambda x: f'{x.components.hours:02d}:{x.components.minutes:02d}:{x.components.seconds:02d}'
if not pd.isnull(x) else ''
)
This leverages the components attribute of pandas Timedelta objects and also handles empty values (NaT).
If the timediff column does not contain pandas Timedelta objects, you can convert it:
df['timediff'] = pd.to_timedelta(df['timediff'])
datetime.timedelta already formats the way you'd like. The crux of this issue is that Pandas internally converts to numpy.timedelta.
import pandas as pd
from datetime import timedelta
time_1 = timedelta(days=3, seconds=3400)
time_2 = timedelta(days=0, seconds=3400)
print(time_1)
print(time_2)
times = pd.Series([time_1, time_2])
# Times are converted to Numpy timedeltas.
print(times)
# Convert to string after converting to datetime.timedelta.
times = times.apply(
lambda numpy_td: str(timedelta(seconds=numpy_td.total_seconds())))
print(times)
So, convert to a datetime.timedelta and then str (to prevent conversion back to numpy.timedelta) before printing.
3 days, 0:56:40
0:56:400
0 3 days 00:56:40
1 0 days 00:56:40
dtype: timedelta64[ns]
0 3 days, 0:56:40
1 0:56:40
dtype: object
I came here looking for answers to the same question, so I felt I should add further clarification. : )
You can convert it into a Python timedelta, then to str and finally back to a Series:
pd.Series(df["duration"].dt.to_pytimedelta().astype(str), name="start_time")
Given OP is ok with an object column (a little verbose):
def splitter(td):
td = str(td).split(' ')[-1:][0]
return td
df['split'] = df['timediff'].apply(splitter)
Basically we're taking the timedelta column, transforming the contents to a string, then splitting the string (creates a list) and taking the last item of that list, which would be the hh:mm:ss component.
Note that specifying ' ' for what to split by is redundant here.
Alternative one liner:
df['split2'] = df['timediff'].astype('str').str.split().str[-1]
which is very similar, but not very pretty IMHO. Also, the output includes milliseconds, which is not the case in the first solution. I'm not sure what the reason for that is (please comment if you do). If your data is big it might be worthwhile to time these different approaches.
If wou want to remove all nonzero components (not only days), you can do it like this:
def pd_td_fmt(td):
import pandas as pd
abbr = {'days': 'd', 'hours': 'h', 'minutes': 'min', 'seconds': 's', 'milliseconds': 'ms', 'microseconds': 'us',
'nanoseconds': 'ns'}
fmt = lambda td:"".join(f"{v}{abbr[k]}" for k, v in td.components._asdict().items() if v != 0)
if isinstance(td, pd.Timedelta):
return fmt(td)
elif isinstance(td,pd.TimedeltaIndex):
return td.map(fmt)
else:
raise ValueError
If you can be sure that your timedelta is less than a day, this might work. To do this in as few lines as possible, I convert the timedelta to a datetime by adding the unix epoch 0 and then using the now-datetime dt function to format the date format.
df['duration1'] = (df['duration'] + pd.to_datetime(0)).dt.strftime('%M:%S')

getting week number from date python

I have code as below. My questions:
why is it assigning week 1 to 2014-12-29 and '2014-1-1'? Why it is not assigning week 53 to 2014-12-29?
how could i assign week number that is continuously increasing? I
want '2014-12-29','2015-1-1' to have week 53 and '2015-1-15' to have
week 55 etc.
x=pd.DataFrame(data=['2014-1-1','2014-12-29','2015-1-1','2015-1-15'],columns=['date'])
x['week_number']=pd.DatetimeIndex(x['date']).week
As far as why the week number is 1 for 12/29/2014 -- see the question I linked to in the comments. For the second part of your question:
January 1, 2014 was a Wednesday. We can take the minimum date of your date column, get the day number and subtract from the difference:
Solution
# x["date"] = pd.to_datetime(x["date"]) # if not already a datetime column
min_date = x["date"].min() + 1 # + 1 because they're zero-indexed
x["weeks_from_start"] = ((x["date"].diff().dt.days.cumsum() - min_date) // 7 + 1).fillna(1).astype(int)
Output:
date weeks_from_start
0 2014-01-01 1
1 2014-12-29 52
2 2015-01-01 52
3 2015-01-15 54
Step by step
The first step is to convert the date column to the datetime type, if you haven't already:
In [3]: x.dtypes
Out[3]:
date object
dtype: object
In [4]: x["date"] = pd.to_datetime(x["date"])
In [5]: x
Out[5]:
date
0 2014-01-01
1 2014-12-29
2 2015-01-01
3 2015-01-15
In [6]: x.dtypes
Out[6]:
date datetime64[ns]
dtype: object
Next, we need to find the minimum of your date column and set that as the starting date day of the week number (adding 1 because the day number starts at 0):
In [7]: x["date"].min().day + 1
Out[7]: 2
Next, use the built-in .diff() function to take the differences of adjacent rows:
In [8]: x["date"].diff()
Out[8]:
0 NaT
1 362 days
2 3 days
3 14 days
Name: date, dtype: timedelta64[ns]
Note that we get NaT ("not a time") for the first entry -- that's because the first row has nothing to compare to above it.
The way to interpret these values is that row 1 is 362 days after row 0, and row 2 is 3 days after row 1, etc.
If you take the cumulative sum and subtract the starting day number, you'll get the days since the starting date, in this case 2014-01-01, as if the Wednesday was day 0 of that first week (this is because when we calculate the number of weeks since that starting date, we need to compensate for the fact that Wednesday was the middle of that week):
In [9]: x["date"].diff().dt.days.cumsum() - min_date
Out[9]:
0 NaN
1 360.0
2 363.0
3 377.0
Name: date, dtype: float64
Now when we take the floor division by 7, we'll get the correct number of weeks since the starting date:
In [10]: (x["date"].diff().dt.days.cumsum() - 2) // 7 + 1
Out[10]:
0 NaN
1 52.0
2 52.0
3 54.0
Name: date, dtype: float64
Note that we add 1 because (I assume) you're counting from 1 -- i.e., 2014-01-01 is week 1 for you, and not week 0.
Finally, the .fillna is just to take care of that NaT (which turned into a NaN when we started doing arithmetic). You use .fillna(value) to fill NaNs with value:
In [11]: ((x["date"].diff().dt.days.cumsum() - 2) // 7 + 1).fillna(1)
Out[11]:
0 1.0
1 52.0
2 52.0
3 54.0
Name: date, dtype: float64
Finally use .astype() to convert the column to integers instead of floats.

parse multiple date format pandas

I 've got stuck with the following format:
0 2001-12-25
1 2002-9-27
2 2001-2-24
3 2001-5-3
4 200510
5 20078
What I need is the date in a format %Y-%m
What I tried was
def parse(date):
if len(date)<=5:
return "{}-{}".format(date[:4], date[4:5], date[5:])
else:
pass
df['Date']= parse(df['Date'])
However, I only succeeded in parse 20078 to 2007-8, the format like 2001-12-25 appeared as None.
So, how can I do it? Thank you!
we can use the pd.to_datetime and use errors='coerce' to parse the dates in steps.
assuming your column is called date
s = pd.to_datetime(df['date'],errors='coerce',format='%Y-%m-%d')
s = s.fillna(pd.to_datetime(df['date'],format='%Y%m',errors='coerce'))
df['date_fixed'] = s
print(df)
date date_fixed
0 2001-12-25 2001-12-25
1 2002-9-27 2002-09-27
2 2001-2-24 2001-02-24
3 2001-5-3 2001-05-03
4 200510 2005-10-01
5 20078 2007-08-01
In steps,
first we cast the regular datetimes to a new series called s
s = pd.to_datetime(df['date'],errors='coerce',format='%Y-%m-%d')
print(s)
0 2001-12-25
1 2002-09-27
2 2001-02-24
3 2001-05-03
4 NaT
5 NaT
Name: date, dtype: datetime64[ns]
as you can can see we have two NaT which are null datetime values in our series, these correspond with your datetimes which are missing a day,
we then reapply the same datetime method but with the opposite format, and apply those to the missing values of s
s = s.fillna(pd.to_datetime(df['date'],format='%Y%m',errors='coerce'))
print(s)
0 2001-12-25
1 2002-09-27
2 2001-02-24
3 2001-05-03
4 2005-10-01
5 2007-08-01
then we re-assign to your dataframe.
You could use a regex to pull out the year and month, and convert to datetime :
df = pd.read_clipboard("\s{2,}",header=None,names=["Dates"])
pattern = r"(?P<Year>\d{4})[-]*(?P<Month>\d{1,2})"
df['Dates'] = pd.to_datetime([f"{year}-{month}" for year, month in df.Dates.str.extract(pattern).to_numpy()])
print(df)
Dates
0 2001-12-01
1 2002-09-01
2 2001-02-01
3 2001-05-01
4 2005-10-01
5 2007-08-01
Note that pandas automatically converts the day to 1, since only year and month was supplied.

Cumulative elapsed minutes from Pandas datetime Series

I have a column of datetime stamps. I need a column of total minutes elapsed from the first to the last value.
I have:
>>> df = pd.DataFrame({'timestamp': [
... pd.Timestamp('2001-01-01 06:00:00'),
... pd.Timestamp('2001-01-01 06:01:00'),
... pd.Timestamp('2001-01-01 06:15:00')
... ]})
>>> df
timestamp
0 2001-01-01 06:00:00
1 2001-01-01 06:01:00
2 2001-01-01 06:15:00
I need to add a column that gives the running total:
timestamp minutes
1-1-2001 6:00 0
1-1-2001 6:01 1
1-1-2001 6:15 15
1-1-2001 7:00 60
1-1-2001 7:35 95
Having a hard time manipulating the datetime Series to allow me to total up the timestamp.
I've looked at a lot of posts and can't find anything that does what I'm trying to do. Would appreciate any ideas!
You can chain a few methods together:
>>> df['minutes'] = df['timestamp'].diff().fillna(0).dt.total_seconds()\
... .cumsum().div(60).astype(int)
>>> df
timestamp minutes
0 2001-01-01 06:00:00 0
1 2001-01-01 06:01:00 1
2 2001-01-01 06:15:00 15
Creation:
>>> df = pd.DataFrame({'timestamp': [
... pd.Timestamp('2001-01-01 06:00:00'),
... pd.Timestamp('2001-01-01 06:01:00'),
... pd.Timestamp('2001-01-01 06:15:00')
... ]})
Walkthrough
The easiest way to break this down is to separate each intermediate method call.
df['timestamp'].diff() gives you a Series of Pandas-equivalent of Python's datetime.timedelta, the differences in times from each value to the next.
>>> df['timestamp'].diff()
0 NaT
1 00:01:00
2 00:14:00
Name: timestamp, dtype: timedelta64[ns]
This contains an N/A value (NaT/not a time) because there's nothing to subtract from the first value. You can simply fill it with the zero-value for timedeltas:
>>> df['timestamp'].diff().fillna(0)
0 00:00:00
1 00:01:00
2 00:14:00
Name: timestamp, dtype: timedelta64[ns]
Now you need to get an actual integer (minutes) from these objects. In .dt.total_seconds(), .dt is an "accessor" that is a way of accessing a bunch of methods that let you work with datetime-like data:
>>> df['timestamp'].diff().fillna(0).dt.total_seconds()
0 0.0
1 60.0
2 840.0
Name: timestamp, dtype: float64
The result is the incremental second-change as a float. You need this on a cumulative basis, in minutes, and as an integer. That's what the final 3 operations do:
>>> df['timestamp'].diff().fillna(0).dt.total_seconds().cumsum().div(60).astype(int)
0 0
1 1
2 15
Name: timestamp, dtype: int64
Note that astype(int) will do rounding if you have seconds that aren't fully divisible by 60.

Convert integer series to timedelta in pandas

I have a data frame in pandas which includes number of days since an event occurred. I want to create a new column that calculates the date of the event by subtracting the number of days from the current date. Every time I attempt to apply pd.offsets.Day or pd.Timedelta I get an error stating that Series are an unsupported type. This also occurs when I use apply. When I use map I receive a runtime error saying "maximum recursion depth exceeded while calling a Python object".
For example, assume my data frame looked like this:
index days_since_event
0 5
1 7
2 3
3 6
4 0
I want to create a new column with the date of the event, so my expected outcome (using today's date of 12/29/2015)
index days_since_event event_date
0 5 2015-12-24
1 7 2015-12-22
2 3 2015-12-26
3 6 2015-12-23
4 0 2015-12-29
I have attempted multiple ways to do this, but have received errors for each.
One method I tried was:
now = pd.datetime.date(pd.datetime.now())
df['event_date'] = now - df.days_since_event.apply(pd.offsets.Day)
With this I received an error saying that Series are an unsupported type.
I tried the above with .map instead of .apply, and received the error that "maximum recursion depth exceeded while calling a Python object".
I also attempted to convert the days into timedelta, such as:
df.days_since_event = (dt.timedelta(days = df.days_since_event)).apply
This also received an error referencing the series being an unsupported type.
First, to convert the column with integers to a timedelta, you can use to_timedelta:
In [60]: pd.to_timedelta(df['days_since_event'], unit='D')
Out[60]:
0 5 days
1 7 days
2 3 days
3 6 days
4 0 days
Name: days_since_event, dtype: timedelta64[ns]
Then you can create a new column with the current date and substract those timedelta's:
In [62]: df['event_date'] = pd.Timestamp('2015-12-29')
In [63]: df['event_date'] = df['event_date'] - pd.to_timedelta(df['days_since_event'], unit='D')
In [64]: df['event_date']
Out[64]:
0 2015-12-24
1 2015-12-22
2 2015-12-26
3 2015-12-23
4 2015-12-29
dtype: datetime64[ns]
Just to follow up with joris' response, you can convert an int or a float into whatever time unit you want with pd.to_timedelta(x, unit=''), changing only the entry for unit=:
# Years, Months, Days:
pd.to_timedelta(3.5, unit='Y') # returns '1095 days 17:27:36'
pd.to_timedelta(3.5, unit='M') # returns '91 days 07:27:18'
pd.to_timedelta(3.5, unit='D') # returns '3 days 12:00:00'
# Hours, Minutes, Seconds:
pd.to_timedelta(3.5, unit='h') # returns '0 days 03:30:00'
pd.to_timedelta(3.5, unit='m') # returns '0 days 00:03:30'
pd.to_timedelta(3.5, unit='s') # returns '0 days 00:00:03.50'
Note that mathematical operations are legal once correctly formatted:
pd.to_timedelta(3.5, unit='h') - pd.to_timedelta(3.25, unit='h') # returns '0 days 00:15:00'

Categories