Efficient timedelta calculator - python

I have a time series data from a data logger that puts time stamps (in the form of dates MM--DD-YY HH:MM:SS:xxx:yyy (e.g. --[ 29.08.2018 16:26:31.406 ] --) where xxx and yyy are milliseconds and microseconds respectively) precise up to microseconds when recording data. Now you can imagine that the generated file recorded over a few minutes could be very big. (100s of megabytes). I need to plot a bunch of data from this file vs time in millisconds (ideally).
The data looks like below:
So I need to parse these dates in python and calculate timedelta to find timelapsed between samples and then generate plots. As when I subtract these two time stamps (--[ 29.08.2018 16:23:41.052 ] -- and --[ 29.08.2018 16:23:41.114 ] --), I want to get 62 milliseconds as time lapsed between these two time stamps.
Currently I am using 'dateparser' (by import dateparser as dp) which outputs datetime after parsing and then I can subtract those to extract timedelta and then convert into ms or seconds as I need.
But this function is taking too long and is the bottleneck in my post processing script.
Anyone could suggest a better library that is more efficient in parsing dates and calculating timedelta?
Here's the piece of code that is not so efficient
import dateparser as dp
def timedelta_local(date1, date2):
import dateparser as dp
timedelta = dp.parse(date2)-dp.parse(date1)
timediff={'us': timedelta.microseconds+timedelta.seconds*1000000+timedelta.days*24*60*60*1000000,
'ms':timedelta.microseconds/1000+timedelta.seconds*1000+timedelta.days*24*60*60*1000,
'sec': timedelta.microseconds/1000000+timedelta.seconds+timedelta.days*24*60*60,
'minutes': timedelta.microseconds/1000000/60+timedelta.seconds/60+timedelta.days*24*60
}
return timediffe
Thanks in advance

#zvone is correct here. pandas is your best friend for this. Here is some sample code that will hopefully get you on the right track. It assumes your data is in a CSV file with a header line like the one you show in your example. I wasn't sure whether you wanted to keep the time difference as a timedelta object (easy for doing further math with) or just simplify it to a float. I did both.
import pandas as pd
df = pd.read_csv("test.csv", parse_dates=[0])
# What are the data types after the initial import?
print(f'{df.dtypes}\n\n')
# What are the contents of the data frame?
print(f'{df}\n\n')
# Create a new column that strips away leading and trailing characters
# that surround the data we want
df['Clean Time Stamp'] = df['Time Stamp'].apply(lambda x: x[3:-4])
# Convert to a pandas Timestamp. Use infer_datetime_format for speed.
df['Real Time Stamp'] = pd.to_datetime(df['Clean Time Stamp'], infer_datetime_format=True)
# Calculate time difference between successive rows
df['Delta T'] = df['Real Time Stamp'].diff()
# Convert pandas timedelta to a floating point value in milliseconds.
df['Delta T ms'] = df['Delta T'].dt.total_seconds() * 1000
print(f'{df.dtypes}\n\n')
print(df)
The output looks like this. Note that the printing of the dataframe is wrapping the columns around to another line - this is just an artifact of printing it.
Time Stamp object
Limit A int64
Value A float64
Limit B int64
Value B float64
dtype: object
Time Stamp Limit A Value A Limit B Value B
0 --[ 29.08.2018 16:23:41.052 ] -- 15 3.109 30 2.907
1 --[ 29.08.2018 16:23:41.114 ] -- 15 3.020 30 8.242
Time Stamp object
Limit A int64
Value A float64
Limit B int64
Value B float64
Clean Time Stamp object
Real Time Stamp datetime64[ns]
Delta T timedelta64[ns]
Delta T ms float64
dtype: object
Time Stamp Limit A Value A Limit B Value B \
0 --[ 29.08.2018 16:23:41.052 ] -- 15 3.109 30 2.907
1 --[ 29.08.2018 16:23:41.114 ] -- 15 3.020 30 8.242
Clean Time Stamp Real Time Stamp Delta T \
0 29.08.2018 16:23:41.052 2018-08-29 16:23:41.052 NaT
1 29.08.2018 16:23:41.114 2018-08-29 16:23:41.114 00:00:00.062000
Delta T ms
0 NaN
1 62.0
If your files are large you may gain some efficiency by editing columns in place rather than creating new ones like I did.

Related

Python Datetime conversion for excel dataframe

Hello,
I am trying to extract date and time column from my excel data. I am getting column as DataFrame with float values, after using pandas.to_datetime I am getting date with different date than actual date from excel. for example, in excel starting date is 01.01.1901 00:00:00 but in python I am getting 1971-01-03 00:00:00.000000 like this.
How can I solve this problem?
I need a final output in total seconds with DataFrame. First cell starting as a 00 sec and very next cell with timestep of seconds (time difference in ever cell is 15min.)
Thank you.
Your input is fractional days, so there's actually no need to convert to datetime if you want the duration in seconds relative to the first entry. Subtract that from the rest of the column and multiply by the number of seconds in a day:
import pandas as pd
df = pd.DataFrame({"Datum/Zeit": [367.0, 367.010417, 367.020833]})
df["totalseconds"] = (df["Datum/Zeit"] - df["Datum/Zeit"].iloc[0]) * 86400
df["totalseconds"]
0 0.0000
1 900.0288
2 1799.9712
Name: totalseconds, dtype: float64
If you have to use datetime, you'll need to convert to timedelta (duration) to do the same, e.g. like
df["datetime"] = pd.to_datetime(df["Datum/Zeit"], unit="d")
# df["datetime"]
# 0 1971-01-03 00:00:00.000000
# 1 1971-01-03 00:15:00.028800
# 2 1971-01-03 00:29:59.971200
# Name: datetime, dtype: datetime64[ns]
# subtraction of datetime from datetime gives timedelta, which has total_seconds:
df["totalseconds"] = (df["datetime"] - df["datetime"].iloc[0]).dt.total_seconds()
# df["totalseconds"]
# 0 0.0000
# 1 900.0288
# 2 1799.9712
# Name: totalseconds, dtype: float64

How to set a random time in a datetime

My instructions are as follows:
Read the date columns in as timestamps, convert them to YYYY/MM/DD
hours:minutes:seconds format, where you set hours minutes and seconds to random
values appropriate to their range
Here is column of the data frame we are suppose to alter to datetime:
Order date
11/12/2016
11/24/2016
6/12/2016
10/12/2016
...
And here is the date time I need
2016/11/12 (random) hours:minutes:seconds
2016/11/24 (random) hours:minutes:seconds
...
My main question is how do I get random hours minutes and seconds. The rest I can figure out with the documentation
You can generate random numbers between 0 and 86399 (number of seconds in a day - 1) and convert to a TimeDelta with pandas.to_timedelta:
import numpy as np
time = pd.to_timedelta(np.random.randint(0, 60*60*24-1, size=len(df)), unit='s')
df['Order date'] = pd.to_datetime(df['Order date']).add(time)
Output:
Order date
0 2016-11-12 02:21:53
1 2016-11-24 13:26:00
2 2016-06-12 15:13:03
3 2016-10-12 14:45:12
You're trying to read the data in '%Y-%m-%d' format but the data is in "%d/%m/%Y" format. See https://docs.python.org/3/library/datetime.html#strftime-and-strptime-behavior to find out how to convert the date to your desired format.

How do I convert entire columns of data (date HH:mm:ss.ms, or just H:MM.SS.ms) to a decimal (e.g. hours) in python/PANDASn

I work with a variety of instruments, and one is particularly troublesome in that the exported data is in XLS or XLSX format with multiple pages, and multiple columns. I only want some pages and some columns, I have achieved reading this into pandas already.
I want to convert time (see below) into a decimal, in hours. This would be from an initial time (in the time stamp data) at the top of the column so timedelta is probably a more correct value, in hours. I am only concerned about this column. How to convert an entire column of data from one format, to another?
date/time (absolute time) timestamped format YYYY-MM-DD TT:MM:SS
I have found quite a few answers but they don't seem to apply to this particular case, mostly focusing on individual cells or manually entered small data sets. My thousands of data files each have as many as 500,000 lines so something more automated is preferred. There is no upper limit to the number of hours.
What might be part of the same question (someone asked me) is this is already in a Pandas dataframe, should it be converted before or after being read in?
This might seem an amateur-ish question, and it is. I've avoided code writing for years, now I have to learn to data-wrangle for my job and it's frustrating so go easy on me.
Going about it the usual way by trying to adapt most of the solutions I found to a column, I get errors
**This is the code which works
import pandas as pd
import matplotlib.pyplot as plt
from pathlib import Path
from datetime import datetime # not used
import time # not used
import numpy as np # Not used
loc1 = r"path\file.xls"
pd.read_excel(loc1)
filename=Path(loc1).stem
str_1=filename
df = pd.concat(pd.read_excel(loc1, sheet_name=[3,4,5,6,7,8,9]), ignore_index=False)
***I NEED A CODE TO CONVERT DATESTAMPS TO HOURS (decimal) most likely a form of timedelta***
df.plot(x='Relative Time(h:min:s.ms)',y='Voltage(V)', color='blue')
plt.xlabel("relative time") # This is a specific value
plt.ylabel("voltage (V)")
plt.title(str_1) # filename is used in each sample as a graph title
plt.show()
Image of relevent information (already described above)
You should provide a minimal reproducible example, to help understand what exactly are the issues you are facing.
Setup
Reading between the lines, here is a setup that hopefully exemplifies the kind of data you have:
vals = pd.Series([
'2019-10-21 17:22:06', # absolute date
'2019-10-21 23:22:06.236', # absolute date, with milliseconds
'2019-10-21 12:00:00.236145', # absolute date, with microseconds
'5:10:10', # timedelta
'40:10:10.123', # timedelta, with milliseconds
'345:10:10.123456', # timedelta, with microseconds
])
Solution
Now, we can use two great tools that Pandas offers to quickly convert string series into Timestamps (pd.to_datetime) and Timedelta (pd.to_timedelta), for absolute date-times and durations, respectively.
In both cases, we use errors='coerce' to convert what is convertible, and leave the rest to NaN.
origin = pd.Timestamp('2019-01-01 00:00:00') # origin for absolute dates
a = pd.to_datetime(vals, format='%Y-%m-%d %H:%M:%S.%f', errors='coerce') - origin
b = pd.to_timedelta(vals, errors='coerce')
tdelta = a.where(~a.isna(), b)
hours = tdelta.dt.total_seconds() / 3600
With the above:
>>> hours
0 7049.368333
1 7055.368399
2 7044.000066
3 5.169444
4 40.169479
5 345.169479
dtype: float64
Explanation
Let's examine some of the pieces above. a handles absolute date-times. Before subtraction of origin to obtain a Timedelta, it is still a Series of Timestamps:
>>> pd.to_datetime(vals, format='%Y-%m-%d %H:%M:%S.%f', errors='coerce')
0 2019-10-21 17:22:06.000000
1 2019-10-21 23:22:06.236000
2 2019-10-21 12:00:00.236145
3 NaT
4 NaT
5 NaT
dtype: datetime64[ns]
b handles values that are already expressed as durations:
>>> b
0 NaT
1 NaT
2 NaT
3 0 days 05:10:10
4 1 days 16:10:10.123000
5 14 days 09:10:10.123456
dtype: timedelta64[ns]
tdelta is the merge of the non-NaN values of a and b:
>>> tdelta
0 293 days 17:22:06
1 293 days 23:22:06.236000
2 293 days 12:00:00.236145
3 0 days 05:10:10
4 1 days 16:10:10.123000
5 14 days 09:10:10.123456
dtype: timedelta64[ns]
Of course, you can change your origin to be any particular date of reference.
Addendum
After clarifying comments, it seems that the main issue is how to adapt the solution above (or any similar existing example) to their specific problem.
Using the names seen in the images of the edited question, I would suggest:
# (...)
# df = pd.concat(pd.read_excel(loc1, sheet_name=[3,4,5,6,7,8,9]), ignore_index=False)
# note: if df['Absolute Time'] is still of dtypes str, then do this:
# (adapt format as needed; hard to be sure from the image)
df['Absolute Time'] = pd.to_datetime(
df['Absolute Time'],
format='%m.%d.%Y %H:%M:%S.%f',
errors='coerce')
# origin of time; this may have to be taken over multiple sheets
# if all experiments share an absolute origin
origin = df['Absolute Time'].min()
df['Time in hours'] = (df['Absolute Time'] - origin).dt.total_seconds() / 3600

Pandas convert series of one to datetime object

I have a data frame with a lot of columns and rows, the index column contains datetime objects.
date_time column1 column2
10-10-2010 00:00:00 1 10
10-10-2010 00:00:03 1 10
10-10-2010 00:00:06 1 10
Now I want to calculate the difference in time between the first and last datetime object. Therefore:
start = df["date_time"].head(1)
stop = df["date_time"].tail(1)
However I now want to extract this datetime value so that I can use the .total_seconds() seconds to calculate the number of seconds difference between the two datetime objects, something like:
delta_t_seconds = (start - stop).total_seconds()
This however doesn't give the desired result, since start and stop are still series with only one member.
please help

Drop certain character in Object before converting to Datetime column in Pandas

My dataframe has a column which measures time difference in the format HH:MM:SS.000
The pandas is formed from an excel file, the column which stores time difference is an Object. However some entries have negative time difference, the negative sign doesn't matter to me and needs to be removed from the time as it's not filtering a condition I have:
Note: I only have the negative time difference there because of the issue I'm currently having.
I've tried the following functions but I get errors as some of the time difference data is just 00:00:00 and some is 00:00:02.65 and some are 00:00:02.111
firstly how would I ensure that all data in this column is to 00:00:00.000. And then how would I remove the '-' from some the data.
Here's a sample of the time diff column, I cant transform this column into datetime as some of the entries dont have 3 digits after the decimal. Is there a way to iterate through the column and add a 0 if the length of the value isn't equal to 12 digits.
00:00:02.97
00:00:03:145
00:00:00
00:00:12:56
28 days 03:05:23.439
It looks like you need to clean your input before you can parse to timedelta, e.g. with the following function:
import pandas as pd
def clean_td_string(s):
if s.count(':') > 2:
return '.'.join(s.rsplit(':', 1))
return s
Applied to a df's column, this looks like
df = pd.DataFrame({'Time Diff': ['00:00:02.97', '00:00:03:145', '00:00:00', '00:00:12:56', '28 days 03:05:23.439']})
df['Time Diff'] = pd.to_timedelta(df['Time Diff'].apply(clean_td_string))
# df['Time Diff']
# 0 0 days 00:00:02.970000
# 1 0 days 00:00:03.145000
# 2 0 days 00:00:00
# 3 0 days 00:00:12.560000
# 4 28 days 03:05:23.439000
# Name: Time Diff, dtype: timedelta64[ns]

Categories