take standard deviation of datetime in python

take standard deviation of datetime in python - python

I am importing the datetime library in my python program and am taking the duration of multiple events. Below is my code for that:
d1 = datetime.datetime.strptime(starttime, '%Y-%m-%d:%H:%M:%S')
d2 = datetime.datetime.strptime(endtime, '%Y-%m-%d:%H:%M:%S')
duration = d2 - d1
print str(duration)
Now I have a value in the variable "duration". The output of this will be:
0:00:15
0:00:15
0:00:15
0:00:15
0:00:15
0:00:05
0:00:05
0:00:05
0:00:05
0:00:05
0:00:10
0:00:10
0:00:10
0:00:10
0:45:22
I want to take the standard deviation of all the durations and determine if there is an anomaly. For example, the 00:45:22 is an anomaly and I want to detect that. I could do this if I knew what format datetime was in, but it doesn't appear to be digits or anything..I was thinking about splitting the values up from : and using all the values in between, but there might be a better way.
Ideas?

You have datetime.timedelta() objects. These have .microseconds, .seconds and .days attributes, all 3 integers. The str() string representation represents those as [D day[s], ][H]H:MM:SS[.UUUUUU] as needed to fit all values present.
You can use simple arithmetic on these objects. Summing and division work as expected, for example:
>>> (timedelta(seconds=100) + timedelta(seconds=200)) / 2
datetime.timedelta(0, 150)
Unfortunately, you cannot multiply two timedeltas and calculating a standard deviation thus becomes tricky (no squaring of offsets).
Instead, I'd use the .total_seconds() method, to give you a floating point value that is calculated from the days, seconds and microseconds values, then use those values to calculate a standard deviation.

The duration objects you are getting are timedelta objects. Or durations from one timestamp to another. To convert them to a total number of microseconds use:
def timedelta_to_microtime(td):
return abs(td.microseconds + (td.seconds + td.days * 86400) * 1000000)
Then calculate the standard deviation:
def calc_std(L):
n = len(L)
mean = sum(L) / float(n)
dev = [x - mean for x in L]
dev2 = [x*x for x in dev]
return math.sqrt(sum(dev2) / n)
So:
timedeltas = [your timedeltas here..]
microtimes = [timedelta_to_microtime(td) for td in timedeltas]
std = calc_std(microtimes)
print [(td, mstime)
for (td, mstime) in zip(timedeltas, microtimes)
if mstime - std > X]

Related

Best way to convert a string to milliseconds in Python

To convert the amount of milliseconds represented by a string I created the following function:
time_str = '1:16.435'
def milli(time_str):
m, s = time_str.split(':')
return int(int(m) * 60000 + float(s) * 1000)
milli(time_str)
But I'm wondering if there is a native Python function to do this directly.

You can easily make it longer and more complicated with datetime:
import datetime
dateobj=datetime.datetime.strptime("1:16.435","%M:%S.%f")
timeobj=dateobj.time()
print(timeobj.minute*60000+timeobj.second*1000+timeobj.microsecond/1000)
76435.0
Now you have 2 additions, 2 multiplications, and even a division. And the bonus points for loading a package, of course. I like your original code more.

Since you're looking for functions to do this for you, you can take advantage of TimeDelta object which has .total_seconds(). This way you don't have to do that calculation. Just create your datetime objects then subtract them:
from datetime import datetime
datetime_obj = datetime.strptime("1:16.435", "%M:%S.%f")
start_time = datetime(1900, 1, 1)
print((datetime_obj - start_time).total_seconds() * 1000)
output:
76435.0
The reason for choosing datetime(1900, 1, 1) is that when you use strptime with that format it fills the rest to make this form: 1900-01-01 00:01:16.435000.
If your string changes to have Hour for example, you just need to change your format and it works as expected. No need to change your formula and add another calculation:
datetime.strptime("1:1:16.435", "%H:%M:%S.%f")
start_time = datetime(1900, 1, 1)
print((datetime_obj - start_time).total_seconds() * 1000)

How to check if two datetimes are within a certain range of each other?

I have two datetime64 objects, a and b, and I want to determine if they are within a certain range of each other. However, the range is not symmetrical. If a is within -30 and 120 minutes of b (a is between half an hour earlier and two hours later than b), the two are within the desired range. My datetime objects look like %m/%d/%Y %H:%M. I tried saying:
difference = a - b
if (-30 <= difference <= 120):
#Do stuff
However, this doesn't work because difference is not in minutes. I am not sure how to perform this comparison. I know timedelta can be used for datetime comparisons, but I don't know how to use it with an asymmetric range like this one.
Thanks.

Compare the timedelta difference to two other timedeltas:
from datetime import timedelta
if timedelta(minutes=-30) <= difference <= timedelta(minutes=120):

You can, I think, build upon the accepted answer at Time difference in seconds from numpy.timedelta64.
>>> import numpy as np
>>> a = np.datetime64('2012-05-01T01:00:00.000000+0000')
>>> b = np.datetime64('2012-05-15T01:00:00.000000+0000')
>>> diff=b-a
>>> diff.item().total_seconds()
1209600.0
>>> minutes = 1209600.0/60
>>> minutes
20160.0
>>> -30 <= minutes <= 120
False

You could also convert all values to seconds, and then compare those values:
difference_in_seconds = difference.total_seconds()
thirty_minutes_in_seconds = 1800
one_hundred_and_twenty_minutes_in_seconds = 7200
if thirty_minutes_in_seconds <= difference_in_seconds <= one_hundred_and_twenty_minutes_in_seconds:
# Do stuff

pandas.date_range accurate freq parameter

I'm trying to generate a pandas.DateTimeIndex with a samplefrequency of 5120 Hz. That gives a period of increment=0.0001953125 seconds.
If you try to use pandas.date_range(), you need to specify the frequency (parameter freq) as str or as pandas.DateOffset. The first one can only handle an accuracy up to 1 ns, the latter has a terrible performance compared to the str and has even a worse error.
When using the string, I construct is as follows:
freq=str(int(increment*1e9))+'N')
which performs my 270 Mb file in less than 2 seconds, but I have an error (in the DateTimeIndex) after 3 million records of about 1500 µs.
When using the pandas.DateOffset, like this
freq=pd.DateOffset(seconds=increment)
it parses the file in 1 minute and 14 seconds, but has an error of about a second.
I also tried constructing the DateTimeIndex using
starttime + pd.to_timedelta(cumulativeTimes, unit='s')
This sum takes also ages to complete, but is the only one which doesn't have the error in the resulting DateTimeIndex.
How can I achieve a performant generation of the DateTimeIndex, keeping my accuracy?

I used a pure numpy implementation to fix this:
accuracy = 'ns'
relativeTime = np.linspace(
offset,
offset + (periods - 1) * increment,
periods)
def unit_correction(u):
if u is 's':
return 1e0
elif u is 'ms':
return 1e3
elif u is 'us':
return 1e6
elif u is 'ns':
return 1e9
# Because numpy only knows ints as its date datatype,
# convert to accuracy.
return (np.datetime64(starttime)
+ (relativeTime*unit_correction(accuracy)).astype(
"timedelta64["+accuracy+"]"
)
)
(this is the github pull request for people interested: https://github.com/adamreeve/npTDMS/pull/31)

I think I reach a similar result with the function below (although it uses only nanosecond precision):
def date_range_fs(duration, fs, start=0):
""" Create a DatetimeIndex based on sampling frequency and duration
Args:
duration: number of seconds contained in the DatetimeIndex
fs: sampling frequency
start: Timestamp at which de DatetimeIndex starts (defaults to POSIX
epoch)
Returns: the corresponding DatetimeIndex
"""
return pd.to_datetime(
np.linspace(0, 1e9*duration, num=fs*duration, endpoint=False),
unit='ns',
origin=start)

Get average time of day in SQLite from datetimes

I have times in SQLite in the form of '2012-02-21 00:00:00.000000' and would like to average times of day together. Dates don't matter--just times. So, e.g., if the data is:
'2012-02-18 20:00:00.000000'
'2012-02-19 21:00:00.000000'
'2012-02-20 22:00:00.000000'
'2012-02-21 23:00:00.000000'
The average of 20, 21, 22, an 23, should be 21.5, or 21:30 (or 9:30pm in the U.S.).
Q1) Is there a best way to do this in a SELECT query in SQLite?
But more difficult: what if one or more of the datetimes crosses midnight? They definitely will in my data set. Example:
'2012-02-18 22:00:00.000000'
'2012-02-19 23:00:00.000000'
'2012-02-21 01:00:00.000000'
Now the average seems like it should be (22 + 23 + 1)/3 = 15.33 or 15:20 (3:20pm). But that would misrepresent the data, as these events are all happening at night, from 22:00 to 01:00 (10pm to 1am). Really, the better approach would be to average them like (22 + 23 + 25)/3 = 23.33 or 23:20 (11:20pm).
Q2) Is there anything I should do to my SELECT query to take this into account, or is this something I have to code in Python?

what do you really want to compute?
datetimes (or times within 1 day) are usually represented as real numbers
time coordinates on a 24-hour clock are complex numbers, however
average of real-number representations of the times will give you dubious results...
i don't know what you want to do with edge cases like [1:00, 13:00], but let's consider following example: [01:30, 06:30, 13:20, 15:30, 16:15, 16:45, 17:10]
I suggest implementing this algorithm - in Python:
convert times to complex numbers - e.g. compute their coordinates on a circle of radius = 1
compute the average using vector addition
convert the result vector angle to minutes + compute the relevance of this result (e.g. relevance of average of [1:00, 13:00] should be 0 whatever the angle is computed because of rounding errors)
import math
def complex_average(minutes):
# first convert the times from minutes (0:00 - 23:59) to radians
# so we get list for quasi polar coordinates (1, radians)
# (no point in rotating/flipping to get real polar coordinates)
# 180° = 1/2 day = 24*60/2 minutes
radians = [t*math.pi/(24*60/2) for t in minutes]
xs = []
ys = []
for r in radians:
# convert polar coordinates (1, r) to cartesian (x, y)
# the vectors start at (0, 0) and end in (x, y)
x, y = (math.cos(r), math.sin(r))
xs.append(x)
ys.append(y)
# result vector = vector addition
sum_x, sum_y = (sum(ys), sum(xs))
# convert result vector coordinates to radians, then to minutes
# note the cumulative ROUNDING ERRORS, however
result_radians = math.atan2(sum_x, sum_y)
result_minutes = int(result_radians / math.pi * (24*60/2))
if result_minutes < 0:
result_minutes += 24*60
# relevance = magnitude of the result vector / number of data points
# (<0.0001 means that all vectors cancel each other, e.g. [1:00, 13:00]
# => result_minutes would be random due to rounding error)
# FYI: standart_deviation = 6*60 - 6*60*relevance
relevance = round(math.sqrt(sum_x**2 + sum_y**2) / len(minutes), 4)
return result_minutes, relevance
And test it like this:
# let's say the select returned a bunch of integers in minutes representing times
selected_times = [90, 390, 800, 930, 975, 1005, 1030]
# or create other test data:
#selected_times = [hour*60 for hour in [23,22,1]]
complex_avg_minutes, relevance = complex_average(selected_times)
print("complex_avg_minutes = {:02}:{:02}".format(complex_avg_minutes//60,
complex_avg_minutes%60),
"(relevance = {}%)".format(int(round(relevance*100))))
simple_avg = int(sum(selected_times) / len(selected_times))
print("simple_avg = {:02}:{:02}".format(simple_avg//60,
simple_avg%60))
hh_mm = ["{:02}:{:02}".format(t//60, t%60) for t in selected_times]
print("\ntimes = {}".format(hh_mm))
Output for my example:
complex_avg_minutes = 15:45 (relevance = 44%)
simple_avg = 12:25

I'm not sure you can average dates.
What I would do is get the average of the difference in hours between the row values and a fixed date then add that average to the fixed date. Using minutes may cause an overflow of int and require some type conversion
sort of...
select dateadd(hh,avg(datediff(hh,getdate(),myrow)),getdate())
from mytable;

If I understand correctly, you want to get the average distance of the times from midnight?
How about this?
SELECT SUM(mins) / COUNT(*) from
( SELECT
CASE
WHEN strftime('%H', t) * 1 BETWEEN 0 AND 11
THEN (strftime('%H', t)) * 60 + strftime('%M', t)
ELSE strftime('%H', t) * 60 + strftime('%M', t) - 24 * 60
END mins
FROM timestamps
);
So we calculate the minutes offset from midnight: after noon we get a negative value, before noon is positive. The first line averages them and gives us a result in minutes. Converting that back to a hh:mm time is left as an "exercise for the student" ;-)

Site Rosetta Code has a task and code on this subject, and in researching that I came across this wikipedia link. Check out the talk/discussion pages too for discussions on applicability etc.

Python: Parsing timestamps and calculating time differences in milliseconds

I have a list of timestamps in "%H:%M:%S" format. For example
09:50:08.650000
09:50:08.665000
09:50:08.820000
09:50:08.877000
09:50:09.897000
09:50:09.907000
09:50:09.953000
09:50:10.662000
09:50:10.662000
I need to compute efficiently in python the time difference in milliseconds between each line.

%H:%M:%S.%f is the format string to be used when parsing the times. See http://docs.python.org/library/datetime.html#strftime-strptime-behavior
import datetime
times = """
09:50:08.650000
09:50:08.665000
09:50:08.820000
09:50:08.877000
09:50:09.897000
09:50:09.907000
09:50:09.953000
09:50:10.662000
09:50:10.662000
""".split()
# parse all times
times = [datetime.datetime.strptime(x, "%H:%M:%S.%f") for x in times]
for i in range(len(times) - 1):
# compute timedelta between current and next time in the list
print times[i + 1] - times[i]
The result:
0:00:00.015000
0:00:00.155000
0:00:00.057000
0:00:01.020000
0:00:00.010000
0:00:00.046000
0:00:00.709000
0:00:00
To output the difference in milliseconds:
delta = times[i + 1] - times[i]
print ((delta.days * 24 * 60 * 60 + delta.seconds) * 1000 + delta.microseconds / 1000)
Note that timedelta only stores days, seconds and microseconds internally. Other units are converted.

Have you tried the datetime.strptime() function? It will read in the datetime as a string and convert it into a datetime object.
You can then use datetime.timedelta() to compute the difference in milliseconds.
Documentation here: http://docs.python.org/library/datetime.html#strftime-strptime-behavior

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

take standard deviation of datetime in python - python

Related

Best way to convert a string to milliseconds in Python

How to check if two datetimes are within a certain range of each other?

pandas.date_range accurate freq parameter

Get average time of day in SQLite from datetimes

Python: Parsing timestamps and calculating time differences in milliseconds

Categories

Resources