Get average time of day in SQLite from datetimes - python

I have times in SQLite in the form of '2012-02-21 00:00:00.000000' and would like to average times of day together. Dates don't matter--just times. So, e.g., if the data is:
'2012-02-18 20:00:00.000000'
'2012-02-19 21:00:00.000000'
'2012-02-20 22:00:00.000000'
'2012-02-21 23:00:00.000000'
The average of 20, 21, 22, an 23, should be 21.5, or 21:30 (or 9:30pm in the U.S.).
Q1) Is there a best way to do this in a SELECT query in SQLite?
But more difficult: what if one or more of the datetimes crosses midnight? They definitely will in my data set. Example:
'2012-02-18 22:00:00.000000'
'2012-02-19 23:00:00.000000'
'2012-02-21 01:00:00.000000'
Now the average seems like it should be (22 + 23 + 1)/3 = 15.33 or 15:20 (3:20pm). But that would misrepresent the data, as these events are all happening at night, from 22:00 to 01:00 (10pm to 1am). Really, the better approach would be to average them like (22 + 23 + 25)/3 = 23.33 or 23:20 (11:20pm).
Q2) Is there anything I should do to my SELECT query to take this into account, or is this something I have to code in Python?

what do you really want to compute?
datetimes (or times within 1 day) are usually represented as real numbers
time coordinates on a 24-hour clock are complex numbers, however
average of real-number representations of the times will give you dubious results...
i don't know what you want to do with edge cases like [1:00, 13:00], but let's consider following example: [01:30, 06:30, 13:20, 15:30, 16:15, 16:45, 17:10]
I suggest implementing this algorithm - in Python:
convert times to complex numbers - e.g. compute their coordinates on a circle of radius = 1
compute the average using vector addition
convert the result vector angle to minutes + compute the relevance of this result (e.g. relevance of average of [1:00, 13:00] should be 0 whatever the angle is computed because of rounding errors)
import math
def complex_average(minutes):
# first convert the times from minutes (0:00 - 23:59) to radians
# so we get list for quasi polar coordinates (1, radians)
# (no point in rotating/flipping to get real polar coordinates)
# 180° = 1/2 day = 24*60/2 minutes
radians = [t*math.pi/(24*60/2) for t in minutes]
xs = []
ys = []
for r in radians:
# convert polar coordinates (1, r) to cartesian (x, y)
# the vectors start at (0, 0) and end in (x, y)
x, y = (math.cos(r), math.sin(r))
xs.append(x)
ys.append(y)
# result vector = vector addition
sum_x, sum_y = (sum(ys), sum(xs))
# convert result vector coordinates to radians, then to minutes
# note the cumulative ROUNDING ERRORS, however
result_radians = math.atan2(sum_x, sum_y)
result_minutes = int(result_radians / math.pi * (24*60/2))
if result_minutes < 0:
result_minutes += 24*60
# relevance = magnitude of the result vector / number of data points
# (<0.0001 means that all vectors cancel each other, e.g. [1:00, 13:00]
# => result_minutes would be random due to rounding error)
# FYI: standart_deviation = 6*60 - 6*60*relevance
relevance = round(math.sqrt(sum_x**2 + sum_y**2) / len(minutes), 4)
return result_minutes, relevance
And test it like this:
# let's say the select returned a bunch of integers in minutes representing times
selected_times = [90, 390, 800, 930, 975, 1005, 1030]
# or create other test data:
#selected_times = [hour*60 for hour in [23,22,1]]
complex_avg_minutes, relevance = complex_average(selected_times)
print("complex_avg_minutes = {:02}:{:02}".format(complex_avg_minutes//60,
complex_avg_minutes%60),
"(relevance = {}%)".format(int(round(relevance*100))))
simple_avg = int(sum(selected_times) / len(selected_times))
print("simple_avg = {:02}:{:02}".format(simple_avg//60,
simple_avg%60))
hh_mm = ["{:02}:{:02}".format(t//60, t%60) for t in selected_times]
print("\ntimes = {}".format(hh_mm))
Output for my example:
complex_avg_minutes = 15:45 (relevance = 44%)
simple_avg = 12:25

I'm not sure you can average dates.
What I would do is get the average of the difference in hours between the row values and a fixed date then add that average to the fixed date. Using minutes may cause an overflow of int and require some type conversion
sort of...
select dateadd(hh,avg(datediff(hh,getdate(),myrow)),getdate())
from mytable;

If I understand correctly, you want to get the average distance of the times from midnight?
How about this?
SELECT SUM(mins) / COUNT(*) from
( SELECT
CASE
WHEN strftime('%H', t) * 1 BETWEEN 0 AND 11
THEN (strftime('%H', t)) * 60 + strftime('%M', t)
ELSE strftime('%H', t) * 60 + strftime('%M', t) - 24 * 60
END mins
FROM timestamps
);
So we calculate the minutes offset from midnight: after noon we get a negative value, before noon is positive. The first line averages them and gives us a result in minutes. Converting that back to a hh:mm time is left as an "exercise for the student" ;-)

Site Rosetta Code has a task and code on this subject, and in researching that I came across this wikipedia link. Check out the talk/discussion pages too for discussions on applicability etc.

Related

How can I ensure that a pandas date range has even spacing?

I am writing some code to interpolate some data with space (x, y) and time. The data needs to be on a regular grid. I cant seem to make a generalized function to find a date range with regular spacing. The range that fails for me is:
date_min = numpy.datetime64('2022-10-24T00:00:00.000000000')
date_max = numpy.datetime64('2022-11-03T00:00:00.000000000')
And it needs to roughly match the current values of times I have, which for this case is 44.
periods = 44
I tried testing if the time difference is divisible by 2 and then adding 1 to the number of periods, which worked for a lot of cases, but it doesn't seem to really work for this time range:
def unique_diff(x):
return numpy.unique(numpy.diff(x))
unique_diff(pd.date_range(date_min, date_max, periods=periods))
Out[31]: array([20093023255813, 20093023255814], dtype='timedelta64[ns]')
unique_diff(pd.date_range(date_min, date_max, periods=periods+1))
Out[32]: array([19636363636363, 19636363636364], dtype='timedelta64[ns]')
unique_diff(pd.date_range(date_min, date_max, periods=periods-1))
Out[33]: array([20571428571428, 20571428571429], dtype='timedelta64[ns]')
However, it does work for +2:
unique_diff(pd.date_range(date_min, date_max, periods=periods+2))
Out[34]: array([19200000000000], dtype='timedelta64[ns]')
I could just keep trying different period deltas until I get a solution, but I would rather know why this is happening and how I can generalize this problem for any min/max times with a target number of periods
Your date range doesn't divide evenly by the periods in nanosecond resolution:
# as the contains start and end, there's a step fewer than there are periods
steps = periods - 1
int(date_max - date_min) / steps
# 20093023255813.953
A solution could be to round up (or down) your max date, to make it divide evenly in nanosecond resolution:
date_max_r = (date_min +
int(numpy.ceil(int(date_max - date_min) / (steps)) * (steps)))
unique_diff(pd.date_range(date_min, date_max_r, periods=periods))
# array([20093023255814], dtype='timedelta64[ns]')

Python function that mimics the distribution of my dataset

I have a food order dataset that looks like this, with a few thousand orders over the span of a few months:
Date
Item Name
Price
2021-10-09 07:10:00
Water Bottle
1.5
2021-10-09 12:30:60
Pizza
12
2021-10-09 17:07:56
Chocolate bar
3
Those orders are time-dependent. Nobody will eat a pizza at midnight, usually. There will be more 3PM Sunday orders than there will be 3PM Monday orders (because people are at work). I want to extract the daily order distribution for each weekday (Monday till Sunday) from those few thousand orders so I can generate new orders later that fits this distribution. I do not want to fill in the gaps in my dataset.
How can I do so?
I want to create a generate_order_date() function that would generate a random hours:minutes:seconds depending on the day. I can already identify which weekday a date corresponds to. I just need to extract the 7 daily distributions so I can call my function like this:
generate_order_date(day=Monday, nb_orders=1)
[12:30:00]
generate_order_date(day=Friday, nb_orders=5)
[12:30:00, 07:23:32, 13:12:09, 19:15:23, 11:44:59]
generated timestamps do not have to be in chronological order. Just like if I was calling
np.random.normal(mu, sigma, 1000)
Try np.histogram(data)
https://numpy.org/doc/stable/reference/generated/numpy.histogram.html
The first argument will give you the density which would be your distribution. You can visualise it
plt.plot(np.histogram(data)[0])
data here would be the time of the day a particular item was ordered. For this approach, I would suggest round up your time to 5 min intervals or more depending on the frequency. For example - round 12:34pm to 12:30 and 12:36pm to 12:35pm. Choose a suitable frequency.
Another method would be scipy.stats.gaussian_kde. This would use a guassian kernel. Below is an implementation I have used previously
def get_kde(df:pd.DataFrame)->list:
xs = np.round(np.linspace(-1,1,3000),3)
kde = gaussian_kde(df.values)
kde_vals = np.round(kde(xs),3)
data = [[xs[i],kde_vals[i]] for i in range(len(xs)) if kde_vals[i]>0.1]
return data
where df.values is your data. There are plenty more kernels which you can use to get a density estimate. The most suitable one depends on the nature of your data.
Also see https://en.wikipedia.org/wiki/Kernel_density_estimation#Statistical_implementation
Here's a rough sketch of what you could do:
Assumption: Column Date of the DataFrame df contains datetimes. If not do df.Date = pd.to_datetime(df.Date) first.
First 2 steps:
import pickle
from datetime import datetime, timedelta
from sklearn.neighbors import KernelDensity
# STEP 1: Data preparation
df["Seconds"] = (
df.Date.dt.hour * 3600 + df.Date.dt.minute * 60 + df.Date.dt.second
) / 86400
data = {
day: sdf.Seconds.to_numpy()[:, np.newaxis]
for day, sdf in df.groupby(df.Date.dt.weekday)
}
# STEP 2: Kernel density estimation
kdes = {
day: KernelDensity(bandwidth=0.1).fit(X) for day, X in data.items()
}
with open("kdes.pkl", "wb") as file:
pickle.dump(kdes, file)
STEP 1: Build a normalised column Seconds (normalised between 0 and 1). Then group over the weekdays (numbered 0,..., 6) and prepare for for every day of the week data for estimation of a kernel density.
STEP 2: Estimate the kernel densities for every day of the week with KernelDensity from Scikit-learn and pickle the results.
Based on these estimates build the desired sample function:
# STEP 3: Sampling
with open("kdes.pkl", "rb") as file:
kdes = pickle.load(file)
def generate_order_date(day, orders):
fmt = "%H:%M:%S"
base = datetime(year=2022, month=1, day=1)
kde = kdes[day]
return [
(base + timedelta(seconds=int(s * 86399))).time().strftime(fmt)
for s in kde.sample(orders)
]
I won't pretend that this is anywhere near perfect. But maybe you could use it as a starting point.

Poisson in sales

I see that Poisson is often used to estimate the number of sales in a certain time period (month, for example).
from scipy import stats
monthly_average_sales = 30
current_month_sales = 35
mu = monthly_average_sales
x = current_month_sales
up_to_35 = scipy.stats.poisson.pmf(x, mu)
above_35 = 1 - up_to_35
Suppose I want to estimate the probability that a specific order will close this month. Is this possible? For example, today is the 15th. If a customer initially called me on the 1st of the month, what is the probability that they will place the order before the month is over? They might place the order tomorrow (the 16th) or on the last day of the month. I don't care when, as long as it's by the end of this month.
from scipy import stats
monthly_average_sales = 30
current_sale_days_open = 15
number_of_days_this_month = 31
equivalent_number_of_sales = number_of_days_this_month / current_sale_days_open
mu = monthly_average_sales
x = equivalent_number_of_sales
up_to_days_open = scipy.stats.poisson.pmf(x, mu)
above_days_open = 1 - up_to_days_open
I don't want to abuse statistics to the point that they become meaningless (I'm not a politician!). Am I going about this the right way?

pandas.date_range accurate freq parameter

I'm trying to generate a pandas.DateTimeIndex with a samplefrequency of 5120 Hz. That gives a period of increment=0.0001953125 seconds.
If you try to use pandas.date_range(), you need to specify the frequency (parameter freq) as str or as pandas.DateOffset. The first one can only handle an accuracy up to 1 ns, the latter has a terrible performance compared to the str and has even a worse error.
When using the string, I construct is as follows:
freq=str(int(increment*1e9))+'N')
which performs my 270 Mb file in less than 2 seconds, but I have an error (in the DateTimeIndex) after 3 million records of about 1500 µs.
When using the pandas.DateOffset, like this
freq=pd.DateOffset(seconds=increment)
it parses the file in 1 minute and 14 seconds, but has an error of about a second.
I also tried constructing the DateTimeIndex using
starttime + pd.to_timedelta(cumulativeTimes, unit='s')
This sum takes also ages to complete, but is the only one which doesn't have the error in the resulting DateTimeIndex.
How can I achieve a performant generation of the DateTimeIndex, keeping my accuracy?
I used a pure numpy implementation to fix this:
accuracy = 'ns'
relativeTime = np.linspace(
offset,
offset + (periods - 1) * increment,
periods)
def unit_correction(u):
if u is 's':
return 1e0
elif u is 'ms':
return 1e3
elif u is 'us':
return 1e6
elif u is 'ns':
return 1e9
# Because numpy only knows ints as its date datatype,
# convert to accuracy.
return (np.datetime64(starttime)
+ (relativeTime*unit_correction(accuracy)).astype(
"timedelta64["+accuracy+"]"
)
)
(this is the github pull request for people interested: https://github.com/adamreeve/npTDMS/pull/31)
I think I reach a similar result with the function below (although it uses only nanosecond precision):
def date_range_fs(duration, fs, start=0):
""" Create a DatetimeIndex based on sampling frequency and duration
Args:
duration: number of seconds contained in the DatetimeIndex
fs: sampling frequency
start: Timestamp at which de DatetimeIndex starts (defaults to POSIX
epoch)
Returns: the corresponding DatetimeIndex
"""
return pd.to_datetime(
np.linspace(0, 1e9*duration, num=fs*duration, endpoint=False),
unit='ns',
origin=start)

take standard deviation of datetime in python

I am importing the datetime library in my python program and am taking the duration of multiple events. Below is my code for that:
d1 = datetime.datetime.strptime(starttime, '%Y-%m-%d:%H:%M:%S')
d2 = datetime.datetime.strptime(endtime, '%Y-%m-%d:%H:%M:%S')
duration = d2 - d1
print str(duration)
Now I have a value in the variable "duration". The output of this will be:
0:00:15
0:00:15
0:00:15
0:00:15
0:00:15
0:00:05
0:00:05
0:00:05
0:00:05
0:00:05
0:00:10
0:00:10
0:00:10
0:00:10
0:45:22
I want to take the standard deviation of all the durations and determine if there is an anomaly. For example, the 00:45:22 is an anomaly and I want to detect that. I could do this if I knew what format datetime was in, but it doesn't appear to be digits or anything..I was thinking about splitting the values up from : and using all the values in between, but there might be a better way.
Ideas?
You have datetime.timedelta() objects. These have .microseconds, .seconds and .days attributes, all 3 integers. The str() string representation represents those as [D day[s], ][H]H:MM:SS[.UUUUUU] as needed to fit all values present.
You can use simple arithmetic on these objects. Summing and division work as expected, for example:
>>> (timedelta(seconds=100) + timedelta(seconds=200)) / 2
datetime.timedelta(0, 150)
Unfortunately, you cannot multiply two timedeltas and calculating a standard deviation thus becomes tricky (no squaring of offsets).
Instead, I'd use the .total_seconds() method, to give you a floating point value that is calculated from the days, seconds and microseconds values, then use those values to calculate a standard deviation.
The duration objects you are getting are timedelta objects. Or durations from one timestamp to another. To convert them to a total number of microseconds use:
def timedelta_to_microtime(td):
return abs(td.microseconds + (td.seconds + td.days * 86400) * 1000000)
Then calculate the standard deviation:
def calc_std(L):
n = len(L)
mean = sum(L) / float(n)
dev = [x - mean for x in L]
dev2 = [x*x for x in dev]
return math.sqrt(sum(dev2) / n)
So:
timedeltas = [your timedeltas here..]
microtimes = [timedelta_to_microtime(td) for td in timedeltas]
std = calc_std(microtimes)
print [(td, mstime)
for (td, mstime) in zip(timedeltas, microtimes)
if mstime - std > X]

Categories