pandas.date_range accurate freq parameter - python

I'm trying to generate a pandas.DateTimeIndex with a samplefrequency of 5120 Hz. That gives a period of increment=0.0001953125 seconds.
If you try to use pandas.date_range(), you need to specify the frequency (parameter freq) as str or as pandas.DateOffset. The first one can only handle an accuracy up to 1 ns, the latter has a terrible performance compared to the str and has even a worse error.
When using the string, I construct is as follows:
freq=str(int(increment*1e9))+'N')
which performs my 270 Mb file in less than 2 seconds, but I have an error (in the DateTimeIndex) after 3 million records of about 1500 µs.
When using the pandas.DateOffset, like this
freq=pd.DateOffset(seconds=increment)
it parses the file in 1 minute and 14 seconds, but has an error of about a second.
I also tried constructing the DateTimeIndex using
starttime + pd.to_timedelta(cumulativeTimes, unit='s')
This sum takes also ages to complete, but is the only one which doesn't have the error in the resulting DateTimeIndex.
How can I achieve a performant generation of the DateTimeIndex, keeping my accuracy?

I used a pure numpy implementation to fix this:
accuracy = 'ns'
relativeTime = np.linspace(
offset,
offset + (periods - 1) * increment,
periods)
def unit_correction(u):
if u is 's':
return 1e0
elif u is 'ms':
return 1e3
elif u is 'us':
return 1e6
elif u is 'ns':
return 1e9
# Because numpy only knows ints as its date datatype,
# convert to accuracy.
return (np.datetime64(starttime)
+ (relativeTime*unit_correction(accuracy)).astype(
"timedelta64["+accuracy+"]"
)
)
(this is the github pull request for people interested: https://github.com/adamreeve/npTDMS/pull/31)

I think I reach a similar result with the function below (although it uses only nanosecond precision):
def date_range_fs(duration, fs, start=0):
""" Create a DatetimeIndex based on sampling frequency and duration
Args:
duration: number of seconds contained in the DatetimeIndex
fs: sampling frequency
start: Timestamp at which de DatetimeIndex starts (defaults to POSIX
epoch)
Returns: the corresponding DatetimeIndex
"""
return pd.to_datetime(
np.linspace(0, 1e9*duration, num=fs*duration, endpoint=False),
unit='ns',
origin=start)

Related

How can I ensure that a pandas date range has even spacing?

I am writing some code to interpolate some data with space (x, y) and time. The data needs to be on a regular grid. I cant seem to make a generalized function to find a date range with regular spacing. The range that fails for me is:
date_min = numpy.datetime64('2022-10-24T00:00:00.000000000')
date_max = numpy.datetime64('2022-11-03T00:00:00.000000000')
And it needs to roughly match the current values of times I have, which for this case is 44.
periods = 44
I tried testing if the time difference is divisible by 2 and then adding 1 to the number of periods, which worked for a lot of cases, but it doesn't seem to really work for this time range:
def unique_diff(x):
return numpy.unique(numpy.diff(x))
unique_diff(pd.date_range(date_min, date_max, periods=periods))
Out[31]: array([20093023255813, 20093023255814], dtype='timedelta64[ns]')
unique_diff(pd.date_range(date_min, date_max, periods=periods+1))
Out[32]: array([19636363636363, 19636363636364], dtype='timedelta64[ns]')
unique_diff(pd.date_range(date_min, date_max, periods=periods-1))
Out[33]: array([20571428571428, 20571428571429], dtype='timedelta64[ns]')
However, it does work for +2:
unique_diff(pd.date_range(date_min, date_max, periods=periods+2))
Out[34]: array([19200000000000], dtype='timedelta64[ns]')
I could just keep trying different period deltas until I get a solution, but I would rather know why this is happening and how I can generalize this problem for any min/max times with a target number of periods
Your date range doesn't divide evenly by the periods in nanosecond resolution:
# as the contains start and end, there's a step fewer than there are periods
steps = periods - 1
int(date_max - date_min) / steps
# 20093023255813.953
A solution could be to round up (or down) your max date, to make it divide evenly in nanosecond resolution:
date_max_r = (date_min +
int(numpy.ceil(int(date_max - date_min) / (steps)) * (steps)))
unique_diff(pd.date_range(date_min, date_max_r, periods=periods))
# array([20093023255814], dtype='timedelta64[ns]')

How to incorporate time string to classifier

One of the columns in my dataset looks like the following:
I'm wondering what the best practice is to incorporate this information into the classifier models.
Please help me with it. The project is done in python with the Jupyter Notebook.
You can translate string to seconds and use it as one of the features:
df = pd.DataFrame(['25m 34s', '1m', '22s'], columns=['game_length'])
df['game_length_seconds'] = pd.to_timedelta(df['game_length']).apply(lambda x: x.seconds)
>>> df
game_length game_length_seconds
0 25m 34s 1534
1 1m 60
2 22s 22
You can write a lambda function to map that column:
def duration_from_epoc(time:str):
# converts string to datetime and in this case the following line would result in 1900-01-01 00:25:28
date_time = datetime.strptime(time, "%Mm %Ss")
# calculate duration from epoc / start datetime. here you can CHANGE your starting time
date_time_from = datetime(1970,1,1)
# returns delta total seconds
return (date_time - date_time_from).total_seconds()
Then you can call,
df["game_length"].map(duration_from_epoc)
For example, the result of '25m 28s' (str) would be -2208987272.0 (float)
Overall, the solution is calculating seconds from a standard date time to your date time (calculated from string form of duration). Caution that "25m 28s" converts to 1900-01-01 00:25:28 datetime object.
At the end, I would say that instead of saving duration, try to save start and end time of each game play so that you can always calculate duration on the go.

How to get a period of time with Numpy?

If a np.datetime64 type data is given, how to get a period of time around the time?
For example, if np.datetime64('2020-04-01T21:32') is given to a function, I want the function to return np.datetime64('2020-04-01T21:30') and np.datetime64('2020-04-01T21:39') - 10 minutes around the given time.
Is there any way to do this with numpy?
Numpy does not have a built in time period like Pandas does.
If all you want are two time stamps the following function should work.
def ten_minutes_around(t):
t = ((t.astype('<M8[m]') # Truncate time to integer minute representation
.astype('int') # Convert to integer representation
// 10) * 10 # Remove any sub 10 minute minutes
).astype('<M8[m]') # convert back to minute timestamp
return np.array([t, t + np.timedelta64(10, 'm')]).T
For example:
for t in [np.datetime64('2020-04-01T21:32'), np.datetime64('2052-02-03T13:56:03.172')]:
s, e = ten_minutes_around(t)
print(s, t, e)
gives:
2020-04-01T21:30 2020-04-01T21:32 2020-04-01T21:40
2652-02-03T13:50 2652-02-03T13:56:03.172 2652-02-03T14:00
and
ten_minutes_around(np.array([
np.datetime64('2020-04-01T21:32'),
np.datetime64('2652-02-03T13:56:03.172'),
np.datetime64('1970-04-01'),
]))
gives
array([['2020-04-01T21:30', '2020-04-01T21:40'],
['2652-02-03T13:50', '2652-02-03T14:00'],
['1970-04-01T00:00', '1970-04-01T00:10']], dtype='datetime64[m]')
To do so we can get the minute from the given time and subtract it from the given time to get the starting of the period and add 9 minutes to get the ending time of the period.
import numpy as np
time = '2020-04-01T21:32'
dt = np.datetime64(time)
base = (dt.tolist().time().minute) % 10 // base would be 3 in this case
start = dt - np.timedelta64(base,'m')
end = start + np.timedelta64(9,'m')
print(start,end,sep='\n')
I hope this helps.

Binning custom data based on a unix timestamp

I have a dict of data entries with a UNIX epoch timestamp as the key, and some value (this could be Boolean, int, float, enumerated string). I'm trying to set up a method that takes a start time, end time, and bin size (x minutes, x hours or x days), puts the values in the dict into the array of one of the bins between these times.
Essentially, I'm trying to convert data from the real world measured at a certain time to data occurring on a time-step, starting at time=0 and going until time=T, where the length of the time step can be set when calling the method.
I'm trying to make something along the lines of:
def binTimeSeries(dict, startTime, endTime, timeStep):
bins = []
#floor begin time to a timeStep increment
#ciel end time to a timeStep increment
for key in dict.keys():
if key > floorStartTime and key < cielEndTime:
timeDiff = (key - floorStartTime)
binIndex = floor(timeDiff/timeStep)
bins[binIndex].append(dict[key])
I'm having trouble working out what time format is suitable to do the conversion from UNIX epoch timestamp to, that can handle the floor, ciel and modulo operations given a variable timeStep interval, and then how to actually perform those operations. I've searched for this, but am getting confused with the formalisms for datetime, pandas, and which might be more suitable for this.
Maybe something like this? Instead of asking for a bin size (the interval of each bin), I think it makes more sense to ask how many bins you'd like instead. That way you're guaranteed each bin will be the same size (cover the same interval).
In my example below, I generated some fake data, which I called data. The start- and end-timestamps I picked arbitrarily, as well as the number of bins. I calculate the difference between the end- and start-timestamps, which I'm calling the duration - this yields the total duration between the two timestamps (I realize it's a bit silly to recalculate this value, seeing as how I hardcoded it earlier in the end_time_stamp definition, but it's just there for completeness). The bin_interval (in seconds) can be calculated by dividing the duration by the number of bins.
I ended up doing everything just using plain old UNIX / POSIX timestamps, without any conversion. However, I will mention that datetime.datetime has a method called fromtimestamp, which accepts a POSIX timestamp and returns a datetime object populated with the year, month, seconds, etc.
In addition, in my example, all I end up adding to the bins are the keys - just for demonstration - you'll have to modify it to suit your needs.
def main():
import time
values = ["A", "B", "C", "D", "E", "F", "G"]
data = {time.time() + (offset * 32): value for offset, value in enumerate(values)}
start_time_stamp = time.time() + 60
end_time_stamp = start_time_stamp + 75
number_of_bins = 12
assert end_time_stamp > start_time_stamp
duration = end_time_stamp - start_time_stamp
bin_interval = duration / number_of_bins
bins = [[] for _ in range(number_of_bins)]
for key, value in data.items():
if not (start_time_stamp <= key <= end_time_stamp):
continue
for bin_index, current_bin in enumerate(bins):
if start_time_stamp + (bin_index * bin_interval) <= key < start_time_stamp + ((bin_index + 1) * bin_interval):
current_bin.append(key)
break
print("Original data:")
for key, value in data.items():
print(key, value)
print(f"\nStart time stamp: {start_time_stamp}")
print(f"End time stamp: {end_time_stamp}\n")
print(f"Bin interval: {bin_interval}")
print("Bins:")
for current_bin in bins:
print(current_bin)
return 0
if __name__ == "__main__":
import sys
sys.exit(main())
Output:
Original data:
1573170895.1871762 A
1573170927.1871762 B
1573170959.1871762 C
1573170991.1871762 D
1573171023.1871762 E
1573171055.1871762 F
1573171087.1871762 G
Start time stamp: 1573170955.1871762
End time stamp: 1573171030.1871762
Bin interval: 6.25
Bins:
[1573170959.1871762]
[]
[]
[]
[]
[1573170991.1871762]
[]
[]
[]
[]
[1573171023.1871762]
[]

take standard deviation of datetime in python

I am importing the datetime library in my python program and am taking the duration of multiple events. Below is my code for that:
d1 = datetime.datetime.strptime(starttime, '%Y-%m-%d:%H:%M:%S')
d2 = datetime.datetime.strptime(endtime, '%Y-%m-%d:%H:%M:%S')
duration = d2 - d1
print str(duration)
Now I have a value in the variable "duration". The output of this will be:
0:00:15
0:00:15
0:00:15
0:00:15
0:00:15
0:00:05
0:00:05
0:00:05
0:00:05
0:00:05
0:00:10
0:00:10
0:00:10
0:00:10
0:45:22
I want to take the standard deviation of all the durations and determine if there is an anomaly. For example, the 00:45:22 is an anomaly and I want to detect that. I could do this if I knew what format datetime was in, but it doesn't appear to be digits or anything..I was thinking about splitting the values up from : and using all the values in between, but there might be a better way.
Ideas?
You have datetime.timedelta() objects. These have .microseconds, .seconds and .days attributes, all 3 integers. The str() string representation represents those as [D day[s], ][H]H:MM:SS[.UUUUUU] as needed to fit all values present.
You can use simple arithmetic on these objects. Summing and division work as expected, for example:
>>> (timedelta(seconds=100) + timedelta(seconds=200)) / 2
datetime.timedelta(0, 150)
Unfortunately, you cannot multiply two timedeltas and calculating a standard deviation thus becomes tricky (no squaring of offsets).
Instead, I'd use the .total_seconds() method, to give you a floating point value that is calculated from the days, seconds and microseconds values, then use those values to calculate a standard deviation.
The duration objects you are getting are timedelta objects. Or durations from one timestamp to another. To convert them to a total number of microseconds use:
def timedelta_to_microtime(td):
return abs(td.microseconds + (td.seconds + td.days * 86400) * 1000000)
Then calculate the standard deviation:
def calc_std(L):
n = len(L)
mean = sum(L) / float(n)
dev = [x - mean for x in L]
dev2 = [x*x for x in dev]
return math.sqrt(sum(dev2) / n)
So:
timedeltas = [your timedeltas here..]
microtimes = [timedelta_to_microtime(td) for td in timedeltas]
std = calc_std(microtimes)
print [(td, mstime)
for (td, mstime) in zip(timedeltas, microtimes)
if mstime - std > X]

Categories