How can I ensure that a pandas date range has even spacing? - python

I am writing some code to interpolate some data with space (x, y) and time. The data needs to be on a regular grid. I cant seem to make a generalized function to find a date range with regular spacing. The range that fails for me is:
date_min = numpy.datetime64('2022-10-24T00:00:00.000000000')
date_max = numpy.datetime64('2022-11-03T00:00:00.000000000')
And it needs to roughly match the current values of times I have, which for this case is 44.
periods = 44
I tried testing if the time difference is divisible by 2 and then adding 1 to the number of periods, which worked for a lot of cases, but it doesn't seem to really work for this time range:
def unique_diff(x):
return numpy.unique(numpy.diff(x))
unique_diff(pd.date_range(date_min, date_max, periods=periods))
Out[31]: array([20093023255813, 20093023255814], dtype='timedelta64[ns]')
unique_diff(pd.date_range(date_min, date_max, periods=periods+1))
Out[32]: array([19636363636363, 19636363636364], dtype='timedelta64[ns]')
unique_diff(pd.date_range(date_min, date_max, periods=periods-1))
Out[33]: array([20571428571428, 20571428571429], dtype='timedelta64[ns]')
However, it does work for +2:
unique_diff(pd.date_range(date_min, date_max, periods=periods+2))
Out[34]: array([19200000000000], dtype='timedelta64[ns]')
I could just keep trying different period deltas until I get a solution, but I would rather know why this is happening and how I can generalize this problem for any min/max times with a target number of periods

Your date range doesn't divide evenly by the periods in nanosecond resolution:
# as the contains start and end, there's a step fewer than there are periods
steps = periods - 1
int(date_max - date_min) / steps
# 20093023255813.953
A solution could be to round up (or down) your max date, to make it divide evenly in nanosecond resolution:
date_max_r = (date_min +
int(numpy.ceil(int(date_max - date_min) / (steps)) * (steps)))
unique_diff(pd.date_range(date_min, date_max_r, periods=periods))
# array([20093023255814], dtype='timedelta64[ns]')

Related

How to get daily difference in time series values when time delta index is irregular in pandas?

I have a dataframe containing a time series indexed by time but with irregular time deltas as below
df
time x
2018-08-18 17:45:08 1.4562
2018-08-18 17:46:55 1.4901
2018-08-18 17:51:21 1.8012
...
2020-03-21 04:17:19 0.7623
2020-03-21 05:01:02 0.8231
2020-03-21 05:02:34 0.8038
What I want to do is get the daily difference between the two (chronologically) closest values, i.e. the closest time the next day. For example, if we have a sample at time 2018-08-18 17:45:08, and the next day we do not have a sample at the same time, but the closest sample is at, say, 2018-08-19 17:44:29, then I want to get the difference in x between these two times. How is that possible in pandas?
There will always be a sample for every single day between first day and last day in the time series.
The difference should be taken as (current x) - (past x) e.g. x_day2 - x_day1
The output's first n rows will be NaN given how the difference is taken, where n is the number of samples in the first day
EDIT: The code below works if the time deltas are regular
def get_daily_diff(data):
"""
Calculate daily difference in time series
Args:
data (pandas.Series): a pandas series of time series values indexed by pandas.Timestamp
Returns:
pandas.Series: daily difference in values
"""
df0 = data.index.searchsorted(data.index - pd.Timedelta(days=1))
df0 = df0[df0 > 0]
df0 = pd.Series(data.index[df0 - 1], index=data.index[data.shape[0] - df0.shape[0]:])
out = data.loc[df0.index] - data.loc[df0.values]
return out
However, if using irregular time delats, a ValueError is thrown when defining the variable out as we get a length mismatch between data.loc[df0.index] and data.loc[df0.values]. So the issue is to expand this function to work when the time deltas are irregular.
I would use pd.merge_asof with direction='nearest':
df['time_1d'] = df['time']+pd.Timedelta('1D')
tmp = pd.merge_asof(df, df, left_on='time', right_on ='time_1d',
direction='nearest', tolerance=pd.Timedelta('12H'), suffixes=('', '_y'))
tmp['delta'] = tmp['x_y'] - tmp['x']
tmp = tmp[['time', 'x', 'delta']]
Here I have used a tolerance of 12H to make sure to have NaN for first days but you could use a more appropriate value.

python getting histogram bins for datetime objects

I have two lists.
The list times is a list of datetimes from 2018-04-10 00:00 to
2018-04-10 23:59.
For each item in times I have a corresponding label of 0 or 1 recorded in the list labels.
My goal is to get the mean label value (between 0 and 1) for every minute interval.
times = [Timestamp('2018-04-10 00:00:00.118000'),
Timestamp('2018-04-10 00:00:00.547000'),
Timestamp('2018-04-10 00:00:00.569000'),
Timestamp('2018-04-10 00:00:00.690000'),
.
.
.
Timestamp('2018-04-10 23:59:59.999000') ]
labels = [0,1,1,0,1,0,....1]
where len(times) == len(labels)
For every minute interval between 2018-04-10 00:00 and 2018-04-10 23:59, the min and max times in the list respectively, I am trying to get two lists:
1) The start time of the minute interval.
2) The mean average label value of all the datetimes in that interval.
In particular I am having trouble with (2).
Note: the times list is not necessarily chronologically ordered
Firstly, I begin with how I generated the data as above format
from datetime import datetime
size = int(1e6)
timestamp_a_day = np.linspace(datetime.now().timestamp(), datetime.now().timestamp()+24*60*60, size)
dummy_sec = np.random.rand(size)
timestamp_series = pd.Series(timestamp_a_day + dummy_sec)\
.sort_values().reset_index(drop=True)\
.apply(lambda x: datetime.fromtimestamp(x))
data = pd.DataFrame(timestamp_series, columns=['timestamp'])
data['label'] = np.random.randint(0, 2, size)
Let's solve this problem !!!
(I hope I understand your question precisely hahaha)
1) data['start_interval'] = data['timestamp'].dt.floor('s')
2) data.groupby('start_interval')['label'].mean()
zip times and labels then sort;
Write a function that returns the date, hour, minute of a Timestamp;
groupby that function;
sum and average the labels for each group

Python, improving for loop performance

I have made a class called localSun. I've taken a simplified model of the Earth-Sun system and have tried to compute the altitude angle of the sun for any location on earth for any time. When I run the code for current time and check timeandddate it matches well. So it works.
But then I wanted to basically go through one year and store all the altitude angles into an array (numpy array) for a specific location and I went in 1 minutes intervals.
Here's my very first naive attempt which I'm fairly certain is not good for performance. I just wanted to test for performance anyways.
import numpy as np
from datetime import datetime
from datetime import date
from datetime import timedelta
...
...
altitudes = np.zeros(int(year/60))
m = datetime(2018, 5, 29, 15, 21, 0)
for i in range(0, len(altitudes)):
n = m + timedelta(minutes = i+1)
nn = localSun(30, 0, n)
altitudes[i] = nn.altitude() # .altitude() is a method in localSun
altitudes is the array to which I want to store all the altitudes and its size is 525969 which is basically the amount of minutes in a year.
The localSun() object takes 3 parameters: colatitude (30 deg), longitude (0 deg) and a datetime object which has the time from a bit over an hour ago (when this is posted)
So the question is: What would be a good efficient way of going through a year in 1 minute intervals and computing the altitude angle at that time because this seems rather slow. Should I use map to update the values of the altitude angle instead of a for loop. I presume I'll have to each time create a new localSun object too. Also it's probably bad to just create these variables n and nn all the time.
We can assume the localSun objects all methods work fine. I'm just asking what is an efficient way (if there is) of going through a year in 1 minute intervals and updating the array with the altitude. The code I have should reveal enough information.
I would want to perhaps even do this in just 1 second interval later so it would be great to know if there's an efficient way. I tried that but it takes very long with that if I use this code.
This piece of code took about a minute to do on a university computer which are quite fast as far as I know.
Greatly appreaciate if someone can answer. Thanks in advance!
Numpy has naitive datetime and timedelta support so you could take an approach like this:
start = datetime.datetime(2018,5,29,15,21,0)
end = datetime.datetime(2019,5,29,15,21,0)
n = np.arange(start, end, dtype='datetime64[m]') # [m] specifies the interval as minutes
altitudes = np.vectorize(lambda x, y, z: localSun(x, y, z).altitude())(30,0,n)
np.vectorize is not fast at all, but gets this working until you can modify 'localSun' to work with arrays of datetimes.
Since you are already using numpy you can go one step further with pandas. It has powerful date and time manipulation routines such as pd.date_range:
import pandas as pd
start = pd.Timestamp(year=2018, month=1, day=1)
stop = pd.Timestamp(year=2018, month=12, day=31)
dates = pd.date_range(start, stop, freq='min')
altitudes = localSun(30, 0, dates)
You would then need to adapt your localSun to work with an array of pd.Timestamp rather than a single datetime.datetime.
Changing from minutes to seconds would then be as simple as changing freq='min' to freq='S'.

Dataset statistics with custom begin of the year

I would like to do some annual statistics (cumulative sum) on an daily time series of data in an xarray dataset. The tricky part is that the day on which my considered year begins must be flexible and the time series contains leap years.
I tried e.g. the following:
rollday = -181
dr = pd.date_range('2015-01-01', '2017-08-23')
foo = xr.Dataset({'data': (['time'], np.ones(len(dr)))}, coords={'time': dr})
foo_groups = foo.roll(time=rollday).groupby(foo.time.dt.year)
foo_cumsum = foo_groups.apply(lambda x: x.cumsum(dim='time', skipna=True))
which is "unfavorable" mainly because of two things:
(1) the rolling doesn't account for the leap years, so the get an offset of one day per leap year and
(2) the beginning of the first year (until end of June) is appended to the end of the rolled time series, which creates some "fake year" where the cumulative sums doesn't make sense anymore.
I tried also to first cut off the ends of the time series, but then the rolling doesn't work anymore. Resampling to me also did not seem to be an option, as I could not find a fitting pandas freq string.
I'm sure there is a better/correct way to do this. Can somebody help?
You can use a xarray.DataArray that specifies the groups. One way to do this is to create an array of values (years) that define the group ids:
# setup sample data
dr = pd.date_range('2015-01-01', '2017-08-23')
foo = xr.Dataset({'data': (['time'], np.ones(len(dr)))}, coords={'time': dr})
# create an array of years (modify day/month for your use case)
my_years = xr.DataArray([t.year if ((t.month < 9) or ((t.month==9) and (t.day < 15))) else (t.year + 1) for t in foo.indexes['time']],
dims='time', name='my_years', coords={'time': dr})
# use that array of years (integers) to do the groupby
foo_cumsum = foo.groupby(my_years).apply(lambda x: x.cumsum(dim='time', skipna=True))
# Voila!
foo_cumsum['data'].plot()

pandas.date_range accurate freq parameter

I'm trying to generate a pandas.DateTimeIndex with a samplefrequency of 5120 Hz. That gives a period of increment=0.0001953125 seconds.
If you try to use pandas.date_range(), you need to specify the frequency (parameter freq) as str or as pandas.DateOffset. The first one can only handle an accuracy up to 1 ns, the latter has a terrible performance compared to the str and has even a worse error.
When using the string, I construct is as follows:
freq=str(int(increment*1e9))+'N')
which performs my 270 Mb file in less than 2 seconds, but I have an error (in the DateTimeIndex) after 3 million records of about 1500 µs.
When using the pandas.DateOffset, like this
freq=pd.DateOffset(seconds=increment)
it parses the file in 1 minute and 14 seconds, but has an error of about a second.
I also tried constructing the DateTimeIndex using
starttime + pd.to_timedelta(cumulativeTimes, unit='s')
This sum takes also ages to complete, but is the only one which doesn't have the error in the resulting DateTimeIndex.
How can I achieve a performant generation of the DateTimeIndex, keeping my accuracy?
I used a pure numpy implementation to fix this:
accuracy = 'ns'
relativeTime = np.linspace(
offset,
offset + (periods - 1) * increment,
periods)
def unit_correction(u):
if u is 's':
return 1e0
elif u is 'ms':
return 1e3
elif u is 'us':
return 1e6
elif u is 'ns':
return 1e9
# Because numpy only knows ints as its date datatype,
# convert to accuracy.
return (np.datetime64(starttime)
+ (relativeTime*unit_correction(accuracy)).astype(
"timedelta64["+accuracy+"]"
)
)
(this is the github pull request for people interested: https://github.com/adamreeve/npTDMS/pull/31)
I think I reach a similar result with the function below (although it uses only nanosecond precision):
def date_range_fs(duration, fs, start=0):
""" Create a DatetimeIndex based on sampling frequency and duration
Args:
duration: number of seconds contained in the DatetimeIndex
fs: sampling frequency
start: Timestamp at which de DatetimeIndex starts (defaults to POSIX
epoch)
Returns: the corresponding DatetimeIndex
"""
return pd.to_datetime(
np.linspace(0, 1e9*duration, num=fs*duration, endpoint=False),
unit='ns',
origin=start)

Categories