I want to window a pandas series which has a DatetimeIndex to the last X seconds. Usually I'd use pandas.Series.rolling for windowing. However the datetime indices are not equidistant means I cannot calculate the number of data points in a reliable manner. How can I implement a time based windowing (e.g. by implementing a BaseIndexer subclass and passing it to the window parameter of rolling())?
The easiest way to get the last X seconds from a datetime indexed series I came up with is getting the newest_timestamp = series.index.max(), calculating the oldest timestamp to consider from it oldest_timestamp = newest_timestamp - pd.to_timedelta(<X-seconds>, unit='s') and slicing the windowed_series = series[oldest_timestamp:newest_timestamp. Cause the oldest_timestamp is not extracted from the series but calculated the slicing operation will usually not match exactly. However this does not matter cause it is handled automatically.
NOTE: series.rolling() is usually used in a time series data pre-processing context (e.g. weighting samples within a window dependent on a function as part of a forecasting application) not for plain windowing use cases.
Related
I need use multi-variate linear regression for my project where I have two dependent variables: mean_1 and mean_2. The independent variable is date in YYYY-mm-dd format. I have been going through various stackoverflow posts to understand how to use date as a variable in regression. some suggest to convert date to numerical value (https://stackoverflow.com/a/40217971/13713750) while the other option is to convert date to dummy variables.
What I don't understand is how to convert every date in the dataset to a dummy variable and use it as an independent variable. Is it even possible or there are any better ways to use date as independent variable?
Note: I would prefer using date in the date format so it would be easy to plot and analyse the results of regression. Also I am working with pyspark but I can switch to pandas if necessary. so any examples of implementations would be helpful. Thanks!
You could create new columns year, month, day_of_year, day_of_month, day_of_week. You could also add some binary columns like is_weekday, is_holiday. In some cases it is beneficial to add third party data, like daily weather statistics for example (I was working on a case where extra daily weather data proved very useful). Really depends on the domain you're working on. Any of those columns could unveil some pattern behind your data.
As for Dummy variables, converting month and day_of_week to dummies makes sense.
Another option is, to build a model for each month.
If you want to transform a date to numeric (but I don't recommend) you can do this:
pd.to_timedelta(df.date).dt.total_seconds().astype(int)
You can do the same but with the total number of seconds:
pd.to_timedelta(df.date).dt.total_seconds()
Also, you can use a baseline date and subtract that from your date variable and obtain the number of days, this will give you an integer number that makes sense (bigger difference means a date more into the future, while smaller difference shows older dates). This value makes sense for me to use as an independent variable in a model.
First, we create a baseline date (can be whatever you want), and add it to the dataframe to the column static:
df['static'] = pd.to_datetime(datetime.date(2017, 6, 28))
Then we obtain the difference of days of the static date vs your date
df['days'] = (df['static'] - df['date']).dt.days
And there you will have a number ready to be used as an independent variable
I have a 100 by 2 matrix. The first column has dates. in numerical format. The dates are not necessarily sequentially increasing or decreasing. The granularity of the dates is 5 minutes. Therefore, there could be rows whose year, month, day and hour are the same but their minutes are different. I need it to do some operations in the matrix, how can I do that? Is there any way to save date and time in the matrix?
Yes, that all depends on the data structure that you want to use:
numpy has a dtype datetime: doc here
pandas too: tuto here
You can also choose to store them as unix timestamps, which are basically integers counting the number of seconds from 1/1/1970.
If you choose to use builtin types instead such as lists and dictionaries, then you can use the library datetime which provides datetime objects.
If you want more information, a simple google search for "python datetime" will probably shed some light...
I have a pandas Series series with a lot of entries, now I calculate
series.rolling(window=25000,center=False).mean()
This may calculate a very long time, since it is a lot of data.
Now I want to add new data to this series (Data that is only available over time and not before the original calculation).
If I calculate the rolling mean again, it will take again a very long time, is there a way, to spped up the second calculation by reusing data from the first calculation? (Probably with another library)
Assume you have new series
r1=series.rolling(window=25000,center=False).mean()
r2=series.iloc[-2499:].append(new_series).rolling(window=25000,center=False).mean().dropna()
r=r1.append(r2)
As far as I'm aware, TSFRESH expects a number of column IDs (entities) with one set of continual time series data each.
If I've got a number of different discrete datasets of time series data for each entity, can TSFRESH use them? These datasets are from the same sensor but are essentially repeats of the same event multiple times.
Yes, that is possible. You can use the kind attribute to assign multiple type of time series for each entity. We have an exemplary notebook/dataset where we show how to do that, see https://github.com/blue-yonder/tsfresh/blob/master/notebooks/robot_failure_example.ipynb.
I am using Convolutional Networks to work with forecasting time series. For this I am using rolling windows to take the last t points to use them as time series. Every feature is going to be a channel,so I have a multiple time series set. The data need to be in 3 dimensions [n_samples,window_size,features]. The original data set I have is [n_samples,features]. The data is already in time ascending order. My problem is that the way I am creating my 3D tensor crash my computer, given I have close to 500k rows. This is the code I am using.
prueba = x_data # This data set has shape [500k,20]
window_size = 100 # I taking the last 100 days
n_units,n_features = prueba.shape
n_samples = n_units - window_size +1 # Represent the number of samples you are getting from the rolling windows.
data_list = []
for init_index in range(n_samples):
fin_index = window_size + init_index
window_set = prueba[init_index:fin_index,:]
window_flat = np.reshape(window_set,(1,window_size*n_features))
data_list.append(window_flat)
features_tensor = np.concatenate(data_list,axis = 0)
features_tensor = np.reshape(features_tensor,(n_samples,window_size,n_features)) ## This break my computer
The problem is that my computer crashes when I use np.concatenate to put together all the individual data set I create. Does anyone know faster way to this. I am trying to think in a way to avoid using np.concatenate, but so far I havent been able to figure out.
Using the approach you have here (which results in np.concatenate) is quite inefficient, since you are duplicating every data point (roughly) window_size times. And that is almost certainly a waste of memory, since whatever operation that acts on this dataset should, ideally, be able to do it on a rolling basis: go through the time series without having to see the fully expanded / vastly duplicated dataset in tensor format.
So, I suggest that the better approach is to find a way to avoid building this redundant tensor in the first place.
Since we don't know what you are doing with this tensor, it's not possible to give an answer. However, here are a few things to consider:
One "right" way to do this is to use pandas, which has a rolling window feature df.rolling()docs here. This does exactly what you want (performs computations on a rolling window, without a big redundant tensor), but of course only if that works with the downstream code.
If you are using tensorflow, then you'll be better served by creating a generator to yield the window when called, which can be put into a tf.Dataset (see the .from_generator() method and example here).
In Keras, try TimeseriesGenerator, which has this capability. docs here