I am working on a regression problem where, I have 12 sensors data (Independent) columns and 1 output column, all sampled at 48KHz. I have total 420 seconds of train data. In test dataset, I have 12 sensor data columns and need to predict output.
Till now, I have tried classical machine learning algorithms without considering time feature. I am new to the time series and not sure if this is actually a time series forecasting problem.
I am not sure if I can consider this as a multivariate time series problem and try LSTM/RNNs.
I have been following https://machinelearningmastery.com/multivariate-time-series-forecasting-lstms-keras/#comment-442845 but not able to understand how I can predict on test data.
Do I need to append a new column to convert test data from (length,12) to (length, 13) and then predict one by one row and use output to next iteration?
Also, is above a correct approach to solve this kind of problem or do I have to think about something else?
UPDATE
Updating my question on below comments.
Let's say my train data looks like below (Updated heading just to explain better). I am training LSTM network same as mentioned in above link. I have created Y(t),Y(t-1),x1(t-1),x2(t-1),x3(t-1),x4(t-1),x5(t-1),x6(t-1) using series_to_supervised function.
Y x1 x2 x3 x4 x5 x6
date
2010-01-02 00:00:00 129.0 -16 -4.0 1020.0 SE 1.79 0
2010-01-02 01:00:00 148.0 -15 -4.0 1020.0 SE 2.68 0
2010-01-02 02:00:00 159.0 -11 -5.0 1021.0 SE 3.57 0
2010-01-02 03:00:00 181.0 -7 -5.0 1022.0 SE 5.36 1
2010-01-02 04:00:00 138.0 -7 -5.0 1022.0 SE 6.25 2
Now, I have test data without Y column.
As an example,
x1 x2 x3 x4 x5 x6
date
2010-01-02 00:00:00 -11 -6.0 1020.0 SE 1.79 0
2010-01-02 01:00:00 -12 -1.0 1020.0 SE 2.68 0
2010-01-02 02:00:00 -10 -4.0 1021.0 SE 3.57 0
2010-01-02 03:00:00 -7 -2.0 1022.0 SE 5.36 1
2010-01-02 04:00:00 -7 -5.0 1022.0 SE 6.25 2
What I have done. I have appended fake Y column with 0 padding and replaced first value as mean of train Y column. My idea is to use t-1 predicted value in next prediction. I don't know how I can get it easily. I came up with following logic.
Code snippet
#test_pd is panda frame of size Nx6
#train_pd is panda frame of size Nx5
test_pd['Y'] = 0
train_out_mean = train_pd[0].mean()
test_pd[0][0] = train_out_mean
test_pd = test_pd.values.reshape((test_pd.shape[0],1,test_pd.shape[1]))
out_list = list()
out_list.append(train_out_mean)
for i in range(test_pd.shape[0]):
y = loaded_model.predict(test_pd[i].reshape(1,test_pd.shape[1],test_pd.shape[2]))
y = y[0]
out_list.append(y)
if (i+1>=test_pd.shape[0]):
break
test_pd[i+1][0][0] = y
I have two follow-up question.
Is above approach theoretically correct to solve the problem?
If yes, then is there any better way to predict on test dataset?
I would consider starting with a simpler approach before going for more complex algorithms like a LSTM.
Here in StackOverflow you should objectively ask some doubt about code. So if you share some of your code here, we can try to help you.
Considering that you have a time series like that (example in your link):
pollution dew temp press wnd_dir wnd_spd snow rain
date
2010-01-02 00:00:00 129.0 -16 -4.0 1020.0 SE 1.79 0 0
2010-01-02 01:00:00 148.0 -15 -4.0 1020.0 SE 2.68 0 0
2010-01-02 02:00:00 159.0 -11 -5.0 1021.0 SE 3.57 0 0
2010-01-02 03:00:00 181.0 -7 -5.0 1022.0 SE 5.36 1 0
2010-01-02 04:00:00 138.0 -7 -5.0 1022.0 SE 6.25 2 0
simpler approach: MLP Regressor
In a simpler approach, assuming you wanted to predict the pollution, you can build a a MLP Regressor, so during the training phase, you should separate the data in 7 features(dew, temp, press, wnd_dir, wnd_spd, snow, rain) to predict the pollution. Here an example:
from sklearn.neural_network import MLPRegressor
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, MinMaxScaler
from sklearn import metrics
data = dataset.values
# integer encode WIND direction
encoder = LabelEncoder()
data[:,4] = encoder.fit_transform(data[:,4])
scaler = MinMaxScaler(feature_range=(0, 1))
scaled = scaler.fit_transform(data)
y, X = np.split(data,[1],axis=1)
mlp = MLPRegressor(learning_rate_init=0.001)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
print (X_train.shape, y_train.shape)
print (X_test.shape, y_test.shape)
mlp.fit(X_train,y_train)
y_prediction = mlp.predict(X_test)
print("R2 score:", metrics.r2_score(y_test, y_prediction))
Output:
R2 score: 0.30376681842945985
in LSTM (You need: 3D [samples, timesteps, features])
NOW, Suppose that some feature (wind, air pressure, etc.) at the moment ** t-1 **, ** t-2 ** (1 hour, 2 hours) has some influence on the moment ** t **. So now you intend to solve your problem as a time series by capturing some evolution of wind speed (for example) for some time. So now it makes sense to use LSTM.
So, the function series_to_supervised (example of your link) will help you to create new features...
The function series_to_supervised has 4 arguments:
data: Sequence of observations as a list or 2D NumPy array.
n_in: Number of lag observations as input (X). Values may be between [1..len(data)]
n_out: Number of observations as output (y). Values may be between [0..len(data)-1].
dropnan: Boolean whether or not to drop rows with NaN values
So, supposing this series the only one feature X and the label y:
X y
2018-01-01 00:00:00 1 2
2018-01-01 01:00:00 2 3
2018-01-01 02:00:00 3 4
2018-01-01 03:00:00 4 5
2018-01-01 04:00:00 5 6
2018-01-01 05:00:00 6 7
2018-01-01 06:00:00 7 8
2018-01-01 07:00:00 8 9
2018-01-01 08:00:00 9 10
2018-01-01 09:00:00 10 11
Using this function series_to_supervised(df.values,n_in=2, n_out=1, dropnan=False) you will have some like that (I did some improvements in order to understand):
X(t-2) y(t-2) X(t-1) y(t-1) X(t) y(t)
2018-01-01 00:00:00 NaN NaN NaN NaN 1 2
2018-01-01 01:00:00 NaN NaN 1.0 2.0 2 3
2018-01-01 02:00:00 1.0 2.0 2.0 3.0 3 4
2018-01-01 03:00:00 2.0 3.0 3.0 4.0 4 5
2018-01-01 04:00:00 3.0 4.0 4.0 5.0 5 6
2018-01-01 05:00:00 4.0 5.0 5.0 6.0 6 7
2018-01-01 06:00:00 5.0 6.0 6.0 7.0 7 8
2018-01-01 07:00:00 6.0 7.0 7.0 8.0 8 9
2018-01-01 08:00:00 7.0 8.0 8.0 9.0 9 10
2018-01-01 09:00:00 8.0 9.0 9.0 10.0 10 11
So, in this approach we are considering that to predict, we will al least two records X(t-2, t-1) and y(t-2, t-1) to predict y(t), future.
Why you need to do THIS? Now I think that I will start answering your question. In a LSTM you need to transform your data in 2D in 3D space.
So, after that you need to reshape input to be 3D [samples, timesteps, features] before using a LSTM. So, transform (using this function) your data is just a preparation.
Answering your question. You don't need append just one column. You NEED to transform your data in order to HAVE new features in t-n, t-3, t-2, t-1 to predict some feature in t.
I recommend you follow the steps on pollution case (cited by you) on this blog first, before trying to adapt in your case.
Related
Original Post
I am working with time series data and I want to apply a function to each data frame chunk for rolling time intervals/windows. When I use rolling() and apply() on a Pandas DataFrame, it applies the function iteratively for each column given a time interval. Here's example code:
Sample data
In:
df = pd.DataFrame({'A': [1, 2, 3, 4, 5, 6],
'B': [2, 4, 6, 8, 10, 12]},
index=pd.date_range('2019-01-01', periods=6, freq='5T'))
print(df)
Out:
A B
2019-01-01 00:00:00 1 2
2019-01-01 00:05:00 2 4
2019-01-01 00:10:00 3 6
2019-01-01 00:15:00 4 8
2019-01-01 00:20:00 5 10
2019-01-01 00:25:00 6 12
Output when using the combination of rolling() and apply():
In:
print(df.rolling('15T', min_periods=2).apply(lambda x: x.sum().sum()))
Out:
A B
2019-01-01 00:00:00 NaN NaN
2019-01-01 00:05:00 3.0 6.0
2019-01-01 00:10:00 6.0 12.0
2019-01-01 00:15:00 9.0 18.0
2019-01-01 00:20:00 12.0 24.0
2019-01-01 00:25:00 15.0 30.0
Desired Out:
2019-01-01 00:00:00 NaN
2019-01-01 00:05:00 9.0
2019-01-01 00:10:00 18.0
2019-01-01 00:15:00 27.0
2019-01-01 00:20:00 36.0
2019-01-01 00:25:00 45.0
Freq: 5T, dtype: float64
Currently, I am using a for loop to do the job, but I am looking for a more efficient way to handle this operation. I would appreciate it if you can provide a solution within the Pandas framework or even with other libraries.
Note: Please do not take the example function (summation) seriously, assume that the function in interest requires iterating over the chunks of datasets as is, i.e., with no prior column operations.
Thanks in advance!
Edit/Update
After reading the responses I have realized that the example I have given was not adequate. Here I go again using the same sample data:
In:
print(
df.rolling('15T', min_periods=2).apply(
lambda x: x['A'].mean() / x['B'].std()
)
)
Out:
KeyError: 'A'
Desired Out:
2019-01-01 00:00:00 NaN
2019-01-01 00:05:00 1.06
2019-01-01 00:10:00 1.00
2019-01-01 00:15:00 1.50
2019-01-01 00:20:00 2.00
2019-01-01 00:25:00 2.50
Freq: 5T, dtype: float64
Again, I want to point out that the main objective of my question is to find an efficient way to iterate over chunks of dataframes. For example, I do not want the following solution.
df['A'].rolling('15T', min_periods=2).mean() / df['B'].rolling('15T', min_periods=2).std()
And, for those who are interested in the real problem rather than the simple example, you can check it out here at mlfactor where triple barrier method is explained.
In my dataframe I have a column that is timestamp formatted as 2021-11-18 00:58:22.705
I wish to create a column that displays the time elapsed from each row to the interval time (first time).
There are 2 ways in which I can think of doing this but I don't seem to know how to make it happen.
Method 1:
To subtract each time stamp to the row above.
df["difference"]= df["timestamp"].diff()
Now that this time difference has been calculated I would like to create another column that sums each time difference but it keeps the sum from the delta above (elapsed time from start of process)
Method 2:
I guess another way would be to calculate the timestamp of each row to the interval time stamp (first one)
I do not know how I would do that.
Thanks in advance.
not completely understood the type of difference needed so adding both which I think are reasonable:
import pandas as pd
times = pd.date_range('2022-05-23', periods=20, freq='0D30min')
df = pd.DataFrame({'Timestamp': times})
df['difference_in_min'] = (df.Timestamp - df.Timestamp.min()).astype('timedelta64[m]')
df['cumulative_dif_in_min'] = df.difference_in_min.cumsum()
print(df)
Timestamp difference_in_min cumulative_dif_in_min
0 2022-05-23 00:00:00 0.0 0.0
1 2022-05-23 00:30:00 30.0 30.0
2 2022-05-23 01:00:00 60.0 90.0
3 2022-05-23 01:30:00 90.0 180.0
4 2022-05-23 02:00:00 120.0 300.0
5 2022-05-23 02:30:00 150.0 450.0
6 2022-05-23 03:00:00 180.0 630.0
7 2022-05-23 03:30:00 210.0 840.0
8 2022-05-23 04:00:00 240.0 1080.0
I found this behavior of resample to be confusing after working on a related question. Here are some time series data at 5 minute intervals but with missing rows (code to construct at end):
user value total
2020-01-01 09:00:00 fred 1 1
2020-01-01 09:05:00 fred 13 1
2020-01-01 09:15:00 fred 27 3
2020-01-01 09:30:00 fred 40 12
2020-01-01 09:35:00 fred 15 12
2020-01-01 10:00:00 fred 19 16
I want to fill in the missing times using different methods for each column to fill missing data. For user and total, I want to to a forward fill, while for value I want to fill in with zeroes.
One approach I found was to resample, and then fill in the missing data after the fact:
resampled = df.resample('5T').asfreq()
resampled['user'].ffill(inplace=True)
resampled['total'].ffill(inplace=True)
resampled['value'].fillna(0, inplace=True)
Which gives correct expected output:
user value total
2020-01-01 09:00:00 fred 1.0 1.0
2020-01-01 09:05:00 fred 13.0 1.0
2020-01-01 09:10:00 fred 0.0 1.0
2020-01-01 09:15:00 fred 27.0 3.0
2020-01-01 09:20:00 fred 0.0 3.0
2020-01-01 09:25:00 fred 0.0 3.0
2020-01-01 09:30:00 fred 40.0 12.0
2020-01-01 09:35:00 fred 15.0 12.0
2020-01-01 09:40:00 fred 0.0 12.0
2020-01-01 09:45:00 fred 0.0 12.0
2020-01-01 09:50:00 fred 0.0 12.0
2020-01-01 09:55:00 fred 0.0 12.0
2020-01-01 10:00:00 fred 19.0 16.0
I thought one would be able to use agg to specify what to do by column. I try to do the following:
resampled = df.resample('5T').agg({'user':'ffill',
'value':'sum',
'total':'ffill'})
I find this to be more clear and simpler, but it doesn't give the expected output. The sum works, but the forward fill does not:
user value total
2020-01-01 09:00:00 fred 1 1.0
2020-01-01 09:05:00 fred 13 1.0
2020-01-01 09:10:00 NaN 0 NaN
2020-01-01 09:15:00 fred 27 3.0
2020-01-01 09:20:00 NaN 0 NaN
2020-01-01 09:25:00 NaN 0 NaN
2020-01-01 09:30:00 fred 40 12.0
2020-01-01 09:35:00 fred 15 12.0
2020-01-01 09:40:00 NaN 0 NaN
2020-01-01 09:45:00 NaN 0 NaN
2020-01-01 09:50:00 NaN 0 NaN
2020-01-01 09:55:00 NaN 0 NaN
2020-01-01 10:00:00 fred 19 16.0
Can someone explain this output, and if there is a way to achieve the expected output using agg? It seems odd that the forward fill doesn't work here, but if I were to just do resampled = df.resample('5T').ffill(), that would work for every column (but is undesired here as it would do so for the value column as well). The closest I have come is to individually run resampling for each column and apply the function I want:
resampled = pd.DataFrame()
d = {'user':'ffill',
'value':'sum',
'total':'ffill'}
for k, v in d.items():
resampled[k] = df[k].resample('5T').apply(v)
This works, but feels silly given that it adds extra iteration and uses the dictionary I am trying to pass to agg! I have looked a few posts on agg and apply but can't seem to explain what is happening here:
Losing String column when using resample and aggregation with pandas
resample multiple columns with pandas
pandas groupby with agg not working on multiple columns
Pandas named aggregation not working with resample agg
I have also tried using groupby with a pd.Grouper and using the pd.NamedAgg class, with no luck.
Example data:
import pandas as pd
dates = ['01-01-2020 9:00', '01-01-2020 9:05', '01-01-2020 9:15',
'01-01-2020 9:30', '01-01-2020 9:35', '01-01-2020 10:00']
dates = pd.to_datetime(dates)
df = pd.DataFrame({'user':['fred']*len(dates),
'value':[1,13,27,40,15,19],
'total':[1,1,3,12,12,16]},
index=dates)
I have a dataset that has 8 mixed features (6 numeric and 2 categorical). Since the numeric values have different ranges, I will have to normalize the dataset as a whole to be able to perform farther actions such as machine learning algorithms, Dimensionality reduction (Feature Extraction).
My original dataset:
time v1 v2 v3 ... v7 v8
00:00:01 15435 0.7 13 ... High True
00:00:06 24356 3.6 23 ... High True
00:00:11 25567 8.3 82 ... LOW False
00:00:16 12345 5.4 110 ... LOW True
00:00:21 43246 1.7 93 ... High False
................................................
23:23:59 23456 3.8 45 ... LOW False
where v1 to v6 are numerical variable at which their values are at different ranges as it can be seen above. Moreover, v7 and v8 are categorical variables that has only two outputs (for v7 {High, Low} and for v8 {True, False}).
I did label encoding for the categorical variables (v7 and v8) where High and True were encoded 1 and LOW and False were encoded 0.
The following illustrate how the dataset looks like after the label encoding:
time v1 v2 v3 ... v7 v8
00:00:01 15435 0.7 13 ... 1 1
00:00:06 24356 3.6 23 ... 1 1
00:00:11 25567 8.3 82 ... 0 0
00:00:16 12345 5.4 110 ... 0 1
00:00:21 43246 1.7 93 ... 1 0
................................................
23:23:59 23456 3.8 45 ... 0 0
My question is as follows: It is easy to standardize the numerical features from v1 to v6. However, I am not sure whether to standardize the categorical observations and if yes what would be the best way to do so??
You can use UNIX time, for example:
import pandas as pd
import numpy as np
date = pd.date_range('1/1/2011', periods = 10, freq ='H')
df = pd.DataFrame({'date':date})
df['unix_time'] = df['date'].astype(np.int64) // 10**9
df
output:
date unix_time
0 2011-01-01 00:00:00 1293840000
1 2011-01-01 01:00:00 1293843600
2 2011-01-01 02:00:00 1293847200
3 2011-01-01 03:00:00 1293850800
4 2011-01-01 04:00:00 1293854400
5 2011-01-01 05:00:00 1293858000
6 2011-01-01 06:00:00 1293861600
7 2011-01-01 07:00:00 1293865200
8 2011-01-01 08:00:00 1293868800
9 2011-01-01 09:00:00 1293872400
Now your machine learning algorithms can compare date, also you can convert date back:
pd.to_datetime(df['unix_time'], unit='s')
output:
0 2011-01-01 00:00:00
1 2011-01-01 01:00:00
2 2011-01-01 02:00:00
3 2011-01-01 03:00:00
4 2011-01-01 04:00:00
5 2011-01-01 05:00:00
6 2011-01-01 06:00:00
7 2011-01-01 07:00:00
8 2011-01-01 08:00:00
9 2011-01-01 09:00:00
Name: unix_time, dtype: datetime64[ns]
Normalization rescales the values between the range 0 to 1. Your values are already in this range, you would have required normalization of categorical values only if the cardinality was really really high but for now you can keep them as it is. I will also suggest you to normalize your whole dataset. Then all the values will be in the same range & algo will not erroneously learn anything by giving preference to any feature with higher numerical values. You can find both normalization & scaling in scikit learn itself.
from sklearn import preprocessing
X=your_data
normalized_X = preprocessing.normalize(X)
This is my data:
time id w
0 2018-03-01 00:00:00 39.0 1176.000000
1 2018-03-01 00:15:00 39.0 NaN
2 2018-03-01 00:30:00 39.0 NaN
3 2018-03-01 00:45:00 39.0 NaN
4 2018-03-01 01:00:00 39.0 NaN
5 2018-03-01 01:15:00 39.0 NaN
6 2018-03-01 01:30:00 39.0 NaN
7 2018-03-01 01:45:00 39.0 1033.461538
8 2018-03-01 02:00:00 39.0 1081.066667
9 2018-03-01 02:15:00 39.0 1067.909091
10 2018-03-01 02:30:00 39.0 NaN
11 2018-03-01 02:45:00 39.0 1051.866667
12 2018-03-01 03:00:00 39.0 1127.000000
13 2018-03-01 03:15:00 39.0 1047.466667
14 2018-03-01 03:30:00 39.0 1037.533333
I want to get index: 10
Because I need to know which time not continuous and I need to add the value.
I want to know if there is a NAN in front of and behind each 'time'. If not I need to know it index. I need to add value for it.
My data is very large. I need a faster way.
I really need your help.Many thanks.
Not sure if I understood you correctly. If you want the index of the column time where the change is more than 15 minutes, you will have more index than 4, and you can do so:
df['time'] = pd.to_datetime(df['time'], format='%Y-%m-%d %H:%M:%S')
df['Delta']=(df['time'].subtract(df['time'].shift(1)))
df['Delta'] = df['Delta'].astype(str)
print df.index[df['Delta'] != '0 days 00:15:00.000000000'].tolist()
And the output is:
[4561, 4723, 5154, 5220, 5293, 5437, 5484]
Edit
Again, if I understood you right, just use this:
df.index[(pd.isnull(df['w'])) & (pd.notnull(df['w'].shift(1))) & (pd.notnull(df['w'].shift(-1)))].tolist()
Output:
[10]
This should work pretty fast:
import numpy as np
index = np.array([4561,4723,4724,4725,4726,5154,5220,5221,5222,5223,5224,5293,5437,5484,5485,5486,5487])
continuous = np.diff(index) == 1
not_continuous = np.where(~continuous[1:] & ~continuous[:-1])[0] + 1 # check on both 'sides', +1 because you 'loose' one index in the diff operation
index[not_continuous]
array([5154, 5293, 5437])
It doesn't handle the first value well but this is quite ambiguous since you don't have a preceding value to check against. Up to you to add this extra check if it matters to you... Same for last value, potentially.