How to test on new values in LSTM in python - python

I have just created an LSTM model that predicts multidimensional numpy array using the time frame 7. The column index ranges from 1, because the 0th index column is actually a date value. Now my model does pretty well for the test set till March 2018 for which I have ground truth value. Now I wanted to predict for the next 1 year. I am stuck in this prediction part because I dont have a ground truth to feed into the model. I just have to give the next following dates. Could you please help me how this prediction can be achieved? Let me know if you need any more details other than data.
Please find the below code
def build_model(NanWah):
NanWah_data_model1=NanWah
list_range=int(NanWah_data_model1.shape[0]*0.8)
rest_list_range=(NanWah_data_model1.shape[0]-list_range)
NanWah_training_set=NanWah_data_model1.iloc[:list_range,1:].values
from sklearn.preprocessing import MinMaxScaler
sc=MinMaxScaler(feature_range=(0,1)) # 0 and 1 scaling it
NanWah_training_set_scaled=sc.fit_transform(NanWah_training_set)
X_train=[]
y_train=[]
for i in range(7,list_range):
X_train.append(NanWah_training_set_scaled[i-7:i,:])
y_train.append(NanWah_training_set_scaled[i])
X_train,y_train=np.array(X_train),np.array(y_train)
X_train=np.reshape(X_train,(X_train.shape[0],X_train.shape[1],13))
from keras.models import Sequential
from keras.layers import LSTM,Dense,Dropout,Activation
regressor=Sequential()
regressor.add(LSTM(units=50,return_sequences=True,input_shape=(X_train.shape[1],13)))
#regressor.add(Dropout(0.20))
regressor.add(LSTM(units=50,return_sequences=True))
#regressor.add(Dropout(0.20))
regressor.add(LSTM(units=50,return_sequences=True))
#regressor.add(Dropout(0.20))
regressor.add(LSTM(units=50,return_sequences=True))
#regressor.add(Dropout(0.20))
regressor.add(LSTM(units=50,return_sequences=False))
regressor.add(Dense(units=13))
regressor.compile(optimizer="adam",loss="mean_squared_error")
regressor.fit(X_train,y_train,epochs=5,batch_size=10)
NanWah_test_set=NanWah_data_model1.iloc[list_range:,1:].values
inputs=NanWah_test_set
inputs=sc.transform(inputs)
X_test=[]
for i in range(7,rest_list_range):
X_test.append(inputs[i-7:i,:])
X_test=np.array(X_test)
X_test=np.reshape(X_test,(X_test.shape[0],X_test.shape[1],13))
predicted_values=regressor.predict(X_test)
predicted_values=sc.inverse_transform(predicted_values)
predicted_water_m3=predicted_values[:,9:10]
predicted_electricity_kwh=predicted_values[:,7:8]
Thank you in advance

I got the answer for this question.
Here is what I did.
Have a numpy array which has the last n values ( where n is the time that LSTM wants to look back)
Append the 0th index of the numpy array again.
Reshape it
Predict the value using the model built
Delete the last index in the numpy array and then append the array with the predicted value
Continue this till you find the desired number of values for the records.
Sample code is given below:
inputs=Test.values # this contains the last 60 values of the training record
inputs = inputs.reshape(-1,1)
# Scale inputs but not actual test values
inputs = sc.transform(inputs)
# I am keeping it for 60 look backs and finding only 5 records
for test in range(0,5):
inputs=np.append(inputs,inputs[0])
inputs=inputs.reshape(-1,1)
print(inputs.shape)
X_test=[]
for i in range(60, 61):
X_test.append(inputs[test:i+test,0])
# make list to array
X_test = np.array(X_test)
print(X_test)
X_test = np.reshape(X_test,(X_test.shape[0], X_test.shape[1],1))
predicted_stock_price = regressor.predict(X_test)
print("for the first iteration {}".format(predicted_stock_price))
inputs=np.delete(inputs,len(inputs)-1,axis=0)
inputs=np.append(inputs,predicted_stock_price)
inputs=inputs.reshape(-1,1)

Related

Multivariate Keras Prediction Model With LSTM: Which index is used when predicting?

Apologies as I am new to using Keras and working with LSTM predictions in general. I am writing code that takes in a CSV file whose columns are float or int values which are related in some way, uses a Keras LSTM model to train against these columns, and attempts to predict one of the columns as an output. I am following this guide:
https://machinelearningmastery.com/multivariate-time-series-forecasting-lstms-keras/
In the example, the relevant predicted column is the amount of air pollution, and all other values are used to predict this value. Adapting the example code for my CSV seems straightforward-- I change the size of the training data and the number of columns appropriately.
My issue is that I don't understand why the example code is outputting predicted values for the "pollution" column and not some other column. I could just make the column I want to predict the second column in my formatted input CSV, but I would like to understand what is actually happening in the example as much as possible. Looking at the documentation for Model.predict(), it says the input value can be a list of arrays if the model has multiple inputs, but the return value is only described as "numpy array(s) of predictions", and it doesn't specify how I can make it return "arrays" versus an array. When I print out the result of this function, I only get an array of predictions for the pollution variable, so it seems like that column is selected somewhere before this point.
How can I change which column is returned by predict()?
Changing which column is returned by predict() depends on what you select your output data (y) to be. When the author preprocessed their data, they made the current pollution the last column of their dataset. Then, when selecting an output, y, to evaluate the model, they ran these lines of code:
# split into input and outputs
train_X, train_y = train[:, :-1], train[:, -1]
test_X, test_y = test[:, :-1], test[:, -1]
the input arrays (X) array include all rows and every column except the last, as denoted by their index, whereas the output (y) array includes all rows but only the last column, which is the pollution variable.
When the model is training, it is trying to use the inputs, in this case the previous timestep inputs, to accurately predict the output, which in this case is the pollution at the current time. Therefore, when the model makes predictions, it will use this function that it learned to relate the two datasets to predict pollution.
So, in summation, select the column that you want your model to predict as the train_y and test_y datasets! Hope this helps!

How can I get predicted the following value of stock using predict method of Tensorflow?

I am wondering how to predict and get future time series data after model training. I would like to get the values after N steps. I wonder if the time series data has been properly learned and predicted. How do I do this right to get the following(next) value? I want to get the next value using model.predict or similar.
I have x_test and x_test[-1] == t So, the meaning of the next value is t+1, t+2, .... t+n. In this example I want to get t+1, t+2 ... t+n
First
I tried using stock index data
inputs = total_data[len(total_data) - forecast - look_back:]
inputs = scaler.transform(inputs)
X_test = []
for i in range(look_back, inputs.shape[0]):
X_test.append(inputs[i - look_back:i])
X_test = np.array(X_test)
predicted = model.predict(X_test)
but the result is like below
The results from X_test[-20:] and the following 20 predictions looks like same. I'm wondering if it's the correct method to train and predicted value and also if the result was correct.
full source
The method I tried first did not work correctly.
Second
I realized something is wrong, I tried using another official data so I used the time series in the Tensorflow tutorial to practice training the model.
a = y_val[-look_back:]
for i in range(N-step prediction): #predict a new value n times.
tmp = model.predict(a.reshape(-1, look_back, num_feature)) #predicted value
a = a[1:] #remove first
a = np.append(a, tmp) #insert predicted value
The results were predicted in a linear regression shape very differently from the real data.
Output a linear regression abnormal that is independent of the real data:
full source (After the 25th line is my code.)
I'm really very curious that How can I predict the following value of time series using Tensorflow predict method
I'm not wondering if this works or not theoretically. I'm just wondering how to get the following n steps using the predict method.
Thank you for reading the long question. I seek advice about your priceless opinion.
In the Second approach, Output is not expected, as per my understanding, because of a small mistake in the code.
The line of code,
a = y_val[-look_back:]
should be replaced by
look_back = 20
x = x_val_uni
a = x[-look_back:]
a.shape
In other words, we should send X Values as Inputs to the Model for the Prediction, not the Y Values.
However, we can compare it's predictions with Y Values, with the code,
y = y_val_uni[-20:]
plt.plot(y)
plt.plot(tmp)
plt.show()
Which would result in the plot shown below:
Please find the Complete Working Code in this Google Colab Gist.

Trend "Predictor" in Python?

I'm currently working with data frames (in pandas) that have 2 columns: the first column is some numeric quantitative data, like weight, amount of money spent on some day, GPA, etc., and the second column are date values, i.e. the date on which the corresponding column 1 entry was added on.
I was wondering, is there a way to "predict" what the next value after time X is going to be in Python? E.g. if I have 100 weight entries spanning over 2-3 months (not all entries have the same time difference, so 1 entry could be during Day 3, the next Day 5, and the next Day 10), and wanted to "predict" what my next entry after 1 month, is there a way to do that?
I think this has something to do with Time Series Analysis, but my statistical background isn't very strong, so I don't know if that's the right approach. If it is, how could I apply it to my data frames (i.e. which packages)? Would there be any significance to the value it potentially returns, or would it be meaningless in the context of what I'm working with? Thank you.
For predicting time-series data, I feel the best choice would be a LSTM, which is a type of recurrent neural network, which are well suited for time-series regression.
If you don't want to dive deep into the backend of neural networks, I suggest using the Keras library, which is a wrapper for the Tensorflow framework.
Lets say you have a 1-D array of values and you want to predict the next value. Code in Keras could look like:
#start off by building the training data, let arr = the list of values
X = []
y = []
for i in range(len(arr)-100-1):
X.append(arr[i:i+100]) #get prev 100 values for the X
y.append(arr[i+100]) # predict next value for Y
Since an LSTM takes a 3-D input, we want to reshape our X data to have 3 dimensions:
import numpy as np
X = np.array(X)
X = X.reshape(len(X), len(X[0]), 1)
Now X is in the form (samples, timesteps, features)
Here we can build a neural network using keras:
from keras.models import Sequential
from keras.layers import Dense, LSTM
model = Sequential()
model.add(LSTM(input_shape = (len(X[0], 1)) #input 3-D timeseries data
model.add(Dense(1)) #output 1-D vector of predicted values
model.compile(loss='mean_squared_error', optimizer='adam')
model.fit(X, y)
And viola, you can use your model to predict the next values in your data
Statsmodels is a python module that provides one of the "most famous" methods in time series forecasting (Arima).
An example can be seen in the following link : https://machinelearningmastery.com/arima-for-time-series-forecasting-with-python/
Other methods for time series forecasting are available in some libraries, like support vector regression, Holt-Winters and Simple Exponential Smoothing.
Spark-ts (https://github.com/sryza/spark-timeseries) is one time series library that supports Python , and provides methods like Arima, Holt-Winters and Exponential Weighted Moving Average.
Libsvm (https://github.com/cjlin1/libsvm) provides support vector regression methods.

Error: Out Of Memory, tensorflow cnn

I am using the getting started example of Tensorflow CNN and updating parameters to my own data but since my model is large (244 * 244 features) I got OutOfMemory error.
I am running the training on Ubuntu 14.04 with 4 CPUs and 16Go of RAM.
Is there a way to shrink my data so I don't get this OOM error?
My code looks like this:
# Create the Estimator
mnist_classifier = tf.estimator.Estimator(
model_fn=cnn_model_fn, model_dir="path/to/model")
# Load the data
train_input_fn = tf.estimator.inputs.numpy_input_fn(
x={"x": np.array(training_set.data)},
y=np.array(training_set.target),
num_epochs=None,
batch_size=5,
shuffle=True)
# Train the model
mnist_classifier.train(
input_fn=train_input_fn,
steps=100,
hooks=[logging_hook])
Is there a way to shrink my data so I don't get this OOM error?
You can slice your training_set to obtain just a portion of the dataset. Something like:
x={"x": np.array(training_set.data)[:(len(training_set)/2)]},
y=np.array(training_set.target)[:(len(training_set)/2)],
In this example you are getting the first half of your dataset (you can select up to what point of your dataset you want to load).
Edit: Another way you can do this is to obtain a random subset of your training dataset. This you can achieve by masking elements on your dataset array. For example:
import numpy as np
from random import random as rn
#obtain boolean mask to filter out some elements
#here you can define your sample %
r = 0.5 #say filter half the elements
mask = [True if rn() >= r else False for i in range(len(training_set))]
#finally, mask out those elements,
#the result will have ~r times the original elements
reduced_ds = training_set[mask]

How to reshape input for keras LSTM?

I have a numpy array of some 5000 rows and 4 columns (temp, pressure, speed, cost). So this is of the shape (5000, 4). Each row is an observation at a regular interval this is the first time i'm doing time series prediction and I'm stuck on input shape. I'm trying to predict
a value 1 timestep from the last data point. How do I reshape it into the 3D form for LSTM model in keras?
Also It will be much more helpful if a small sample program is written. There doesn't seem to be any example/tutorial where the input has more than one feature (and also not NLP).
The first question you should ask yourself is :
What is the timescale in which the input features encode relevant information for the value you want to predict?
Let's call this timescale prediction_context.
You can now create your dataset :
import numpy as np
recording_length = 5000
n_features = 4
prediction_context = 10 # Change here
# The data you already have
X_data = np.random.random((recording_length, n_features))
to_predict = np.random.random((5000,1))
# Make lists of training examples
X_in = []
Y_out = []
# Append examples to the lists (input and expected output)
for i in range(recording_length - prediction_context):
X_in.append(X_data[i:i+prediction_context,:])
Y_out.append(to_predict[i+prediction_context])
# Convert them to numpy array
X_train = np.array(X_in)
Y_train = np.array(Y_out)
At the end :
X_train.shape = (recording_length - prediction_context, prediction_context, n_features)
So you will need to make a trade-off between the length of your prediction context and the number of examples you will have to train your network.

Categories